Shannon Entropy Analysis of the Genome Code

This paper studies the chromosome information of twenty five species, namely, mammals, fishes, birds, insects, nematodes, fungus, and one plant. A quantifying scheme inspired in the state space representation of dynamical systems is formulated. Based on this algorithm, the information of each chromosome is converted into a bidimensional distribution. The plots are then analyzed and characterized by means of Shannon entropy. The large volume of information is integrated by averaging the lengths and entropy quantities of each species. The results can be easily visualized revealing quantitative global genomic information.


Introduction
Genome sequencing produced a huge volume of information that is now available for computational processing.Deschavanne et al. 1 explored DNA structures of genomes by means of a tool derived from the chaos dynamics.Murphy et al. 2 studied the genome sequences of four species to infer early events in placental mammal phylogeny.Ebersberger et al. 3 developed a phylogenetic analysis of several DNA sequence alignments from human, chimpanzee, gorilla, orangutan, and rhesus.Prasad et al. 4 analyzed a genomic sequence, which we generated from 41 mammals and 3 other vertebrates.In 5 , Bolshoy reported a novel compositional complexity-based method for sequence analysis.The study shows that the method indicated periodicities and related features in several sets of DNA sequences.In 6 , Liu et al. analyzed several aspects of the information content of the Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Arabidopsis thaliana, Saccharomyces cerevisiae, and Escherichia coli genomes.In 7 , Sims et al. used an alignment-free method in which l-mer frequency profiles of whole genomes are used for comparison.Macropol et al. 8 proposed an algorithm based on repeated random walks RRWs and apply the technique on a functional network of yeast genes identifying statistically significant clusters of proteins.Kozobay-Avraham et al. 9 performed a genome analysis of DNA curvature distributions in coding and noncoding regions of prokaryotic genomes to evaluate the assistance of mathematical and statistical procedures.Two methods were applied producing similar clustering reflecting genomic attributes and environmental conditions of the species' habitat.C ¸arkacio glu et al. 10 proposed the bi-k-bi clustering for finding association rules of gene pairs that can easily operate on large scale and multiple heterogeneous data sets.Kaplunovsky et al. 11 investigated correlations between certain properties of exons in a gene and genomic trees obtained with different approaches of clustering based on exonic parameters.They concluded that the best approach was based on distances among four principal components obtained by factor analysis, and followed by application of clustering algorithms.Sualp and Can 12 computed several graph theoretic measures on a protein-protein interaction network of a target organism as indicators of network context.Machado et al. 13 studied the human DNA from the perspective of system dynamics, associating entropy and the Fourier transform.
Based on the genomic data, this paper studies the deoxyribonucleic acid DNA code of twenty five species.Having in mind, the tools adopted in system and chaos analysis a state space representation and entropy measure are adopted.The state space plots reveal complex evolutions, resembling those revealed by chaotic systems and suggesting that the DNA information can be tackled by numerical tools.Given the large number of chromosomes and species involved in the study, the information is synthesized by means of the arithmetic averages of the entropy and the chromosome length.This strategy allows a simple quantitative visualization of the global genomic information of each species.Bearing these ideas in mind this paper is organized as follows.Section 2 presents the DNA code mapping concepts and the Shannon entropy characterization of the resulting numerical data.Section 3 analyzes the DNA entropy content of 489 chromosomes corresponding to twenty five species, including several mammals, fishes, birds, insects, nematodes, fungus, and one plant.Finally, Section 4 outlines the main conclusions.

Mapping the DNA Code and Quantification by Means of Entropy
The DNA helix encodes information by means of four distinct nitrogenous bases {thymine, cytosine, adenine, guanine} usually denoted by the symbols {T, C, A, G}.Besides the four symbols, the chromosome data files include a fifth symbol {N} which is believed to have no practical meaning for the DNA decoding.Each base connects with only one type of base on the other side forming the base pairing A-T and C-G.
The problem of DNA decoding is addressed in this paper using an algorithm inspired in system dynamical analysis using state space representation.This method was formulated by Roy et al. 14 and later addressed in conjunction with fractal dimension by Machado 15 .In the present paper, the scheme is improved by connecting the state plane with the entropy measure.The proposed strategy consists of implementing the translation scheme: i the A-T and C-G pairs are represented in the horizontal and vertical Cartesian axes, respectively, and ii each base along the DNA strand is converted to a one-step increment δ > 0, being δ −δ for the first second base in each bonding pair.In the case of symbol {N}, no action is taken.Therefore, the DNA information, corresponding to the succession of bases, is converted into a trajectory representative of the dynamical evolution.Furthermore, the translation preserves the based pairing logic and does not introduce any preconception biasing the DNA information.In 15 , it was adopted the box counting method for characterizing the fractal image in the state plane.However, the box counting is an approximate method that requires large images in order to have a reasonable precision and does not quantify the case of successive trajectories passing through the same points.Having this fact in mind, in this paper, it is proposed an alternative method that takes into account the number of trajectories passing through a given point in the state plane.First, as in the case of using images, the minimum and maximum values along each axis are calculated and the trajectories are rescaled in order to fit a matrix M of size n × n.Second, the points in the trajectories are quantified and counted for each cell in matrix M. Third, the matrix M is converted to a bidimensional histogram by dividing each cell counting that represents the number of trajectory points that fit inside the cell boundaries by the total number of trajectory points.
The characterization of bidimensional histograms can be accomplished by several indices.In the paper, it is adopted the Shannon entropy 16-21 .Statistical indices based on moments can be used but that option requires a high number of measures.In fact, describing the frequency distribution of 25 species using the mean, variance, skewness, and kurtosis goes in the opposite direction of designing an assertive characterization and visualizing methodology.Furthermore, the histograms reveal irregular shapes, which preclude alleviating the total number and considering only a limited set of indices.
The concept of entropy was developed by Ludwig Boltzmann when analyzing the statistical behavior of system's microscopic components.In information theory, entropy was devised by Claude Shannon to study the amount of information in a transmitted message.The Shannon entropy H, satisfying the Shannon-Khinchin axioms, is defined as where p x is the probability that event x ∈ X occurs.
For bidimensional probability distributions, the expression becomes where p x, y is the joint probability distribution function of X, Y .The entropy index H is applied to 25 species having the main characteristics depicted in Table 1 and totalizing 489 chromosomes.

DNA Entropy and Chromosome Length
The code in each of the 489 chromosomes is converted to a state plane portrait, and the bidimensional histogram is described in the light of the entropy measure.Several experiments varying the number of cells of the n × n matrix M demonstrated that there are only minor numerical differences once large values are adopted, and it was found that n 100 is a good compromise between precision and computational requirements.
Figure 1 shows, for example, the two-dimensional state plane plots and the corresponding bidimensional distribution of relative frequency of the chromosomes Am1, Hu1, Tg1, and Zf1.The horizontal and vertical axes are not represented since they have no useful contribution for the calculations.The charts of the 489 chromosomes were analyzed, and it was concluded that i the plots vary considerably and are a signature of each case, ii there were significant areas of the state plane that were not visited by the trajectories, and iii there were parts of the charts constituted by lines or by part of lines along the 45 or −45 degree direction.
For each chromosome, the Shannon entropy was calculated.For example, in the bidimensional histograms of Figure 1 were obtained the values H Am1 7.092, H Hu1 7.242, H Tg1 6.240, and H Zf1 6.676.The quality of the entropy index was verified by two sets of experiments, namely, by comparing it with two alternative measures, and by assessing three artificial test files.In the first set of experiments, the fractal dimension FD of the two-dimensional state portraits and the Mutual information I of the bidimensional relative frequency distribution were calculated as alternative measures.
For estimating the fractal dimension, the box counting method was adopted 22-24 .For a set S in a n-dimensional and any ε > 0, if there is a number FD so that N ε S ∼ 1/ε FD as ε → 0, where N ε S is the minimum number of n-dimensional cubes of side-length ε needed to cover S, we say that the box counting dimension of S is FD.This reasoning leads to the expression: which can be implemented with image processing algorithms.
In our case, S consists of the state plane monochrome images and small values of ε are reached by accessing images at the pixel level.
The mutual information I of two random variables measures the dependence between two random variables and is defined as  The mutual information can be expressed as where H X and H Y are the marginal entropies.
In the second set of experiments three files with random permutations of 1, 2, and 3 symbol sequences of {T, C, A, G} were generated and treated as if they were chromosome files.For these files, the probabilities are identical i.e., 1/4, 1/16 and 1/64, for the 1, 2, and 3 symbol sequences and the number of generating iterations adjusted so that they were 1 megabyte length.
Figure 2 shows the entropy H of the state plane histogram versus the fractal dimension FD and the mutual information I.The black circles represent the 489 chromosomes, while the white markers at the right corners, namely, the square, the triangle, and the diamond, represent the test sequences of 1, 2, and 3 symbols.We verify that there is a strong correlation in both cases, and, therefore, results are expected to be qualitatively of the same type.In the case of FD, this is due to the fact that the relative frequency distribution concentrates into a few spots, making the information along the z-axis less significant than the one represented by the xand y-axes.Nevertheless, also due to that same reason, H is slightly superior to FD.Since the relative frequency of the four symbols is approximately identical, in the case of I the marginal entropies are almost constant and expression 3.1 leads to a linear relationship with H.In what concerns the three test files, we observe the white markers are located at the right limits of the set of points, and, consequently, the proposed scheme is capable of distinguishing between the natural and the artificial data files.
Figure 3 shows the relationship between the entropy H and the length L of the 489 chromosomes.Analyzing individually each of the species we observe some grouping that reflects the qualitative analysis held initially for each separate plot.For each species, an individual map can be plotted, showing the relative similarities of the chromosomes.For example, Figure 4 represents the locus of H versus L for the 24 and the 16 chromosomes of  Hu and Am, respectively.The white markers represent the arithmetic average of the horizontal and vertical coordinates for each species and can be interpreted as the "center" of each set of chromosomes.For the Hu, we observe that the chromosomes 4 and Y are in opposite parts of the set, while, for the Am, chromosomes 4 and 8 are the most distant ones.Figure 3 includes a considerable number of points, and, therefore, some sort of integration action is necessary.In this perspective, for each species, the arithmetic averages of the entropy and the logarithm of chromosomes lengths i.e., Av H versus Av ln L are applied.The plot depicted in Figure 5 reveals the emergence of patterns that are in accordance with phylogenetics.The corresponding numerical values are depicted in the two right columns of Table 1.
At the left are located the less complex species and at right are plotted the mammals.Within the cluster of mammals, the primates {Ho, Ch, Or} form a subcluster.Among the mammals, it is interesting to notice Mm close to the primates and the extreme position of the marsupial Op, relatively distant from the placental mammals.In what concerns the remaining points, we verify Cb to be almost indistinguishable from Ce.In a middle position, we have the clusters of birds {Ck, Tg} and fishes {Tn, St, Me, Zf}.It is interesting to see that the plant At is located between the insect Am and the fish Zf.Finally, at the extreme left, we have Sc.
Since the mammals have a relative close position in a narrow region of the map, it is important to analyze the zoom represented in Figure 6 where it is clear the close position not only of the primates {Ho, Ch, Or} but also of Mm and Rn.

Conclusions
Chromosomes have a code based on a four-symbol alphabet, and the information can be analyzed with tools adopted in dynamical systems.In this paper, a translation scheme for converting the DNA sequence into a state plane trajectory was adopted.The application to the 489 data files of 25 species revealed bidimensional histograms representative of each chromosome.The results were processed by means of Shannon entropy, and, in order to obtain a simple visualization, the values were averaged for each species.The map of entropy versus chromosome length revealed the emergence of comprehensive patterns of the species relative characteristics.It was verified that the mammals form a cluster located in a narrow area of Mathematical Problems in Engineering 6.8 6.9  the map and that the mouse and rat are relatively close to the primates, while the marsupial is far from the rest of the placental species.

Figure 1 :
Figure 1: State plane portraits and relative frequency distribution of the chromosomes: a Am1, b Hu1, c Tg1, d Zf1.
p x and p y are the marginal probability distribution functions of X and Y , respectively.

Figure 2 :
Figure 2: Entropy H versus a fractal dimension FD, b mutual information I.The black circles represent the 489 chromosomes.The white markers, namely, the square, triangle, and diamond, represent the test sequences of 1, 2 and 3 symbols.

Figure 3 :
Figure 3: Entropy H versus chromosome length L for the 25 species.

Figure 4 :
Figure 4: Entropy H versus logarithm of chromosome length ln L the 24 and the 16 chromosomes of Hu and Am, respectively.The white markers represent the arithmetic average of the horizontal and vertical coordinates for each set of chromosomes.

Figure 5 :
Figure 5: Arithmetic averages of entropy versus logarithm of chromosome length, Av H versus Av ln L , for the 25 species.

Figure 6 :
Figure 6: Averages of entropy versus logarithm of chromosome length, Av H versus Av ln L , for the 11 mammals.

Table 1 :
Species and chromosomes.