Entropy and Multifractality for the Myeloma Multiple TET 2 Gene

The nucleotide and amino-acid distributions are studied for two variants of mRNA of gene that codes for a protein which is involved in multiple myeloid. Some patches and symmetries are singled out, thus, showing some distinctions between the two variants. Fractal dimensions and entropy are discussed as well.


Introduction
In some recent papers, the concepts of fractality 1-19 and entropy 19-21 have been considered as fundamental parameters to investigate the existence of correlations 22-36 and simple rules 37 in the DNA sequences.In particular, it has been observed that the increasing fractal dimension 7-13 can be related to a degeneration in sequences, having as a consequence pathological evolution of related diseases.A fundamental role is played by the concept of information entropy 20, 21 so that a change in the nucleotide distribution in DNA implies a corresponding change in the information content and, as a consequence, a variation in the entropy.Since the cell activity is functionally dependent on the nucleotide distribution our task is to understand better about this distribution and/or about the existence of large scale structure 1-6, 15, 22-37 .So that we could relate the functional activity of cells to some epitomizing patches in the nucleotide distribution.We will propose, in the following, also to take into account the information content in the amino acid distribution.In particular, we will see that the amino-acid distribution shows a higher level structure, and some patchiness which are undetectable in the nucleotide distribution.Our statistical approach is based on the transformation of the symbolic string into a numerical string by the Voss indicator function 4, 5 which is a discrete binary function.On this function, the indicator matrix is defined and on this matrix the fractal dimension and entropy can be simply computed.We will compare the fractal dimension and complexity of two mRNA variants of TET 2 ten-eleven translocation 2 gene downloaded from gene bank 38 similar data are also available from 39-41 , by showing that these parameters can be used to classify the two variants.Multiple myeloma is a pathology which involves plasma cell, but it can move and spread into the whole body.Some aspects are still unclear; however, it is known that this pathology is characterized by the activation of abnormal genes through chromosomal translocations and other genetic anomalies.One of the genes involved in the birth and progression of multiple myeloma is the TET 2. In fact it is present in some myelodysplastic syndromes, and it seems to play a key role when it is subject to mutation.The gene TET 2 is related with myelopoiesis; in fact it encodes protein that we can find significantly expressed in hematopoietic cells and granulocytes.

Multiple Myeloma and the Oncogene TET 2
Multiple myeloma MM is a blood cancer of the plasma cell.Myeloma originates in a specific type of cell, the plasma cell, but it can move, so that it spreads by the blood to the whole body.Like other cancers, multiple myeloma will develop in steps.Myeloma begins when the normal plasma cell becomes abnormal.The abnormal cell divides, and the new cells divide again and again, thus proliferating the number of abnormal cells.Myeloma cells collect in the bone marrow and in the solid part of the bone.These malignant plasma cells produce a para protein, an inactive antibody known also as M-protein or Bence Jones protein, that attack bone marrow, bones, blood, and kidneys.As a consequence, there happens extensive destruction within the skeleton involving multiple bones, and resulting in widespread bone pain and multiple fractures; for this reason, such a disease is also called multiple myeloma.Some genetic factors are also involved in this pathology.In absence of other symptoms and clinical signs, this condition is more properly called benign monoclonal gammopathy of uncertain significance MGUS .In fact, the uncertainty about the future progression, it shows that also benign diseases might evolve into MM.It is likely that the evolution of MGUS into MM depends on many mutations of the MGUS clone.Initially, MM has a low progression, but afterward it becomes more aggressive.The signs that characterize onset of multiple myeloma are mostly high concentration of calcium ions with damages in the kidneys, the weakening of the immune system with abnormal production of immune globulin, and some other signs such as an evident osteoporosis.Both MGUS and MM diseases are characterized by the presence of alterations in gene expression 42-46 .The chromosomes that are more involved are 1,11,13,14, respectively.The alteration at chromosome 1 is found in half of cases of MM patients 47-54 .The same aberrations chromosome seem to be evident both in MM and in MGUS, thus supporting the thesis that these two diseases are closely related 53 .
Gene TET 2 is located on the chromosome 4 exactly in 4q24.More precisely, the TET2 gene is located from base pair 106,067,942 to base pair 106,200,957 on chromosome 4, position as shown in Figure 1 38 .
The gene TET 2 plays a key role in the conversion of methylcytosine 5mC to 5hydroxymethylcytosine hmC moreover is related to myelopoiesis.For the hmC many roles were noted like for example 1 remodeling of chromatin structure 2 recruitment of some factors 3 demethylation of cytosine 55, 56 .The gene TET2 encodes a protein that we find significantly expressed in hematopoietic cells and granulocytes.In almost all patients with myelodysplastic syndromes, the protein is decreased in peripheral blood granulocytes.TET 2 gene is usually mutated in myeloproliferative disorders MPDs .The MPD is part of a larger group of disorders called myeloproliferative neoplasms MPNs .The mutation of TET 2 characterizes some disorder known as systemic mast cell disease, but TET 2 is above all mutated in myelodisplastic syndromes 57 .
We will see that, by using some parameters defined on the indicator function, we can single out some patches which characterize abnormal functional activity 1-3, 35, 58, 59 .

DNA Representation
The DNA, as well as the mRNA, of each organism of a given species is a sequence of a specific number of base pairs defined on the 4 elements alphabet of nucleotides: A adenine, C cytosine, G guanine, T thymine. 3.1 Since the base pairs are distributed along a double helix, when straightened, the helix appears as a complementary double-strand system.The two sequences on opposite strands are complementary in the sense that opposite nucleotides must fulfil the ligand rules A with T and G with C of base pairs, between purines A and G and pyrimidines T and C. In a DNA sequence, there are some subsequences, which can be roughly subdivided into coding and noncoding regions, having special meaning.In particular, genes belonging to coding regions are characteristic sequences of base pairs, and the genes in turn are made by some alternating subsequences of exons and introns except Procaryotes where the introns are missing .Each exon region is made of triplets of adjacent bases called codons.There are 64 possible codons, inasmuch the number of combination of the 4 nucleotides into 3 length classes.There are only 20 amino acids, therefore, the correspondence codons to amino acids are many to one.The 20 elements alphabet of amino acids is in Table compared with H1 , is shown to have c-terminal to be distinct and even shorter of H1 , which is also represented by a longer transcript 37 .

Dot Plot on the Indicator Matrix
In this section, we will define the indicator matrix 4, 5 on which the computation of multifractality and entropy are based.

Indicator Function for the 20-Symbols Alphabet of Amino Acids
As a generalization of the 4-symbols alphabet of nucleotides, we can define the 20-symbols alphabet of amino acids as follows: After a transduction of the two DNA sequences H1 and H2 into their amino acids components, we can see that the corresponding dot plots can show some higher-level structure on the distribution of nucleotides see Figure 2 .In particular, H2 shows a special pattern which is more evident in the amino acids dot plot.

Frequency Distribution
The probability distribution of nucleotides can be defined by the frequency that the acid nucleic X can be found at the position n.This value can be approximated by the frequency count on the indicator matrix of the nucleotide distribution before n.So that, for the transcript variant, we have the probability density distribution of Figure 3 which, however, tends to assume some different constant values thus showing that nucleotides are heterogeneously distributed.

Distribution of the Essential Amino Acids
Analogously to the nucleotides frequency distribution, we can compute also the amino-acid distribution that the amino-acid X can be found at the position n.
In particular, we have noticed that even if the nucleotides distribution is nearly the same in both sequences H1 and H2 , the amino acid shows different distributions for the same amino-acid in each sequence.In other words, the "second"-level distribution seems to be organized according to a different distribution law see Figures 4 and 5 .

Fractal Dimension
The frequency distribution implies a corresponding frequency of correlation in the correlation matrix.By using the indicator matrix, it is possible to give a simple formula which enables If we compare the fractal dimensions of the two mRNA sequences H1 and H2 , we can see Figure 6 that the fractal dimension of nucleotide distribution tends, for both variants, to the value 1.26.
It is interesting to notice that the corresponding amino acids of the two sequences have more or less the same fractal dimension which tends for both Figure 7 to the value 1.29.

Entropy Estimate
As a measure of the information distribution, we consider the normalized Shannon entropy, which is defined, for a distribution over the alphabet ℵ , as where p X i n is given by 5.1 for nucleotides and 5.2 for amino acids.Since i 1 p X i n 1, for all n, the main values of this function are the following.
1 If p X i n 1, p X j n 0 j / i , then H n 0. This happens when the information is concentrated in only one symbol.
2 If p X i n p X j n 1/ , i / j then H n 1.In this case, the information is equally distributed over all symbols.
3 Equation 0 ≤ H n ≤ 1.In general, the information content is distributed over the range 0, 1 .
Therefore, the entropy is a positive function ranging in the interval 0, 1 , the minimum value is obtained when the distribution is concentrated on a single symbol, while the maximum value is obtained when all symbols are equally distributed.
In particular for higher values of n, according to the frequency definition of probability, the entropy tends to the constant value 1 see Figures 8 and 9    acids.However, in the first case Figure 8 , the entropy of H1 is lower than H2 , while on the contrary, for the corresponding amino acids,s the entropy of H1 is greater than H2 .

Conclusions
In this paper, two variants of mRNA of isoforms TET2 gene have been analyzed through their nucleotide and amino acids distribution.By using the indicator function and matrix , the fractal dimension and the entropy have been easily computed.We have noticed that, at

4 . 6 When D 1 N ≡ D 2 N
the indicator function, it shows the existence of autocorrelation on the same sequence.According to 4.5 , the indicator map of the N-length sequence can be easily represented by the N × N sparse matrix of binary values {0, 1}, and this matrix can be visualized by the following autocorrelation dot-plot:

e 50 50 fFigure 2 :
Figure 2: Dot plot for the first 50 nucleotides distribution in the H1 -H1 , H2 -H2 DNA sequences a,b and corresponding amino acids d and e .In c and f , the cross correlations H1 -H2 are given.

Figure 3 :
Figure 3: Probability density distribution of nucleotides along TET2 oncogene variants H1 brown and H2 blue .

Figure 7 :
Figure 7: Fractal dimension as function of the length, i 10, . . ., 200 for the amino acids of H1 red and H2 blue .

100 200 1 Figure 8 :
Figure 8: Entropy for the first 300 nucleotides of the sequence H1 red and H2 blue .
1.In the following, we will analyze two mRNA sequences: H1 and H2 , downloaded from the National Center for Biotechnology Information 38 , which represent respectively.Some differences between two variants are the following: H2 is different from H1 in 5 UTR untranslate region and in 3 UTR untranslate region ; furthermore, H2 variant,