Studying the Polypeptide Sequence (α-Code) of Escherichia coli

This paper is devoted to algebraically simulating the α-code of bacterium Escherichia coli and studying contrast factors (words) in its polypeptide sequence. We utilize the methods of spectral theory of graphs which were previously employed by us for enumerating De Bruijn and Kautz sequences. The empirical material is borrowed from the computer investigation of contrast factors in the polypeptide sequences of prokaryotes.


Introduction
It was proposed [1,2] to divide 19 out of all the 20 amino acids into two subgroups, an alanine subgroup (of fatty amino acids: alanine, phenylamine, isoleucine, leucine, methionine, proline, threonine, and valine) and a glycine subgroup (of more polar amino acids: cysteine, aspartic acid, glutamic acid, glycine, histidine, lysine, asparagine, glutamine, arginine, tryptophan, and tyrosine), while serine remains a spare element in the full classification thereof. In a shorthand notation, this gives a, f, i, l, m, p, t, and v (an alanine subgroup); c, d, e, g, h, k, n, q, r, w, and y (a glycine subgroup); and s (a free character). The three numbers 1, 2, and 3 were picked to represent the main two subgroups and the character s, respectively.
Brute statistics, under the natural ratio of 1s to 2s being 0.526: 0.474, had predicted an almost regular distribution (alternation) of the two ciphers. To check such a hypothesis, there were found the frequencies of all 2 (1 ≤ ≤ 11) possible substrings of the length in the genomic sequence of E. coli. The results at once showed that the respective perfectly alternating substrings are in fact the least frequent ones, in the entire genomic sequence. However, visual observation allowed suspecting that the main condition for near-tostatistical distribution of the two subgroups of amino acids may be disguised in grouping equal ciphers (either 1 or 2) representing the respective subgroups in adjacent pairs thereof. That is, in lieu of the code 121212 ⋅ ⋅ ⋅ 12 it should be 112211 ⋅ ⋅ ⋅ 22, where general ratio of ciphers stays thus unaltered.
The situation with pairing of equal ciphers reminded us of the phenomenon of the so-called -code in polypeptides. According to it, one turn of polypeptide spiral involves 3.5 amino acids on an average. Since the nearest to 3.5 multiple integer equals 7, it was of interest to interpret the -code as a one in which all structural features are due to conditionally grouping amino acids into consecutive sevens thereof.
Merging the ideas of sevens (which suit well for interpreting the -code) and pairs (which better obey the natural proportion of amino acids and follow experimental observation) allows us to set forward a universal model of the -code. In this, the distribution of amino acids along the entire polypeptide sequence should maximize a mean number of pairs of equal integers (1 or 2) that are contained in consecutive sevens of it. Thus, the optimal case is that one has 3 pairs (11 and/or 22) and one unpaired number (1 or 2) which can either be between any two pairs out of the three or be outside of these (e.g., it may comprise a pair with an equal number of the adjacent seven). However, it is more interesting to consider the case when the window of length 7 cuts out a substring without 3 such pairs; this is also a necessary pattern in the sequence.
Logically, we could readily establish that the optimalcode should have not less than 2 pairs of adjacent equal numbers in every window of length 7. Another criterion that seemed to be of use was that the -code should reject the subsequences 1212 and 2121 since these allow no more than one pair 11 or 22 in most seven-cipher substrings that contain them. Experimentally, there was the third (and strongest of all we know) criterion; the -code should avoid the inclusion of the palindrome of the type 1211121; however, here, very serious reservations must be made concerning the fact that substituting 2 for 1 (and conversely) in this palindrome produces (longer) substrings that are, on the contrary, not very frequent in the natural sequence of E. coli. In case of longer palindromes including 1211121 (or 2122212), the situation may seem rather ambiguous.
That is why we first tried to attest the simplest (for implementation) mathematical criterion of the above; the prohibition of 1212 and 2121, in a model -code. Here, we shall turn to some rigorous mathematical matters.

Preliminaries
First of all, we must introduce an ancillary digraph which is constructed as follows. The vertices of are all the eight ordered triples of 1s and 2s (i.e., 111, 112, 121, 211, 122, 202, 220, and 222) and an arc (a self-loop) goes out of one vertex to another (this same vertex) if the last two ciphers (on the right) in the first triple coincide with the first two ciphers in the second triple. Say there is an arc from 111 to 112 (with two common ciphers: 11) and also an oriented selfloop attached to the vertex 111 (i.e., an arc from 111 into 111 itself).
Using the connectivity of , one can mentally travel in it from any vertex V, consecutively traversing arcs and visiting adjacent vertices. So, the passage to any adjacent vertex along the respective arc (or returning to the original vertex V through the selfloop attached to it, if any) means that one has done a walk of length 1 (et seq.). However, what is very essential is that every walk of the length , in , factually covers +3 ciphers since every next step increases the number of visited vertices by one but the very first vertex already contained three ciphers, by definition. It is very important to note that the auxiliary digraph allows one to reproduce every sequence of 1s and 2s of length + 3 ( ≥ 0), which also includes sequences containing 1212 and 2121.
In order to rule out all forbidden sequences with 1212 and/or 2121, we shall derive from a "better" working digraph . Specifically, is the above digraph less two opposite arcs going from 121 to 212, and vice versa, from 212 to 121. Since now, the problem on enumerating all (model) -code sequences of length + 3 ( ≥ 3) can be reduced to another one enumerating all walks of length in the working digraph .
Also, we need to construct a derivative matrix = − − , where is the matrix of all 1s while is a diagonal identity matrix; namely, 1 1 1 1 0 1 1 1  1 −1 1 1 1 0 1 1  0 1 0 1 0 1 1 1  1 0 1 0 1 0 1 1  1 1 1 0 0 1 0 1  1 1 0 1 1 0 1 0  1 1 0 1 1 1 0 1  1 1 1 0 1 1 1  ] . ( The matrix is the matrix of the complementary digraph . The corresponding characteristic polynomial is  be the generating function of the number of walks in the digraph , wherein the coefficient of is the number of walks of length , in . Taking into account that a walk of length ( ≥ 0) corresponds to a substring of length +3 of the -code, one can also write down the generating function ( ) for the number of substrings (of the -code) of length as follows: In the next subsection, we shall demonstrate that analytically calculating ( ) can readily be done with the aid of spectral theory of graphs.

Main Part
We shall start this subsection with adapting adapted in our notation the following fundamental result (see Theorem 1.11 in [3]).

Lemma 1. Let be a finite connected (di)graph. Then the generating function
( ) of the number of walks in is where is the number of vertices in .
Journal of Theoretical Chemistry 3 For practically using (6), one should substitute to the R.H.S. of it the polynomials ( ; ) and ( ; ) obtained above. After elementary but tedious manipulations (which will be omitted herein) we have arrived at the following formula for our specific digraph : It should also be noted that along with the exact result given by Theorem 1.11 from [3], there exists a remarkable asymptotic evaluation for , based on Theorem 1.12 in [3]. In our specific case, it gives the following simple formula: where 8 is the number of vertices in the digraph while 1.83929 is its eigenvalue 1 with the maximum modulus (| 1 | > | |; 2 ≤ ≤ 8). Thus, there can be deduced the following approximate formula for the entire series ( ): From the obtained spectral results for the number of walks, one can immediately derive the respective working formulae for the number of substrings of length ( ≥ 3) of the -code. So, we arrive at the following final expression: Beckmann et al. [4] and Brendel et al. [5] studied the measure for evaluating the deviation of the frequency of a string of length , in a genomic sequence, from its statistically expected magnitude. Here, we shall also adopt their approach [4,5]. Let = 1 2 ⋅ ⋅ ⋅ be a sequence of letters, of length , encoding amino acids in a genome or random sequence ( ∈ A 20 = {a, c, d, e, f, g, h, i, k, l, m, n, p, q, r, s, t, k, w,   the observed frequency of a string in a genome. Given the observed frequencies of ( − 2)and ( − 1)-mers, one can calculate the expected frequency of -mers in a genome as follows: To measure the deviation of the observed frequency of string from its expected occurrence in a genome sequence, one should further define the standard deviation of as A string is called a contrast factor (or word) if the absolute value of std( ), defined as above, is greater or equal to some threshold value (which is set to be 3.0, in [4,5]). In other words, a contrast word should obey the claim |std( )| ≥ 3. Table 1 contains the values std( ) for all 31 8-character contrast factors in the polypeptide sequence of Escherichia coli, wherein the numbers are taken from the full inventory of all possible 256 8-character factors over the two-character alphabet A 2 = {1, 2}, as these follow in the nonincreasing order of the magnitudes of ( ). Now we shall consider some conclusions.

Conclusions
The following inferences can be drawn.
(1) The distribution the most contrast factors approximately corresponds to that of the most frequent ones. Say the first 16 out of 31 contrast factors fall into the group of the 41 most frequent factors, whereas 50 of the least frequent factors include no contrast factors whatever. The last, 31st, contrast word is 206th in the complete list of all 8-character sequences over A 2 .
(2) It is clearly seen that a "nonpolar" alanine subgroup of amino acids, denoted by 1, and a "polar" glycine subgroup, denoted by 2, play asymmetric parts in composing the contrast words. For instance, 1 is the last character of word 12 times while 2 is at the tail 19 times; truly, at the head of word the respective numbers are 14 and 17, whose difference seems to be not so essential. The sums 12 + 14 = 26 and 19 + 17 = 36 also conserve this disbalance of frequencies. Thus, mutually substituting 1s for all 2s, and vice versa, in the words collected in Table 1 is not at all an invariant action on it. Additionally note that out of 8 × 31 = 248 characters comprising the 31 contrast words 2 appears only 86 times (or in 34.68% of cases). Separately, the number of occurrences of 2 in Table 1 for the words with std( ) ≥ 3 is 48 (or 19.35% of cases) and that for the words with std( ) ≤ −3 is 38 (or 15.32% of cases). Since the share of 2 in a natural polypeptide sequence of Escherichia coli is 0.474, contrast factors in it can comprise only a minor part of its total length (it is a trivial qualitative fact that is easily deducible from the data in Table 1). But another readily seen fact is of paramount importance. Contrast factors occur chiefly due to the presence of a fattier alanine subgroup of amino acids, denoted by 1.
(3) Since (the most preferable) contrast factors with std( ) ≥ 3 and (the most avoidable) contrast factors with std( ) ≤ −3 have practically equal frequencies in Table 1 (16 and 15, resp.), both types of contrast factors play a commanding role in forming the polypeptide sequence (in particular of Escherichia coli). That is, the synthesis of polypeptides in nature is carried out so as to give preference to the maximum number of admissible "preferable" factors and reject the maximum number of avoidable factors. Though the contrast factors themselves comprise only a minor part of polypeptide sequence (see above), their existence, with the underlined preference of one of them and avoidance of the others, crucially controls the synthesis of polypeptides. Accordingly, noncontrast factors occur in a quite statistical way and thereby ensure the conservation of the natural ratio of polar and fatty amino acids. The current study just emphasizes the significance of compiling a special dictionary of contrast factors for polypeptides.
(4) Thus, the role of noncontrast factors is to be the main building material for the aminoacid sequence of Escherichia coli and, as we may guess, also of such a sequence of any other organism. Since our proposed mathematical approach is applicable to an arbitrary aminoacid sequence, it would be of interest to check it for different organism, as well as to investigate aminoacid factors of different lengths (longer than 7, as in our present work).
The material of this paper has entirely been borrowed from [6]. Additionally, the interested reader may see also unsolved combinatorial problems in our earlier publications [7,8].