Early-Stage Folding in Proteins (In Silico) Sequence-to-Structure Relation

A sequence-to-structure library has been created based on the complete PDB database. The tetrapeptide was selected as a unit representing a well-defined structural motif. Seven structural forms were introduced for structure classification. The early-stage folding conformations were used as the objects for structure analysis and classification. The degree of determinability was estimated for the sequence-to-structure and structure-to-sequence relations. Probability calculus and informational entropy were applied for quantitative estimation of the mutual relation between them. The structural motifs representing different forms of loops and bends were found to favor particular sequences in structure-to-sequence analysis.


INTRODUCTION
Prediction of three-dimensional protein structures remains a major challenge to modern molecular biology. On the one hand, identical pentapeptide sequences exist in completely different tertiary structures in proteins [1]; on the other, different amino acid sequences can adopt approximately the same three-dimensional structure. However, the patterns of sequence conservation can be used for protein structure prediction [2,3,4]. Usually, secondary structure definition has been used for ab initio methods as a common starting conformation for protein structure prediction [5]. A large body of experiments and theoretical evidence suggests that local structure is frequently encoded in short segments of protein sequence. A definite relation between the amino acid sequences of a region folded into a supersecondary structure has been found. It was also found that they are independent of the remaining sequence of the molecule [6,7]. Early studies of local sequence-structure relationships and secondary structure prediction were based on either simple phys-ical principles [8] or statistics [9,10,11,12]. Nearestneighbor methods use a database of proteins with known three-dimensional structures to predict the conformational states of test protein [13,14,15,16]. Some methods are based on nonlinear algorithms known as neural nets [17,18,19] or hidden Markov models [20,21,22,23]. In addition to studies of sequence-to-structure relationships focused on determining the propensity of amino acids for predefined local structures [24,25,26,27], others involve determining patterns of sequence-to-structure correlations [21,22,28,29,30]. The evolutionary information contained in multiple sequence alignments has been widely used for secondary structure prediction [31,32,33,34,35,36,37,38]. Prediction of the percentage composition of α-helix, β-strand, and irregular structure based on the percentage of amino acid composition, without regard to sequence, permits proteins to be assigned to groups, as all α, all β, and mixed α/β [5,39].
Structure representation is simplified in many models. Side chains are limited to one representative virtual atom; virtual Cα − Cα bonds are often introduced to decrease the number of atoms present in the peptide bond [40,41]. The search for structure representation in other than the φ, ψ angles conformational space has been continuing [42].
Other models are based on limitation of the conformational space. One of them divided the Ramachandran map into four low-energy basins [43,44]. In another study, all sterically allowed conformations for short polyalanine chains were enumerated using discrete bins called mesostates [45]. The need to limit the confomational space was also asserted [46,47].
The model introduced in this paper is based on limitation of the conformational space to the particular part of the Ramachandran map. The structures created according to this limited conformational subspace are assumed to represent early-stage structural forms of protein folding in silico.
In this paper, in contrast to commonly used base of final native structures of proteins, the early-stage folding conformation of the polypeptide chain is the criterion for structure classification.
Two approaches are the basis for the early-stage folding model presented in this paper.
(1) The geometry of the polypeptide chain can be expressed using parameters other than φ, ψ angles. These new parameters are the V -angle-dihedral angle between two sequential peptide bond planes-and the R-radius, radius of curvature, found to be dependent on the Vangle in the form of a second-degree polynomial. Details on the background of the geometric model based on the V , R [48,49] are recapitulated briefly in "appendix A." (2) The structures satisfying the V -to-R relation appeared to distinguish the part of the Ramachandran map (the complete conformational space) delivering the limited conformational subspace (ellipse path on the Ramachandran map). It was shown that the amount of information carried by the amino acid is significantly lower than the amount of information needed to predict φ, ψ angles (point on Ramachandran map). These two amounts of information can be balanced after introducing the conformational subspace limited to the conformational subspace distinguished by the simplified model presented above. Details on the background of the information-theory-based model [50] are reviewed briefly in "appendix B." The conformational subspace found to satisfy the geometric characteristics (polypeptide limited to the chain peptide bond planes with side chains ignored) and the condition of information balancing appeared to select the part of Ramachandran map which can be treated as the early-stage conformational subspace.
The introduced model of early-stage folding was extended to make it applicable to the creation of starting structural forms of proteins for an energy-minimization procedure oriented to protein structure prediction. The characteristics and possible applicability of the sequenceto-structure and structure-to-sequence contingency tables is the aim of this paper.
The structures created according to the limited conformational subspace can be reached in two different ways: (1) as the partial unfolding (Figures 1a-1e) and (2) as the basis for the initial structure assumed to represent early-stage folding (Figures 1f-1j). The partial unfolding of the native structural form (called the "step-back" structure in this paper) is expressed by changing the φ, ψ angles to the φ sb , ψ sb angles (φ sb , ψ sb angles belong to the ellipse path, and their values are obtained according to the criterion of the shortest distance between φ, ψ and the ellipse-shown in Figure 1b). The second approach, in which the structure is created on the basis of the φ es , ψ es angles (φ es , ψ es denote the dihedral angles belonging to the ellipse and representing a particular probability maximum), is based on the library of sequence-to-structure relations for tetrapeptides.
A scheme summarizing the two procedures-partial unfolding and partial folding-is shown below (Figure 1). The procedure called partial unfolding starts at the native structure of the protein (Figure 1a). The values of the φ, ψ angles present in the protein are changed (according to the shortest distance criterion) to the values of the angles belonging to the ellipse (φ sb , ψ sb ). When these dihedral angles are applied, the structure of the same protein looks as is shown in Figure 1c. When this procedure is applied to all proteins present in the protein data bank, a probability profile can be obtained which represents the distribution of φ, ψ angles in the limited conformational subspace. The distribution is different for each amino acid, although some characteristic maxima can be distinguished. The profile shown in Figure 1d represents Glu (the ellipse equation t-parameter = 0 • represents the point of φ = 90 • and ψ = −90 • , and then increases clockwise). Particular probability maxima can be recognized using the letter codes also shown in Figure 2. These letter codes are used to classify the structures of proteins in their early-stage folding (in silico) (Figure 1e).
The opposite procedure, aimed at protein folding, is shown also in Figures 1f-1j. The starting point in this procedure is the amino acid sequence of a particular protein.
After selecting four-amino-acid fragments (in an overlapping system), four different structural codes (for the same tetrapeptide) can be attributed on the basis of the contingency table described above (Figure 1f). Only a particular fragment of the probability profile (according to the letter code) can be recognized in this case. In consequence, the φ es , ψ es values representing the location of the probability maximum on the t-axis can be attributed to a particular sequence ( Figure 1g). This is why the φ es , ψ es angles differ versus φ sb , ψ sb . In consequence, the structure of the transforming growth factor β binding proteinlike domain (protein selected as an example, PDB ID: 1APJ) created according to the φ es , ψ es angles shown in Figure 1h differs versus the (φ sb , ψ sb )-based structure. The "sb" (step-back) and "es" (early-stage) structures differ due to the continuous form of the probability distribution in "sb" procedure and the discrete one in the "es" procedure. The next step in the prediction procedure is energy minimization, which in some cases causes approach toward the native structure ( Figure 1j).
The structures created according to the ellipse path treated as the starting structures for the energyminimization procedure, deliver forms that approach the native structure after one simple optimization procedure. BPTI [51], ribonuclease [50], to some extent also human hemoglobin α and β chains [52] and lysozyme [53] were used as the model molecules. All these examples Late-stage folding simulation (not applied) + Energy minimization Native structure Predicted structure Step-back conformation Early-stage conformation Sequence-to-structure contingency table AA sequence Step-back unfolding path Folding simulation path    proved that the ellipse-path-limited conformational subspace helped define the initial structure for the energy-minimization procedure, leading to proper, native-like structures without any forms inconsistent with proteinlike ones. When the energy-minimization procedure is not sufficient to deliver the proper native-like structure of the protein (which can be seen in Figures 1a and 1j), the additional procedure is necessary (Figure 1i). It is under study now and will be published in the close future.

Early-stage folding structure classification
All proteins present in PDB (release January 2003) were taken for analysis [54]. Letter codes have been used for sequence identification. A letter code system is introduced in this paper for structure representation in protein early-stage folding (in silico) based on the probability distribution of φ, ψ angles along the ellipse-path-limited conformational subspace (see "appendix B"). To easily distinguish the structure codes versus sequence codes, the former are printed in bold and the latter in italics in this work.
Comparison of distributions between three-state secondary structures indicated four-amino-acid fragments as the most common ones for α-helices, β-strands, and loops [21,55]. The tetrapeptide was adopted as the unit for investigation of the sequence-structure relation.
The probability distribution along the ellipse, which is assumed to represent the limited conformational subspace, is the basis for the structure classification introduced in this paper. The profile of the probability distribution (of all amino acids) along the ellipse path is shown in Figure 2. Figure 2a shows the usual distribution of φ, ψ angles as found in proteins together with the ellipse path.
The procedure of moving particular φ, ψ angles to the ellipse path is also shown in Figure 2a. The shortest distance between particular φ, ψ angles (point on the Ramachandran map) and a point belonging to the ellipse path located the φ e , ψ e (e denotes ellipse belonging) dihedral angles determining the early stage for a particular amino acid of the polypeptide chain. After moving all φ, ψ angles to the ellipse path, the profile of the probability distribution can be obtained, as shown in Figure 2b. The tparameter is the ellipse parameter present in the equation shown in "appendix A." The t-parameter equal to zero represents the point φ = 90 • and ψ = −90 • on the Ramachandran map and increases clockwise, as is shown in Figure 2c. Seven probability maxima can be distinguished in this profile. Each of them is letter coded.
This coding system was applied to classify the structures of all proteins analyzed. The codes introduced according to the probability distribution shown in Figure 2b are interpreted as follows: C (t-value range) represents right-handed helical structures, E represents βstructural forms, and G represents left-handed helices. The β-structural forms are differentiated (some amino acids like Ala, Ser, Asp reveal two probability maxima [50]); this is why code F also represents β-like structures.
Although all other letters represent structural forms not identified in the traditional classification, the presence of probability maxima suggests the need to distinguish these categories (code A mostly for Pro and Gly, code B represented mostly by Asn and Asp, and code D characteristic for Tyr and Asn, to take a few examples).

The contingency table
A window size of four amino acids (analogous to the open reading frame in nucleotide identification) with one amino acid step (overlapping system) was applied to code the sequences and structures in proteins. Potentially 160 000 (20 4 ) different sequences for tetrapeptides can occur (columns). Taking seven different structural forms for each amino acid in a tetrapeptide, 2401 (7 4 ) structural forms can be distinguished for a tetrapeptide (rows). These numbers give an idea of the size of the contingency table under consideration. For all cells, probability values of p t , p c , and p r were calculated as follows: where i denotes a particular structure (row), j denotes a particular sequence (column), n i j is the number of polypeptide chains belonging to the ith structure and representing the jth sequence, N t is the total number of ORFs, and N c j and N r i denote the number of ORFs belonging to a particular ith structure and jth sequence, respectively. The table expressing all probabilities (p t i j , p c i j , and p r i j ) is available on request at http://www. bioinformatics.cm-uj.krakow.pl/earlystage/. All values are expressed on a logarithmic scale because of the very low probability values in the cells of the table.

Information entropy as a measure of sequence-to-structure and structure-to-sequence predictability
High values of probability calculated as above (relative to potential probability values) can disclose highly coupled pairs of structure and sequence. Ranking the probability values can extract the highly determined relations for both sequence-to-structure and structure-tosequence.
Structural predictability can also be measured using informational entropy calculation. According to Shannon's definition [56], the amount of information can be calculated as follows: where I i expresses the amount of information (in bits) dependent on p i -the probability of event i. This definition is very useful for measuring the amount of information carried by a particular simple (elementary) event. In the case of a complex event, for which few solutions are possible, informational entropy can be calculated, expressing the level of uncertainty in predicting the solution. Informational entropy according to Shannon's definition is as follows: where n is the number of possible solutions for a particular event. N denotes the number of possible solutions for the event under consideration (number of elementary events). SE reaches its maximum value for all p i equal to each other, that is, each ith solution is equally probable for the event under consideration and no solution is preferred. The maximum value depends on the number of possible solutions for the event (n).
SE equal to zero (or 1.0) represents the determinate case in which only one solution is possible. The higher the difference between max SE and SE, the higher the degree of determinability in the given case. A high max SE − SE value means that the case is realized by a few solutions and that some of them occur with higher probability, which can be interpreted as a case with higher determinability (biased event). SE, max SE, and the values of the differences between them can be calculated for all rows SE r (structural preferences versus amino acid sequence) and for columns SE c (sequence preference for a particular structural form) in the contingency table. SE r allowed extraction of structures highly determined by the sequence; SE c extracted structures highly attributed to a particular sequence.
The SE calculation performed for each column (particular sequence) in the contingency table was calculated as follows: where SE c j denotes informational entropy for the jcolumn, i denotes a particular row (structure), N 0 j is the number of nonzero cells in the j-column, and p c i j is calculated according to (2).
The value SE c j as calculated according to (6) measures the level of uncertainty in predicting structure for the jth sequence. The closer the SE value to zero, the higher the degree of chance in prediction. max SE expresses quantitatively the level of uncertainty in the most difficult case for making a decision. For the jcolumn (sequence): Thus the difference between two quantities ( (6) and (7)) can be used as the "distance" between the most difficult situation (all solutions equally possible-random solution) and the situation observed in the case under consideration. For the j column Analogous calculations for rows (sequences) were performed. For each i-row, the value of SE r i , max SE r i , and ∆ SE r i was calculated.

Structures coded according to the introduced system
Structures of all proteins present in the PDB (release January 2003) [56] were analyzed. The φ, ψ angles were calculated for each amino acid. The φ e , ψ e angles were calculated according to the shortest distance versus the ellipse. A letter code was assigned for each amino acid according to the ellipse path fragment. Since the tetrapeptide was used as the structural unit, four letters coded one structural unit. The overlapping reading frame system was applied, which means that one amino acid step was applied in structure classification. The maximum combination of seven letter codes for a four-letter string is equal to 2401. This means that 2401 different four-letter strings were expected to be found. It turned out that only 2397 different strings were found in real proteins. Since there are 20 amino acids and four amino acids were taken for the unit, 160 000 different sequences of tetrapeptides were expected; 146 940 different sequences were found in the proteins under consideration.

Contingency table
Each tetrapeptide found in proteins was described by a four-letter string expressing the sequence and a four-letter string expressing the structure. Each tetrapetide with a known sequence and known structure can be ordered in the form of a table. The rows of the table represent structures and the columns represent sequences. Finally a 2397 × 146 940 table was constructed. To distinguish the structure codes from sequence codes, sequence codes are in bold capital letters and structure codes in italics. The scheme of the contingency table is presented in Table 1. The total number of tetrapeptides in the analyzed database was found to be 1 529 987. Global analysis of the contingency table shows that the maximum number of different structures attributed to the same tetrapeptide is 144. This tetrapeptide appeared to be of the sequence GSAA. The maximum number of different sequences was found for α-helix (CCCC: 90 587) and for β-structure (EEEE: 47 809). Four structures were not found in the library: ABAB, ABBD, ABFB, DBAB.

Information entropy calculation
SE, max SE, and the value of the difference between these two quantities (∆SE) were calculated according to the procedure presented in "material and methods." They can be calculated for columns (sequences) and for rows (structures) separately. The calculation of SE c j for the j-column expresses the information entropy related to the structural differentiation of a particular sequence. The calculation of SE r i for the i-row in the contingency table expresses the sequential differentiation for a particular structure. max SE according to information entropy characteristics expresses the entropy for the case in which each of all the nonzero cells represents equal probability. For SE c j = max SE c j , all structures for a particular sequence are equally probable. Equal probability for a set of elementary events (different structures) represents the random situation. The bigger the difference max SE − SE, the more deterministic the case. This is why the difference (∆ SE) between SE and max SE was taken to measure the degree of structure-to-sequence (or vice versa) determination.
The interpretation of Tables 2 and 3 is as follows. The structural predictability for a particular sequence can be estimated in the first case, and the predictability of the sequence for a particular structure in the latter case. The results for only the top ten structures and top ten sequences are shown in Tables 2 and 3. Its highest structural predictability for a particular sequence confirms polyalanine as a highly probable helical structure. Generally, the highly predictable structures for particular sequences are helical forms ( Table 2).
The sequence predictability for particular structural forms displayed a quite unexpected regularity. The structures representing irregular structural forms appeared to reveal the strongest entropy decrease versus the random distribution of sequences. This can be seen analyzing the letter codes for the structures ( Table 3).
The top ten structures presented in Table 3 are also shown in Figure 3. In summary, one can say that when a particular irregular structural form is expected in a protein, there are preferable sequences to build these irregular motifs; they are shown in Table 3. This seems to be of particular relevance for threading procedures oriented to the production of new proteins not observed in nature.

DISCUSSION
Particular classes of amino acid relations to particular structural forms in proteins were recently found to solve the problem of structure predictability [57]. All papers concerning this subject linked sequence with structure as it appears in the final native form of the protein. The model introduced in this paper represents an approach to the relation between sequence and structure in the earlystage folding structural form; the bases for the model are presented in detail elsewhere [48,49,50], and verified by BPTI [51], ribonuclease [50], hemoglobin [52], and lysozyme [53] folding. The (in silico) early-stage structures of these proteins can be found in the corresponding publications.  Figure 3. Structures of tetrapeptides with highest structure-tosequence determinability as found using informational entropy calculation (see "material and methods" and Table 1). Gray terminal fragments represent the extended form of polyalanine (tetrapeptides) to emphasize the mutual spatial orientation of terminal fragments. Other colors distinguish ellipse fragments as follows: red (A), green (B), violet (C), sky-blue (D), yellow (E), dark blue (F), orange (G). The data for creation of these structures is given in Table 1 and Figure 2.
Several algorithms for quantitatively assigning αhelix, β-strand, and loop regions for proteins with known structure have been developed [58,59,60,61]. The threedimensional model presented in this paper shows that it is enough to select seven fragments of the ellipse with welldefined probability maxima to be able to predict the earlystage structural form.
The high structure-to-sequence relation found for loops (Table 3, Figure 3) may be particularly important, since a recent survey of 31 genomes indicated that disordered segments longer than 50 residues are very prevalent [62]. Helices, sheets, and turns together account for only about 50%-55% of all protein structure on average [63]; the remaining structures are classified as several types of loops [63,64]. Current estimations suggest that over 50% of proteins in eukaryotes may carry unconstructed regions of more than 40 residues in length [65], while less than 1% of the proteins in the PDB contains such long disordered regions. These observations taken together imply that many proteins with disordered regions would be unlikely to form crystals [66]. Proteins containing long, disordered segments under physiological conditions are frequently involved in regulatory functions [67], and the structural disorder may be relieved upon binding of the protein to its target molecule [68,69]. Intrinsically unconstructed proteins and regions, which are also known as natively unfolded and intrinsically disordered, differ from structured globular proteins and domains with regard to many attributes, including amino acid composition, sequence complexity, hydrophobicity, charge, flexibility, and type and rate of amino acid substitutions over evolutionary time [66]. Compared to highly ordered secondary structure regions, the loops and turns are more difficult to identify due to the absence of hydrogen bonding and repeating backbone dihedral angle patterns [70]. The first computational tool indicating the predictability of disordered regions from protein sequence [71] was a neural network predictor (PONDR). Several other disorder predictors have been published since then [72,73,74]. Statistically based turn propensity used over a four-residue window was described [75]. The inverse folding problem is the design of protein sequences that have a desired structure [76,77]. It is impossible to mention even a small part of the papers dealing with the sequence-to-structure relation. Recently, it was concluded that the probability of any state (φ,ψ) is influenced by the full sequence and not only by the local structure [78].
A genome-scale fold recognition program exploring the knowledge-based structure-derived score function for a particular residue was proposed incorporating three terms: backbone torsion, buried surface, and contact energy [79].
Unlike many others, our model, dual in nature, incorporating sequential and structural information, predicts sequence-to-structure as well as structure-to-sequence.
The contingency table was independently analyzed using another statistics-related method (Meus J, Stefaniak J. The Z coefficient as a measure of dependence in contingency tables (unpublished data), Meus J, Brylinski M, Piwowar P, et al. A tabular approach to the sequence-to-structure relation in proteins (unpublished data)). High accordance was found between the results presented in this paper and in the statistical analysis: the top ten sequences and structures presented in Table 1 were found to be among the most highly correlated, both in sequence-to-structure and in structure-to-sequence, on the ranking list created by the alternate calculation method. The order of the two ranking lists is very similar, additionally confirming the reliability of the model presented.
Aside from early-stage structure prediction, the contingency table presented may contribute to conventional secondary structure prediction, local and supersecondary structure prediction, location of transmembrane regions in proteins, location of genes, or sequence design.
The list of highly determinable tetrapeptides (in sequence-to-structure and structure-to-sequence relations) also allowed the SPI (structure predictability index) scale to be defined [80]. Applied to amino acid sequences, this scale helps to measure the degree of difficulty of structure prediction for a particular amino acid sequence without knowledge of the final, native structure of the protein.
The sequence-to-structure and structure-to-sequence contingency tables, which is created on the basis of all proteins of known structure (step-back procedure), can be used to create the early-stage folding (in silico) structure. Applied to other (late-stage folding) procedures, it presumably can enable protein structure prediction. The early-stage form was used as the object for comparison to simplify the presentation of the structure (seven possibilities). The SPI (structure predictability index) parameter, attributed to any amino acid sequence, allows estimation of the degree of difficulty in structure prediction. The probability values (which can be higher or lower) taken from particular cells of the contingency table can tell how offen a particular structure occurs in the protein database so far. The information entropy-based classification presented in this paper allows highly distributed structural forms to be distinguished for a particular tetrapeptide sequence.

APPENDIX A
The main assumption for the model presented below is that all structural forms of polypeptides in proteins can be treated as helical. The β-structure in this approach is a helix with a very large radius of curvature. The radius of curvature depends on the V -angle, which expresses the dihedral angle between two sequential peptide bond planes. The quantitative analysis of the relation between these two parameters (V and R) used the following procedure.
(1) The structure of the alanine pentapeptide was created for each 5 • grid point on the Ramachandran map. Each alanine present in the pentapetide represented the φ, ψ angles appropriate for a particular grid point.
(2) Before the parameters (R, V ) were calculated, all structures (for each grid point) were oriented in a unified way: the averaged position of the carbonyl oxygen atoms and the averaged position of carbonyl carbon atoms determined the Z-axis.
(3) The radius of curvature was calculated for projections of Cα atoms on the xy plane. The radius of curvature for extended (and β-structural) forms is very large (theoretically infinite). This is why the natural logarithmic scale was introduced to express the magnitude of R.
(4) The V -angle was calculated as the difference between the tilt of the central peptide bond plane and the tilt of two (averaged) neighboring peptide bond planes.
The Ramachandram map expressing the V -angle distribution and R-radius of curvature (in ln scale) is shown in Figure 6.
The (ln R) dependence on the V -angle for structures representing low-energy conformations is shown in Figure 4. The approximation function found for this relation is as follows: The distribution of φ, ψ angles of structures that satisfy the above equation is shown in Figure 5. The ellipse path found based on this distribution is as follows: where A and B are long and short ellipse diagonals, respectively.

APPENDIX B
The sequence of amino acids in polypeptide determines its structural form. This expression can be understood also as follows. The amount of information carried by an amino acid sequence is comparable to the amount of information necessary to predict its structure.
The amount (bit) of information carried by a particular amino acid can be calculated using Shannon's equation where p i expresses the probability of the ith amino acid's presence in a sequence. Assuming all amino acids occur with the same probability (1/20), the amount of information can be calculated.
The amount of information necessary to predict a particular structure (expressed by φ i , ψ i dihedral angles) for the ith amino acid can also be calculated as follows (using the same Shannon's equation): where p φψ i expresses the probability of the ith amino acid to represent the φ, ψ dihedral angles. Assuming 1 • as the step for exploring the Ramachandran map and assuming that the Ramachandran map is flat (all φ, ψ angles equally possible), the amount of information I is calculated for p φψ i equal to 1/(359 * 359).
This simple comparison shows that the big difference makes the situation highly nonequilibrated.
The value of p i is different from 1/20 in real proteins because the frequency of amino acids differs.
The value of p φψ i also depends on the amino acid under consideration. The assumption of equal probability of   φ, ψ angles cannot be accepted. Predicting particular φ, ψ angles is relatively easy for proline and most difficult for glycine. Prediction of particular φ, ψ angles is connected with the selection decision. This means selection of φ, ψ from among 359 * 359 possible solutions. Moreover, particular φ, ψ angles are not equally possible. With information entropy measuring the degree of uncertainty in φ, ψ angles, selection (according to Shannon's equation) is calculated as follows: where index i denotes the amino acid under consideration, p i denotes the probability of occurrence of particular φ, ψ angles calculated for the ith amino acid, N denotes the number of grid points (depending on the step size for φ, ψ angles all over the Ramachandran map), and SE i expresses the mean value (quantity) of information (bit) necessary to select one solution from among the number that represents the complete event space (359 * 359 in our case). The mean value takes into account the different probabilities for different φ, ψ angles and also the dependence on the amino acid under consideration (ith). SE can be interpreted as a scale to measure the predictability Table 4. Amount of information (I i (bit)) carried by a particular amino acid, calculated on the basis of the frequency and amount of information (SE φeψe i (bit), φ e , ψ e denote φ, ψ angles belonging to the ellipse) necessary to predict the structure belonging to the ellipse path (early-stage folding conformational subspace) with 10 • step of t-angle precision (see ellipse equation in "appendix A"). Detailed analysis of the data shown in this table can be found elsewhere [50]. characteristic for a particular amino acid. It was shown that the SE scale places Gly and Pro at opposite positions on the ranking (scoring) list of amino acids. The 10 * 10 step for φ, ψ angles precision prediction still needs a large amount of information to be equilibrated with the amount of information carried by a particular amino acid (in this case N is equal to 35 * 35). Analysis of the ellipse path from the point of view of SE calculation reveals that this limited conformational subspace (with 10 • steps along the ellipse expressed as N as in (B.4)) satisfies the condition of balancing ( Table 4) the amount of information carried by amino acid and the amount of information necessary for selection of the structure belonging to the ellipse path representing the limited conformational subspace with 10 • precision.
where p i denotes the probability value for a particular point on the ellipse (particular t-parameter), and N de-notes the number of points selected (it is coupled with the t-parameter step size). The ellipse path presented in "appendix A" appeared to satisfy two important conditions. (i) Almost all structurally important forms of polypeptide are present in this conformational subspace; and (ii) the amount of information carried by the amino acid and the amount of information needed to predict a particular structural form belonging to the conformational subspace are equilibrated. Details on the information problem can be found elsewhere [50]. Figure 5a shows the relation between the φ, ψ angles of Ser distribution all over the Ramachandran map, with the ellipse path distinguished by a black line. The distribution of the φ, ψ angles of Ser after moving them toward the ellipse path is shown in Figure 5b. The overlapping of the probability profiles of all amino acids is shown in Figure 1b.