Statistical Analysis of Terminal Extensions of Protein β-Strand Pairs

The long-range interactions, required to the accurate predictions of tertiary structures of β-sheet-containing proteins, are still difficult to simulate. To remedy this problem and to facilitate β-sheet structure predictions, many efforts have been made by computational methods. However, known efforts on β-sheets mainly focus on interresidue contacts or amino acid partners. In this study, to go one step further, we studied β-sheets on the strand level, in which a statistical analysis was made on the terminal extensions of paired β-strands. In most cases, the two paired β-strands have different lengths, and terminal extensions exist. The terminal extensions are the extended part of the paired strands besides the common paired part. However, we found that the best pairing required a terminal alignment, and β-strands tend to pair to make bigger common parts. As a result, 96.97%  of β-strand pairs have a ratio of 25% of the paired common part to the whole length. Also 94.26% and 95.98%  of β-strand pairs have a ratio of 40% of the paired common part to the length of the two β-strands, respectively. Interstrand register predictions by searching interacting β-strands from several alternative offsets should comply with this rule to reduce the computational searching space to improve the performances of algorithms.


Introduction
e issue of protein structure prediction is still extremely challenging in bioinformatics [1,2]. Usually, structural information for protein sequences with no detectable homology to a protein of known structure could be obtained by predicting the arrangement of their secondary structural elements [3]. As we know, the two predominant protein secondary structures are -helices and -sheets. However, a combination of the early suitable -helical model systems and sustained researches have resulted in a detailed understanding ofhelix, while comparatively little is known about -sheet [4]. Tertiary structures of -sheet-containing proteins are especially difficult to simulate [3,5]. Unlike -helices, -sheets are more complex resulting from a combination of two or more disjoint peptide segments, called -strands. erefore, thesheet topology is very useful for elucidating protein folding pathways [6,7] for predicting tertiary structures [3,[8][9][10][11], and even for designing new proteins [12][13][14].
As fundamental components, -sheets are plentifully contained in protein domains. In a -sheet, multiplestrands held together linked by hydrogen bonds and can be classi�ed into parallel and antiparallel direction styles. Adjacent -strands bring distant residues on sequences into close special contact with one another and constitute a speci�c mode of amino acid pairing [1,[15][16][17], interactions (like DNA base pairing). ere is a growing recognition of the importance of the strand-to-strand interactions among -sheets [18]. Several studies, including statistical studies examining frequencies of nearest-neighbor amino acids insheets, found a signi�cantly different preference for certain interstrand amino acid pairs at nonhydrogen-bonded and hydrogen-bonded sites [1,17,19,20], Dou et al. [21] created a comprehensive database of interchain -sheet (ICBS) interactions. We also developed the SheetsPair database [22] to compile both the interchain and the intrachain amino acid pairs.
Generally speaking, previous work on -sheets mainly focused on the interresidue contacts or amino acid partners [23][24][25][26][27][28]. Prediction of inter-residue contacts in -sheets is interesting, while the prediction by ab initio structure is also useful to understand protein folding [29,30]. Our previous studies showed that the interstrand amino acid pairs played a signi�cant role to determine the parallel or antiparallel orientation of -strands [15], and the statistical results could possibly be used to predict the -strand orientation [16]. Cheng and Baldi [11] introduced BETAPRO method to predict and assemble -strands into a -sheet, in which a single misprediction of one amino acid pairing from the �rst stage could be ampli�ed by the next stages and results in serious wrong set of partner assignments between -strands. However, those studies can be viewed as initial steps ofsheet studies relative to predict strand level pairing [25]. In this paper, to go one step further, we investigate the -strand pairing on the strand level for exploring the rules of howstrands form a -sheet.
Many results have shown the importance of statistical analysis in protein structure studies [15,16]. In particular, statistical information could provide a starting point for de novo computational design methods that are now becoming successful for short, single-chain proteins [14], as well as methods of protein structure predictions and understanding of protein folding mechanisms [31,32]. Fooks et al. [1] also indicated that such statistical analysis results would be useful for protein structure prediction. erefore, we advocate using the tools of statistics and informatics to study -sheet and generate new rules for algorithm development. In this study, we focused on the terminal extensions of paired -strands.

Dataset.
All protein structure data used in this study were taken from a PISCES [33,34] dataset generated on May 16, 2009. In the dataset, the percentage identity cutoff is 25%, the resolution cutoff is 2.0 angstroms, and thefactor cutoff is 0.25. Secondary structures were assigned from the experimentally determined tertiary structures by using the DSSP program. Besides proteins containing disordered regions [35][36][37], all data were further preprocessed according to the following criteria: (i) no -sheet-containing protein chains were removed; (ii) protein chains with nonstandard three-letter residue names (such as DPN, EFC, ABA, C5C, PLP, etc.) were removed, since these indicate that the protein chains have covalently bounded ligands or modi�ed residues; (iii) protein chains with uncertain structures or incorrect data were removed. Since -bulges tend to be isolated and rare [11], we did not consider -bulges in this study either, as several previous studies did [1,3]. Finally, 2,315 protein chains were extracted, containing 19,214 -strand pairs. Note that in the special case of -bulges, no amino acid pair is assigned.

e -Sheet
Structure. e -sheets, where two or more -strands are arranged in a speci�c conformation, are illustrated in Figure 1(a), by a protein example (PDB code 1HZT). Adjacent strands, or the so-called strand pairs, can either run in the same (parallel) or in the opposite (antiparallel) direction styles. In protein 1HZT, there are 3 -sheets called A, B, and C, formed by 10 different -strands numbered from 1 to 10, making 7 different -strand pairs, respectively. e 10 -strands can be named by the -sheet each belongs to and the index numbers in the order of partnership. For example, the 3 -strands forming -sheet A can be called "A1, " "A2, " and "A3, " while other 4 -strands forming -sheet B can be called "B1, " "B2, " "B3, " and "B4, " respectively. "A1-A2, " "A2-A3, " "B1-B2, " "B2-B3, " and "B3-B4" are all -strand pairs. Sequences of the 10 -strands with their initial and ending residue numbers are also given in Figure 1(b).

Different Lengths of Paired -Strands.
For a -strand pair, the terminal of one -strand does not always align with the terminal of the other (Figure 2), making "terminal extensions" besides the common paired parts. Note that only amino acids in the common part construct amino acid pairs.
Why "terminal extensions" exist widely in -strand pairs? We �rstly investigated the lengths of two paired -strands and then calculated the percent of each case whether the "terminal extensions" exist or not. Results are shown in Table 1.
As shown in Table 1, the two paired -strands having the same length only account for 29.53% of all samples. In other 70.47% percent of samples, lengths of the two paired -strands are different.

Statistical Results of Variables.
We de�ne the following variables.
(1) Let 1 and 2 represent the lengths of two paired -strands, respectively. Length of the -strand with smaller strand number (strand numbers can be obtained from PDB database) is de�ned as 1 , while length of the other -strand is de�ned as 2 .
(2) Let PL stand for the length of the common part, which is oen smaller than 1 and 2 .
(3) Terminal extensions can be found in either of the two -strands. We de�ne the lengths of the two terminal extensions 1 and 2 , respectively. Length of the terminal extension of the -strand with length 1 is de�ned as 1 while the other as 2 .
(4) Let EL represent the whole length; 1 2 . en, the paring ratio R could be calculated by e ratio of the common paired part to the length of each -strand ( 1 2) could be calculated by  A small percent of -strand pairs have no "terminal extensions, " the R, 1 , and 2 values for which will be 100%.
We calculated PL, 1 , 2 , EL for all -strand pairs in the present dataset. Table 2 gives the range of these variables as well as the averages and standard deviations.
We also calculated R, 1 , and 2 for all -strand pairs in the present dataset. e distribution of these variables is shown in Figure 3.

Strands Tend to Align eir Terminals.
For the 70.47% of samples with different strand lengths, although they have different lengths, the differences are not big for most of them. Only a small percent of samples (below 2.09%) have the difference above 5. In these cases, it is obvious that they cannot align the terminals (with both 1 = 0 and 2 = 0). ey have two ways to choose from: either align to only one terminal making another "terminal extension", or align to none of the two terminals making both "terminal extensions. " However, it can be seen from Table 1 that most -strands tend to be in the former case. For example, in case of the length difference 1, the former case accounts for 85.18% while the latter only 14.82%. It is consistent with the case of same-length strand pairs, in which -strands tend to align their terminals with each other. Interestingly, it is suggested that -strands tend to align their terminals. In differentlength strand pairs, they still retain one terminal alignment, although they can not align both ends. Table 2, it can be seen that lengths of -strands are not very long, ranging from 1 to 25 with an average length about 4-5 amino acids. e averages and the standard deviations are similar between lengths of the two paired -strands ( 1 and 2 ). e length of the common part PL has a range similar to that of lengths of -strands. is indicates that although "terminal extensions" exist, common pairing parts occupy most of -strands, while "terminal extensions" occupy least.  e fact that the maximum value of EL is 29, only a little bigger than that of lengths of -strands, and the fact that in average both the "terminal extensions" only have about 1 amino acid ( 1 = 1.05 and 2 = 1.03) also support this assumption. Figure 3 gives percent of samples for , 1 , and 2 in each range of their possible values (from 0% to 100%), respectively. It can be seen that the distributions of 1 and 2 are similar. More than half of the -strand pairs have these two variables above 95% (or in the range (95-100)). Big 1 or 2 means big common part of -strands, or small "terminal extensions. " Rare -strand pairs have smaller values of R, 1 , and 2 , which indicates that moststrands do not pair by means of small "common part" or big "terminal extensions. " It could be concluded from the results that -strands tend to pair with bigger pairing common parts, leaving smaller "terminal extensions. "

Possible Reasons for -Strand Extensions.
Why "terminal extensions" exist so widely in -strand pairs? e fact that lengths of two paired -strands are not the same in most cases as shown in Table 1  F 4: Cumulative percentages (CPs) of R, 1 , and 2 calculated from the present dataset. e horizontal axis denotes the percentage of common paired region to (for curve ) or to (for curves 1 and 2 ). Points on the R curve denote the cumulative percentages of samples whose equals or is bigger than the corresponding abscissa value. Points on the 1 and 2 curves denote the cumulative percentages of samples whose 1 1 or 2 2 equals or is bigger than the corresponding abscissa value, respectively.
(82.95%) tend to align their terminals with each other, leaving no "terminal extensions. " A -strand is led to pair with another by several kinds of potential forces. Steward and ornton [3] indicated that a single -strand was still able to recognize a noninteractingstrand with greater accuracy than that in the case of between two random sequences. e potential forces include hydrogen bonds, van der Waals forces, electrostatic interaction, ionic bonds, hydrophobic effects, and so forth. Parisien and Major [38] revealed that among all the forces, the most important one was the construction of a hydrophobic face. It is conceivable that one residue of a -strand prefers to pair with the residue of another resulting in a stable state of hydrophobic effects. Optimizing such interactions may result in extensions, which could be the second reason, since more oen than not the "terminal alignment" is not the case of optimized pairing style.
A third possible reason could be due to the nucleation events that initiate the -sheet folding. Amino acids in the central part could pair �rstly and then fold to extend to terminals.
Another reason is the roles of the nonpaired terminal amino acids in stabilizing the -sheet structure. Several other studies have identi�ed their key roles in modulating protein folding rates, stability, and folding mechanism [39][40][41][42][43]. erefore, the -strand terminals could also be important factors for a -sheet formation.

Ratio Rule of Pairing Strand Alignment.
To quantify the pairing common part of paired -strands, we calculated the cumulative percent of variables R, 1 , and 2 and depicted them in Figure 4.
From Figure 4, it can be seen that when 1 ≥ 40% and 2 ≥ 40%, the cumulative percentages reach 94.26% and 95.98%, respectively, while when ≥ 40% only 89.89%. When ≥ 2 %, the cumulative percentages reach up to 96.97%. erefore, a rule can be made of the alignment of -strand pair as follows: Almost all samples (above 94%) obey this rule. In a -strand alignment prediction algorithm, all possible pairings should be examined and scored; it is a timeconsuming task. Kato et al. [44] stated that prediction of planar -sheet structures was NP-hard in the present state of our knowledge (http://en.wikipedia.org/wiki/NP-hard). However, this previous rule should be used as a constraint of the relative positions in -strand alignment to reduce the computational searching space, which could be used to develop high-speed -strand topology prediction algorithms.

Conclusion
At the most straightforward level, full "identi�cation" of a -strand pair could consist of (i) �nding the interacting partner -strand(s), (ii) predicting the relative orientation (i.e. parallel or antiparallel), and (iii) shiing the relative positions of the two interacting -strands [15,16]. In this study, we focused on the third aspect. e formation of protein structure and protein folding mechanism are very complex, and the mechanisms of -sheet formation are unclear [45]. However, simple rules could contribute to developing new algorithms in the step of full prediction of -sheet and understanding of protein folding pathways in ongoing research.
In this study, to go one step further, we studied -sheets on the strand level instead of amino acid level. Statistical analyses of the terminal extensions of paired -strands were performed and a simple rule " ≥ 2 % and ≥ 40%, 1 2" was made. Steward and ornton [3] developed an information theory approach to predict the relative offset positions by shiing one -strand up to 10 residues either side of that observed. Such a rule could be used in similar studies. We certainly believe that the conclusions presented in this study could contribute to predict protein structures and to develop -sheet prediction methods.

Con�ict o� �nte�ests
e authors have declared that no con�ict of interests exists.