Characterization and Prediction of Protein Flexibility Based on Structural Alphabets

Motivation. To assist efforts in determining and exploring the functional properties of proteins, it is desirable to characterize and predict protein flexibilities. Results. In this study, the conformational entropy is used as an indicator of the protein flexibility. We first explore whether the conformational change can capture the protein flexibility. The well-defined decoy structures are converted into one-dimensional series of letters from a structural alphabet. Four different structure alphabets, including the secondary structure in 3-class and 8-class, the PB structure alphabet (16-letter), and the DW structure alphabet (28-letter), are investigated. The conformational entropy is then calculated from the structure alphabet letters. Some of the proteins show high correlation between the conformation entropy and the protein flexibility. We then predict the protein flexibility from basic amino acid sequence. The local structures are predicted by the dual-layer model and the conformational entropy of the predicted class distribution is then calculated. The results show that the conformational entropy is a good indicator of the protein flexibility, but false positives remain a problem. The DW structure alphabet performs the best, which means that more subtle local structures can be captured by large number of structure alphabet letters. Overall this study provides a simple and efficient method for the characterization and prediction of the protein flexibility.


Introduction
Proteins are dynamic molecules that are in constant motion. Their conformations are depending on environmental factors like temperature, pH, and interactions [1]. Some regions are more susceptible to change than others. Such motions play a critical role in many biological processes, such as proteinligand binding [2], virtual screening [3], antigen-antibody interactions [4], protein-DNA binding [5], structure-based drug discovery [6], and fold recognition [7,8].
Many studies try to predict protein flexibilities using either sequence or structure information of proteins [9]. Sonavane et al. [10] analyzed the local sequence features and the distribution of B-factor in different regions of protein threedimensional structures. Yuan et al. [11] adopted support vector regression (SVR) approach with multiple sequence alignment as input to predict the B-factor distribution of a protein from its sequence. Schlessinger and Rost [12] found that flexible residues differ from regular and rigid residues in local features such as secondary structure, solvent accessibility, and amino acid preferences. They combined these local features and global evolution information for protein flexibility prediction. Several sequence-based B-factor prediction methods were compared by Radivojac et al. [13]. Different models have been proposed to predict B-factor distribution based on protein atomic coordinates. The normal mode analysis can identify the most mobile parts of the protein as well as their directions by focusing on a few C atoms that move the most [14,15]. The translation liberation screw model [16] simplified the protein as a rigid body with movement along translation, liberation, and screw axes. The Gaussian network model (GNM) [17] transformed a protein as an elastic network of C atoms that fluctuate around their mean positions. Recently, Yang et al. [18] predicted the B-factor by combining local structure assembly variations with sequence-based and structure-based profiling. There are also many other methods for protein flexibility prediction [19][20][21].
All the above methods use the B-or temperature factors produced by X-ray crystallography to elucidate flexibilities of proteins. The B-factor reflects the degree of thermal motion and static disorder of an atom in a protein crystal structure [22]. However, there is noise in experimentally determinate B-factor. Many factors can affect the value of B-factor such as the overall resolution of the structure, crystal contacts, and, importantly, the particular refinement procedures [23]. B-values from different structures can therefore not be reasonably compared [12]. Some researchers considered that the upper limit of accuracy for the prediction of B-factors is no more than 80% [11].
Protein structures are not static and rigid. The polypeptide backbones and especially the side chains are constantly moving due to thermal motion and the kinetic energy of the atoms (Brownian motion) [24]. Recent study [1] used the continuum prediction of secondary structures to identify the region undergoing conformational change. Other researchers have pointed out that continuous secondary structure assignment can capture protein flexibility [25]. Furthermore, the MolMovDB database [26] consists of structures that are experimentally determinate to exhibit conformational flexibility enabling a variety of protein motions. The Morph Server [27] in particular has been used by many scientists to analyze pairs of conformations and produce realistic animations.
The present work aims to explore whether the predicted conformations from the protein sequences can characterize their flexibilities or not. To achieve this goal, a simplified description of protein structure has to be provided first. The protein secondary structure offers only a summary of general backbone conformation and of local interactions through hydrogen bonding. The DSSP program [28] provides 8-class secondary structures. However, most secondary structures prediction methods only predict 3-class states with nearly 80% accuracy [29,30]. The secondary structures are very crude description of protein backbone structures. Recently, many studies try to describe protein structures in a more refined manner. Toward this goal, many fragment libraries or structure alphabets (SA) have been presented either in Cartesian coordinates space or in torsion angles space [31][32][33]. Camproux et al. first derived a 12-letter alphabet of fragments by Hidden Markov Model [34] and then extended to 27 letters by Bayesian information criterion [35]. De Brevern et al. [36] proposed a 16-letter alphabet generated by a selforganizing map based on a dihedral angle similarity measure. The prediction accuracy of local three-dimensional structure has been steadily increased by taking sequence information and secondary structure information into consideration [37]. A comprehensive evaluation of these and other structural alphabets is performed by Karchin et al. [38].
In this study, we first explore whether the conformation variants can capture protein flexibility. The multiple conformations of proteins are taken from the Baker decoy sets [39]. Each three-dimensional conformation is represented by the one-dimensional series of letters from a structural alphabet. Four different structure alphabets, including the secondary structure in 3-class and 8-class, the PB structure alphabet [37], and the DW structure alphabet [40], are investigated here. Here, the conformational entropy is used to quantitatively indicate the flexibility. The results show that the conformational entropy has high correlation with B-factor. We then predict the protein flexibility from basic amino acid sequence. The structure alphabet letters of proteins are predicted using only sequence information and the entropy function of the predicted class distribution is used to be indicators of protein flexibilities. Experiment is performed on a subset of the MolMovDB database [26]. The results indicate that the conformational entropy is a good indicator of protein flexibility.

Materials and Method
2.1. Dataset. Three datasets are used in this study for different experimental validation.
The first dataset is taken from the work of Bodén and Bailey [1], which is used for the prediction of protein flexibility. This dataset contains 171 nonredundant protein sequences, in which no pair of sequences has larger than 20% sequence identity. All the proteins exhibit conformational flexibility according to the comprehensive database of macromolecular movements (MolMovDB) [26]. Each sequence in this dataset has been annotated with a list of residue positions that have more than one local structure according to the structure alphabets.
The second dataset is used to train the support vector machine which is used for the local structure predictions of proteins. This dataset is a subset of PDB database [41] obtained from the PISCES [42] web-server. There is less than 25% sequence identity between any two proteins and any protein has a resolution better than 2.5Å. The structures with missing atoms and chain breaks are excluded. The proteins that show homologue with the proteins from the first dataset are also excluded. The resulting dataset contains 928 protein chains.
The third dataset is used to test whether the changes of local structures can characterize the protein flexibility. To achieve this goal, a variant of conformations for one protein must be provided. We use the Baker decoy sets [39] previously used for the evaluation of knowledge-based mean force potentials. This dataset consists of 41 single domain proteins with varying degrees of secondary structures and lengths from 25 to 87 residues. Each protein is attached with about 1400 decoy structures generated by ab initio protein structure prediction method of Rosetta [43].

Training and Test of Local Structures.
Many methods have been presented for the prediction of protein local structures. The dual-layer model has been adopted here, which is developed in our previous studies [44]. The method is based on the observation that neighboring local structures are strongly correlated. A dual-layer model is then designed for protein local structure prediction. The position specific score matrix (PSSM), generated by PSI-BLAST [45], is inputted to the first-layer classifier, whose output is further enhanced by a second-layer classifier. At each layer, a variant of classifiers can be used, such as support vector machine (SVM) [33], neural network (NN) [46], Hidden Markov Models (HMM). In this study, the SVM is selected as the classifier, since its performance is better than those of other classifiers. Experimental results show that the dual-layer model provides an efficient method for protein local structure prediction.

Characterization of Protein Flexibilities by Conformational
Changes. The conformations of proteins are represented by the local structures in the form of a structural alphabet. All the local structure types can be referred to as structure alphabet. Four different structure alphabets, including the secondary structure in 3-class and 8-class, the PB structure alphabet [37], and the DW structure alphabet [40], are investigated here. The three-dimensional protein structures can be represented by one-dimensional structure alphabet sequences according to a specific structure alphabet. Given a protein and its variable conformations, we can convert them into several structure alphabet sequences. The changes of local structures can be used to characterize the protein flexibility. For example, there is a protein sequence 1 , 2 , . . . , . Its three-dimensional structures and conformations are labeled as structure alphabet sequences; we then obtained a structure alphabet matrix 11 , 12 , . . . , , where is the probability of the structure alphabet letter of the th conformation at the amino acid position , is the length of the protein sequence, and is the total number of letters in the structure alphabet. The conformational entropy is then used as an indicator of the protein flexibility: where ( ) is the conformational entropy of the protein at sequence position . The correlation between the conformational entropies and the -factors is calculated as follows: where is the B-factor of the protein at sequence position and Ave( ) and Ave( ) are the average of the conformational entropy and the average of B-factor of the protein.

Prediction of Protein Flexibilities by Local Structure
Entropies. Let the predicted local structure for a given residue be = 1 , . . . , , where is the probability that the residue is in the th local structure class, and is the number of local structure classes: 3 for 3-class secondary structure alphabet, 8 for 8-class secondary structure alphabet, 16 for PB structure alphabet, and 28 for DW structure alphabet. The conformation entropy of a residue is defined as High entropy indicates relative disorder. Low entropy indicates relative order.

Performance Metrics.
The following measures are used to evaluate the prediction of protein flexibilities: sensitivity, specificity, precision, and the Receiver Operator Characteristic (ROC) curves, which are defined as follows: where TP is the number of true positives (flexible residues correctly classified as flexible residues), FP is the number of false positives (rigid residues incorrectly classified as flexible residues), TN is the number of true negatives (rigid residues correctly classified as rigid residues), and FN is the number of false negatives (flexible residues incorrectly classified as rigid residues). The ROC curve is plotted with true positives as a function of false positives for varying classification thresholds. A ROC score is the normalized area under the ROC curve. A score of 1 indicates the perfect separation of positive samples from negative samples, whereas a score of 0 denotes that none of the sequences selected by the algorithm is positive.

Local Structure Prediction.
Four different structure alphabets are used in this study. They are the secondary structure in 3-class and 8-class, the PB structure alphabet [37], and the DW structure alphabet [40]. All of them are the description of the local structures of proteins.
The 3-class secondary structure provides a three-state description of backbone structures: helices, strands, and coils. The 8-class secondary structure provides a more detail description [28]. However, this description of protein structures is still very crude [47].
Two other structure alphabets are investigated in this study: the DW structure alphabet and the PB structure alphabet. They are represented in Cartesian coordinate space and in torsion angles space, respectively. The PB alphabet [37] is composed of 16 prototypes, each of which is 5residue in length and represented by 8 dihedral angles. This structure alphabet remains valid although the size of the databank becomes large [48]. The DW structure alphabet is developed in our previous study [40], which is represented in Cartesian coordinate space. This structure alphabet contains 28 prototypes with lengths of 7 residues.
The dual-layer model is used to predict the local structures of proteins [44]. The experiment is performed on the second dataset. The -score is used to assess the prediction results, that is, the proportion of structure alphabet prototypes correctly predicted. This score is equivalent to the 3 value for secondary structure prediction. After 5-fold crossvalidation, the results are shown in Table 1. The accuracy of secondary structure prediction is comparable with the currently state-of-the-art method [29], while the performances of the other two structure alphabets are significantly better The single-layer model uses the position specific score matrix (PSSM) as input and output probability of the structure alphabet letters. The dual-layer model adds an additional classifier, which uses the output of single-layer model as input and output final prediction. For both models, the support vector machine is used as the classifiers.

Results for the Characterization of Protein Flexibilities.
Since proteins are dynamic molecules, we can investigate whether the conformational changes can capture protein flexibilities. The protein structures are represented by structure alphabet sequences. The conformational entropy is used as an indicator of protein flexibility. The experiment is performed on the third dataset.
The initial results demonstrate that some of the proteins show high correlations between the conformational entropies and the B-factors while the other proteins show low and even negative correlations. After detail analysis, we find that the correlations are influenced by the distribution of the decoy structures. Uniform distribution often leads to high correlation. The decoy structures are first classified by the Root-Mean-Squared Deviation (RMSD) with the native structures. We then select the decoy structures so that they are approximately uniform distribution between different classes. Some of the proteins and the correlations and are listed at Table 2 together with the number of decoy structures. As the number of letters increases, the correlations also increase.
According to the law of thermodynamics, the native structure is the one that has the lowest energy. Since proteins are dynamically molecular in living organisms, their structures often fluctuate around the native state. The decoy sets used here are generated by the well-known Rosetta algorithm [43]. These sets contain many decoy structures whose energies are close to the native one. The conformational entropies are then derived from the decoy sets. Some of the conformational entropies show high correlation with the protein flexibilities. However, the decoy sets are not the true stories; there still are some proteins that show low correlations between the entropies and the B-factors (data not shown). This experiment only tries to investigate whether the conformational changes can capture protein flexibilities. If the true decoy sets can be obtained, we can give a definite answer. However, obtaining the true decoy sets is costly and labor-intensive work. are converted into structure alphabet letter sequences by the specific structure alphabet. If a residue changes its structure alphabet letter among the animations, it is labeled as flexible residue. Otherwise, it is labeled as rigid residue.

Results for the Prediction of Protein
During the prediction process, the protein local structures are first predicted from amino acid sequence by the dual-layer model, and then the entropy function is applied to the predicted class distribution for each residue. Residues with entropy larger than a given threshold are predicted to be flexible residues. Otherwise, they are predicted to be rigid residues. Following the work of Bodén and Bailey [1] we use the mean entropy of all residues in our conformation variability dataset as the threshold .
The results of the four structure alphabets are shown in Table 3. The corresponding Receiver Operator Characteristic (ROC) curves are given at Figure 1. The different structure alphabets get different number of positive (flexible) and negative (rigid) samples. As the number of letters in the structure alphabet increases, the number of positive samples increases and the prediction performance also increases, which means that more subtle local structures can be captured by large number of structure alphabet letters. Particularly, the precision and ROC scores steadily increase. Overall the DW structure alphabet gets the best performance.
The results obtained here are similar to the work of Bodén and Bailey [1]. The precisions of this study are higher than that of Bodén and Bailey (0.05 for Sec3 and 0.12 for Sec8), but the ROC scores are a little lower than of Bodén and Bailey (0.61 for Sec3 and 0.64 for Sec8). The main differences of this study to that of Bodén and Bailey lie in two aspects. The first one is that the additional two structure alphabets (the PB and DW structure alphabet) are investigated here. The second one is that a decoy set is used to explore whether the conformation change can capture protein flexibility.

Conclusion
In this study we provide a simple and efficient method for the characterization and prediction of the protein flexibility. We first validate that the conformational change can capture protein flexibility and then predict protein flexibility from primary sequences. The results show that conformational entropy is a good indicator of protein flexibility. Four structure alphabets with different number of letters are investigated. Future work will aim at exploring other structure alphabets that can provide detail description of protein backbone structures and even the side-chain structures.