How Good Are Simplified Models for Protein Structure Prediction?

Protein structure prediction (PSP) has been one of the most challenging problems in computational biology for several decades. The challenge is largely due to the complexity of the all-atomic details and the unknown nature of the energy function. Researchers have therefore used simplified energy models that consider interaction potentials only between the amino acid monomers in contact on discrete lattices. The restricted nature of the lattices and the energy models poses a twofold concern regarding the assessment of the models. Can a native or a very close structure be obtained when structures are mapped to lattices? Can the contact based energy models on discrete lattices guide the search towards the native structures? In this paper, we use the protein chain lattice fitting (PCLF) problem to address the first concern; we developed a constraint-based local search algorithm for the PCLF problem for cubic and face-centered cubic lattices and found very close lattice fits for the native structures. For the second concern, we use a number of techniques to sample the conformation space and find correlations between energy functions and root mean square deviation (RMSD) distance of the lattice-based structures with the native structures. Our analysis reveals weakness of several contact based energy models used that are popular in PSP.


Introduction
Proteins are one of the most important organisms in a living cell, virtually participating in almost every process within the cell including carrying oxygen (by hemoglobin), signaling cells (by insulin), fighting infection (by antibodies), and performing metabolism (by enzymes). For proper functioning, a protein has to fold into a native three-dimensional structure, which is unique, stable, and kinetically accessible [1] in a given environment. However, not much is known about the process of folding. Also, the nature of the energy function is yet unknown. Misfolded proteins cause many critical diseases such as Alzheimer's disease, cystic fibrosis, and mad cow disease. Knowledge about the native structure is of paramount importance, specially for rational drug discovery and to understand the basics of life. Protein structure prediction (PSP) is therefore one of the most challenging problems in biology. Due to slowness and expensiveness of in vitro methods, computational methods are of great interest. Given a primary amino acid sequence of protein, the task in PSP is to find its three-dimensional native structure that has the minimum free energy.
In the absence of any known structure or templates, ab initio methods guided by a scoring function or energy function are used to predict structures. However, the complexity of searching for a native structure depends on the underlying model. The all-atomic details pose much complexity on the modeling and require huge computational time [4,5]. Therefore, reduced models are preferred. A general paradigm in PSP [6][7][8] is to generate simple decoys or candidate structures using a reduced or simplified model and then refine them by adding necessary backbone and side-chain atoms [9,10]. These reduced or simplified models are also used in investigating the protein folding process in detail [11,12] and also in the CASP competition by one of the best performing systems such as TASSER [6].

Advances in Bioinformatics
The simplified models are often based on contact based statistical energy models on discrete lattices (cubic or face centered cubic (FCC)). Contact based energy models [13][14][15][16] consider interaction energy only among the amino acid types that are in contact. Moreover, the discrete lattices restrict the backbone atoms only to valid lattice points. Such restricted behavior of both the energy models and the lattices poses a twofold concern for the researchers about the goodness of such reduced models. Firstly, how close we are able to model the native structures of the proteins using discrete lattices [17]? Close fits of the backbone of the native structure provide an estimate of the optimal achievable target on discrete lattices and definitely are an indication of the goodness of the lattices being used. Secondly, how well we can guide the search towards the native structures by using the contact based energy models [13][14][15][16] regardless of the particular lattice being used? An energy model provides effective search guidance when the structures having lower energy values also have more similarity with the native structures. This proximity or similarity with the native structure is measured by using root mean square deviation (RMSD). Lower RMSD values indicate that structures are more similar to the native one.
In this paper, we address the first concern mentioned above by using the protein chain lattice fitting (PCLF) problem. We developed a constraint-based local search algorithm for the PCLF problem for cubic and face-centered cubic lattices and found very close lattice fits for the native structures. Our algorithm starts with a greedy chain growth algorithm and in subsequent iterations improves by taking moves from the neighborhood generated by a set of operators. On a set of 1192 proteins from the PISCES [18] benchmark set, we achieve average RMSD distances of 1.87Å and 1.23Å, respectively, for cubic and FCC cubic lattices. By doing these, we reconfirm the rationale behind selecting these discrete lattices in PSP. For the second concern, we use a number of techniques to sample the conformation space and find correlations between energy functions and root mean square deviation (RMSD) distance of the lattice-based structures with the native structures. The sampling techniques include minimizing RMSD, minimizing the energy functions, and performing a random walk on the conformation space. Using one million samples for each protein and each sampling method, we performed Spearman rank correlation test and found that energy functions such as HP, Barrera, and MJ matrix have mostly negative correlation with the RMSD values. These analyses thus reveal weakness of these contact based energy models, even though they are popular in PSP.
The rest of the paper is organized as follows: a brief summary of the related work in the literature is presented in Section 2; the materials and methods are described in Section 3; experimental results and discussion are presented in Section 4; and the paper is concluded in Section 4 providing a summary of the work and possible future directions.

Related Work
To the best of our knowledge, no significant study has been performed that has evaluated the effectiveness of the simplified models in terms of the ability of the lattice models to represent native structures and in terms of ability of the energy models to guide the search. Researchers have tried to compare the effectiveness of the lattice models with other models [19] and their effectiveness in folding simulation [20]. However, in the latter case, the studies are conducted using simple two-dimensional lattices only. We address the problem of accuracy of lattices from a point of view of representation to show the accuracy level of the lattices up to which they can model the real proteins. We first explore the literature of the PCLF problem and then that of the energy models.
PCLF problem is proved to be NP-complete [21]. Several techniques have been applied to solve the problem such as exhaustive full enumeration [22], dynamic programming [23], chain growth algorithms [2,24], move based local search [25], and specialized force fields [26,27]. Recently Mann et al. [3] proposed LatFit, a tool for PCLF problem for both backbone and side-chain atoms, and achieved the stateof-the-art results. They also proposed another refinement algorithm [28] based on constraint programming techniques.
Contact based energy models [13][14][15][16] with discrete lattices have been used extensively in the literature of PSP [11]. A general trend is to optimize the energy function by using various search techniques such as genetic algorithms [29], constraint programming [30,31], simulated annealing [32,33], and memory based methods [34,35]. However, it is important to assess their performance on a wide variety of protein sequences to see if they really work for simplified models. In the literature, quasichemical approximation techniques to derive contact based energy functions like Miyajawa-Jernigan matrices [14] are criticized for neglecting the peptide bonding of the amino acids [36]. However, no comprehensive study was found in the literature to show the effectiveness of such energy functions. The ability of contact based energy methods to discriminate the native state from the decoys was investigated in [37,38] and a general negative answer was found. Later on, a new empirical energy model was presented in [16] that was much simpler than the Miyajawa-Jernigan matrices and was successfully able to discriminate the native state from the decoy sets. These energy functions are used by the researchers both in fold recognition [39] and in ab initio methods to guide the search [31].
In this paper, we use a constraint-based local search to produce state-of-the-art results for PCLF problem and thus show the effectiveness of cubic and FCC cubic lattices. Using a number of guided sampling techniques, we also perform an assessment of the effectiveness of different energy models. The analysis reveals weakness of those energy functions.

Materials and Methods
Proteins are polymers of amino acid monomers. There are 20 different amino acids. In a simplified model, all monomers have an equal size and all bonds are of equal length. Each monomer is modeled by a point in a three-dimensional lattice (lattice constraint). The given amino acid sequence fits into the lattice: pair of all consecutive amino acids in the sequence are also neighbors in the lattice (chain constraint) and two monomers cannot occupy the same point in the lattice (selfavoiding constraint). A simplified energy function is used in calculating the energy of a structure. Two lattice points , L are said to be in contact or neighbors of each other, if = + ⃗ V for some vector ⃗ V in the basis of . There exist a number of lattice models (see Figure 1). FCC lattice is preferred to cubic lattice since the former provides a higher degree of freedom for placing an amino acid and has a higher packing density [40]. The points on FCC lattice are generated by 12 basis vectors: , and ⃗ V 12 = (−1, 0, −1), and the points on a cubic lattice are generated by 6 basis vectors:

Constraint Programming Model for Protein Structures.
In our constraint programming model, we are given a sequence , where each element ∈ is an amino acid type. Each amino acid is associated with a point = ( , , ) ∈ Z 3 . The decision variables are the , , and coordinates of a point. For a sequence of length , the domain of the variables is the range The first point is assigned as (0, 0, 0), which is a valid point in the FCC lattice. The rest of the points follow the constraint, Here, ⃗ is the absolute vector between points ( +1 , +1 , +1 ) and ( , , ), and { ⃗ V 1 , . . . , ⃗ V 12 } are the basis vectors for FCC lattice. Thus all points satisfy the lattice constraint and chain constraint. The self-avoiding constraint is defined using the all-different constraint, all-different(∀ ). The all-different constraint is defined over a set of points and it is satisfied only if no two elements are the same in the set. We define sqrdist( , ) as the square of Euclidean distances between two points and . Now, contact( , ) = 1, if sqrdist( , ) = 2, and contact( , ) = 0, if sqrdist( , ) ̸ = 2.

Intramolecular Similarity
Measure. The choice of distant measure is very important for our experiments. Distance measures between two structures cRMSD are calculated by taking the square root of the average distance between corresponding atoms of two structures. However, in case of the the molecular structures sampled from molecular dynamics or other forms of sampling, often the structure drifts away from the origin and rotates in an arbitrary way.
Calculating cRMSD requires finding an optimal alignment of two structures first and then calculating RMSD. Moreover, in cases where we wish to find structures that are similar to each other in potential energy (free energy), cRMSD will find a structure with overall minimum average atomic displacement by treating all atoms equally. However, in cases like protein structure prediction, we cannot treat all atoms similarly since atoms on the outside of the protein can often move without affecting the potential energy, while atoms at the centers have more impact on the energy function even for slightest movements. For these reasons, intramolecular distance measures like dRMSD are developed to address the shortcoming of cRMSD based measures. For two given structures = 1 , . . . , and = 1 , . . . , , dRMSD is defined as follows: where function dist( , ) denotes the Euclidean distance between two points and .

Problem
Definition. Now, given the native structure of a protein in full atomic representation and the backbone of the given native structure = 1 , . . . , , in PCLF problem, the task is to find a structure in the lattice, = 1 , . . . , such that the distance between and is minimized. The backbone of a protein structure is defined by the -Carbon positions. In order to normalize, we consider neighborhood distance in the discrete lattices (1 in case of cubic and √ 2 in case of FCC) to be equal to 3.8Å, which is the average distance between two consecutive -Carbon atoms in real proteins. This distance is enough to avoid possible steric clashes after adding other atoms during refinement. In PCLF problem, Algorithm 1: chainGrowthInitialize().
.add( → V , ) (7) if .isEmpty() then  Table 1). The pseudocode for the search is given in Algorithm 3. At each iteration, we randomly select an operator. According to the selected operator type, points are selected randomly. We maintain a tabu list of recently used moves. We use two operators: jump move operator (see [41] for details) and pull move operator proposed in [42]. After the selection of the points, the substructures comprising those positions are reoptimized by allowing all possible valid orientations by jump move or pulled in all possible directions in the neighborhood by using pull moves. In case of jump moves, multiple points are selected depending on the parameter V . Initially, V is set to 1. For a pull move, a single point that is not in the tabu list is selected. After selecting the points, the neighborhood moves are generated using the selected operator. After generation, all the candidate moves are simulated. Simulation of a move temporarily calculates the changes in the heuristic functions without committing the move. After simulation, only the best candidate is selected. The selection is based on the dRMSD value only. However, we always include the current structure as a candidate and the search progresses monotonously in a nonincreasing manner. If the search is not able to find improvement in the global minimum for a number of steps determined by the parameter , we increase the V by one and the parameter is also multiplied by a . They are set to initial values whenever there is an improvement.

Energy Models.
The contact based energy functions are generally used along with lattices in the simplified models. We analyzed three different energy functions proposed in [13,14,16]. Formally, energy of a structure in the setting of simplified models described in Section 3 is defined by the following equation: Here, energy( , ) is defined by the particular energy interaction of the amino acid types, , in a given sequence , for a particular energy model. The first energy model, denoted by hp-basic in this paper, is the basic HP energy model proposed in [13]. It considers only the interaction between hydrophobic residues. In other words, there is an energy potential, −1, defined only for hydrophobic interactions. The other two energy models are elaborate in nature and 20×20 energy models considering all types of interaction between amino acids. They are denoted as mj [14] and bre [16] throughout this paper. Details of these energy matrices are given in [14,16].

Sampling Algorithm.
Sampling the conformation space is important for the second part of our experiment. Since we attempt to find the native structure, we first study the energy values of the structures that are visited when we minimize the dRMSD value. By minimizing dRMSD using our PCLF algorithm in Algorithm 3, we are able to generate sample structures that are within very close proximity of the native structures (experimental results show this). However, in reality the dRMSD values will be unknown and we have to minimize the energy values with the hope that lower energy values will lead to lower dRMSD structures. We therefore also generate sample protein structures by running a search algorithm that minimizes a given energy function. For this, we use Algorithm 3 but with dRMSD being replaced by the given energy function. Lastly, we also use a random walk algorithm that starts with a random initialization and in each iteration picks a random candidate from the randomly Algorithm 3: PCLFSearch(). generated neighbor structures. We do not use a sampling method that randomly generates all structures. For each lattice type, we therefore have 5 different search algorithms to generate samples.
(1) PCLF search: this algorithm is essentially the algorithm presented in Algorithm 3. The search is directly guided by dRSMD. At each iteration, dRSMD is minimized and we reported dRSMD values and values of the energy functions for all selected candidates. (2) Guided search, hp-basic: this algorithm is similar to that presented in Algorithm 3 except in Line 1 and Line 10. The search is guided by the basic HP energy model [13]. In each iteration, we reported values of all the energy functions and dRSMD values of the selected candidate structures. (3) Guided search, mj: this algorithm is similar to that presented in Algorithm 3 except in Line 1 and Line 10. The search is guided by MJ Matrix model proposed in [14]. In each iteration, we reported values of the energy function and dRSMD values of the selected candidate structures. (4) Guided search, bre: this algorithm is similar to that presented in Algorithm 3 except in Line 1 and Line 10. The search is guided by the empirical energy model proposed in [16]. In each iteration, we reported values of the energy function and dRSMD values of the selected candidate structures.
(5) Random walk: we ran another version using no guidance and selecting the candidates randomly from the list of the generated structures. This algorithm is essentially a random walk in the space of feasible structures. In each iteration, we reported values of all the energy functions and dRSMD values of the selected candidate structures.

Implementation.
We implemented our framework using C++ on top of the constraint-based local search system, Kangaroo [43]. The constraints are maintained by invariants in kangaroo. Invariants are special constructs that are defined by using mathematical operators over the variables. Simulation of moves, execution, and necessary propagation of constraint and function values are performed incrementally using Kangaroo.

Results and Discussion
We ran our experiments on the Gowonda clusters provided by Griffith University with nodes equipped with Intel Xeon CPU X5650 processors @2.67 GHz, QDR 4 x InfiniBand Interconnect. Experiments were run on benchmark protein sequences taken from the PISCES [18]. These proteins are originally used in [3]. The proteins were selected by enforcing 40% sequence identity cutoff, with chain length 50-300,factor ≤ 0.3, and resolution ≤ 1.5Å to derive a high-quality set of proteins. The resulting benchmark set contains 1192 proteins exhibiting a mean length of 160 ( = 64).

Accuracy of Lattice Fitting.
We ran our algorithm on both cubic and FCC lattices. For each protein sequence in the benchmark set, we ran our algorithm 5 times with a timeout of 1 hour. Average dRMSD values are reported in Table 1. We report average dRMSD values produced by the greedy chain growth algorithm in the "initial" column and the final average values in the "final" column. Since we consider backboneonly structures denoted by -Carbon atoms, we compare our results with those in the literature with backbone-only models. Average dRMSD values produced by the algorithm by Park and Levitt [2] and LatFit program [3] are reported as in [3]. The values reported in Table 1 show that our approach produces lattice fits with lower dRMSD values. Percentage improvements of our approach over other methods are also shown in Table 1 in column "Imp. " Improvements are defined as Imp. = 100 × (Average dRMSD of our approach-Average dRMSD of other approach)/Average dRMSD of other approach. The differences in the dRMSD value are statistically significant since the number of samples is large and we take average values of 5 runs. However, we performed statistical -test with 95% significance level to ensure the significance of these results. From the lower dRMSD values reported for both types of lattices, we conclude that with these types of discrete lattice, it is possible to generate structures that are within close proximity of the native structures. The state-of-the-art methods for structure prediction confirm this value for the backbone models and also for any other detailed models [8,9]. In other words, these discrete lattices provide realistic backbone for real protein structures.

Effectiveness of Energy Models in Search Guidance.
Ab initio search based PSP algorithms usually minimize a given energy function. The prior assumption is that if the energy model is effective, then the structures with lower energy values have lower dRSMD values too. In other words, minimizing the energy function would result in minimizing dRSMD value and the search will eventually lead to structures closer to the native structure. In that case, the values of the energy functions must show a strong positive correlation with dRSMD. We ran each of 5 variants of sampling algorithms mentioned in Section 3.6 and generated one million sample structures during the search for each type of lattice. After finishing each run, we test correlation between the dRSMD values and values of each of the energy functions reported for the candidate structures of each protein. We performed Spearman rank correlation test for each protein.
The summary of the results is given in Table 2. From the bold-faced values in Table 2, we see that the search when guided by the energy functions shows strong negative correlations (<-0.5) for most proteins, while a positive correlation was ideally desired. Typical plots of dRSMD against the energy function values are shown in Figure 2 for the protein 1A6M. For this protein, we see that, for each of the energy functions, dRSMD value does not decrease with the minimization of the energy values. Similar plots are also found for other protein sequences as well. Thus we conclude that the search when guided by the energy functions, that is, minimizing the energy value does not result in minimizing the dRSMD values.
In Table 2, we also notice that the correlation coefficients are negative or weakly positive (<0.5) for most of the proteins even when the sampling was performed by minimizing dRSMD. These observations are indicative only because in effect we will not minimize dRMSD. Weak positive correlations are also found for two of the energy functions for the samples generated by random walk. In order to test the significance of the weak positive correlations, we further analyze the quality of the structures that were generated by our sampling algorithms. Advances in Bioinformatics 7   Figure 3 shows the scatter plots of minimum dRSMD values found for each protein by each of the sampling algorithms against the minimum dRSMD value found by random walk on both types of lattices. It is clearly visible that the PCLF search produces structures with lower dRSMD (in the plot near the -axis) compared to other methods and thus covers the areas of the search spaces that are closer to the native structures. For the sampling methods that minimize energy functions, most proteins in the chart lie below = line meaning that the minimum dRSMD values found by them are better than the minimum dRSMD values found by the random walk. This further means that the random walk based sampling method does not generate structures that are closer to the native structures and so the weak positive correlations are not very significant. Thus a random walk instead of using energy functions as search guidance would not be useful. Overall these observations reveal the weakness of the energy functions used in this experiment; that is, minimizing them does not necessarily mean guide the search on the way to the native structures.

Conclusions
In this paper, we propose a constraint-based local search framework that produces state-of-the-art results for the protein chain lattice fitting (PCLF) problem for real proteins. This confirms the effectiveness of using discrete lattices in protein structure prediction (PSP). In addition to this, we also analyze several simplified energy functions and their effectiveness to find better structures in terms of dRMSD 8 Advances in Bioinformatics values. Our analysis revealed the weakness of several contact based energy models used in the literature of PSP. The algorithm that we use to find the lattice fits and sampling for the structures is stochastic in nature. In future, we wish to add side chains and secondary motifs and use the discrete lattices effectively for PSP guided by effective energy functions and the heuristics. We also wish to develop a web server providing the service of lattice fitting and extend our work to other lattices as well.

Disclosure
A preliminary version of this work is published earlier in AAAI 2013 conference as a short paper [44].