Ab Initio Protein Structure Prediction Using Pathway Models

Ab initio prediction is the challenging attempt to predict protein structures based only on sequence information and without using templates. It is often divided into two distinct sub-problems: (a) the scoring function that can distinguish native, or native-like structures, from non-native ones; and (b) the method of searching the conformational space. Currently, there is no reliable scoring function that can always drive a search to the native fold, and there is no general search method that can guarantee a significant sampling of near-natives. Pathway models combine the scoring function and the search. In this short review, we explore some of the ways pathway models are used in folding, in published works since 2001, and present a new pathway model, HMMSTR-CM, that uses a fragment library and a set of nucleation/propagation-based rules. The new method was used for ab initio predictions as part of CASP5. This work was presented at the Winter School in Bioinformatics, Bologna, Italy, 10–14 February 2003.


Introduction
Protein structure prediction methods have implicit underlying principles that fall into two categories: evolution and folding. Evolution-based methods seek to find conserved sequence patterns, while folding methods simulate the physical process of folding. A folding pathway is a time series of protein folding events. Most molecular simulation methods, including molecular dynamics (MD) and Monte Carlo (MC), create a pathway implicitly. Other methods enforce certain characteristics of the folding events during the simulation, including some genetic algorithms, neural nets, and a new rule-based approach.

Detailed molecular representations
The MD approach to folding draws its strength from the fundamental nature of its physics-based energy function. Unfortunately, unless simplified models can be used, long simulations are still far too costly to be practical. Head-Gordon and Crivelli have developed the global optimization methods called 'Stochastic Perturbation with Soft Constraints'. The atom-based energy function and novel hydrophobic solvation function of their MD approach is able to discriminate against misfolds. However, the method is still computationally expensive, and it needs improvement in β-strand and loop matching [5].
In Beveridge's protocol, they combined an AMBER united atom empirical energy functions, a GBSA (generalized born/solvent accessibility) for solvent dielectric polarization, Van der Waals and cavitation effects, and a multiple-copy MCSA (Monte Carlo simulated annealing) searching scheme, which is able to escape to some extent from meta-stable local minimum. The results show that the method is able to recover the structures of test cases within 6.0Å RMS [12].

Simplified models and lattice simulations
In Gibbs's ab initio method, the protein conformation is represented using backbone torsion angles and fixed side chains. An evolutionary Monte Carlo algorithm is developed to search through this restricted conformational space. The simple physiochemical force field based on hydrophilic, hydrophobic, steric and hydrogenbonding potentials is used to assess the energies. The 3D structures of polypeptide chains up to 38 residues have been accurately predicted [7].
Scheraga's group used the hierarchical approach for global optimization of an off-lattice simplified chain, with a modified united-residue (UNRES) force field and their conformational space annealing (CSA) global optimization procedure. Good results have been obtained for both a four-and a threehelix protein [17].
LINUS, developed by Rose's group, is a Monte Carlo program that emphasizes the role of steric interactions and conformational entropy. Simple scoring functions represent the hydrogen bonds and hydrophobic interactions [19].
Lattice-based studies represent proteins on a cubic or tetrahedral lattice, and this reduces the conformational space enormously, making even exhaustive simulations possible for short chains (e.g. 27 residues).
The recent face-centred cubic lattice model in Skolnick's group includes the interactions between hydrophobic residues, repulsive interactions between hydrophobic and polar residues, and orientation-dependent polar-polar interactions. Their replica exchange Monte Carlo method is able to reproduce a cooperative all-or-none folding transition and the cooperative formation of secondary structure upon the folding transition [14].
Skolnick also proposed a lattice-based parallel hyperbolic sampling (PHS) Monte Carlo algorithm through the logarithmic flattening of the local highenergy barriers by an inverse hyperbolic sine function, which can overcome the local minima trapping and speed up the 'thermalization' of the protein folding process (meaning the time spent to reach equilibrium). They applied the method to the side chain-only protein model and were able to identify much lower energy structures and explore a larger conformational space than the replicasampling MC method. They also pointed out that the minimum relative RMSD (mrRMSD) is more favourable than lowest-energy for prediction quality. The drawback of PHS is that for a relatively smooth energy landscape system it might be less efficient than other methods [24].

Fragment libraries
The hierarchical condensation of a polypeptide may be roughly modelled by simulations that draw from a fragment library. Each fragment is a preferred conformer for a segment of the chain, usually defined by sequence statistics or motif patterns. Fragment library simulations leapfrog the earliest steps in folding, that being the formation of local structure.
Levitt's group constructed proteins from different-sized fragment libraries (four to seven residues) using a simulated-annealing k -means clustering method. Their discrete approximation model is able to achieve 1Å accuracy with lower complexity for four-and five-residue fragments. However, the complexity for longer fragments still needs to be improved [11]. Their study demonstrates that it is sufficient to use fragments for protein structure simulations. This has relevance to the work of several other groups, including Karplus [10], Baker [1], Jones [9] and Bystroff [18], all of whom use the fragment libraries for simulations.
In general, fragment library simulations use knowledge-based potentials and a simplified side chain representation while swapping fragments drawn from a library. The first such program was Baker's ROSETTA algorithm [1], automated in Bystroff's I-sites/Rosetta server [3], which explores conformation space using MC simulated annealing and a Bayesian knowledge-based potential. Jones's FRAGFOLD starts with a library of common supersecondary units and also applies simulated annealing [9]. Karplus's Undertaker uses HMMs (and other sources) to build a fragment library and optimizes the 'cost of burial' [10]. Fragment library simulations have had perhaps the broadest success in ab initio prediction in the last three CASP meetings.

HMMSTR-CM: rule-based folding in 2D
HMMSTR-CM is a new algorithm based on HMM-STR [4] that was used for predicting contact maps in CASP5. The approach is not a simulation but a set of knowledge-based potentials and rules for building a protein contact maps. Nucleation/condensation-type folding pathways are encoded in the rule set. A contact map is a lowresolution, 2D representation of a protein's 3D Ab initio protein structure prediction using pathway models 399 structure. Contact maps have been used as a tool for protein structure prediction [6,13,15], particularly because the representations of 3D structure that they produce are data that can easily be mined [8,22,23]. Contact maps may be projected into 3D using existing algorithms [2,20,21].
For a given protein sequence, its contact potential map (Figure 1a) is calculated using HMMSTR, a hidden Markov model for local sequence structure correlations. A contact potential is the negative loglikelihood of a contact between a pair of Markov states, one at each of two positions in the sequence. The Markov states for each position in the sequence are assigned using the forward/backward algorithm [16]. In Figure 1a, a low contact potential is coloured red and a high contact potentials are blue. Secondary structures, which can be predicted directly by HMMSTR, can also be identified in the contact potential map, e.g. strong i to i + 4 contacts indicate predicted helix, and predicted β-strands tend to have low contact potentials with other strands. In Figure 1a, three helices and four (or five) β-strands can be identified.
HMMSTR-CM initially overpredicts contacts, with few false negatives. Thus, the accuracy of the ab initio approach depends on the accuracy of pruning false positives. A nucleation propagation folding pathway scheme is used to find the true contacts. Its success depends strongly on the choice of the initial nucleation site. The strategy of the   there are at the most six residues in contact with both i and j 4. β-pairing rule: a β-strand can be in contact with at the most two other β-strands 5. β-sheet rule: any two pairing strands are either parallel or antiparallel 6. Helix mutual contact rule: a residue cannot be in contact at the same time with the residues on the opposite sides of a helix 7. Helix rule: within a helix, only the contact between residue i and i + 4 is allowed 8. β-rule: no contact is allowed within any strand 9. Right-hand crossover rule: crossovers between parallel strands of the same sheet (paired or not) are right-handed (especially if the crossover contains a helix) 10. Helix crowding rule: if a helix can go to either side of a sheet, it picks the side with fewer crossovers 11. Strand burial rule: if a strand can pair with either of two other strands, it chooses the one that is more non-polar prediction is: (a) predict the secondary structure; (b) choose a folding nucleation site by assigning local contacts; (c) propagate from the nucleation site by assigning or removing contacts, based on physicality and propagation rules ( Table 1). The prediction is finished when all pairs are assigned either a contact or non-contact, and when none of the rules are violated.

Results from CASP5
Here we will discuss one example of HMMSTR-CM prediction of a CASP5 target. Summaries of prediction methods, including this one, can be found in a special edition of the journal Proteins, Structure, Function and Genetics this year [18], to be dedicated to the CASP5 prediction experiment. Target T0130 has 116 residues arranged in a three-layer α/β sandwich. The contact potential map is shown in Figure 1a. By choosing different nucleation sites, we found more than one way to derive a physically possible topology. In this case, we selected to start the pathway with β 2 α 2 β 3 . The following is the sequence of operations that built the prediction. This sequence of events is the predicted folding pathway: 1. Parallel β contacts were assigned between β 2 and β 3 . 2. Anti-parallel contacts were assigned to β 1 and β 2 . All other β contacts to β 2 were pruned. 3. There were two ways to make a right-handed crossover from β 3 to β 4 (Figure 1c-d). Since β 1 is more hydrophobic than β 3 , we paired β 1 with β 4 . All other β contacts to β 1 were pruned, and contacts between α 2 and α 3 were pruned since they are now on opposite sides of the sheet. 4. α 1 must be on the opposite side of the sheet from α 3, since α 3 extends across the sheet. Contacts were assigned between α 1 and α 2 .
The completed TOPS diagram and contact map accurately match the true structure ( Figure 1b). The prediction has 42% contact coverage and 29% accuracy. However, if we count near-misses (±one residue), the coverage is 75% and the accuracy is 57%. Note that the long-range contacts between the β 1 and β 4 were correctly predicted. Longrange contacts are difficult to predict using purely statistical methods.
Identification of the folding nucleation site is the critical step in this approach. Once the nucleation site is chosen, the subsequent contact assignments are often unambiguous. The choice of the nucleation site in T0130 was relatively easy. Only one of the three parallel βαβ units had a high score. The hairpin between β 1 and β 2 would also be a correct choice, but the selection of β 2 α 2 β 3 eliminated more of the potential incorrect folding pathways. This prediction turned out to be topologically correct. In other cases, the wrong structure was chosen for the nucleation site, and the algorithm failed. When the correct nucleation site was assigned retrospectively, the correct topology could be identified, but this has not yet been cross-validated.

Summary
Simulating the physical process of protein folding has taken many algorithmic forms, distinguished by the differing levels of detail in the representation of the model. Detailed all-atom representations continue to be popular, while simplified models have proved to be successful in blind predictions.
Pathways have recently been defined for a 2D contact map representation, and this approach shows potential for modelling the folding process without simulations.