On the Difference in Quality between Current Heuristic and Optimal Solutions to the Protein Structure Alignment Problem

The importance of pairwise protein structural comparison in biomedical research is fueling the search for algorithms capable of finding more accurate structural match of two input proteins in a timely manner. In recent years, we have witnessed rapid advances in the development of methods for approximate and optimal solutions to the protein structure matching problem. Albeit slow, these methods can be extremely useful in assessing the accuracy of more efficient, heuristic algorithms. We utilize a recently developed approximation algorithm for protein structure matching to demonstrate that a deep search of the protein superposition space leads to increased alignment accuracy with respect to many well-established measures of alignment quality. The results of our study suggest that a large and important part of the protein superposition space remains unexplored by current techniques for protein structure alignment.


Introduction
Pairwise protein structure alignment is one of the most important problems in computational molecular biology. At the same time, protein structure alignment is a very difficult problem, due to an in�nite number of possible ways to position a pair of proteins in the three-dimensional space. Because of the enormous size of the search space, the research into protein structure alignment has been traditionally focused on the development of methods with better objective functions, that explore a relatively small but representative set of proteins' spatial superpositions.
In this paper, we take a different approach and study the bene�ts of searching proteins' superpositions in a more detailed manner. We demonstrate signi�cant increase in the alignment accuracy of several well-known distance-based alignment methods, obtained by utilizing the superpositions that rigorously optimize a very simple and intuitive alignment metric, de�ned as the largest number of residues from the input proteins that can be �t under a prede�ned distance cutoff.
e size of gap between the accuracy of current heuristic solutions and optimal solutions, observed in this study, suggests that the protein structure alignment problem will likely remain a hot topic in years to come.

Materials and Methods
Our study is carried out using two protein structure alignment benchmarks: Sisyphus and FSSP. In both benchmarks, an in-house algorithm, MaxPairs [1], is applied to compute the superpositions that closely approximate the measure , which is de�ned as the largest number of pairs of residues from the input proteins that can be �t under Ångströms. MaxPairs algorithm is based on the approximation algorithm EPSILON-OPTIMAL [1], which is capable of �nding a superposition of the input proteins that �ts at least as many pairs of residues under the distance as an optimal superposition �ts under the distance , for any accuracy threshold . As an approximation algorithm, EPSILON-OPTIMAL suffers from high computational complexity. e algorithm's run time is a high degree polynomial in the lengths of the structures being compared. To circumvent high computational cost, the present study utilizes MaxPairs-a heuristic version of EPSILON-OPTIMAL that searches through a relatively small subset of the space of all superpositions of the input proteins inspected by EPSILON-OPTIMAL. While still not practical, as demonstrated in [1], MaxPairs enjoys accuracy superior to that of some widely utilized alignment programs and, as such, this algorithm is an indispensable tool for assessing the precision of more efficient and more popular algorithms. In present study, we set the distance cutoff to Å and the accuracy threshold to . Going below proves to be computationally prohibitive with our computing infrastructure.
We evaluated the performance of three well-known methods for protein structure comparison, STRUCTAL [2][3][4], TM-align [5], and LOCK2 [6,7], before and aer replacing their original superpositions with superpositions that optimize . It is important to emphasize that our experiment is not designed to compare these three methods head-to-head, but rather to assess the extent of improvements in the accuracy of each method that can be made by exploring the search space in a more thorough manner.
In choosing the methods for our study, we only considered the availability of soware and the simplicity of implementing the alignment scoring functions (see the Results section). An overview of the three algorithms is given below.
STRUCTAL. e STRUCTAL algorithm [2][3][4] employs iterative dynamic programming to balance the cRMS score with the lengths of aligned regions. In each iteration, the algorithm computes an optimal residue-residue correspondence (alignment) of the input proteins , … , ) and , … , ) and then �nds a superposition that minimizes cRMS of the aligned subchains , … , ) and , … , ). e cRMS score is given by (1) e alignment step in STRUCTAL is carried out using a dynamic programming routine, which implements the following recurrence formula: where , 20 e outputs of STRUCTAL are the subchains of and of , along with the rigidly transformed protein , denoted bŷ, and a residue-residue correspondence that maximizes the STRUCTAL score where , denotes the total number of gaps in the alignment. e STRUCTAL program used in our analysis was downloaded from http://csb.stanford.edu/levitt/Structal/.
TM-align. TM-align is another popular protein structure alignment program, widely used in many applications, in particular for assessing the quality of protein models generated by comparative modeling or abinitio techniques. e score matrix in TM-align is protein-length speci�c and is de�ned as where 0 .24 √ − 5 − . , and is the length of the shorter structure [5]. In contrast to linear gap penalties employed by STRUCTAL, the gap penalties in TMalign are affine and are set to 0.6 for gap-opening and 0.0 for gap-extension [5]. An improved version of the algorithm, called Fr-TM-align, has been published [8]. e TMalign soware, used in this study, was downloaded from http://zhanglab.ccmb.med.umich.edu/TM-align/.
LOCK2. LOCK2 [6] is an improved version of the original LOCK program [7]. It incorporates secondary structure information into the alignment process. An initial superposition is obtained by comparing the vectors of secondary structure elements. An iterative procedure is then applied to minimize RMSD between aligned subchains of the input proteins, using the threshold distance of 3 Å for atomic superposition. Rigid body motions for RMSD minimization are realized using quaternion transformations [9,10]. e alignment returned by LOCK2 is a sequence of pairs of points , , … , , , where are each other's nearest neighbors. More specifically, for every , … , , the point is the closest point in protein to the point and vice versa. e �nal alignment is generated through a two-step process. First, for every atom from protein , the algorithm �nds the nearest atom from protein that is at distance 3 Å from . In the second step, the algorithm selects the maximum number of aligned pairs in sequential order, by removing pairs that violate colinearity.

Sisyphus
Benchmark. e Sisyphus test [11] is frequently used to assess the accuracy of automated methods for Step 2: Final alignment F 1: e procedure for creating methods' speci�c alignments and alignments based on MaxPairs superpositions. protein structure comparison [1,12]. is sophisticated benchmark utilizes 125 alignments of structurally related proteins, created by experts in the �eld of protein structure analysis. e reference alignments can be downloaded from http://sisyphus.mrc-cpe.cam.ac.uk.
In present study, we (like Rocha et al. [12]) utilize only a subset of the Sisyphus test set, containing 106 alignments between single-chain proteins. e two-step process is illustrated in Figure 1. In the �rst step, STRUCTAL, TM-align, and LOCK2 are run with default parameters to generate the methods' speci�c alignments between proteins from the Sisyphus set. ese alignments are then compared to the reference ("gold-standard") alignments to compute the percentage of correctly aligned residue pairs [1,12].
In the second step, the MaxPairs algorithm is run to compute the set of (near-)optimal superpositions, namely, the superpositions that rigorously maximize the number of pairs of atoms that can be �t under 3 �. �e used our own implementations of the STRUCTAL, TM-align, and LOCK2 alignment procedures to compute optimal residue-residue correspondence (alignment) between the newly superimposed proteins. e percentage agreement with reference alignments is recorded again and compared to the agreement obtained in the �rst step.
e agreement with reference alignments in the Sisyphus test is de�ned as a function of the magnitude of the alignment error. More speci�cally, for the alignment tolerance shi , the agreement is de�ned as / ref , where is the number of aligned residues that are shied by no more than positions in the reference alignment and ref is the length of the reference alignment [12]. e perfect agreement is the one that corresponds to zero-shi ( . e dashed lines in Figures 2, 3, and 4 track the performance of original STRUCTAL, TM-align, and LOCK2 methods. e solid lines show the performance of the same methods when run on the superpositions that maximize the number of residues under 3 �. As seen in these �gures, there is a signi�cant boost in the methods' accuracy resulting from the "�ne-tooth comb" search of superposition space. More precisely, the new superpositions improve absolute agreement with the reference alignments for STRUCTAL, TM-align, and LOCK2 by 11%, 5%, and 5%, respectively, with a similar trend continuing for nonzero shi. e increase in number of correctly aligned residues, obtained by switching to MaxPairs superpositions, varies from one pair of structures to another (Figures 5, 6, and  7). For some pairs, the difference is striking. However, it should be emphasized that, in some of these cases, such a high difference might be due to unavailability of information in P�B �les used by the methods in our study. For instance, the LOCK method is built to take advantage of the residues' secondary structure assignment. Hence, it is reasonable to assume that the lack of secondary structure information in the P�B �le for one or both structures will oen decrease the accuracy of the LOCK alignment of those structures. A more detailed analysis shows that, when MaxPairs superpositions are used, the number of residue pairs correctly aligned by STRUCTAL increases by more than 10 for 31 out of 106 test pairs. e corresponding number of test pairs for which the same magnitude of increase is observed for TMalign and LOCK is 14 and 13, respectively. For comparison, original STRUCTAL superpositions have such an advantage only in 3 out of 106 test pairs. For TM-align and LOCK, the corresponding numbers are 5 and 4. e value added by the deep search of superposition space makes some of the methods analyzed here comparable to the best to date methods evaluated in the Sisyphus test. A slight accuracy advantage of algorithms such as Matt [13], PPM [14], and ProtDeform [12] is due to the fact that these methods consider proteins as �exible, rather than rigid objects. In other words, unlike STRUCTAL, TM-align, and LOCK2, which all utilize single transformations of input proteins to compute �nal alignments, the new generation of protein structure alignment methods consider sequences of different rigid transformations at different sites. It should be emphasized that the methods based on sequences of local transformations can themselves bene�t from incorporating the ��ne-tooth comb� search to detect fragments of local  similarity. is would lead to further improvements in their overall accuracy, but the true extent of these improvements can only be accessed through a carefully designed study.

FSSP Benchmark.
Our second benchmarking set utilizes 183 representative pairs of proteins, related at various levels according to FSSP structural classi�cation [15]. is test set consists of 55 family pairs, 68 superfamily pairs, and 60fold pairs (see Supplementary Material available online at doi:10.1155/2012/459248).
In contrast to Sisyphus benchmark, which compares alignments returned by automated methods to those generated by human experts, the alignment precision in the FSSP benchmark is assessed using a set of well-known alignment quality measures: (i) NumPairs(d) represents the number of aligned pairs of residues in two proteins that are at distance ≤ Ångströms from each other. We note that, unlike ≤ , which is a globally optimal metric, representing the maximum number of pairs of residues in the superimposed structures that can be placed under Ångströms, NumPairs(d) represents the method speci�c count of pairs of aligned residues at distance ≤ .
(ii) Similarity Index, denoted by SI, is de�ned as min{ ( ) ( ) , where is the number of aligned residues in proteins and and ( ) and ( ) are the lengths of and , respectively [16]. e cRMS score, used in the formula for SI, is computed based upon the method speci�c alignments. As seen in Table 1, a more detailed search of the superposition space increases both NumPairs and PSI scores for all three methods in our study. e increase in SI scores is also seen for both STRUCTAL and LOCK2. It is interesting to note, though, that the original TM-align superpositions yeald better SI scores than the optimal superpositions. e FSSP level-speci�c results of our benchmarking analysis are summarized in Tables 2, 3, and 4. Figure 8 shows the alignment independent PSI scores computed from superpositions generated by STRUCTAL, TM-align, and LOCK2. For reference, a near-optimal PSI score, averaged across the FSSP test set and computed by the MaxPairs algorithm, is also provided in this �gure. e data used in Figure 8 shows that (on average) STRUCTAL, TM-align, and LOCK fail to place 8%, 7%, and 11% pairs of residues at distance ≤ 3Å, respectively. As expected, the best performance of these methods is observed at the FSSP family level (STRUCTAL fails to place 5%, TM-align: 5%, LOCK: 6%) and worst at FSSP fold level (STRUCTAL: 15%, TM-align: 12%, LOCK: 17%).

Illustrative Examples.
Several examples illustrating the advantage of the deep search of superposition space are given in Figures 9,10,11,12,and 13. While examples in Figures 9-13 are striking, it should be noted that they represent rather isolated cases. In fact (as the reader can conclude from Figures 5, 6, and 7), there are several examples where the output of heuristic methods compares favorably to that of MaxPairs (although the difference in quality is not as obvious as that shown in Figures 9-13). As emphasized before, in many instances, the inaccuracy of the alignment generated by heuristic methods is due to insufficient structural information stored in the PDB �le, relied upon these methods.

Discussion
Resent years have witnessed advances in the development of methods for approximate and exact solution to protein structure alignment problem. One of the �rst such methods is the Umeyama�s algorithm for �nding the transformation that gives the least mean squared error between two point patterns [17]. Since then, several algorithms have been published  for �nding a near-optimal solution to the structure alignment problem under distance constraints. e procedure by Akutsu, for example, returns a superposition of the input proteins that �ts at least as many pairs of residues under the distance as an optimal alignment �ts under the distance , for every �xed [18]. is algorithm runs on the order of 8 ), where denotes the protein length. An improved running time procedure for the same problem has also been published [19]. e EPSILON-OPTIMAL algorithm, used in present study, is able to place at least as many pairs of residues under the distance as an optimal superposition places under the distance . e asymptotic cost of EPSILON-OPTIMAL is 4 ) for globular and 8 ) for nonglobular proteins [1]. e polynomial time approximation schemes (PTASs) have been designed for selected nonsequential protein structure alignment measures [20] as well as for the class of measures satisfying the so-called Lipschitz condition [21]. Moreover, methods exist that rigorously minimize proteins' intra-atomic distances, including the algorithm by Caprara et al., which is capable of approximating the "Contact Map Overlap" (CMO) measure with great accuracy [22]. Finally, the algorithms for absolute optimum, with respect to selected alignment metrics, have also been published [1,23], but they are computationally too expensive for everyday use.
Although inefficient for large scale analysis, the algorithms for exact solution are indispensable tools for assessing the accuracy of more commonly used heuristic methods. e present study utilizes a set of precomputed superpositions to evaluate the improvements in accuracy of three wellknown protein structure alignment algorithms, obtained by the deep search of the superposition space. In the Sisyphus benchmark, these superpositions increase the accuracy of alignments generated by STRUCTAL, TM-align, and LOCK2 by 11%, 7%, and 6%, respectively. An improvement of similar magnitude is seen aer allowing for alignment errors (residue shis). In the FSSP benchmark, the new superpositions increase NumPairs and PSI scores for STRUCTAL, TMalign, and LOCK2 by ∼7%, ∼5%, and ∼13%, respectively. A particularly noticeable improvement is seen in the Similarity Index scores of alignments generated by LOCK2 (from 8.35 to 5.69). We emphasize that our analysis provides an estimate of the lower bound on the difference between optimal and heuristic solution, since alignments generated by MaxPairs are not always optimal (in the strict sense).
Finally, it is reasonable to expect that a more thorough exploration of the superposition space, coupled with the fragment-based alignment techniques, can be used to further improve the precision of methods based on sequences of local transformations, such as Matt [13], PPM [14], and ProtDeform [12].

Conclusions
A typical distance-based protein structure alignment method explores the space of proteins' spatial superpositions, computing an optimal residue-residue correspondence (alignment) each time a new superposition is generated. Because of the large search space, current methods for protein structure alignment must trade precision for speed and explore only a small but representative set of superpositions.
We utilize an algorithm capable of �nding an alignment of any speci�ed accuracy to demonstrate signi�cant increase in the alignment quality of solutions generated by three popular protein structure alignment methods, obtained through the deep search of the superposition space. e large lower bound on the size of gap between optimal and heuristic solutions, observed in this study, suggests that the protein structure alignment problem will likely remain an attractive research area throughout the next decade.