The importance of pairwise protein structural comparison in biomedical research is fueling the search for algorithms capable of finding more accurate structural match of two input proteins in a timely manner. In recent years, we have witnessed rapid advances in the development of methods for approximate and optimal solutions to the protein structure matching problem. Albeit slow, these methods can be extremely useful in assessing the accuracy of more efficient, heuristic algorithms. We utilize a recently developed approximation algorithm for protein structure matching to demonstrate that a deep search of the protein superposition space leads to increased alignment accuracy with respect to many wellestablished measures of alignment quality. The results of our study suggest that a large and important part of the protein superposition space remains unexplored by current techniques for protein structure alignment.
Pairwise protein structure alignment is one of the most important problems in computational molecular biology. At the same time, protein structure alignment is a very difficult problem, due to an infinite number of possible ways to position a pair of proteins in the threedimensional space. Because of the enormous size of the search space, the research into protein structure alignment has been traditionally focused on the development of methods with better objective functions, that explore a relatively small but representative set of proteins’ spatial superpositions.
In this paper, we take a different approach and study the benefits of searching proteins’ superpositions in a more detailed manner. We demonstrate significant increase in the alignment accuracy of several wellknown distancebased alignment methods, obtained by utilizing the superpositions that rigorously optimize a very simple and intuitive alignment metric, defined as the largest number of residues from the input proteins that can be fit under a predefined distance cutoff.
The size of gap between the accuracy of current heuristic solutions and optimal solutions, observed in this study, suggests that the protein structure alignment problem will likely remain a hot topic in years to come.
Our study is carried out using two protein structure alignment benchmarks:
We evaluated the performance of three wellknown methods for protein structure comparison, STRUCTAL [
It is important to emphasize that our experiment is not designed to compare these three methods headtohead, but rather to assess the extent of improvements in the accuracy of each method that can be made by exploring the search space in a more thorough manner.
In choosing the methods for our study, we only considered the availability of software and the simplicity of implementing the alignment scoring functions (see the Results section). An overview of the three algorithms is given below.
The STRUCTAL algorithm [
The alignment step in STRUCTAL is carried out using a dynamic programming routine, which implements the following recurrence formula:
TMalign is another popular protein structure alignment program, widely used in many applications, in particular for assessing the quality of protein models generated by comparative modeling or abinitio techniques. The score matrix in TMalign is proteinlength specific and is defined as
LOCK2 [
The alignment returned by LOCK2 is a sequence of pairs of points
The LOCK2 software can be downloaded from
The
In present study, we (like Rocha et al. [
The procedure for creating methods’ specific alignments and alignments based on MaxPairs superpositions.
In the second step, the MaxPairs algorithm is run to compute the set of (near)optimal superpositions, namely, the superpositions that rigorously maximize the number of pairs of atoms that can be fit under 3 Å. We used our own implementations of the STRUCTAL, TMalign, and LOCK2 alignment procedures to compute optimal residueresidue correspondence (alignment) between the newly superimposed proteins. The percentage agreement with reference alignments is recorded again and compared to the agreement obtained in the first step.
The agreement with reference alignments in the
The dashed lines in Figures
The accuracy of the STRUCTAL algorithm using original versus optimized superpositions in the
The agreement of TMalign alignments and reference alignments in the
The agreement of LOCK2 alignments and reference alignments in the
The increase in number of correctly aligned residues, obtained by switching to MaxPairs superpositions, varies from one pair of structures to another (Figures
The increase in accuracy of STRUCTAL obtained on 106 pairs from the
The increase in accuracy of TMalign obtained on 106 pairs from the
The increase in accuracy of LOCK2 obtained on 106 pairs from the
A more detailed analysis shows that, when MaxPairs superpositions are used, the number of residue pairs correctly aligned by STRUCTAL increases by more than 10 for 31 out of 106 test pairs. The corresponding number of test pairs for which the same magnitude of increase is observed for TMalign and LOCK is 14 and 13, respectively. For comparison, original STRUCTAL superpositions have such an advantage only in 3 out of 106 test pairs. For TMalign and LOCK, the corresponding numbers are 5 and 4.
The value added by the deep search of superposition space makes some of the methods analyzed here comparable to the best to date methods evaluated in the
Our second benchmarking set utilizes 183 representative pairs of proteins, related at various levels according to FSSP structural classification [
In contrast to
NumPairs(d) represents the number of aligned pairs of residues in two proteins that are at distance
As seen in Table
Average (perpair) accuracy of STRUCTAL, TMalign, and LOCK2 in the FSSP benchmark, for all structural levels combined. The best results are indicated in bold.
NumPairs(3)  PSI(3)  SI  

STRUCTAL  
Original  50.47  0.59  7.85 
Nearoptimal 



TMalign  
Original  53.35  0.62 

Nearoptimal 


5.95 
LOCK2  
Original  51.75  0.60  8.35 
Nearoptimal 



The FSSP levelspecific results of our benchmarking analysis are summarized in Tables
Average accuracy of the three methods in our study, computed on 60 pairs of proteins that share the same FSSP fold.
NumPairs(3)  PSI(3)  SI  

STRUCTAL  
Original  31.48  0.47  10.07 
Nearoptimal  36.68  0.54  9.63 
TMalign  
Original  35.98  0.52  7.60 
Nearoptimal  39.58  0.57  7.76 
LOCK2  
Original  34.82  0.50  12.56 
Nearoptimal  42.47  0.61  7.25 
Average accuracy computed on the set of 68 pairs of proteins that belong to the same FSSP superfamily.
NumPairs(3)  PSI(3)  SI  

STRUCTAL  
Original  47.71  0.58  8.41 
Nearoptimal  50.09  0.61  7.61 
TMalign  
Original  51.40  0.62  6.11 
Nearoptimal  52.71  0.64  6.10 
LOCK2  
Original  48.63  0.59  8.17 
Nearoptimal  55.37  0.67  5.85 
Average accuracy computed on the set of 55 pairs of structures from the same FSSP family.
NumPairs(3)  PSI(3)  SI  

STRUCTAL  
Original  74.6  0.74  4.76 
Nearoptimal  77.11  0.76  4.62 
TMalign  
Original  74.71  0.73  3.65 
Nearoptimal  77.49  0.76  3.79 
LOCK2  
Original  74.09  0.73  3.98 
Nearoptimal  79.73  0.78  3.80 
Figure
Alignment independent PSI scores in the FSSP benchmark. The number in parentheses is the highest number of pairs of residues (averaged over each test set) that can be placed under 3 Å, given the superpositions generated by each method.
The data used in Figure
Several examples illustrating the advantage of the deep search of superposition space are given in Figures
Structural alignment of two cystatinlike folds:
Structural superposition of the
Structural superposition of
Structural superposition of two helical regions in the
Structural superposition of the
While examples in Figures
Resent years have witnessed advances in the development of methods for approximate and exact solution to protein structure alignment problem. One of the first such methods is the Umeyama’s algorithm for finding the transformation that gives the least mean squared error between two point patterns [
The polynomial time approximation schemes (PTASs) have been designed for selected nonsequential protein structure alignment measures [
Although inefficient for large scale analysis, the algorithms for exact solution are indispensable tools for assessing the accuracy of more commonly used heuristic methods. The present study utilizes a set of precomputed superpositions to evaluate the improvements in accuracy of three wellknown protein structure alignment algorithms, obtained by the deep search of the superposition space. In the
We emphasize that our analysis provides an estimate of the lower bound on the difference between optimal and heuristic solution, since alignments generated by MaxPairs are not always optimal (in the strict sense).
Finally, it is reasonable to expect that a more thorough exploration of the superposition space, coupled with the fragmentbased alignment techniques, can be used to further improve the precision of methods based on sequences of local transformations, such as Matt [
A typical distancebased protein structure alignment method explores the space of proteins’ spatial superpositions, computing an optimal residueresidue correspondence (alignment) each time a new superposition is generated. Because of the large search space, current methods for protein structure alignment must trade precision for speed and explore only a small but representative set of superpositions.
We utilize an algorithm capable of finding an alignment of any specified accuracy to demonstrate significant increase in the alignment quality of solutions generated by three popular protein structure alignment methods, obtained through the deep search of the superposition space. The large lower bound on the size of gap between optimal and heuristic solutions, observed in this study, suggests that the protein structure alignment problem will likely remain an attractive research area throughout the next decade.
A. Poleksic was supported, in part, by a Professional Development Assignment from the University of Northern Iowa.