Identify High-Quality Protein Structural Models by Enhanced K-Means

Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means.


Background
A critical issue in protein three-dimensional (3D) structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys [1][2][3][4]. According to the first principle of predicting protein folding, the native structure of the target sequence should be the conformation exhibiting minimal free energy [5]. According to this methodology, large-scale protein-candidate conformations are generated using ab initio or comparative methods [6][7][8][9][10]. Because accurate calculation of free energy remains unclear in theory [11][12][13], a protein-structure clustering algorithm is employed, and the structure located at the center of the largest cluster is considered the conformation exhibiting minimal free energy. In clustering algorithms, the 3D-structural similarity between two proteins is used as the distance metric. Currently, root mean square deviation (RMSD) and template modeling (TM)-scores [14] constitute the two most common metrics for determining 3D-structural similarity between candidates. Subsequent refinement steps are also performed based on the conformations detected by protein-structure clustering; however, the quality of the clustering algorithm directly affects the final results of protein prediction.
SPICKER is a simple, widely used, and efficient program used for identifying near-native folds. In this algorithm, clustering is performed in a one-step procedure using a shrunken, but representative, set of decoy conformations, with a pairwise RMSD cut-off determined by a self-adjusting iteration proposed by Zhang and Skolnick [15]. After benchmarking using a set of 1489 nonhomologous proteins representing all protein structures in the PDB ≥ 200 residues, Xu and Zhang [14] proposed a fast algorithm for population-based protein structural model analysis. Two new distance metrics, Dscore1 and Dscore2, based on the comparison of protein-distance matrices for describing the differences and similarities among models were developed [1]. Compared with existing methods using calculation times quadratic to the number of models, 2 BioMed Research International Dscore1-based clustering achieves linear-time complexity to obtain almost the same accuracy for near-native model selection.
Clusco [16] is a fast and easy-to-use program allowing high-throughput comparisons of protein models using different similarity measures (coordinate-based RMSD [cRMSD], distance-based RMSD [dRMSD], global distance test [GDT], total score [TS] [17], TM-score, MaxSub [18], and contact map overlap) to cluster the comparison results using standard methods, such as -means clustering or hierarchical agglomerative clustering. The application was highly optimized and written in C/C++ and included code allowing for parallel execution, which resulted in a significant speed increase relative to similar clustering and scoring algorithms. Berenger et al. [19] proposed a fast method that works on large decoy sets and is implemented in a software package called Durandal, which is consistently faster than other software packages in performing rapid and accurate clustering. In some cases, Durandal outperforms the speed of approximate methods through the use of triangular inequalities to accelerate accurate clustering without compromising the distance function.
However, most of these methods are data sensitive, with both different protein targets and different modeling algorithms potentially resulting in large differences in detecting the center of clusters [20,21]. One possible reason for this is that the free energy distribution varies greatly when using different decoy generated algorithms, such as those relying on ab initio and comparative modeling. Identifying the nearnative conformation is also a memory and time-intensive task [22][23][24]. The -means [25,26] clustering algorithm is popular and has been successfully employed in many different scientific fields due to its robust performance in several previous applications [27,28] and the relative simplicity of the algorithm. However, the efficacy of -means clustering in protein-structure prediction has not been extensively studied.
In this paper, we proposed two enhanced -means clustering algorithms to identify the near-native structures. The first one employs SPICKER to determine the initial centroids for basic -means algorithm. Another one employs squared distance to optimize the initial centroids.

Data Sets of Benchmark.
To comprehensively evaluate the methodology, we applied the algorithms to two representative datasets. The first dataset is I-TASSER SPICKER Set-II (http://zhanglab.ccmb.med.umich.edu/decoys/decoy2.html), which is widely used for evaluating the performance of protein decoys clustered algorithm [29,30]. I-TASSER SPICKER Set-II contains the whole-set atomic structure decoys of 56 nonhomologous small proteins ranging from 47 residues to 118 residues, average with 80.88 residues. And the decoy average contains 439.20 conformations.
The second benchmark is CASP11 experimental targets which were generated by Zhang-Server and QUARK. We choose 12 hard and very hard targets from 64 CASP11 targets published on http://zhanglab.ccmb.med.umich.edu/decoys/ casp11/. Hard and very hard targets indicate lower similarity of PDBs and more PDBs in the decoy. The targets without Zhang-Server and QUARK server results and with ZHANG-Server TM-score less than 0.6 are removed from the dataset. Each decoy contains around 1200-1500 conformations, average with 1520.83 conformations. These proteins ranged from 68 residues to 204 residues, average with 135.90 residues.

Classical -Means Algorithm and 3D Distance Metrics
-means algorithm is a typical clustering algorithm which is based on distance. It uses the Euclidean metric as the similarity measure. The closer the two objects, the greater the similarity -means' important criterion. -means considers that cluster is composed of many objects which are close in distance. Therefore, its final goal is to find out the compact and independent clusters. The selection of initial clustering center has great influence on the clustering results, because in the first step -means use a random selection of arbitrary objects as the initial clustering center, representing an initial cluster. In each iteration, the remaining data set will be reassigned to the nearest cluster according to the distance. An iteration operation will be finished when all remaining data sets are assigned and new clustering centers will be calculated. When the new clustering centers are equal to the original clustering centers or less than a specified threshold, the algorithm will be finished. Euclidean metric is defined as follows: where is the number of corresponding atoms between two objects and .

Root Mean Square Deviation and Template Modeling
Score. The similarity between two models is usually assessed by the root mean square deviation (RMSD) between equivalent atoms in the model and native structures after the optimal superimposition [31,32].
RMSD alone is not sufficient for globally estimating the similarity between the two proteins, because the alignment coverage can be very different from approaches. A template with a 2Å RMSD to native having 50% alignment coverage is not necessarily better for structure modeling than the one with an RMSD of 3Å but having 80% alignment coverage. While the template aligned regions are better in the former because fewer residues are aligned, the resulting full-length model might be of poorer quality. Template Modeling Score (TM-score) function is a variation on the Levitt-Gerstein (LG) score [1,33], which was first used for sequence independent structure alignments. TM-score is defined as follows: where is the length of the native structure, is the length of the aligned residues to the template structure, is the Input: V is the distance matrix for protein pairs; is the number of proteins; is the specified number of the clusters; CC k is the center of the th cluster; CC k indicates the th new cluster center. Output: C 1 ⋅ ⋅ ⋅ C k , the th cluster.
One of the key limitations of the -means algorithm concerns the positioning of initial cluster centers. As a heuristic algorithm, it will converge to the global optimum, with the results potentially dependent upon the initial cluster positions. In the classical -means algorithm, the initial centers are randomly generated, and different initial positions consistently result in entirely different final cluster centers. SPICKER represents a simple and efficient strategy for identifying near-native folds by clustering protein structures generated during computer simulations. SPICKER performs this in a one-step procedure using a shrunken, but representative, set of decoy conformations, with the pairwise RMSD cut-off determined by self-adjusting iterations.
We proposed the first enhanced -means algorithm,means, which integrates SPICKER with -means as Algorithm 1. In the 1st line Prepare_data() calculates the similarity of all proteins. In the 2nd line, startSpicker( , ) executes the program, SPICKER, and gets initial cluster centers. In the 6th line, function DistributeToCluster(V, C , ) is to distribute the th protein to the nearest cluster center C according to the distance matrix V. And in the 10th line, function CaculateNewCenter(C k ) is to calculate the new center for current cluster C k . In the 19th line, Update() copies the new cluster center to the current cluster center. The flow chart of -means is depicted in Figure 1(a).

Initial Constraints Enhance the Classical -Means Algorithm.
Another enhanced -means algorithm, -means++ [35], was applied to detect the near-native conformation. The -means++ algorithm maximizes the distance between initial cluster centers, which are not chosen uniformly at random from the data points that are being clustered. Each subsequent cluster center is chosen from the remaining data points, with probabilities proportional to its squared distance from the closest existing cluster center to that point. The flow chart of -means is depicted in Figure 1   (59%) of the 56 targets detected by -means++ obtain TMscores better than or equal to those of SPICKER, and 42 (75%) of the 56 targets detected by -means obtained TMscores better than or equal to those of SPICKER. These results demonstrated that the performance of both of the two enhanced -means algorithms outperformed SPICKER in situations involving larger populations of conformation decoys. A statistical significance is important to indicate that the difference between two approaches' sample averages most likely reflects a "real" difference in the population. For practical purposes statistical significance suggests that the two larger populations from which we sample are "actually" different. -Test (Student's test) is the most common form of statistical significance test. We implemented equal sample sizes -test between four methods ( -means++, -means,means, and SPICKER) and random method in Supplemental Information Table S1 in Supplementary Material available online at https://doi.org/10.1155/2017/7294519. Unfortunately, on I-TASSER Set-II dataset, none of the four methods show statistical significance with the random method as first row in Table S1. But when we only consider the data with decoy size less than 520 (2nd row in Table S1), -means++ andmeans showed more significant than -means and SPICKER. These indicate that -means and -means++ are more likely to be different with random method than -means and SPICKER when the decoy size is less than 520. Figure 2 is a comparison of the TM-score between -means++ and SPICKER. The green histograms are the TM-score of SPICER model1 from Zhanglab website (http://zhanglab.ccmb.med .umich.edu/decoys/casp11/). The red and yellow histograms are the TM-score values of model1 and the best model (in model1-model5) of -means++, respectively. For all 12 CASP11 hard targets, 8 (66.7%) out of -means++ model1 have higher TM-score than SPICKER model1. And on three targets (T0820, T0824, and T0857), -means++ and SPICER have very similar results (TM-score difference is less than 0.01). -means++ increase the average TM-score 10.5% from SPICKER's 0.38 to 0.42. -means++ performances perfect on the target T0837 with TM-score 0.69 which is 60.5% higher than SPICKER's TM-score 0.43. 10 (83%) out 12 best models of -means++ have higher TM-score than SPICKER.  The size of the models in the decoy. c The best (maximum) TM-score of the models in the decoy.  T0763  T0785  T0812  T0816  T0820  T0822  T0824  T0836  T0837  T0838  T0855  T0857 TM-score SPICKER model1 K-means++ model1 K-means++ best model (in model1-model5) When comparing with -means, even though only 5 out of 12 model1s have higher TM-score than SPICKER, the average TM-score 0.38 of -means model1 is the same to SPICKER's. And the average TM-score 0.46 of -means best model, which is the same to average TM-score of -means++ best model, is 28% higher than average TM-score 0.36 of SPICKER model1.

Case Study on Three
Targets. The two enhanced -means methods got comparable results with SPICKER on most targets and demonstrated perfect advantages on some rich -helix and -stands targets. -helix and -stands are two most common secondary structure elements; they have been researched a lot and they are very important for protein 3D structure prediction. After exploring some targets, we find that -means possibly prefers to identify better -strands targets and -means++ possibly prefers to recognize better -helix targets. The -means method achieved higher TM-score than SPICKER on some -strands targets, such as 1af7, 1gpt, 1sro, and 1tig. We choose two targets (1sro and 1tig) and compared model1 which identified by -means (red) and SPICKER (green) with their native (blue) conformations in Figure 3. The black frames highlight the improvements of the means algorithm relative to SPICKER results. Figure 3(a) shows conformation identified by SPICKER (green) on 1sro; all three -strands are shorter than those in the conformation identified by -means (red). For protein 1tig (Figure 3(b)), the conformation identified by -means (red), the threestand sections are closer to the native (blue) conformation than the structure identified by SPICKER (green). These results demonstrate that -means algorithm possibly can perform better on identifying -stands. Figure 4(a) shows the distribution of TM-score and RMSD on the whole decoy with yellow points; points closing to the left-top are better. And we point out the minimum RMSD, the maximum TM-score, model1 identified by -means++, and SPICKER with different point shape and color. In this figure, we find that model1 identified by -means++ is closer to native than model1 of SPICKER on both measurement of TM-score and RMSD. In Figure 4(b), we find that T0837 is mainly consisted of -helix. The conformation identified by -means++ (red) is overlapped with the native conformation (blue) on most -helix area. In the black frame, we mark an obvious difference between model1 structure identified by SPICKER (green) and the native structure (blue); the green -helix has totally wrong direction. This probably validates our -means++ algorithm having advantages in identifying better -helix.

The Time and Space Complexity Analysis.
Since, in classical -means algorithm, every iteration requires calculation of the distance between each protein and each cluster center, the time complexity of classical -means algorithm is ( * * ); here is the number of iteration until cluster centers convergence. K is the specified number of clusters. And N is number of proteins in the decoy. The space complexity of classical -means algorithm is ( + ).
-means is combined by SPICKER and the classicalmeans algorithm. The time complexity of SPICKER is ( 2 * + * + * + ); is the length of the protein. Therefore, the time complexity of -means is ( 2 * + (1 + + + * ) * ), the sum of ( 2 * + * + * + ) and ( * * ). The space complexity of SPICKER is ( 2 + * + * + + ). The space complexity ofmeans algorithm is largest of space complexity of SPICKER and classical -means, ( 2 + * + * + + ). -means++ is combined by initial centers choosing process and the classical -means algorithm. The initial centers choosing process determines each center by the max distance to all other proteins, which has the time complexity ( * ) and the space complexity ( + ). So the time complexity of -means++ is ( * + * * ). And the space complexity of -means++ algorithm is ( + ).
The time complexities of -means and -means++ both have quadratic polynomial forms. The space complexity of -means and -means++ has quadratic polynomial and linear forms, respectively.

Conclusions
Here, we developed two efficient methods for identifying high-quality protein structural models by enhanced -means clustering algorithm ( -means and -means++). Based on the publicly available benchmark dataset (I-TASSER decoy set-II and), our results showed that -means and -means++ were more robust than SPICKER at identifying conformational targets, with detection rates of 59% and 75%, respectively, exhibiting TM-scores better than or equal to those identified by SPICKER. Benchmarking on the CASP11 hard dataset, 8 (66.7%) out of 12 -means++ model1 have higher TM-score than SPICKER model1. And the average TM-score 0.46 of -means best model, which is the same to average TM-score of -means++ best model, is 28% higher than average TM-score 0.36 of SPICKER model1. These findings demonstrated that the two methods achieved better results at candidate-decoys populations conformations, possibly due to our improvements of initializing the cluster centers, thereby removing the element of randomness.

Data Access
Programs and the data employed in the experiments are available at online http://eie.usts.edu.cn/whj/SK-means/index .html.

Disclosure
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the paper.