patGPCR: A Multitemplate Approach for Improving 3D Structure Prediction of Transmembrane Helices of G-Protein-Coupled Receptors

The structures of the seven transmembrane helices of G-protein-coupled receptors are critically involved in many aspects of these receptors, such as receptor stability, ligand docking, and molecular function. Most of the previous multitemplate approaches have built a “super” template with very little merging of aligned fragments from different templates. Here, we present a parallelized multitemplate approach, patGPCR, to predict the 3D structures of transmembrane helices of G-protein-coupled receptors. patGPCR, which employs a bundle-packing related energy function that extends on the RosettaMem energy, parallelizes eight pipelines for transmembrane helix refinement and exchanges the optimized helix structures from multiple templates. We have investigated the performance of patGPCR on a test set containing eight determined G-protein-coupled receptors. The results indicate that patGPCR improves the TM RMSD of the predicted models by 33.64% on average against a single-template method. Compared with other homology approaches, the best models for five of the eight targets built by patGPCR had a lower TM RMSD than that obtained from SWISS-MODEL; patGPCR also showed lower average TM RMSD than single-template and multiple-template MODELLER.


Introduction
G-protein-coupled receptors (GPCRs) are among the most heavily investigated drug targets in the pharmaceutical industry [1] because activated GPCRs trigger a cascade of responses inside the cell. Although about 800 GPCRs in the human genome still await determining, the annual revenue in the market for human therapeutics based on the currently available receptors is in excess of $40 billion, and more than 50% of modern drugs are related to GPCRs [2]. On the other hand, it is still a very difficult problem to determine the conformation of GPCRs in vivo. The lipid environment in which the receptors are embedded blocks the two major techniques, nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography, that are used to determine protein structures.
It is really exciting that the Nobel Prize in Chemistry was awarded in 2012 to two researchers studying the structure of GPCRs.
Fortunately, as demonstrated by recent publications [3,4], in silico methods for deducing the three-dimensional structure of GPCRs have been increasingly successful. However, the development of computational approaches to predicting the structure of GPCRs remains a challenging task [5]. A lot of effort has been put into modeling the structures of the full-chain and the loop sections of GPCRs [6][7][8] and of membrane proteins [9,10], whereas comparatively little research has been done on building more accurate models of the transmembrane helix sections of GPCRs, because the transmembrane helical bundles are commonly regarded as a conserved domain that can be easily duplicated 2 Computational and Mathematical Methods in Medicine from templates. In fact, the accuracy of the models of the transmembrane helix sections is still far away from the native structures, and it cannot meet the requirements for subsequent full-chain prediction and the modeling of ligand docking. For the GPCR Dock 2010 assessment [3] targets CXCR4/CVX15, CXCR4/IT1t, and D3, the averages of the TM RMSD values (the root mean square deviation of the backbone of the transmembrane helices) of all models submitted by the participants were 3.56Å, 9.75Å, and 2.55Å, respectively, which are not high-resolution values (<2.0Å). How can one build high-resolution models of the conformations of full-length GPCRs without an accurate transmembrane helical bundle? This paper addresses this problem using a parallelized multitemplate homology approach.
The classification of methods for the prediction of the three-dimensional structure of GPCRs in silico into homology modeling, threading, and ab initio folding follows the classification of methods for protein structure prediction. The homology method builds models starting from one or more template structures with a high sequence similarity. When the sequence identity between target and template is more than 50%, near-native models tend to be generated. When it is less than 30%, the accuracy of the models decreases sharply. SWISS-MODEL [11], Sybyl [12], Prime [13], MODELLER [14], NEST [15], SEGMOD/ENCAD [16], 3D-JIGSAW [17], and Builder [18] are widely used, stable, reliable, and accurate systems for homology modeling. The threading method operates by "threading" (i.e., placing and aligning) each amino acid (or amino acid segment) in the target sequence into a position in the three-dimensional structure and evaluating how well the target fits the template. Zhang et al. [6] used TASSER to generate structure predictions for all 907 putative GPCRs in the human genome. Based on a benchmarked confidence score, approximately 820 of the predicted models should have the correct folds. Ab initio prediction of protein structure involves modeling the dihedral angles for each residue based on the minimum-free-energy principle, without the use of any experimentally solved structures. Baker's group [19,20] was successful in the recent CASP (the Critical Assessment of protein Structure Prediction) challenge and designed a membrane-environment-related energy function to guide membrane protein folding.
Homology approaches can be divided into two categories, namely, single-template and multiple-template, depending on the number of templates employed in modeling. Singletemplate homology approaches cannot always achieve the best results, owing to difficulties in detecting the best template, particularly when remote homology is detected [21]. Multitemplate approaches can effectively combine more reasonably aligned regions than the single-template approach Cheng [22] reported that a multitemplate combination algorithm improved the GDT-TS scores of the predicted models by 6.8% on average for 45 CASP7 comparative-modeling targets. Liu et al. [23] took into account the information represented by multiple templates and alignments at the threedimensional level by mixing and matching regions between different initial comparative models, and the multitemplate approach produced conformational models of higher quality than the individual starting predictions. MODELLER [14] modeled 3D conformations by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions for the features restrained. NEST [15] produced a model by taking a sequence alignment of a target to one or multiple template PDB files as input. 3D-JIGSAW [17] used a convergence of algorithms for comparative modeling which led to more reliable structures by superimposing multiple known structures from a protein family. In our opinion, previous multitemplate approaches have generally built a "super" template, which might ignore the flexibility of aligned structures from different templates. The longdistance homology information from different templates should be exchanged in each iteration rather than directive merging structures.
This paper proposes a parallelized multitemplate approach, inspired by our previous method pacBackbone [24,25], for the prediction of the three-dimensional structure of the transmembrane helices of GPCRs. The system proposed here is referred to as "patGPCR" (parallelized multitemplate approach to GPCR transmembrane helix structure prediction). Parallelization not only accelerates the running speed but also provides a novel and effective mechanism to exchange homology regions between templates softly. We have exploited our method to predict tertiary structures for all eight determined GPCRs published on the GPCR Network website (http://gpcr.scripps.edu/index.html). Compared with other homology approaches, the best models for five of the eight targets built by patGPCR had a lower TM RMSD than was obtained from SWISS-MODEL, patG-PCR showed lower average TM RMSD than single-template MODELLER, and patGPCR showed only one higher average TM RMSD target compared with multiple-template MOD-ELLER.

Parallelized Framework of patGPCR.
GPCRs share a similar structural topology, composed of seven transmembrane (TM) helices packed into a 7-TM helical bundle, with three intracellular (icls) and three extracellular loops (ecls) [26]. Thus, single helical refinement should be paralleled in independent threads. The parallelized framework for patGPCR is depicted in Figure 1. At the beginning of patGPCR, top 2-4 templates were examined for sequence identities using the Protein Data Bank (http://www.rcsb.org/pdb), which were used as starting template conformations. Eight parallelized pipelines which involve independent subprocedures (or threads) were used to randomly select starting conformations of the templates. Each pipeline containing TM refinement and loop refinement optimized an adjacent helix pair region. The first and the last pipelines optimized only one helix and one terminus. To use reasonable structural regions within different templates, TM helix crossing step and elite pool were introduced into the parallelized framework at the end of the pipelines. In the crossing step, each pipeline shared the best-so-far helix with other pipelines. If lower energy was obtained by helix crossing, the helix substitution was accepted; otherwise, it was rejected. Optimized conformations are conserved in the elite pool prepared for the next Output: conformation after TM refined is the Rosetta membrane energy function with S. 3 is the Rosetta score3 energy function. iteration. Thus, the crossing step and elite pool are two critical mechanisms of patGPCR to identify reasonable structures from multiple templates to pipelines and iterations.

Multiple
Templates for Eight GPCR Targets. patGPCR was evaluated by using blind prediction testing the set of the eight determined GPCRs published on the GPCR network. Amino acid sequence was the only input used for blind prediction, which is commonly employed in GPCRDock2008/2010 and CASP exercises. This is the largest set used for directly evaluating prediction results. patGPCR employed 2-4 templates (column 4 in Table 1) for each target in the test set. The top three or four templates were selected by standard protein BLAST based on default parameters. Most sequence similarities between the templates and targets were lower than 50% and average sequence similarity for the templates was 36% (column 6 in Table 1). In some cases, the templates have high-sequence identity. There are two reasons why we did not remove these cases from benchmark. First, we are interested in validating the modeling ability of patGPCR based on various templates with different sequence identities. Second, patGPCR, SWISS-MODEL, and MODELLER used the same templates in the comparing experiments, so the results we presented are fair for these methods, no matter high-or low-sequence identity the templates is.

TM Refinement
Protocol. The 7-TM helical bundle is the primary topology of GPCRs, which comprises approximately 75% of amino acids in the entire protein chain. The TM helix has been conserved throughout evolution [5]. However, in a recent GPCRDock2008/GPCRDock2010 study, average TM RMSDs of CXCR4/CVX15, CXCR4/IT1t, and D3 were 3.56Å, 9.75Å, and 2.55Å. The accuracy of predicting TM regions can be improved, and the correct TM position and orientation ensure that loop regions are properly oriented. TM refinement protocols employed using patGPCR are based on 7-TM geometrical topology reflecting the bundle structure using a set of geometrical parameters. Topologically, each TM helix is regarded as a rigid body, and relative positions of internal atoms remain fixed when moving or rotating the rigid body. TM refinement is divided into four stages. In the first stage, patGPCR was used to identify TM boundaries by averaging the results of six existing methods: TopPred [27], UniProt [28], TMpred [29], HMMTOP [30], TMHMM [31], and OCTOPUS [32]. In the second stage, translation of each rigid body along, or perpendicular to, the axis of the helix was used to optimize the relative positions of seven helices. In the third and fourth stages, spin angels and tilt angles were refined, respectively.

Loop Refinement
Protocol. patGPCR, which involves an entirely predictive approach for GPCRs, combines two Rosetta loop modeling protocols. Due to high flexibility in loop regions, ab initio methods typically involve calculation of possible loop conformations with the help of various energy functions and minimizations [33][34][35]. patGPCR paralleled two Rosetta loop modeling protocols, including CCD [36] and KIC [37] modeling with Rosetta energy function score3. Each pipeline was used to randomly choose a loop movement to refine the loop section.  () refines the loop regions using Rosetta KIC or CCD remodel protocol randomly. In the 17th line, function () exchanges the helixes among pipelines.

Energy Item for Evaluating the Helical Bundles.
Developing an appropriate energy function remains a challenge in predicting GPCR 3D structures. An accurate energy function can be used to distinguish near-native models from candidates, while an imprecise energy function may not recognize near-native models even if they have been sampled by using an accurate search algorithm. patGPCR improves the Rosetta membrane energy function [19] by including a novel energy item for evaluating rigid helix packing. After   includes an additional item is employed by the patGPCR TM refinement algorithm:

The Contribution of TM Refinement Protocol Employed by Single
Pipeline. The contributions of TM refinement protocols were investigated on the comparison with models submitted by GPCRDOCK2010 participants. GPCR-DOCK2010 exercise is a community-wide assessment for GPCR homology-modeling and docking. The participants submitted the best five candidates of targets CXCR4 and D3. We executed patGPCR to optimize the helical bundles starting from the submitted conformations downloaded from GPCRDOCK2010 official website. The TM RMSD of the best model after refining and top 5 models published on GPCRDOCK2010 are listed in Table 2. For CXCR4 (D3) target, the TM RMSD of the best model after refining is 2.165Å (0.93Å), which ranks 4th (1st) among 161 (168) submitted models.

The Contribution of Parallelized Multitemplate.
We examined the contributions of multiple-templates versus a single-template of patGPCR with the same settings. The number of templates used by the patGPCR ranged from two to four. Average of sequence identities between the templates and targets was 36%. Compared to multitemplate patGPCR, single-template patGPCR employed only one template and did not utilize the helix crossing step. Full-length RMSD (FL RMSD) of GPCR and TM RMSD comparisons are plotted in Figure 3; dots with higher RMSD values (>15Å) are not included. The average improvement in TM RMSD using the multiple-template patGPCR was 33.64% (raw TM RMSD increase 1.57Å, columns 4 and 8 in Table 1).
Multiple-template patGPCR yielded a higher number of low TM RMSD conformations than the single-template patGPCR in most cases (five out of eight targets), including CXCR4 (Figure 3(a)), D3 (Figure 3(d)), A2A (Figure 3(e)), KOR3 (Figure 3(g)), and Beta2 (Figure 3(h)). In other cases, the multitemplate approach showed similar performance with single-template patGPCR, including KOR1, HH1R, and S1P1 in Figures 3(b), 3(c), and 3(f). For KOR1, the narrow sampling space in the single template may have resulted from similar performance between multiple-template and singletemplate patGPCR. For HH1R and S1P1, both multipletemplate and single-template patGPCR appeared to show bottlenecks. One reason for this may be that the transmembrane refining protocol employed by multiple-template and single-template approaches failed in some cases. Further analysis is described in Section 4.3.

Comparison of patGPCR to Homology Approach SWISS-MODEL.
The SWISS-MODEL [16] is a widely used homology modeling approach which provides online prediction and manual template specification. We executed the SWISS-MODEL to predict eight targets using the same templates as used in patGPCR (Table 1). Three predictions that failed using the SWISS-MODEL are indicated as "×" in column 9 in Table 1.
For each decoy (set of conformations of the predicted result) predicted by patGPCR, the top 400 conformations in terms of TM RMSD were reserved, while the others were eliminated from the decoy. Thus, we simplified selection of the nearest native prediction from the decoys. We depicted the comparison of decoys from the SWISS-MODEL results using a box-and-whisker plot ( Figure 4). Prediction accuracy was expressed as the RMSD in angstroms, which was calculated after superimposing the corresponding coordinates of alpha-carbons determined by prediction and those of the native structure. To accurately determine transmembrane helices, the best model of five targets (CXCR4, KOR1, D3, A2A, and Beta2) yielded by patGPCR showed lower TM RMSD values than those determined using the SWISS-MODEL. patGPCR showed similar transmembrane accuracy on two targets (HH1R and S1P1); another target (KOR3), patGPCR, was inferior to the target identified using the SWISS-MODEL. In Figure 4, the accuracy of full-length chain and loop regions is shown. For four of the eight targets (KOR1, HH1R, KOR3, and Beta2), the best models predicted by patGPCR showed lower full-length RMSD values than those generated using the SWISS-MODEL. It is not surprising that only three targets (KOR1, KOR3, and Beta2) for patGPCR were more accurately predicted compared to the SWISS-MODEL regarding the accuracy of loop regions since patGPCR places emphasis on helix optimization, whereas relatively little emphasis is placed on loop optimization and  the loop regions were excluded in the crossing step. Next, statistical properties of the representative solutions, the top 400 conformations regarding TM RMSD, for each decoy set were investigated. For each result decoy, Figure 5 shows the 1-percentile average (x-axis) and standard deviation (error bars of symbols) for every test instance, calculated from 50fold bootstrap estimation using the BioShell package [38]. We choose the 1 percentile compared with the SWISS-MODEL results (y-axis) since only some of the decoys will be retained for further analysis, such as side chain refinement, ligand docking, and virtual screen. In Figure 5, patGPCR showed 14, 16, and 11 better results (below the line) for a total of 20 symbols on TM, FL, and loop RMSD values, respectively. In Figure 5, the lower deviations of the error bars indicate that patGPCR exhibits robust performance in most cases.

Comparison of patGPCR to Homology Approach MOD-ELLER. MODELLER [14]
is a well-known tool used for single-template and multiple-template homologies or comparative modeling of protein three-dimensional structures. We conducted single-template and multiple-template experiments using MODELLER on the same Linux system and hardware as was used for patGPCR. For a reasonable comparison, single-template MODELLER generated 1300 predicted conformations, which was similar to the average predicted number of 1358.75 conformations generated by patGPCR, for each target and each template. Multiple-template MOD-ELLER was also used to generate 1300 prediction conformations for each target. MODELLER also employed the same templates as patGPCR listed in Table 1. MODELLER aligning commands, including align2d() and salign(), which reportedly take into account structural information from the template when constructing an alignment, were used to align templates to the targets. MODELLER experimental results are tabulated in columns 10 and 11 in Table 1. The 10th column shows the average and standard deviation of TM RMSD of conformations generated using the single-template MODELLER, and the 11th column shows the average and standard deviation of TM RMSD of conformations generated by the multipletemplate MODELLER. For six out of eight targets, patG-PCR showed lower average TM RMSD than single-template MODELLER, and patGPCR showed only one higher average TM RMSD target compared with multiple-template MOD-ELLER. Multiple-template patGPCR yielded a larger number of lower TM RMSD conformations than the single-template patGPCR for most cases (five out of eight targets), including CXCR4, D3, A2A, KOR3, and Beta2. However, no target was improved by using multiple-template MODELLER. The average TM RMSD values for conformations generated by using multiple-template MODELLER were nearly two times higher than those generated using single-template MODELLER. The lower standard deviation in column 11 indicates that a narrow structural sampling strategy was adopted by multipletemplate MODELLER, which may have led to the multipletemplate MODELLER being trapped by an unaligned region.
Finally, a direct visual comparison of conformations yielded by native, patGPCR, SWISS-MODEL, and MOD-ELLER are depicted in Figure 6.   the complexity of patGPCR depends on that of a single pipeline repeating Algorithm 1 times. For a GPCR target with amino acid residues, the functions (), (), and () execute a movement of each residue in the helices, so these three functions have the same time complexity, ( ). Therefore, the complexity of the TM refinement contained in the three stages from the fourth to the 15th line of Algorithm 1 is 3 × ( 2 ). The function () exchanges the residues in each pair of helices and has complexity ( 2 ). The complexity of () is determined by that of the Rosetta CCD and KIC refinement protocols, which are both composed of two-layer iterations repeating and (Rosetta parameters). Moreover, we generally set the parameters , 2 (in the fourth line of Algorithm 1), 4 (in the 12th line of Algorithm 1), , and in a way that is linearly dependent on . In summary, the time complexity of the parallelized patGPCR is eight times of that of a single pipeline, which is × (3 × ( 2 ) + ( 2 ) + ( 2 )) = ( 3 ).

Complexity
The space complexity of patGPCR is mainly determined by the elite conformation pool shown in Figure 1, the set of residue numbers for the transmembrane regions [], and the set of residue numbers for the loop regions []. The pool is designed to store the coordinates of four backbone atoms of the elite conformations with the same size of template, so it requires 4 × 3 × × space, where is the number of available templates for the corresponding GPCR targets.

Why Does the patGPCR Approach Work in General?
The following factors may contribute to the improvement in transmembrane helix structural prediction. First, a multitemplate approach can effectively combine more reasonably aligned regions than a single-template approach. Singletemplate approaches use the top-ranked template and its alignment with the target protein to model the structure. These approaches cannot always achieve the best results, owing to difficulties in detecting the best template, particularly when remote homology is detected between targets and templates [39]. Because most of the sequence similarities among GPCRs are less than 50%, it is unreliable to attempt to obtain high-resolution models using only a single template. We believe that each template, even if it has lower sequence similarity (<30%), contains a reasonable structure for the targets overall. Multiple-template approaches can not only increase the alignment coverage but also broaden the conformation search space by extracting aligned fragments from different templates.
Second, the parallelized approach enables an effective schema for exchanging reasonable regions among multiple templates. Previous multitemplate approaches [22,40] have generally built a "super" template by methods that involve very little merging of aligned fragments from different templates, whereas the parallelized schema of patGPCR includes a soft resolution for introducing aligned regions and exchanges the reasonable helices between different pipelines at the end of an iteration. Therefore, the contribution of aligned regions from different templates will not only influence the beginning of the algorithm but also be preserved in subsequent iterations. Fortunately, because the helices that the GPCRs have in common are conserved, the transmembrane refinement protocol (see Section 2.3) will improve the accuracy of the aligned regions by moving the coordinates of the atoms toward lower free energy. This also helps the parallelized schema to broadcast the reasonable regions to the elite conformation pool.
Third, the additional energy item related to the helical bundles (see Section 2.6) can reasonably guide the helices toward assembling into a tighter bundle, although not perfectly. The combination of the Rosetta membrane score with based on the GPCR targets is effective in patGPCR, as shown in our experiments.

Limitations of patGPCR.
On the other hand, patGPCR can also be improved in some respects. First, owing to the inadequately known structures of GPCRs, we could only validate the strengths and weaknesses of patGPCR on eight determined cases published on the GPCR Network website. These eight targets are the largest test data set for GPCRs at present. Therefore, the question of whether the current results of our experiments are occasional instances or have some universality cannot be answered immediately. Second, patGPCR seemingly encountered a bottleneck in some cases (in Figures 3(c) and 3(f)); it was difficult for patGPCR to obtain an accuracy results than single-template approach. The bottleneck may have two reasons: (a) the energy term in (1) improves the Rosetta energy function, but it is still not good enough to distinguish between the near-native conformations and (b) in patGPCR, the helix is treated as a rigid body, unlike real helices, which have disorder, irregular regions, and flexibility. The coordinates of the atoms in the interior of the helix are likely to cause a loss of precision. Third, although patGPCR generates a higher quality of decoys in the transmembrane regions, the difficulty of building accurate models of loop regions with high flexibility is a consequence of the difficult issues of the GPCR structure prediction problem.

Ability of New Energy
Item. The ability of the new energy item was investigated for six GPCR targets, including CXCR4, KOR1, D3, A2A, KOR3, and Beta2, whose average TM RMSDs were lower than 5Å. The relatively lower average TM RMSD values gained by using both patGPCR and MODELLER indicate that the accuracy of these targets depend on the energy functions. In Table 3, the top quarter of conformations in terms of TM RMSD values from the decoy of multiple-template patGPCR were chosen to compare how many conformations could be predicted using MEM (Rosetta membrane score) and (new energy item). For four targets (KOR1, D3, A2A, and KOR3), identified more nearly native conformations than MEM. Only one target (CXCR4) was identified to have fewer conformations than MEM.

Conclusion
In this paper, we have presented a parallelized multitemplate approach, patGPCR, to predicting the 3D structure of the transmembrane helices of G-protein-coupled receptors. patGPCR, which employs a bundle-packing-related energy function that improves on the RosettaMem energy, parallelizes eight pipelines for the refinement of transmembrane helix structures and exchanges the optimized helix structures The four red cycles indicate that patGPCR (green) backbones in the TM region are closer to native (red) than those predicted using the SWISS-MODEL (blue) and MODELLER (cyan) in targets A2A (a) and CXCR4 (b). among multiple templates. We have investigated the performance of patGPCR on a test set containing eight determined GPCRs published on the GPCR Network website. The results indicate that our multitemplate algorithm improves the TM RMSD values of the predicted models by 33.64% on average against a single-template method. Compared with other ab initio and homology approaches, patGPCR yielded more predicted conformations with high-resolution structures of the transmembrane helices. The best models for five of the eight targets built by patGPCR had a lower TM RMSD than had the models obtained from SWISS-MODEL, and half of them had a lower full-length RMSD. For six out of eight targets, patGPCR showed lower average TM RMSD than single-template MODELLER, and patGPCR showed only one higher average TM RMSD target compared with multipletemplate MODELLER.