Identification of Hot Spots in Protein Structures Using Gaussian Network Model and Gaussian Naive Bayes

Residue fluctuations in protein structures have been shown to be highly associated with various protein functions. Gaussian network model (GNM), a simple representative coarse-grained model, was widely adopted to reveal function-related protein dynamics. We directly utilized the high frequency modes generated by GNM and further performed Gaussian Naive Bayes (GNB) to identify hot spot residues. Two coding schemes about the feature vectors were implemented with varying distance cutoffs for GNM and sliding window sizes for GNB based on tenfold cross validations: one by using only a single high mode and the other by combining multiple modes with the highest frequency. Our proposed methods outperformed the previous work that did not directly utilize the high frequency modes generated by GNM, with regard to overall performance evaluated using F1 measure. Moreover, we found that inclusion of more high frequency modes for a GNB classifier can significantly improve the sensitivity. The present study provided additional valuable insights into the relation between the hot spots and the residue fluctuations.


Introduction
Flexibility and dynamics play key roles for proteins in implementing various biological processes and functions [1,2]. Residue fluctuations or atomic motions, contributing to large-scale conformational changes of protein structures, are shown to be closely related to functions of native proteins [3][4][5].
Two methods, molecular dynamic (MD) simulation and normal mode analysis (NMA), are widely used to investigate the dynamic link between protein structures and functions. The main drawback of MD simulations is their computational cost [6,7]. Coarse-grained NMA, such as elastic network model (ENM) [7], has been increasingly used in recent years as a powerful tool to elucidate the structure-encoded dynamics of biomolecules [2]. The ENMs, including the isotropic Gaussian network model (GNM) [8,9] and the anisotropic network model [10], define spring-like interactions between residues that are within a certain cutoff distance. They simplify the computationally costly all-atom potentials into a quadratic function in the vicinity of the native state, which allows the decomposition of the motions into vibrational modes with different frequencies that are often known as normal modes. Being simple and efficient, ENM and GNM have been validated in numerous applications that resulted in reasonable agreement with a wealth of experimental data, including prediction of X-ray crystallographic B-factors for amino acids [9,11], identifications of hot spots [12][13][14], catalytic sites [15], core amino acids stabilizing rhodopsin [16] and important residues of HLA proteins [17], elucidation of the molecular mechanisms of motor-protein motions [18], and general conformational changes and functions [3,4,[19][20][21][22][23][24][25][26][27][28][29][30][31].
Previous studies have shown in many cases that the normal modes including the high frequency (fast) modes and the low frequency (slow) modes by the GNM are very useful for recognizing several specific types of protein functions. In particular, the highest frequency modes that reflect local events at the residue level can be utilized to identify core residues or binding sites [16,17,20,32], while the lowest frequency modes are usually responsible for the collective functional dynamics of the global protein motions [23,33].
In area of protein-protein interaction, several studies such as Ozbek et al. [12], Haliloglu et al. [13], and Demirel et al. [14] utilized GNM to identify hot spots that are defined as the residues contributing more than 2 kcal/mol to the binding energy. Their results suggested that hot spots are predefined in the dynamics of protein structures and forming the binding core of interfaces. However, the mean square distance fluctuations of residue pairs and the mean square fluctuations of residues calculated from the highest frequency modes by GNM, rather than the direct usage of the highest frequency modes themselves, were applied to detect the hot spots in the work by Ozbek et al. [12] and by Haliloglu et al. [13] and Demirel et al. [14], respectively.
In addition, several computational methods by utilizing machine learning tools have been developed to predict hot spots from protein sequences and structures [34][35][36][37]. The advantage of learning methods is the ability to result in higher quality by sufficiently integrating the extracted feature information from protein structures. In this paper, we follow the work by Ozbek et al. [12] but focus on the direct usage of the highest frequency modes to investigate the relation between the residue fluctuations and the hot spots. The top 20 highest frequency modes by GNM were used as an original feature set inputted into Gaussian Naive Bayes (GNB), as a representative of learning methods, to identify hot spots. The main purpose of this study is to examine whether the raw fast modes can be directly used to differentiate hot spots or nonhot spots and whether the utilization of learning methods can improve the identification quality of hot spots for unbound protein structures.

Dataset.
We used the dataset that was collected by Ozbek et al. [12]. This set was filtered with PISCES culling server [38] at the sequence identity of 25% and was originally composed of 33 unbound protein structures. We had to remove one protein with ID 1lrp from the dataset since its structure cannot be currently found in Protein Data Bank (PDB) [39]. Therefore, the final dataset had 32 unbound protein structures with a total of 4270 residues of which 171 are hot spot residues. The dataset including the detailed information about hot spot residues can be derived from Ozbek et al. [12].

Gaussian Network Model and Its
Applications to Identification of the Hot Spots. GNM describes each protein as an elastic network, where the springs connecting the nodes represent the bonded and nonbonded interactions between the pairs of residues located within a cutoff distance [8,9]. Assuming that the springs are harmonic and the residue fluctuations are isotropic and Gaussian, the network potential of nodes (residues) in a protein structure is where R and R 0 are instantaneous and original distance vectors between residues and , respectively, is the force constant assumed to be uniform for all network springs, and Γ = (Γ ) is the Kirchhoff connectivity matrix defined as where 0 is the distance between residues and and is given as a cutoff. Then, the mean correlation between residue fluctuations is calculated as where U is the orthogonal matrix of eigenvectors (u ), Λ is the diagonal matrix of eigenvalues ( ), is the Boltzmann constant, and is the absolute temperature.
To identify hot spot residues, Ozbek et al. [12] used the mean square distance fluctuations (MSDF), ⟨ΔR 2 ⟩, of residues and given as which were calculated using high frequency modes of GNM based on a cutoff of 6.5Å. The residues with relatively high MSDF value were considered functionally probable; see more details in Ozbek et al. [12]. In addition, both Haliloglu et al. [13] and Demirel et al. [14] similarly defined mean square fluctuation (or vibration) (MSF) of residues in the weighted average of several high frequency modes based on a cutoff of 7.0Å, to identify the hot spot residues. The MSF of residue weighed by a subset of modes 1 ≤ ≤ 2 is given as Then, one residue was predicted as a hot spot if the normalized MSF of the residue (i.e., the measure expressed in (5) divided by 3 / ) is larger than a given threshold. The main difference between the work by Haliloglu et al. [13] and that by Demirel et al. [14] is the different thresholds adopted. Haliloglu et al. [13] used a constant threshold of 0.005 while it was 6 −1 given by Demirel et al. [14] where is the number of residues in a protein sequence.

Gaussian Naive Bayes.
A Naive Bayes (NB) classifier calculates the probability of a given instance (example) belonging to a certain class [40]. Given an instance described by its feature vector ( 1 , . . . , ) and a class target , the conditional probability ( | ) can be expressed as a product of simpler probabilities using the Naive independence assumption according to Bayes' theorem: Here, the target may have two values where = 1 means a hot spot residue and = 0 represents non-hot spot residue. for one residue (one instance) is a feature vector with the same size for describing its characteristic using high frequency modes generated by GNM. For example, is equal to a vector composed of th component u for th residue in a sequence when only one high frequency mode u is used. If three high frequency modes, denoted by u 1 , u 2 , and u 3 , are taken into account, the vector will be (u 1 , u 2 , u 3 ) for residue in a protein sequence. Moreover, if a window size of 3 with respect to the residue is adopted, becomes Since ( ) is constant for a given instance, the following rule is adopted to classify the instance whose class is unknown:̂= where "arg" means a value of so that the above expression is maximized; that is, if ( = 1)∏ ( | = 1) is larger than Moreover, when the likelihood of the features (i.e., ( | )) is assumed to be Gaussian, a NB classifier is called Gaussian Naive Bayes (GNB). Due to its simplicity and being computationally fast compared to other more sophisticated methods, GNB has been widely applied to prediction problems in bioinformatics [41,42]. In this study, GNB was mainly used to train the models by inputting the highest frequency modes to identify hot spot residues.

Performance Evaluation.
In a classification task, the following quality indices, including sensitivity (also known as recall), specificity, precision, and the overall accuracy, were generally used to assess prediction performance: where true positives (TP) and true negatives (TN) correspond to correctly predicted hot spot residues and non-hot spot residues, respectively, false positives (FP) denote nonhot spot residues predicted as hot spot residues, and false negatives (FN) denote hot spot residues predicted as non-hot spot residues. Obviously, the dataset used in this study is extremely unbalanced with a very high proportion of non-hot spot residues. For this reason, the accuracy value is not a good choice to evaluate the overall performance of results. When a dataset includes 95% negative samples but 5% positive samples, a classifier may identify all of them as negative, resulting in 95% overall accuracy and 100% specificity. This is really shown as excellent performance, but it fails to identify the positive samples that we actually need pay close attention to. Moreover, two indices, sensitivity and precision, can both measure the classification correctness for positive samples. It is strongly expected that these two indices can synchronously reach high values, but there exists a trade-off between them in general. Therefore, we used 1 measure to evaluate the overall prediction performance: which can balance the sensitivity and the precision in case of the unbalanced dataset. The formula of the 1 measure can be changed to be 1 = 2/((1/sen) + (1/pre)) when both sen and pre are exactly larger than zero. Thus, 1 measure can be viewed as an increasing function of sen and pre. The minimum of 1 is 0 when sen = 0 or pre = 0, and the maximum of 1 is 1 when sen = 1 and pre = 1.

Identification of Hot Spots Using GNM and GNB.
The experimental performance on identification of hot spot residues is tested using -fold cross validation ( CV) on the dataset composed of 32 unbound protein structures. In the CV procedure, chains are randomly divided into subsets with the same numbers of sequences, and the test is repeated times. In each time, the − 1 subsets are used to build the model, and the remaining one subset is then tested by the prediction model.
In the present study, we performed tenfold cross validation (10CV) based on Gaussian Naive Bayes using the highest modes as features from GNM outputs in different ways. Then, we mainly implemented two schemes concerning feature coding for investigating the relations between the highest modes and the hot spot residues. Firstly, a classifier was modeled by directly using single one of the top 20 high frequency modes (i.e., the eigenvectors (u ) that correspond to the top 20 largest eigenvalues ( )). Meanwhile, a sliding window of the central residue with sizes ranging from 1 to 21 was utilized to examine the impact of the neighboring residues' fluctuations, and the computation of GNM was carried out by usage of multiple distance cutoffs ranging from 6.0 to 8.0 with a step size of 0.1. Secondly, we combined top modes with the highest frequency ( = 1, 2, 3, . . . , 20) and utilized similar scheme for the distance cutoff of GNM computation and the sliding window of the central residue to establish the models for identifying hot spot residues.

Identification of Hot Spot Residues Using Single One of the Highest Modes.
In this work, the overall performance was evaluated by the 1 measure in (9), which is able to balance the sensitivity and the precision. Table 1 lists twenty  Table 1: List of top 20 1 measures based on tenfold cross validations of Gaussian Naive Bayes when using single th highest mode ( = 1, 2, . . . , 20) inputted into the feature vector, where cutoff means the distance threshold for GNM computation that varies from 6.0 to 8.0 with step size of 0.1 and sw represents the size of the sliding window for the central residue that ranges from 1 to 21 with step size of 2. computational outcomes of the prediction performance that are ordered by 1 measure, where the feature vector for a GNB classifier was extracted from single one mode, that is, th highest mode ( = 1, 2, . . . , 20), the distance cutoff in GNM varied from 6.0 to 8.0 with the step size of 0.1, and the sliding window for one mode ranged from 1 to 21 with a step size of 2. As shown in Table 1, the highest performance was achieved by 1 measure of 0.1517 when the distance cutoff is 7.1Å and the size of the sliding window is 3 in case of the 8th highest mode. Moreover, top six 1 measures shown in Table 1 were from the same 8th highest mode, indicating that the best performance achieved may not belong to the first or second highest frequency mode. Even the 19th and the 13th highest modes can also result in relatively high 1 measures. From the aspect of cutoff, it has been shown that majority of the cutoff values shown in Table 1 are in or close to the [7.0, 7.3] interval.
Given the cutoff of 7.3Å in GNM, we plotted sensitivity, precision, and 1 measure for all of the top 20 high modes; see Figure 1. Three cases with sizes of the sliding windows equal to 1, 3, and 5 were examined. It is apparent that the 1 measures and the sensitivity values for the majority of the 20 modes can be improved when the size of the sliding window is from 1 to 3. However, there is no sufficient evidence to prove that larger size of the sliding window can further increase the 1 measure. On the other hand, the majority of the sensitivity values were improved when the window size was increased from 3 to 5, but no consistent trend can be found for precision values in three cases of the window sizes.

Identification of Hot Spot Residues by Combining the
Highest Modes. Furthermore, top modes ( = 1, 2, . . . , 20) with the highest frequency were combined to establish the GNB classifier and to investigate whether the prediction performance can be improved. For example, when is taken to be 10, top ten high modes (i.e., hm1, hm2, . . . , hm10) are together inputted into the feature vector of a GNB classifier. Meanwhile, the classification experiments were also performed on various cases in which the distance cutoff is from 6.0 to 8.0 with the step size of 0.1 and the size of the sliding window (sw) ranges from 1 to 21 with the step size of 2. Table 2 lists twenty outcomes of these computational experiments ordered by 1 measure. Among these results, the size of the sliding window is almost 1 except the case of the 10th highest 1 measure in which 9 high modes and the window size of 3 were used, suggesting that the fluctuation of the central residue may be sufficient to identify hot spot residues by a combination of multiple high frequency modes. Moreover, as shown in Table 2, the distance cutoff often belongs to the [7.1, 7.5] interval, and it seems that a larger value tends to result in higher sensitivity. For instance, the sensitivity value obtained by a combination of the top 10 high modes with cutoff of 7.4Å (i.e., the case of top 1 1 measure) is 0.2924, while the sensitivity values in the cases of top 4, 6, and 7 1 measures, which are achieved by the usage of the top 20, 19, and 20 high modes, respectively, are all larger than 0.41.
In Figure 2, we plotted the sensitivity, the precision, and the 1 measure against modes with the highest frequency that were combined as features for five cases denoted by   the distance cutoffs of 7.1Å, 7.2Å, 7.3Å, 7.4Å, and 7.5Å, respectively, where the sizes of the sliding window for all cases are 1. It can be seen from the figure that these three indices are consistently improved with the number of top high modes used up to 10. Especially for the case of sensitivity, its value is an increasing function of the number of modes with the highest frequency. It can be concluded that inclusion of more high frequency modes can improve the sensitivity, but the precision values become slightly decreased by adding more high frequency modes when the number of high modes combined is larger than 10. In the meantime, the 1 measure tends to be no longer enhanced.

Performance Comparison with Existing Methods.
In the present work, we directly inputted the high frequency modes to a GNB classifier for predicting hot spots when compared with the existing methods proposed by Ozbek et al. [12], Haliloglu et al. [13], and Demirel et al. [14].
Ozbek et al. [12] utilized the mean square distance fluctuations of residue pairs, which were computed at most based on five top high frequency modes, to identify hot spot residues. It may be not appropriate to directly compare our work with the results obtained by Ozbek et al. [12], since the datasets used and the test procedures are both slightly different. However, we reported here again part of outcomes from Table 1 in Ozbek et al. [12] for a comparison. The 1 measures were calculated on the reported sensitivity and precision values, as shown in Table 3. In addition, no results concerning the prediction quality of hot spot residues based on a nonredundant dataset were reported in Haliloglu et al. [13] and Demirel et al. [14], where only MSF profiles for a couple of protein cases were depicted and shown as figures. The usage of the number of high frequency modes is not consistent that three, four, or five fast modes may be adopted for different cases. Due to the lack of details and web servers, we here simulated their methods on the dataset in this work by computing the normalized MSF values weighted by one up to five high frequency modes using a cutoff of 7Å for GNM. A constant 0.005 and a varied value Table 3: Performance comparison of the proposed models with the work by Ozbek et al. [12] and the simulated methods proposed by Haliloglu et al. [13] and Demirel et al. [14], where hm1-means that a total of high frequency modes (hm1, hm2, . . . , hm ) are used together.
Reference GNM modes Cutoff sw sen spe pre acc 1 Ozbek et al. [ 6 −1 with respect to the sequence length were used to identify hot spot residues for the simulations of the methods by Haliloglu et al. [13] and Demirel et al. [14], respectively. The quality indices including sensitivity, specificity, precision, accuracy, and 1 measure for these simulations were then reported in Table 3. We also listed part of the best outcomes from this study in Table 3, two using single high mode and four by a combination of multiple high modes, which have been shown in Tables 1 and 2. On the whole, if evaluated by 1 measure or precision, all of the cases in Table 3 by this work outperformed the results by Ozbek et al. [12] and by the simulated methods of Haliloglu et al. [13] and Demirel et al. [14]. This suggests that the direct usage of the high frequency modes is efficient to identify hot spot residues. Besides, the improvement on 1 measure by combining multiple high frequency modes seems to be very slight when compared with the methods only using single high mode, while the sensitivity values in general tend to be improved a lot. This is in good agreement with the work by Ozbek et al. [12] and the simulation results of Haliloglu et al. [13] and Demirel et al. [14] as outlined in Table 3. Additionally, the specificity and accuracy values of the simulated method for Demirel et al. [14] are higher than those of other methods, but on the contrary the values of sensitivity, precision, and 1 measure are in general lower. The reason causing worse quality on 1 measure achieved by the simulation of Demirel et al. [14] is due to a larger threshold used when compared with the simulated method of Haliloglu et al. [13].
In addition, we also performed computational experiments using several common classifiers, including logistic regression, decision tree, -nearest neighbor, and support vector machine with default parameters, instead of GNB, where all of the machine learning methods were implemented in scikit-learn [43]. As a consequence, the results (data not shown in this paper) showed that GNB exhibited better performance than other classifiers. This is the reason why we finally adopted GNB as the base classifier for identification of the hot spot residues.

Conclusion
In this study, we followed previous work [12][13][14] focusing on the identifications of hot spots by using GNM but directly used the high frequency modes and further performed GNB classifier. The proposed methods outperformed the outcomes reported in Ozbek et al. [12] and the simulated results of Haliloglu et al. [13] and Demirel et al. [14] based on 1 measure to evaluate the overall performance. The results by this work suggested that the high frequency modes can be directly used to identify hot spot residues with reasonable performance. In case of the scheme using only single high frequency mode, the largest 1 measure may not be necessarily achieved by one of the top five high frequency modes. In our study, it was surprisingly gained by the 8th highest mode with the distance cutoff of 7.3 and the window size of 3. We further included more modes from total number of 20 high frequency modes when compared with the work 8 BioMed Research International by Ozbek et al. [12] in which at most five frequency modes are used. Of particular interest is the fact that inclusion of more high frequency modes can significantly improve the sensitivity value, but not the 1 measure and the precision in general.
The dataset used in this work is obviously unbalanced. There is a trade-off between the sensitivity and the precision. It is not easy for researchers to find a perfect way to determine the proper performance index to evaluate experimental results. Therefore, we finally reported multiple results as listed in Tables 1, 2, and 3, which were considered for choices associated with different purposes in practice. Overall, the present study provided additional valuable insight into the relation between hot spots and residue fluctuations.