Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.
Heat shock proteins (HSPs) are ubiquitous in living organisms. They act as molecular chaperones by facilitating and maintaining proper protein structure and function [
The benchmark dataset was generated by Feng et al. [
The number of sequences in HSP families.
Dataset | Family | Number of HSP samples |
---|---|---|
HSP20 | 357 | |
HSP40 | 1279 | |
HSP60 | 163 | |
HSP70 | 283 | |
HSP90 | 58 | |
HSP100 | 85 | |
Overall | 2225 |
The number of sequences in the independent dataset.
Families | HGNC dataset | RICE dataset | |
---|---|---|---|
Wang et al. | Sarkar et al. | ||
HSP20 | 11 | 14 | — |
HSP40 | 49 | — | — |
HSP60 | 15 | 4 | — |
HSP70 | 17 | 7 | 24 |
HSP90 | 4 | 3 | — |
HSP100 | — | 3 | — |
Total | 96 | 31 | 24 |
The prediction model process is illustrated in Figure
The flowchart of the proposed method. SAAC: split amino acid composition; DC: dipeptide composition; CTF: conjoint triad feature; PseACS: pseudoaverage chemical shift; SMOTE: syntactic minority oversampling technique.
In order to predict the HSPs, it is very important to choose a classifier and a set of reasonable parameters. In this paper, the split amino acid composition (SAAC), the dipeptide composition (DC) [
Split amino acid composition (SAAC) is a feature extraction method based on AAC. In SAAC, the protein sequence is split into various segments; then, the composition of each segment is counted separately [
With this method, we can get
Dipeptide composition (DC) is a discrete method using sequence neighbor information [
The conjoint triad feature (CTF) representation was used by Shen et al. [
Nuclear magnetic resonance (NMR) plays a unique role in studying the structure of proteins because it provides information on the dynamics of the internal motion of proteins on multiple time scales [
For a protein
After, we select
As shown in Table
The support vector machine is a machine learning algorithm, which is based on the statistical learning theory. The basic idea of SVM is to transform the input data into a high-dimensional Hilbert space and then determine the optional separating hyperplane [
In statistical prediction, three cross-validation tests are commonly used to examine a predictor for its effectiveness in practical application: the
In order to investigate the effectiveness of the predictive model, many characteristic parameters were selected to predict the HSPs [
The predictive results of individual features with the jackknife test by using SVM for HSP families.
Features | HSP families | OA (%) | ||||||
---|---|---|---|---|---|---|---|---|
HSP20 | HSP40 | HSP60 | HSP70 | HSP90 | HSP100 | |||
CTF | Sn (%) | 74.86 | 90.92 | 54.72 | 67.27 | 53.85 | 67.9 | 80.92 |
Sp (%) | 95.07 | 76.19 | 98.71 | 96.48 | 99.86 | 99.52 | ||
MCC | 0.7 | 0.68 | 0.63 | 0.66 | 0.69 | 0.75 | ||
Acc (%) | 91.79 | 84.68 | 95.5 | 92.75 | 98.76 | 98.35 | ||
SAAC | Sn (%) | 81.07 | 97.53 | 58.49 | 75.9 | 57.69 | 74.07 | 87.25 |
Sp (%) | 97.7 | 81.06 | 99.36 | 98.26 | 100 | 99.48 | ||
MCC | 0.81 | 0.81 | 0.7 | 0.78 | 0.76 | 0.78 | ||
Acc (%) | 95 | 90.55 | 96.38 | 95.41 | 98.99 | 98.53 | ||
DC | Sn (%) | 90.96 | 96.66 | 68.55 | 84.89 | 63.46 | 77.78 | 90.69 |
Sp (%) | 96.66 | 90.69 | 99.11 | 98.16 | 100 | 99.86 | ||
MCC | 0.85 | 0.88 | 0.75 | 0.84 | 0.79 | 0.86 | ||
Acc (%) | 95.73 | 94.13 | 96.88 | 96.47 | 99.13 | 99.04 | ||
PseACS | Sn (%) | 92.37 | 95.46 | 75.47 | 87.41 | 67.31 | 83.95 | 91.38 |
Sp (%) | 99.01 | 89.94 | 98.71 | 98.16 | 99.91 | 99.33 | ||
MCC | 0.92 | 0.86 | 0.77 | 0.86 | 0.79 | 0.83 | ||
Acc (%) | 97.94 | 93.12 | 97.02 | 96.79 | 99.13 | 98.76 |
Figure
Prediction results of different combined features. Numbers denote features: 1 for DC, 2 for CTF, 3 for PseACS, and 4 for SAAC.
Table
The predictive results of HSPs by using the combined feature of SAAC+DC+CTF+PseACS with and without SMOTE.
Features with and without SMOTE (Y/N) | HSP families | OA (%) | |||||||
---|---|---|---|---|---|---|---|---|---|
HSP20 | HSP40 | HSP60 | HSP70 | HSP90 | HSP100 | ||||
PseACS+DC+SAAC+CTF | Y | Sn (%) | 100 | 98.33 | 100 | 100 | 100 | 100 | 99.72 |
Sp (%) | 99.92 | 100 | 99.92 | 99.82 | 100 | 100 | |||
MCC | 1 | 0.99 | 1 | 0.99 | 1 | 1 | |||
Acc (%) | 99.93 | 99.72 | 99.93 | 99.85 | 100 | 100 | |||
PseACS+DC+SAAC+CTF | N | Sn (%) | 94.35 | 98.89 | 81.13 | 90.29 | 75 | 91.36 | 94.91 |
Sp (%) | 98.58 | 94.26 | 99.6 | 98.84 | 100 | 99.9 | |||
MCC | 0.92 | 0.94 | 0.87 | 0.90 | 0.86 | 0.94 | |||
Acc (%) | 97.89 | 96.93 | 98.26 | 97.75 | 99.4 | 99.59 |
The predictive performance of our predictive model (SVM), Random Forest (RF) [
The predictive sensitivity, specificity, MCC, and accuracy of HSPs by using four algorithms.
The predictive overall accuracy of HSPs by using four algorithms.
Figure
A comparison of the proposed method for independent datasets.
In order to evaluate the performance of our predictive model, we made comparisons with existing methods. The method developed by Ahmad et al. did not provide any family-wise accuracy of HSPs, so we compared the effectiveness with iHSP-PseRAAAC, PredHSP, and ir-HSP. The results of the comparisons are shown in Table
The comparison of the predictive results between this paper and existing methods.
Method | HSP families | ||||||
---|---|---|---|---|---|---|---|
HSP20 | HSP40 | HSP60 | HSP70 | HSP90 | HSP100 | ||
iHSP-PseRAAACa | Sn (%) | 87.68 | 95.31 | 66.87 | 79.15 | 51.72 | 69.41 |
Sp (%) | 96.36 | 84.87 | 98.93 | 86.54 | 99.89 | 99.84 | |
MCC | 0.82 | 0.99 | 0.69 | 0.54 | 0.3 | 0.83 | |
Acc (%) | — | — | — | — | — | — | |
PredHSPb | Sn (%) | 92.16 | 96.09 | 79.75 | 91.17 | 72.41 | 82.35 |
Sp (%) | 97.16 | 86.26 | 97.24 | 91.97 | 99.12 | 98.08 | |
MCC | 0.87 | 0.83 | 0.72 | 0.71 | 0.7 | 0.71 | |
Acc (%) | 96.36 | 91.91 | 95.96 | 91.87 | 98.43 | 97.48 | |
ir-HSPc | Sn (%) | 94.63 | 97.45 | 67.92 | 88.49 | 75 | 88.89 |
Sp (%) | 96.61 | 95.13 | 98.86 | 98.84 | 99.76 | 99.57 | |
MCC | 0.8718 | 0.9276 | 0.7307 | 0.8871 | 0.8112 | 0.8846 | |
Acc (%) | 96.28 | 96.47 | 96.61 | 97.52 | 99.17 | 99.17 | |
Our predictive model | Sn (%) | 100 | 98.33 | 100 | 100 | 100 | 100 |
Sp (%) | 99.92 | 100 | 99.92 | 99.82 | 100 | 100 | |
MCC | 1 | 0.99 | 1 | 0.99 | 1 | 1 | |
Acc (%) | 99.93 | 99.72 | 99.93 | 99.85 | 100 | 100 |
aFeng et al. [
In this work, an optimized classifier for HSP family identification was developed. This model was derived from the SVM machine learning algorithm, and SMOTE was used for the imbalanced data classification problems. The overall accuracy was 99.72% with the balanced dataset and the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS. High overall accuracy results indicate that our predictive model is a reliable tool for HSP family prediction. It is known that HSP expression is associated with human diseases, and these families of HSPs have different functions. Therefore, our predictive model will benefit researchers by quickly and effectively identifying HSP families and enabling researchers to design new drugs to achieve the goal of treating diseases.
The data used to support the findings of this study are available from the supplementary materials.
The authors declare that there is no conflict of interest.
FM Li conceived the selection of feature parameters. XY Jing carried out the computation and wrote the manuscript. FM Li performed the results analysis. Both authors reviewed the manuscript.
This work was supported by the Natural Science Foundation of Inner Mongolia Autonomous Region of China (2019MS03015) and the National Natural Science Foundation of China (31360206).
The sequence names of HSP families.
The sequence names of the independent datasets.