The recognition of protein folds is an important step for the prediction of protein structure and function. After the recognition of 27-class protein folds in 2001 by Ding and Dubchak, prediction algorithms, prediction parameters, and new datasets for the prediction of protein folds have been improved. However, the influences of interactions from predicted secondary structure segments and motif information on protein folding have not been considered. Therefore, the recognition of 27-class protein folds with the interaction of segments and motif information is very important. Based on the 27-class folds dataset built by Liu et al., amino acid composition, the interactions of secondary structure segments, motif frequency, and predicted secondary structure information were extracted. Using the Random Forest algorithm and the ensemble classification strategy, 27-class protein folds and corresponding structural classification were identified by independent test. The overall accuracy of the testing set and structural classification measured up to 78.38% and 92.55%, respectively. When the training set and testing set were combined, the overall accuracy by 5-fold cross validation was 81.16%. In order to compare with the results of previous researchers, the method above was tested on Ding and Dubchak’s dataset which has been widely used by many previous researchers, and an improved overall accuracy 70.24% was obtained.
With the accomplishment of the Human Genome Project, the “post genome era” has presented a large number of protein sequences that have challenged to develop a high-throughput computational method to structurally annotate the sequences coming from genomic data. One of these critical structures, the protein fold, reflects a key topological structure in proteins as it contains three major aspects of protein structure: the units of secondary structure, the relative arrangement of structures, and the overall relationship of protein peptide chains [
A protein can only perform its physiological functions if it folds into its proper structure. Abnormal protein folding causes molecular aggregation, precipitation, or abnormal transport, resulting in different diseases. For example, the prion protein (PRNP) accumulates in the brain and causes neurodegenerative diseases, such as scrapie, Creutzfeldt-Jakob disease, Parkinson’s disease, Huntington’s disease, and mad cow disease. And the PRNP is a pathogenic protein caused by the abnormal folding of proteins. Thus, the correct identification of protein folds can be valuable for the studies on pathogenic mechanisms and drug design [
Proteins with similar sequences tend to fold into similar spatial structures. When proteins have close evolutionary relationships, the similarity-based methods can achieve very reliable predicted results [
Afterwards, based on the dataset built by Ding and Dubchak and the feature parameters, some researchers suggested algorithm improvements for the identification of folds. For example, Chinnasamy et al. [
Based on the dataset built by Ding and Dubchak, other previous researchers suggested the selection of feature parameters for the identification of protein folds. For example, Shamim et al. [
Based on the dataset built by Ding and Dubchak [
Other previous researchers have constructed new 27-class fold datasets and performed corresponding researches. For example, based on the Astral SCOP 1.71, with sequence identity below 40%, Shamim et al. [
In this paper, based on the dataset built by Liu et al. [
The dataset used in this paper was built by Liu et al. [
Datasets of 27-class protein folds.
Fold | Dataset built by Liu et al. [ |
Dataset built by Ding and Dubchak [ | ||
---|---|---|---|---|
Training set | Testing set | Training set | Testing set | |
All |
174 | 169 | 54 | 61 |
(1) Globin-like | 14 | 14 | 13 | 6 |
(2) Cytochrome c | 10 | 10 | 7 | 9 |
(3) DNA-binding 3-helical bundle | 92 | 90 | 12 | 20 |
(4) 4-helical up-and-down bundle | 25 | 24 | 7 | 8 |
(5) 4-helical cytokines | 8 | 8 | 9 | 9 |
(6) Alpha; EF-hand | 25 | 23 | 6 | 9 |
All |
260 | 254 | 109 | 117 |
(7) Immunoglobulin-like |
86 | 85 | 30 | 44 |
(8) Cupredoxins | 18 | 18 | 9 | 12 |
(9) Viral coat and capsid proteins | 24 | 24 | 16 | 13 |
(10) ConA-like lectins/glucanases | 18 | 17 | 7 | 6 |
(11) SH3-like barrel | 41 | 41 | 8 | 8 |
(12) OB-fold | 29 | 28 | 13 | 19 |
(13) Trefoil | 11 | 10 | 8 | 4 |
(14) Trypsin-like serine proteases | 17 | 16 | 9 | 4 |
(15) Lipocalins | 16 | 15 | 9 | 7 |
|
341 | 337 | 115 | 143 |
(16) (TIM)-barrel | 93 | 92 | 29 | 48 |
(17) FAD (also NAD)-binding motif | 5 | 5 | 11 | 12 |
(18) Flavodoxin-like | 37 | 36 | 11 | 13 |
(19) NAD(P)-binding Rossmann-fold | 17 | 16 | 13 | 27 |
(20) P-loop-containing nucleotide | 74 | 73 | 10 | 12 |
(21) Thioredoxin-like | 37 | 36 | 9 | 8 |
(22) Ribonuclease H-like motif | 39 | 40 | 10 | 12 |
(23) Hydrolases | 33 | 33 | 11 | 7 |
(24) Periplasmic binding protein-like | 6 | 6 | 11 | 4 |
|
181 | 179 | 33 | 62 |
(25) |
39 | 39 | 7 | 8 |
(26) Ferredoxin-like | 101 | 99 | 13 | 27 |
(27) Small inhibitors, toxins, and lectins | 41 | 41 | 13 | 27 |
|
||||
Overall | 956 | 939 | 311 | 383 |
The dataset built in our group was according to Ding and Dubchak’s description about the construction of protein folds dataset in literature [
The distributions of the 20 amino acid residues in protein sequences for different protein folds are obviously different, and previous researches have shown that amino acid composition is associated with protein folding information [
A motif is the conserved local region in a protein during evolution [
A previous study showed that predicted secondary structure information is a main feature parameter for the identification of multiclass protein folds [
For example, given the hydrophobicity values (Table
The physicochemical property values for 20 amino acid residues.
Code | H1 | H2 | PL | SASA |
---|---|---|---|---|
A | 0.62 | −0.5 | 8.1 | 1.181 |
C | 0.29 | −1 | 5.5 | 1.461 |
D | −0.9 | 3 | 13 | 1.587 |
E | −0.74 | 3 | 12.3 | 1.862 |
F | 1.19 | −2.5 | 5.2 | 2.228 |
G | 0.48 | 0 | 9 | 0.881 |
H | −0.4 | −0.5 | 10.4 | 2.025 |
I | 1.38 | −1.8 | 5.2 | 1.81 |
K | −1.5 | 3 | 11.3 | 2.258 |
L | 1.06 | −1.8 | 4.9 | 1.931 |
M | 0.64 | −1.3 | 5.7 | 2.034 |
N | −0.78 | 2 | 11.6 | 1.655 |
P | 0.12 | 0 | 8 | 1.468 |
Q | −0.85 | 0.2 | 10.5 | 1.932 |
R | −2.53 | 3 | 10.5 | 2.56 |
S | −0.18 | 0.3 | 9.2 | 1.298 |
T | −0.05 | −0.4 | 8.6 | 1.525 |
V | 1.08 | −1.5 | 5.9 | 1.645 |
W | 0.81 | −3.4 | 5.4 | 2.663 |
Y | 0.26 | −2.3 | 6.2 | 2.368 |
CC variables describe the average interactions between different types of secondary structure segments, which can be calculated according to the following equation:
The numbers of sequences containing secondary structure segments. (a) and (b) are for training set and testing set, respectively.
As the protein fold is a description based on the secondary structure, the formation of secondary structure in sequence influences the folding of protein. From the researches of published literatures [
Random Forest is an algorithm for classification developed by Breiman [
For the 27-class fold dataset, amino acid composition, motif frequency, predicted secondary structure information, and the interaction of secondary structure segments were extracted as feature parameters, with the combined feature vector as input parameters for the Random Forest algorithm. The overall accuracy of the testing set in the dataset measured up to 78.38% by independent test.
For further comparison, identification results from the gradual addition of relevant feature parameters are listed (Table
Prediction accuracies of different parameters in the testing set (%).
Fold | A | A + ACC | A + ACC + M | A + ACC + M + P | A + ACC + M + P (5-fold cross validation) | The results of Liu et al. [ |
Ding and Dubchak’s dataset [ |
---|---|---|---|---|---|---|---|
1 | 21.43 | 71.43 | 71.43 | 71.43 | 75.00 (0.0252) | 78.5 | 100.00 |
2 | 10.00 | 70.00 | 70.00 | 80.00 | 95.00 (0.0000) | 90.0 | 100.00 |
3 | 60.00 | 90.00 | 91.11 | 91.11 | 92.86 (0.0026) | 75.5 | 75.00 |
4 | 4.17 | 83.33 | 75.00 | 75.00 | 81.63 (0.0000) | 54.1 | 87.50 |
5 | 12.50 | 25.00 | 12.50 | 25.00 | 18.75 (0.0187) | 25.0 | 77.78 |
6 | 0.00 | 60.87 | 52.17 | 52.17 | 75.00 (0.0342) | 39.1 | 66.67 |
7 | 87.06 | 91.76 | 89.41 | 90.59 | 89.47 (0.0114) | 82.3 | 79.55 |
8 | 11.11 | 27.78 | 27.78 | 38.89 | 41.67 (0.0000) | 55.5 | 75.00 |
9 | 45.83 | 50.00 | 50.00 | 58.33 | 70.83 (0.0421) | 70.8 | 84.62 |
10 | 23.53 | 35.29 | 47.06 | 52.94 | 57.14 (0.0255) | 47.0 | 66.67 |
11 | 24.39 | 56.10 | 48.78 | 58.54 | 70.73 (0.0185) | 43.9 | 37.50 |
12 | 0.00 | 46.43 | 64.29 | 60.71 | 54.39 (0.0096) | 60.7 | 89.47 |
13 | 0.00 | 30.00 | 50.00 | 60.00 | 66.67 (0.0426) | 10.0 | 50.00 |
14 | 37.50 | 56.25 | 62.50 | 62.50 | 81.82 (0.0000) | 75.0 | 25.00 |
15 | 53.33 | 40.00 | 40.00 | 46.67 | 67.74 (0.0136) | 40.0 | 100.00 |
16 | 86.96 | 95.65 | 98.91 | 100.00 | 98.92 (0.0144) | 89.1 | 66.67 |
17 | 0.00 | 20.00 | 20.00 | 20.00 | 20.00 (0.0097) | 20.0 | 91.67 |
18 | 11.11 | 30.56 | 47.22 | 61.11 | 68.49 (0.0894) | 16.6 | 38.46 |
19 | 37.50 | 81.25 | 100.00 | 100.00 | 100.00 (0.0300) | 81.2 | 62.96 |
20 | 26.03 | 72.60 | 90.41 | 89.04 | 91.84 (0.0398) | 87.6 | 41.67 |
21 | 30.56 | 50.00 | 75.00 | 72.22 | 72.60 (0.0217) | 52.7 | 75.00 |
22 | 22.50 | 40.00 | 62.50 | 57.50 | 65.82 (0.0113) | 50.0 | 41.67 |
23 | 27.27 | 45.45 | 90.91 | 90.91 | 95.46 (0.0107) | 78.7 | 57.15 |
24 | 0.00 | 16.67 | 50.00 | 66.67 | 41.67 (0.0373) | 50.0 | 25.00 |
25 | 12.82 | 56.41 | 61.54 | 61.54 | 69.23 (0.0233) | 30.7 | 12.50 |
26 | 51.52 | 88.89 | 90.91 | 92.93 | 86.00 (0.0104) | 67.6 | 62.96 |
27 | 100.00 | 75.61 | 87.80 | 92.68 | 92.68 (0.0122) | 1.000 | 96.30 |
|
43.66 | 68.80 | 76.25 |
|
|
66.5 |
|
Note: A means amino acid composition (20 dimensions), A + ACC means amino acid composition and the interaction of segments (164 dimensions), A + ACC + M means amino acid composition, the interaction of segments, and motif frequency (290 dimensions), and A + ACC + M + P means amino acid composition, the interaction of segments, motif frequency, and predicted secondary structure information (296 dimensions);
The architecture of the protein folds identification system.
When only one feature parameter, amino acid composition, was used, the overall accuracy was 43.66% (Table
To test the efficiency of our method, with the same feature parameters, classification strategy, and algorithm used above, the 27-class folds dataset built by Ding and Dubchak [
The previous identification results by an independent test from Ding and Dubchak’s dataset (%).
Author | Classifier | Accuracy |
---|---|---|
Ding and Dubchak [ |
SVM (All-Versus-All) | 56.0 |
Chinnasamy et al. [ |
Tree-Augmented Naive Bayesian Classifier | 58.2 |
Shen and Chou [ |
OET-KNN | 62.1 |
Nanni [ |
Fusion of classifiers | 61.1 |
Chen and Kurgan [ |
PFRES | 68.4 |
Guo and Gao [ |
GAOEC | 64.7 |
Damoulas and Girolami [ |
Multiclass multikernel | 70.0 |
Zhang et al. [ |
Increment of diversity | 61.1 |
Ghanty and Pal [ |
Fusion of different classifiers | 68.6 |
Dong et al. [ |
ACCFold | 70.1 |
Shen and Chou [ |
PFP-FunDSeqE | 70.5 |
Yang et al. [ |
MarFold | 71.7 |
Liu et al. [ |
SVM | 69.8 |
|
Random Forest |
|
According to Shen and Chou’s [
Overall accuracies of structural class using different approaches in the testing set (%).
Dataset | Author | Structural class | Accuracy | |||
---|---|---|---|---|---|---|
|
|
|
|
|||
Liu and Hu [ |
|
|
|
|
|
|
Liu and Hu [ |
97.04 | 85.43 | 94.07 | 78.21 | 89.24 | |
|
||||||
Ding and Dubchak [ |
|
|
|
|
|
|
Liu and Hu [ |
86.89 | 88.03 | 83.22 | 59.68 | 81.46 | |
Zhang et al. [ |
79.11 | |||||
Chinnasamy et al. [ |
80.52 |
In early 2001, Ding and Dubchak built the 27-class folds dataset and started research on the identification of 27-class protein folds with multiple feature groups. Researchers have since been devoted to the improvement of feature parameters, algorithms, classification strategies, and the datasets for the identification of protein folds and have achieved good identification results. Based on previous researches, we combined sequence information, structural information, and functional information as input feature parameters of the Random Forest algorithm for protein folds identification and obtained better results. Therefore, the addition of segment interactions and motif information for recognizing 27-class protein folds is a valid and novel approach.
Given the same dataset, when different feature parameters are used, the same sequence can be correctly or falsely classified. Here, our approach achieved the improved results with the following possible reasons. First, in considering the correlation at the level of secondary structure segments, we calculate the interaction information of secondary structure segments, which reflects the segments-order and long-range correlation information of the sequence and has a major influence on protein folding. Second, in considering the local conservation of kernel structure in protein folds, motif information was extracted, including functional motifs and statistical motifs. Finally, the Random Forest algorithm is a combination classifier of convenience and high efficiency whose final classification results are decided by votes from decision trees.
In our future work, with the same parameters and classification strategy, testing methods (jackknife test, 10-fold cross validation, etc.) would be used. Different physiochemical properties also would be analyzed for the recognition of protein folds. Moreover, Precision-Recall curves or ROC curves for individual folds could be presented.
The authors declare that there is no conflict of interests that would prejudice the impartiality of this scientific work.
This work was supported by the National Natural Science Foundation of China (30960090, 31260203), The “CHUN HUI” Plan of Ministry of Education, and Talent Development Foundation of Inner Mongolia.