Prediction of protein structural classes for low-similarity sequences is useful for understanding fold patterns, regulation, functions, and interactions of proteins. It is well known that feature extraction is significant to prediction of protein structural class and it mainly uses protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Currently, prediction solely based on the PSSM has played a key role in improving the prediction accuracy. In this paper, we propose a novel method called CSP-SegPseP-SegACP by fusing consensus sequence (CS), segmented PsePSSM, and segmented autocovariance transformation (ACT) based on PSSM. Three widely used low-similarity datasets (1189, 25PDB, and 640) are adopted in this paper. Then a 700-dimensional (700D) feature vector is constructed and the dimension is decreased to 224D by using principal component analysis (PCA). To verify the performance of our method, rigorous jackknife cross-validation tests are performed on 1189, 25PDB, and 640 datasets. Comparison of our results with the existing PSSM-based methods demonstrates that our method achieves the favorable and competitive performance. This will offer an important complementary to other PSSM-based methods for prediction of protein structural classes for low-similarity sequences.
Protein structural classes play a key role in protein science, simply because the biological function of a protein essentially related to its tertiary structure, which is determined by its amino acid sequence in accordance with the process of protein folding [
During the last two decades, a great number of statistical learning algorithms had been developed to tackle this problem. Protein structural classes prediction is a typical pattern recognition problem, which is mainly performed in three steps. The first step is feature extraction, by which the different length sequences are converted into an equal length feature vectors. The methods include amino acid composition (AAC) [
Currently, feature extraction methods mainly use protein primary sequence, predicted secondary structure sequence, and position-specific scoring matrix (PSSM). Position-specific scoring matrix can be obtained by giving a query sequence, which can be searched against a database of proteins using PSI-BLAST [
In this paper, we extract a consensus sequence based on PSSM, from which 40 global features are calculated. Then we propose two segmented feature extraction techniques based on the concepts of pseudo-position-specific scoring matrix (PsePSSM) and autocovariance transformation (ACT), which are defined on the PSSM, respectively. PsePSSM is originally proposed to avoid complete loss of the sequence-order information by Shen and Chou [
In order to facilitate the comparison with the previous works, three popular benchmark datasets are used to evaluate the performance of our method: the
The compositions of three datasets adopted in this paper.
Dataset | All- |
All- |
|
|
Total |
---|---|---|---|---|---|
1189 | 223 | 294 | 334 | 241 | 1092 |
25PDB | 443 | 443 | 346 | 441 | 1673 |
640 | 138 | 154 | 177 | 171 | 640 |
To develop a powerful predictor for the protein structural class based on position-specific scoring matrix (PSSM), the key is how to effectively define feature vectors to formulate the statistical samples concerned. Here, we use a combination of the consensus sequences, segmented PsePSSM, and segmented autocovariance transformation.
To extract the evolutionary information, we use each protein sequence (query sequence) as a seed to search and align homogenous sequences from NCBI’s NR database (
To extract global features, we adopt the method in [
Furthermore, we propose 20 composition moment features for CS, which have been applied for prediction of protein structural class mainly based on amino acid sequence [
In summary, we obtain 40 global features by combining 20 amino acid composition features with 20 composition moment features of CS-PSSM.
To extract local features, we divide PSSM into
When
When
In the above-mentioned way, a total of 380 local features are extracted using segmented PsePSSM.
In order to further obtain local features, here the autocovariance transformation (ACT) is introduced to get the neighboring effects of the sequences. The same as the previous section, we divide PSSM into
When
When
To extract more global and local information from PSSM, we propose a comprehensive method called CSP-SegPseP-SegACP by fusing the 40 CS-PSSM features, the 380 segmented PsePSSM features, and the 280 segmented ACT-PSSM features. Finally, each protein sequence is characterized by a 700-dimensional (700D) feature vector.
The dimension of our constructed feature vector is 700, which is a large input for SVM. The large dimension will lead to two problems: information redundancy or noise and dimension disaster. Hence, feature selection plays a key role in classification task. Principal component analysis (PCA) [
Support vector machine (SVM) is a well known machine learning algorithm based on statistical learning theory for binary classification problems, which is considered as the state-of-the-art classification technique and introduced by Vapnik in 1995 [
The basic idea of SVM is to find the separating hyperplane based on the support vector theory to minimize classification errors. It transforms the input data of samples to a higher dimensional space using the kernel function to find support vectors. Generally, four basic kernel functions are used by SVM, that is, linear function, polynomial function, sigmoid function, and radial basis function (RBF). Here, we choose the RBF as SVM’s kernel due to its superiority for solving nonlinear problem [
Independent dataset test, subsampling test, and jackknife test are three widely used cross-validation methods in statistical prediction. Among these three methods, the jackknife test is deemed the most rigorous and objective due to its ability of yielding a unique result for a given dataset [
To evaluate the performance of our method comprehensively, we report the seven standard performance measures, including sensitivity (Sens), specificity (Spec),
In this study, a 700D feature vector is obtained and reduced to 224D by PCA to avoid dimension disaster. Then the 224 features are input into SVM. The RBF kernel function, the grid search approach, and the fifteenfold cross-validation for 1189 dataset are used to find the best parameters of
The prediction accuracies of our method on the 1189, 25PDB and 640 datasets.
Dataset | Structural class | Sens (%) | Spec (%) |
|
MCC | AUC |
---|---|---|---|---|---|---|
1189 | All- |
84.8 | 95.6 | 0.84 | 0.80 | 0.90 |
All- |
85.4 | 94.1 | 0.85 | 0.79 | 0.90 | |
|
85.0 | 90.0 | 0.82 | 0.74 | 0.88 | |
|
55.2 | 91.3 | 0.59 | 0.49 | 0.73 | |
OA | 78.5 | |||||
AA | 77.6 | |||||
|
||||||
25PDB | All- |
94.4 | 96.4 | 0.92 | 0.90 | 0.95 |
All- |
91.9 | 97.2 | 0.92 | 0.89 | 0.95 | |
|
71.1 | 95.7 | 0.76 | 0.70 | 0.83 | |
|
92.5 | 95.2 | 0.90 | 0.86 | 0.94 | |
OA | 88.4 | |||||
AA | 87.5 | |||||
|
||||||
640 | All- |
83.3 | 96.8 | 0.86 | 0.82 | 0.90 |
All- |
83.1 | 95.3 | 0.84 | 0.79 | 0.89 | |
|
83.0 | 89.4 | 0.79 | 0.70 | 0.86 | |
|
60.2 | 87.4 | 0.62 | 0.49 | 0.74 | |
OA | 77.0 | |||||
AA | 77.4 |
The flowchart of our proposed method.
The overall protein structural class prediction accuracy (OA) as well as the prediction accuracy for each structural class has been achieved by using the combination of the features from the three sequence representation models, which include consensus sequence-PSSM (CSP), segmented PsePSSM, and segmented autocovariance transformation-PSSM (ACP). The proposed prediction method (CSP-SegPseP-SegACP) is examined with 1189, 25PDB, and 640 datasets by jackknife tests and we report the Sens, Spec,
To overcome the impact of information redundancy and dimension disaster for SVM, the dimension of our obtained feature vector is reduced from 700 to 224 by using PCA. In this Section, we report the accuracies of our method using all 700 features on the three datasets, and we still optimize the SVM parameters
Comparison of accuracies between our method that includes 224 features and method that includes 700 features.
To investigate the contributions of feature groups on the protein structural class prediction accuracy, firstly, we calculate each feature group one by one on the 1189 dataset; the results are shown in Table
Performance comparison of our six feature groups on the 1189 dataset.
Dataset | Features | Prediction accuracy (%) | ||||
---|---|---|---|---|---|---|
All- |
All- |
|
|
OA (%) | ||
1189 | CSAAC-PSSM (20D) | 72.7 | 76.2 | 78.7 | 26.1 | 65.2 |
CSCM-PSSM (20D) | 69.1 | 76.9 | 82.0 | 29.9 | 66.5 | |
Seg2-PsePSSM (200D) | 80.7 | 82.7 | 80.8 | 51.0 | 74.7 | |
Seg3-PsePSSM (180D) | 79.8 | 80.6 | 81.4 | 48.1 | 73.5 | |
Seg2-ACPSSM (160D) | 76.7 | 82.3 | 76.0 | 44.4 | 70.9 | |
Seg3-ACPSSM (120D) | 69.1 | 77.6 | 78.4 | 38.6 | 67.5 |
The contribution of each feature group for the overall accuracy (%).
Combination of feature groups | Dimension | 1189 | 25PDB | 640 |
---|---|---|---|---|
CSAACP | 20 | 65.2 | 62.0 | 66.0 |
CSAACP + CSCMP (CSP) | 40 | 66.5 | 63.1 | 64.7 |
CSP + Seg2-PseP | 240 | 75.2 | 74.4 | 75.8 |
CSP + Seg2-PseP + Seg3-PseP | 420 | 76.2 | 87.7 | 74.5 |
CSP + SegPseP + seg2-ACP | 680 | 76.1 | 87.9 | 75.0 |
CSP + SegPseP + seg2-ACP + seg3-ACP | 700 | 77.1 | 88.6 | 75.5 |
CSP + SegPseP + SegACP-PCA | 224 | 78.5 | 88.4 | 77.0 |
In this section, to demonstrate the superiority of our method; the CSP-SegPseP-SegACP is further compared with the other recently reported prediction methods on the same datasets. We select the accuracy of each class and overall accuracy as evaluation indexes that are summarized in Table
Performance comparison of different methods on three datasets.
Dataset | Method | Prediction accuracy (%) | ||||
---|---|---|---|---|---|---|
All- |
All- |
|
|
OA (%) | ||
1189 | PSSM-S [ |
93.3 | 85.1 | 77.6 | 65.6 | 80.2 |
LCC-PSSM [ |
89.2 | 88.8 | 85.6 | 58.5 | 81.2 | |
MBMGAC-PSSM [ |
79.8 | 85.0 | 84.7 | 50.6 | 76.3 | |
RPSSM [ |
67.7 | 75.2 | 74.6 | 17.4 | 60.2 | |
AADP-PSSM [ |
69.1 | 83.7 | 85.6 | 35.7 | 70.7 | |
AATP [ |
72.7 | 85.4 | 82.9 | 42.7 | 72.6 | |
MEDP [ |
85.2 | 84.0 | 84.3 | 45.2 | 75.8 | |
PsePSSM [ |
82.0 | 82.3 | 84.1 | 44.0 | 74.4 | |
AAC-PSSM-AC [ |
80.7 | 86.4 | 81.4 | 45.2 | 74.6 | |
This paper |
|
|
|
|
|
|
|
||||||
25PDB | PSSM-S [ |
93.8 | 92.8 | 92.6 | 81.7 | 90.1 |
LCC-PSSM [ |
91.7 | 80.8 | 79.8 | 64.0 | 79.0 | |
MBMGAC-PSSM [ |
86.7 | 81.5 | 79.5 | 61.7 | 77.2 | |
RPSSM [ |
75.6 | 70.2 | 52.0 | 43.3 | 60.8 | |
AADP-PSSM [ |
83.3 | 78.1 | 76.3 | 54.4 | 72.9 | |
AATP [ |
81.9 | 74.7 | 75.1 | 55.8 | 71.7 | |
MEDP [ |
87.8 | 78.3 | 76.0 | 57.4 | 74.8 | |
AAC-PSSM-AC [ |
85.3 | 81.7 | 73.7 | 55.3 | 74.1 | |
PsePSSM [ |
86.2 | 78.8 | 75.7 | 57.6 | 75.5 | |
Xia et al. [ |
92.6 | 72.5 | 71.7 | 71.0 | 77.2 | |
This paper |
|
|
|
|
|
|
|
||||||
640 | LCC-PSSM [ |
92.8 | 88.3 | 85.9 | 66.1 | 82.7 |
MBMGAC-PSSM [ |
86.2 | 83.1 | 85.3 | 63.2 | 79.1 | |
MEDP [ |
84.8 | 75.3 | 86.4 | 53.8 | 74.7 | |
PsePSSM [ |
73.9 | 76.6 | 85.3 | 51.5 | 71.7 | |
This paper |
|
|
|
|
|
As listed in Table
In this paper, the main contribution is to construct a 700D feature vector by three descriptors: consensus sequence- (CS-) PSSM, PsePSSM, and autocovariance transformation (ACT) based on segmented PSSM. While CS-PSSM reflects the global information, segmented PsePSSM and segmented ACT represent the local sequence-order information. Then 224 features are selected by using PCA. The SVM classifier and the jackknife test are employed to predict and evaluate the method on three benchmark datasets: 1189, 25PDB, and 640 datasets, with sequence similarity lower than 40%, 25%, and 25%, respectively. The experiment indicates that our approach can be used as a reliable tool and an excellent alternative for the accurate prediction of protein structural classes for low-similarity datasets. We shall make efforts in our future task to provide a public accessible web-server for the method presented in this paper. The codes are written in MATLAB language and can be downloaded from
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors would like to thank the anonymous reviewers for their helpful comments on our paper. This work was supported by the National Natural Science Foundation of China (nos. 61373174 and 11326201), the Fundamental Research Funds for the Central Universities (no. JB140703), and the Natural Science Basic Research Plan in Shaanxi Province of China (no. 2015JQ1010).