Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.
The explosion of protein sequences generated in the postgenomic era has not been followed by an equal increase in the knowledge of protein biological attributes, which are essential for basic research and drug development. Since manual classification of proteins by means of biological experiments is both time-consuming and costly, much effort has been applied to the problem of automating this process using various machine learning algorithms and computational tools for fast and effective classification of proteins given their sequence information [
In this work we are mainly interested in the second procedure, that is, in the definition of a discrete numerical representation for a protein. Since many different representations have been proposed in the literature, it would be valuable to investigate which of these are most useful for the specific applications, such as subcellular localization and protein-protein interactions [
Two kinds of models are typically employed to represent protein samples: the sequential model and the discrete model. The most widely used sequential model is based on the entire amino-acid sequence of a protein, expressed by the sequence of its residues, with each one belonging to one of the 20 native amino-acid types:
More suitable for machine learning purposes are protein discrete models, which fall into two main classes. The first class includes the simple amino-acid composition (AAC) and approaches that are based on the AAC-discrete model, such as Chou’s pseudo-amino-acid composition (PseAAC) [
Before proceeding to the second class of representations based on protein discrete models, it should be noted that a number of different PseAAC methods have been developed for specific applications, such as for predicting certain biological attributes. Some examples include cellular automata image classification [
The second class of protein feature extraction methods is based on kernels. One of the first kernel-based methods (proposed for remote homology detection) is the Fisher kernel [
Aside from using AAC and protein properties for protein representation, several high performing features have also been derived from the position-specific scoring matrix (PSSM) [
The main drawback of the methods based on structural or sequential features is that they only focus on the local variation of the protein itself. For this reason, cellular interactions of proteins have been investigated, as in [
In this study our objective is to search for a general ensemble method that works well across different protein classification datasets. To accomplish our goal we focus on structural and sequential features. We are motivated to study protein classification methods that generalize well because such systems offer the potential of deepening our understanding of protein representation and of speeding up real world development in new areas involving protein classification. Such investigations also have the potential of promoting and laying the foundations for the development of more robust and powerful classification systems.
The present paper provides an in-depth look at the protein representations that have led to the evolution of some of our previous work in this area. Reference [ Reference [ Reference [
In this work we explain and compare several state-of-the-art descriptors and some new variants starting from different types of protein representations: the PSSM, the amino-acid sequence, two matrix representations of the protein, and the 3D tertiary structure representations of the protein. We also develop a new ensemble (based on the above cited works) that performs well across multiple datasets, with our ensemble obtaining state-of-the-art performances on several datasets. For the sake of fairness, we use the same ensemble with the same set of parameters (i.e., the same weights in the weighted sum rule) across all tested datasets.
The remainder of this paper is organized as follows. In Section
Since several problems in the bioinformatics literature require the classification of proteins, a number of datasets are available for experiments, and recent research has focused on finding a compact and effective representation of proteins [
The classification system illustrated in Figure
Schema of the proposed method.
The combination of representation and descriptors is summarized in Table
Summarized description of the datasets (if available, the number of training and independent samples is given in column “number of samples”). The column BKB reports whether it is possible from the dataset to obtain the PDB of the proteins for extracting the backbone structure.
Name | Short name | Number of samples | Number of classes | Protocol | BKB |
---|---|---|---|---|---|
Membrane subcellular | MEM | 3249 + 4333 | 8 | Independent training and testing sets | NO |
Human pairs | HU | 1882 | 2 | 10-fold cross validation | NO |
Protein fold | PF | 698 | 27 | Independent training and testing sets | YES |
GPCR | GP | 730 | 2 | 10-fold cross validation | NO |
GRAM | GR | 452 | 5 | 10-fold cross validation | NO |
Viral | VR | 112 | 4 | 10-fold cross validation | NO |
Cysteines | CY | 957 | 3 | 10-fold cross validation | YES |
SubCell | SC | 121 | 3 | 10-fold cross validation | YES |
DNA-binding proteins | DNA | 349 | 2 | 10-fold cross validation | YES |
Enzyme | ENZ | 1094 | 6 | 10-fold cross validation | YES |
GO dataset | GO | 168 | 4 | 10-fold cross validation | YES |
Human interaction | HI | 8161 | 2 | 10-fold cross validation | NO |
Submitochondria locations | SL | 317 | 3 | 10-fold cross validation | NO |
Virulent independent set 1 | VI1 | 2055 + 83 | 2 | Independent training and testing sets | NO |
Virulent independent set 2 | VI2 | 2055 + 284 | 2 | Independent training and testing sets | NO |
Adhesins | AD | 2055 + 1172 | 2 | Independent training and testing sets | NO |
Each descriptor is used to train a general purpose classifier. SVMs are used for the classification task due to their wide diffusion and high generalization ability. SVMs derive from the field of statistical learning theory [
The ensemble approaches based on the fusion of different descriptors are obtained by combining the pool of SVMs by weighted sum rule; this rule simply sums the scores obtained by the pool of SVMs classifiers, where to each SVM a given weight is applied.
The most widely used representation for proteins is a sequential model of the amino-acid sequence:
The PSSM representation of a protein, first proposed in [
The PSSM representation considers the following parameters. Position: the index of each amino-acid residue in a sequence after multiple sequence alignment. Probe: a group of typical sequences of functionally related proteins already aligned by sequence or structural similarity. Profile: a matrix of 20 columns corresponding to the 20 amino acids. Consensus: the sequence of amino-acid residues most similar to all the alignment residues of probes at each position. The consensus sequence is generated by selecting the highest score in the profile at each position.
A PSSM representation for a given protein of length
Small values of
In [
In the experiments reported below, 25 random physicochemical properties have been selected to create an ensemble (labelled SMR) of
In [
Then
Wavelets are very useful descriptors with lots of different applications. First [
The 3D tertiary structure representation for proteins is based on the protein backbone (i.e., the sequence of its
As with the other matrix representations introduced above, DM is regarded as a grayscale image, which is used to extract texture descriptors, as illustrated in Figure
DM images extracted from 2 sample proteins of the DNA dataset.
In this section the approaches used to extract descriptors from the different representations introduced above are described. Most of the descriptors extracted from the primary representation are based on substituting the letter representation of an amino acid with its value of a fixed physicochemical property. In order to make the result independent on the selected property, a selection on 25 or 50 properties is done by random, and the resulting descriptors are used to train an ensemble of SVM classifiers.
Amino-acid composition is the simpler method for extracting features from a protein representation that is based on counting the fraction of a given amino acid:
The standard 2 grams descriptor is a vector of 202 values, each counting the number of occurrences of a given couple of amino acids in a protein sequence. Consider
Quasiresidue Couple is a method for extracting features from the primary sequence of a protein [
The QRC model (of order
In our experiments, the
The autocovariance approach [
Given a protein
The AAIndexLoc is a descriptor proposed in [
In the experiments reported below, 25 random physicochemical properties have been selected to create an ensemble of 25
Global encoding is a descriptor proposed in [
The physicochemical 2 grams [
The N-gram descriptor is similar to the standard 2 grams descriptor but is obtained on a different A1 = G–I–V–F–Y–W–A–L–M–E–Q–R–K–P–N–D–H–S–T–C, A2 = LVIM–C–A–G–S–T–P–FY–W–E–D–N–Q–KR–H, A3 = LVIMC–AG–ST–P–FYW–EDNQ–KR–H, A4 = LVIMC–ASGTP–FYW–EDNQ–KRH, A5 = LVIMC–ASGTP–FYW–EDNQKRH.
Each protein is first translated according to the 5 alphabets. Then 2 gram representations are calculated from A1 to A2 languages, and the 3 gram representations are calculated from A3 to A4 to A5. The five descriptors are
Split amino-acid composition is a descriptor proposed by [
A sequence descriptor based on biorthogonal discrete wavelet is proposed in [
The vector
In the experiments reported below, 25 random physicochemical properties have been selected to create an ensemble of 25
This matrix descriptor was originally proposed in [
This descriptor [
The descriptors
In this work two variants of the single average descriptor are used: the one described above (labelled SA) and a version including matrix normalization using a sigmoid function by which each element of Mat is scaled to
The autocovariance matrix is a matrix descriptor proposed in [
AM can be calculated from an input matrix
This pseudo-PSSM approach (PP) is one of the most widely used matrix descriptors for proteins (see [
Given an input matrix
Singular value decomposition is a general purpose matrix factorization approach [
Given an input matrix
DCT [
In this work the final DCT descriptor is obtained by retaining the first 400 coefficients.
The N-gram descriptor is usually extracted from the primary protein sequence (as already described in Section
Given an input matrix
A very interesting feature extraction approach for proteins is to treat a protein matrix representation as an image and to use well-known image texture descriptors for extracting features. In this work two high performing descriptors are evaluated: local binary pattern histogram fourier (LHF) descriptors [
First proposed by [
The LPQ operator [
This section reports the results of an experimental evaluation of the protein descriptors on sequence-based protein classification problems performed on several datasets.
The proposed approach has been evaluated on the 15 datasets listed below and according to the testing protocols suggested by the developers of the datasets. A brief summary description of each dataset and related testing protocol is reported in Table
Summary of the descriptors (short names are defined in Sections
Descriptors | ||
---|---|---|
Protein representation | Descriptor | Size |
AAS | AS | 20 |
2G | 400 | |
QRC |
1200 | |
AC |
40 | |
P2G |
800 | |
AA |
65 | |
GE | 480 | |
NG | 400, 225, 512, 125, 64 | |
SAC | 20 | |
DW |
52 | |
|
||
PSSM/SMR |
AB | 400 |
SAN | 400 | |
SA | 400 | |
AM | 300 | |
PP | 320 | |
SVD | Depends on the input representation | |
DCT | 400 | |
LHF_G | 176 | |
LPQ_G | 512 | |
LHF_L | 528 | |
LPQ_L | 1536 | |
BGR | 400 | |
TGR | 8000 |
The testing protocol employed in the experiments depended on the datasets. In cases where the original dataset is not divided into training and testing sets, a 10-fold cross-validation was performed (results averaged on ten experiments); otherwise the subdivision of the training and testing sets was maintained.
Three performance indicators are used in the reported results: the classification accuracy, the area under the ROC curve (AUC), and the statistical rank. The accuracy is the ratio between the number of samples correctly classified and the total number of samples. The ROC curve is a graphical plot of the sensitivity of a binary classifier versus false positives (1—specificity), as its discrimination threshold is varied. AUC [
The statistical rank returns the relative position of a method against other tested methods. The average rank is the most stable indicator to average performance on different datasets and is calculated using the Friedman’s test (
The first experiment is aimed at comparing all the descriptors detailed in Section
In Table
Comparison among the different feature extractors in terms of the statistical rank on the different datasets. The 2 best descriptors for each representation are in boldface.
Descriptors | Rank | |
---|---|---|
Protein representation | Descriptor | |
AAS | AS | 23.42 |
2G | 27.25 | |
QRC |
21.54 | |
AC |
|
|
P2G |
39.78 | |
AA |
|
|
GE | 30.24 | |
NG | 27.85 | |
SAC | 23.45 | |
DW |
29.48 | |
|
||
PSSM | AB | 15.25 |
SAN |
|
|
SA | 13.20 | |
AM | 20.50 | |
PP |
|
|
SVD | 39.56 | |
DCT | 28.56 | |
LHF_G | 24.10 | |
LPQ_G | 14.87 | |
LHF_L | 31.81 | |
LPQ_L | 26.72 | |
BGR | 12.44 | |
TGR | 15.68 | |
|
||
SMR | AB | 28.78 |
SAN | 24.80 | |
SA | 24.82 | |
AM | 40.52 | |
PP |
|
|
SVD | 29.20 | |
DCT | 32.45 | |
LHF_G |
|
|
LPQ_G | 17.22 | |
LHF_L | 26.24 | |
LPQ_L | 31.24 | |
BGR | 19.86 | |
TGR | 23.24 | |
|
||
PR (ensemble of 25 PR |
SVD |
|
DCT |
|
|
LHF_G | 41.25 | |
LPQ_G | 38.38 | |
LHF_L | 44.02 | |
LPQ_L | 38.48 | |
|
||
WAVE (ensemble of 25 WAVE |
SVD | 40.25 |
DCT | 47.00 | |
LHF_G |
|
|
LPQ_G |
|
|
LHF_L | 41.10 | |
LPQ_L | 40.20 |
The second experiment is aimed at comparing only the best descriptors found in Table
Comparison in terms of AUC in 2 class problems.
AUC | Datasets | |||||||
---|---|---|---|---|---|---|---|---|
Protein representation | Descriptor | DNA | HU | HI | GP | AD | VI1 | VI2 |
AAS | AC |
92.6 | 71.8 |
|
99.1 | 80.9 |
|
76.5 |
AA |
90.6 | 68.3 | — | 98.8 | 78.9 |
|
75.6 | |
|
||||||||
PSSM | PP |
|
|
94.8 |
|
|
86.2 |
|
SAN |
|
|
|
|
|
87.3 |
|
|
|
||||||||
SMR | PP | 92.9 | 73.8 | — | 99.5 | 79.8 | 88.5 | 76.0 |
LHF_G | 89.3 | 69.0 | — | 99.3 | 81.6 | 83.4 | 71.1 | |
|
||||||||
PR | SVD | 79.6 | 74.2 | — | 98.0 | 72.3 | 59.1 | 73.3 |
DCT | 83.4 | 67.7 | — | 95.9 | 73.4 | 68.4 | 63.0 | |
|
||||||||
WAVE | LPQ_G | 83.1 | 68.6 | — | 98.5 | 74.0 | 67.4 | 67.6 |
LHF_G | 77.7 | 68.6 | — | 97.8 | 68.9 | 65.1 | 60.8 |
Comparison in terms of AUC in multiclass problems.
AUC | Datasets | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Protein representation | Descriptor | MEM | PF | ENZ | GR | VR | SL | CY | GO | SC |
AAS | AC |
93.6 | 84.8 | 66.7 | 92.7 |
|
93.2 | 78.4 | 70.0 | 67.6 |
AA |
90.4 | 84.2 | 63.7 | 92.6 | 72.2 | 91.1 | 76.5 | 69.5 | 65.5 | |
|
||||||||||
PSSM | PP |
|
|
|
80.8 |
|
|
|
|
|
SAN | 95.5 |
|
|
|
72.0 |
|
|
|
|
|
|
||||||||||
SMR | PP | 94.2 | 85.9 | 66.2 |
|
76.9 | 92.2 | 78.7 | 69.0 | 66.2 |
LHF_G |
|
87.6 | 65.6 | 91.3 |
|
89.5 | 78.2 | 72.4 | 62.9 | |
|
||||||||||
PR | SVD | 94.4 | 83.5 | 59.4 | 80.8 | 76.0 | 85.4 | 73.5 | 59.7 | 60.3 |
DCT | 91.7 | 79.5 | 60.8 | 82.6 | 74.2 | 83.9 | 71.7 | 65.3 | 64.2 | |
|
||||||||||
WAVE | LPQ_G | 94.2 | 87.2 | 63.2 | 82.7 | 79.2 | 83.4 | 68.1 | 65.7 | 58.1 |
LHF_G | 92.7 | 86.2 | 61.5 | 80.3 | 80.6 | 81.0 | 66.6 | 65.2 | 57.0 |
The best results in the previous tables are almost always obtained with PSSM and AAS representations of proteins. Comparing the reported results of PR and WAVE with [
Comparisons with previous versions of WAVE and PR.
AUC | Dataset | |||
---|---|---|---|---|
Protein representation | Descriptor | HU | GP | AD |
WAVE | Best in [ |
66.1 | 96.6 | 67.1 |
PR | Best in [ |
62.8 | 87.8 | 57.5 |
WAVE | LPQ_G | 68.6 |
|
72.3 |
PR | SVD |
|
98.0 |
|
The third experiment tests some ensemble approaches based on the fusion of some of the best descriptors, selected considering all the datasets, excluding HI. The ensembles tested in this experiment are obtained as the weighed fusion of the following methods, labelled in terms of representation (descriptor): FUS1: 2 × AAS(AC) + 2 × PSSM(SAN) + 4 × PSSM(PP) + PSSM(LHF_G) + PSSM(BGR) + PSSM(TGR) + SMR(PP) + SMR(BGR), FUS2: 2 × AAS(AC) + 2 × PSSM(SAN) + 4 × PSSM(PP) + PSSM(LHF_G) + PSSM(BGR) + PSSM(TGR) + SMR(PP) + SMR(BGR) + 2 × DM(LPQ_G) = FUS1 + 2 × DM(LPQ_G).
The results of these two ensembles are compared in Tables
Comparison among ensembles and best stand-alone descriptors in terms of AUC in 2 class problems.
AUC | Datasets | ||||||
---|---|---|---|---|---|---|---|
Protein representation | DNA | HU | HI | GP | AD | VI1 | VI2 |
PSSM(PP) | 95.5 | 81.2 | 94.8 | 99.8 | 87.7 | 86.2 | 87.2 |
PSSM(SAN) | 95.2 | 76.4 | 95.7 | 99.7 | 82.7 | 87.3 | 85.7 |
AAS(AC) | 92.6 | 71.8 | 95.9 | 99.1 | 80.9 | 90.0 | 76.4 |
FUS1 | 97.2 |
|
|
|
|
|
|
FUS2 |
|
— | — | — | — | — | — |
Comparison among ensembles and best stand-alone descriptors in terms of AUC in multiclass problems.
AUC | Datasets | ||||||||
---|---|---|---|---|---|---|---|---|---|
Protein representation | MEM | PF | ENZ | GR | VR | SL | CY | GO | SC |
PSSM(PP) | 96.8 | 93.1 | 78.0 | 80.8 | 81.8 | 95.7 | 79.4 |
|
70.3 |
PSSM(SAN) | 95.5 | 87.7 | 71.1 |
|
72.0 | 94.1 | 81.8 | 78.6 | 73.4 |
AAS(AC) | 93.6 | 84.8 | 66.7 | 92.7 | 81.8 | 93.2 | 78.4 | 70.0 | 67.6 |
FUS1 |
|
92.7 |
|
92.3 |
|
|
|
83.8 | 75.3 |
FUS2 | — |
|
80.1 | — | — | — | 84.3 | 82.8 |
|
The most interesting result among those reported in Tables
The forth experiment is aimed at comparing our ensembles FUS1 and FUS2 with the performance reported in the literature by other state-of-the-art approaches. Unfortunately, a fair comparison with other approaches is not always easy for the following reasons. Several papers use self-collected datasets and only in a few cases is the code for feature extraction available. Many works report results obtained on small datasets, without a clear indication of the testing protocol used; therefore, it is difficult to know whether parameter optimization was performed on the entire dataset (thereby overfitting results) or only on a training set. Overfitting is particularly dangerous in small datasets.
The comparison is much easier when considering large datasets (as with HI and MEM) or when an independent dataset separate from the training set is available (as in PF). So in the following tests we compare our results only when we are quite sure that the comparison is fair.
Tables
Comparison with the state-of-the-art using AUC as performance indicator.
AUC | Datasets | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 |
[ |
77.0 | 87.0 | 83.4 | |||||||||||
[ |
93.3 | 72.5 | 50.0 | |||||||||||
[ |
72.5 | 99.7 |
|
82.5 | 96.0 | 82.9 | 86.1 | 76.0 | ||||||
[ |
98.2 | |||||||||||||
[ |
81.6 | 91.2 | 84.1 | |||||||||||
[ |
95.9 | 79.4 | 96.8 | 93.8 | 98.0 | 87.1 | 87.9 | |||||||
FUS1 |
|
92.7 |
|
92.3 |
|
|
|
|
|
|
|
|
|
|
Comparison with the state-of-the-art using accuracy as performance indicator.
Accuracy | Datasets | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 |
[ |
56.50 | |||||||||||||
[ |
65.50 | |||||||||||||
[ |
58.18 | |||||||||||||
[ |
61.04 | |||||||||||||
[ |
70.0 | |||||||||||||
[ |
69.60 | |||||||||||||
[ |
91.6 | |||||||||||||
[ |
91.6 | |||||||||||||
[ |
84.1 | |||||||||||||
[ |
92.7 | |||||||||||||
[ |
92.6 | |||||||||||||
[ |
70.0 | 98.1 | 84.4 | 78.6 | 91.5 | |||||||||
[ |
56.2 | 94.1 | 59.4 | 85.8 | 93.1 | 85.5 | 81.7 | |||||||
FUS1 |
|
68.6 |
|
87.9 |
|
|
|
94.3 | 64.3 |
|
93.9 |
|
|
|
FUS2 | 74.6 |
|
|
63.0 |
The results reported in Tables
Considering the dataset PF, which is one of the most widely used benchmarks, FUS1 compares very well with the other approaches where features are not extracted using 3D information (for a fair comparison). The performance FUS1 is all the more valuable when considering that unlike the older approaches, ours is obtained without ad-hoc feature extractors (where the features are validated only on PF with a high risk of overfitting).
The compared approaches on PF are the following. Reference [ Reference [ Reference [ Reference [
Since the PF dataset aims at predicting the 3D structure of a protein, features extracted from 3D representations are highly useful as proven by the better performance obtained by FUS2 with respect to FUS1.
Given the results reported above, our proposed ensemble FUS1 should prove useful for practitioners and experts alike since it can form the base for building systems that are optimized for particular problems (e.g., SVM optimization and physicochemical properties selection). Obviously, it is very important that only the training data be used for physicochemical properties selection; it is not fair to choose the physicochemical properties using the entire dataset to do so. Moreover, when the ensemble is optimized for a given dataset, it is very important to consider that large descriptors work better when a large training set is available (because of the curse of dimensionality). As an example of this, we report below the performance of AAS(RC) and AAS(AC). AAS(RC) has high dimensionality and, accordingly, as seen in Table
Comparison between AAS(RC) and AAS(AC).
AUC | Datasets | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 | CY | SC |
AAS(RC) | 70.3 |
|
98.9 | 90.0 | 69.0 | 86.2 | 64.5 |
|
68.3 | 87.8 |
|
|
89.2 | 75.9 | 77.6 | 62.4 |
AAS(AC) |
|
84.8 |
|
|
|
|
|
93.6 |
|
|
95.9 | 80.9 |
|
|
|
|
A similar behavior occurs with some other methods. In Table
Comparison among ensembles and best stand-alone descriptors in terms of AUC.
AUC | Datasets | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods | HU | PF | GP | GR | VR | DNA | ENZ | MEM | GO | SL | HI | AD | VI1 | VI2 | CY | SC |
PSSM(PP) |
|
|
99.8 | 80.8 |
|
|
|
|
|
|
94.8 |
|
86.2 |
|
79.4 | 70.3 |
PSSM(SAN) | 76.4 | 87.7 | 99.7 |
|
72.0 | 95.2 | 71.1 | 95.5 | 78.6 | 94.1 | 95.7 | 82.7 |
|
85.7 |
|
|
PSSM(LPQ_G) | 72.0 | 89.5 |
|
82.3 | 77.7 | 89.5 | 66.2 | 93.6 | 73.0 | 93.7 |
|
86.8 | 82.3 | 83.9 | 70.3 | 61.6 |
It is clear from our experimental results that it is difficult to find an ensemble that performs the best across each of the datasets. Nonetheless, we have shown that among the several tested and proposed protein descriptors, it is always possible to find an ensemble that performs well in each type of dataset.
One goal in this work was to provide a survey of several state-of-the-art descriptors and some new variants starting from different protein representations. We compare the performance of these descriptors across several benchmark datasets. The results reported in this paper show that the best protein representation is PSSM, but AAS and SMR also work well. We found that no single descriptor is superior to all others across all tested datasets.
Another objective of this study was to search for a general ensemble method that could work well on different protein classification datasets. Accordingly, we performed several fusions for finding experimentally a set of descriptors based on different representations that worked well across each of the tested datasets. A couple of representations, such as WAVE and PR, were not useful in fusion. Given the results of our experiments, we concluded that a wide survey of different texture descriptors needs to be performed since different descriptors contain different information that might boost performance when combined with others.
Our major contribution is to propose an ensemble of descriptors/classifiers for sequence-based protein classification that not only works well across several datasets but also, in some cases, proves superior to the state-of-the-art. Unlike other papers that develop a web server, we share almost all the MATLAB codes used in the proposed approaches. Our proposed ensemble could be considered a baseline system for developing an ad-hoc system for a given problem. Issues to consider when optimizing such a base system for a given dataset were also discussed. For instance, the size of datasets seems to play a role in the choice of protein representation, with some descriptors showing stronger performance on large datasets. In particular, approaches that use a high dimensional representation (e.g., RC) requires larger datasets in order to avoid the curse of dimensionality.
To further improve the performance of our methods, we plan, in the future, on testing more classification approaches. We are particularly interested in investigating ensembles made with AdaBoost and Rotation forest [
The authors declare that there is no conflict of interests regarding the publication of this paper.