Protein Remote Homology Detection Based on an Ensemble Learning Approach

Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods.


Introduction
In computational biology, protein remote homology detection is the classification of proteins into structural and functional classes given their amino acid sequences, especially, with low sequence identities. Protein remote homology detection is a critical step for basic research and practical application, which can be applied to the protein 3D structure and function prediction [1,2]. Although remote homology proteins have similar structures and functions, they lack easily detectable sequence similarities, because the protein structures are more conserved than protein sequences. When the protein sequence similarity is below 35% at the amino acid level, the alignment score usually falls into a twilight zone [3,4]. Therefore, it is often a failure to detect protein remote homology by computational approaches only based on protein sequence features. To improve the specificity and sensitivity of the detection, we proposed an ensemble learning method, which can combine basic classifiers based on different feature spaces.
Up to now, many methods for protein remote homology detection have been proposed, which can be categorized into three groups [5]: pairwise alignment algorithms, generative models, and discriminative classifiers. Early computational approaches for protein remote homology detection are pairwise alignment methods, which detect sequence similarities between any given two protein sequences by using Needleman-Wunsch global alignment algorithm [6,7] and Smith-Waterman local alignment algorithm [8]. Later, some trade-off methods were proposed so as to trade reduced accuracy for improved efficiency, such as BLAST [9] and FASTA [10]. PSI-BLAST [11] iteratively builds a probabilistic profile of a query sequence and therefore a more sensitive sequence comparison score can be calculated [12]. After pairwise alignment methods, the predictive accuracy was significantly improved by using the generative algorithms. Generative models were iteratively trained by using positive samples of a protein family or superfamily; for example, HHblits [13] generates a profile hidden Markov model (profile-HMM) [ Currently the discriminative methods achieve the stateof-the-art performance [16][17][18][19]. Different from pairwise algorithm and generative methods, the discriminative methods can easily embed various characteristics of protein sequences and learn the information from both positive and negative samples in a given benchmark dataset. A key feature of discriminative method is that its input requires fixed length feature vectors. Therefore, some researchers proposed various feature vectors for protein representation. Some methods are based on sequence information, physical and chemical properties of proteins [20][21][22], or secondary structure information [23,24], such as SVM-DR [25]. Some methods are based on kernel method, such as SVM-Pairwise [5], SVM-LA [26], motif kernel [27], mismatch [28], SW-PSSM [29], and profile kernel [30]. Later, the performance of discriminative approaches is further improved by Top-n-gram, because it can transform protein profiles into pseudo protein sequences, which contain the evolutionary information [31][32][33].
Although many discriminative methods for protein remote homology detection have been proposed based on various feature extracting techniques, there is no attempt to combine these methods using an ensemble learning method to improve predictive performance. An ensemble classifier [34,35] is built by combining a set of basic classifiers in weighted voting strategy to give a final determination in classifying a query sample. Ensemble classifiers have achieved great success in many fields, including protein-protein interaction sites [36], protein fold pattern recognition [22,37], tRNA detection [38,39], microRNA identification [40][41][42][43][44], DNA binding protein identification [45], and eukaryotic protein subcellular location prediction [46], because they are able to learn a more expressive concept in classification compared to a single classifier and reduce the variance caused by a single classifier.
In this study, inspired by the success of ensemble classifier in the other fields, we proposed an ensemble classifier for protein remote homology detection, called SVM-Ensemble, which combined three state-of-the-art discriminative methods with a weighted voting strategy. The three basic classifiers SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC were constructed with Kmer, auto-cross covariance (ACC), and series correlation pseudo amino acid composition (SC-PseAAC), respectively. Experimental results on a widely used benchmark dataset [5] showed that SVM-Ensemble can obviously improve the predictive performance by combining various features. Moreover, SVM-Ensemble achieved an average ROC score of 0.945, outperforming the other start-of-the-art methods, indicating that it would be a useful computational tool for protein remote homology detection.

Benchmark Dataset.
A widely used superfamily benchmark [5] was used to evaluate the performance of our method for protein remote homology detection. The classification problem definition and benchmark dataset are available at http://noble.gs.washington.edu/proj/svm-pairwise/. The same dataset has been used in a number of earlier studies [26,[47][48][49][50], allowing us to perform direct comparisons to the relative performance.
The benchmark contains 54 families and 4352 proteins, which are derived from the SCOP database with version 1.53 and the similarities between any two sequences are less than -value of 10 −25 . Remote homology detection can be treated as a superfamily classification problem. For each family, the proteins within the family were regarded as positive test samples, and the proteins outside the family but within the same superfamily were taken as positive training samples. Negative samples were selected from outside of the fold and split into training and testing sets. This process was repeated until each family had been tested. This yielded 54 families with at least 10 positive training examples and 5 positive test examples.

Profile-Based Protein Representation.
Although some methods have achieved certain degree of success only by using amino acid sequence information, their performance is not satisfying. Recent studies demonstrated that the methods over profile-based protein sequences would show better performance because a profile is richer than an individual sequence as far as the evolutionary information is concerned [50,53].
The frequency profile M for protein P with amino acids can be represented as where , (0 ≤ , ≤ 1) is the target frequency which reflects the probability of amino acid ( = 1, 2, . . . , 20) occurring at the sequence position ( = 1, 2, . . . , ) in protein P during evolutionary processes. For each column in M, the elements add up to 1. Each column can therefore be regarded as an independent multinomial distribution. The target frequency was calculated from the multiple sequence alignments generated by running PSI-BLAST [11] against the NCBI's NR with default parameters except that the number of iterations was set at 10 in the current study. The details of how to build a frequency profile can be found in [50].
Given the frequency profile M for protein P, we can find the amino acid with maximum frequency in each column of M. These amino acids are combined to produce the profilebased protein representation. In a frequency profile M, the target frequencies reflect the probabilities of the corresponding amino acids appearing in the specific sequence positions. The higher the frequency is, the more likely the corresponding amino acid occurs. Thus, the produced profile-based protein sequence contains evolutionary information in the frequency profile. We convert the frequency profiles into a series of profile-based proteins. The existing sequence-based methods can therefore be directly performed on the protein representations for further processing.

Feature Vector Representations for Protein Sequences.
In this study, three kinds of features have been employed to construct the SVM-Ensemble predictor, including Kmer, auto-cross covariance (ACC), and series correlation pseudo amino acid composition (SC-PseAAC).
Suppose a protein sequence P with amino acid residues can be represented as where represents the amino acid residue at the sequence position , such that 1 represents the amino acid residue at the sequence position 1 and 2 represents the amino acid residue at position 2 and so on. The three used representation methods can be described as follows.

Kmer.
Kmer [56] is the simplest approach to represent the proteins, in which the protein sequences are represented as the occurrence frequencies of neighboring amino acids. [60][61][62] is to build two signal sequences and then calculate the correlation between them. ACC results in two kinds of variables: autocovariance (AC) transformation and cross covariance (CC) transformation. AC variable measures the correlation of the same property between two residues separated by a distance of lag along the sequence. CC variable measures the correlation of two different properties between two residues separated by lag along the sequence.

Auto-Cross Covariance (ACC). ACC transformation
Autocovariance (AC) Transformation. Given a protein sequence P in (2), the AC variable can be calculated by where is a physicochemical index, is the length of the protein sequence, ( ) means the numerical value of the physicochemical index for the amino acid , and is the average value for physicochemical index along the whole sequence: In such a way, the length of AC feature vector is * LAG, where is the number of physicochemical indices. LAG is the maximum of lag (lag = 1, 2, . . . , LAG).
Cross Covariance (CC) Transformation. Given a protein sequence P in (2), the CC variable can be calculated by where 1 , 2 are two different physicochemical indices, is the length of the protein sequence, and 1 ( ), 2 ( +lag ) are the numerical value of the physicochemical indices 1 , 2 for the amino acids , +lag .
1 , 2 are the average value for physicochemical index values 1 , 2 along the whole sequence and they can be calculated by (4).
In such way, the length of the CC feature vector is * ( − 1) * LAG, where is the number of physicochemical indices. LAG is the maximum of lag (lag = 1, 2, . . . , LAG).
Therefore, the length of the ACC feature vector is * * LAG. In current implementation, three physicochemical properties were employed, including hydrophobicity, hydrophilicity, and mass (see Table S1 in Supplementary file, available online at http://dx.doi.org/10.1155/2016/5813645) extracted from AAindex [57,63]. [64] is an approach incorporating the contiguous local sequence-order information and the global sequence-order information into the feature vector of the protein sequence. Given a protein sequence P in (2), the SC-PseAAC [64] feature vector of P is defined:

Series Correlation Pseudo Amino Acid Composition (SC-PseAAC). SC-PseAAC
where ( = 1, 2, . . . , 20) is the normalized occurrence frequency of the 20 native amino acids in the protein P; the parameter is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; is the weight factor ranging from 0 to 1; and is the -tier sequence-correlation factor that reflects the sequence-order correlation between all of the most contiguous residues along a protein sequence, which is defined as where 1 , , 2 , , and , are the hydrophobicity, hydrophilicity, and mass correlation functions given by whereĥ 1 ( ),ĥ 2 ( ), and̂( ) are the substituting values of hydrophobicity, hydrophilicity, and mass values for amino acid . They are all subjected to a standard conversion as described by the following equation: where we use R ( = 1, 2, . . . , 20) to represent the 20 native amino acids. The symbols ℎ 1 ( ), ℎ 2 ( ), and ( ) represent the original hydrophobicity, hydrophilicity, and mass values (see Table S1 in Supplementary file) of the amino acid . These aforementioned features can be generated by a web-server called Pse-in-one [56], which can be used to generate the desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of user's studies. It covers a total of 28 different modes, of which 14 are for DNA sequences, 6 are for RNA sequences, and 8 are for protein sequences.

Support Vector Machine. Support vector machine (SVM)
is a supervised machine learning technique for classification task based on statistical theory [65,66]. Given a set of fixed length training vectors with labels (positive and negative input samples), SVM can learn a linear decision boundary to discriminate the two classes. The result is a linear classification rule that can be used to classify new test samples. When the samples are linearly nonseparable, the kernel function can be used to map the samples to a high-order feature space in which the optimal hyper plane as decision boundary can be found. SVM has exhibited excellent performance in practice [54,58,[67][68][69][70][71][72][73] and has a strong theoretical foundation of statistical learning.
In this study, the publicly available Gist SVM package (http://www.chibi.ubc.ca/gist/) is employed. The SVM parameters are used by default of the Gist Package except that the kernel function is set as radial basis function.

Ensemble Classifier.
The ensemble classifier is able to learn a more expressive concept in classification compared to a single classifier and reduces the variance caused by a single classifier. Therefore, it was employed in many fields and achieved great success [36,37].
In this paper, we proposed a weighted voting strategy for protein remote homology detection, as shown in Figure 1. The ensemble framework of SVM-Ensemble was constructed by combining SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC with weighted factors. The processing can be formulated as below.
Suppose the ensemble classifier is expressed by where C represents the th basic SVM classifier on superfamily (1 ≤ ≤ 54). That is, C 1 1 represents the classifier SVM-Kmer that operates on the superfamily 1 , C 2 1 represents the classifier SVM-ACC that operates on superfamily 1 , and C 3 1 represents the classifier SVM-SC-PseAAC that operates on superfamily 1 . C is the average performance of three basic classifiers on superfamily with weighted voting strategy. In (12), the symbol ⊕ denotes the weighted voting operator.
The three basic classifiers can be combined by using the following equation: where C (P, ) is the belief function or supporting degree for P belonging to predicted by the th basic classifier and is the weighted factor assigned with the average ROC score of the th basic classifier on superfamily .  Figure 1: Flowchart to show how the ensemble classifier is formed by combining three basic classifiers on superfamily-level. The ensemble strategy is first employed on superfamily-level, and then the query protein P is predicted belonging to the superfamily type with which its score is the highest.
the test sets have more negative than positive samples, simply measuring error-rates will not give a good evaluation of the performance. For the case, the best way to evaluate the tradeoff between the specificity and sensitivity is to use ROC score. ROC score is the normalized area under a curve that is plotted with true positives as a function of false positives for varying classification thresholds. ROC score of 1 indicates a perfect separation of positive samples from negative samples, whereas ROC score of 0.5 denotes that random separation. ROC50 score is the area under the ROC curve up to the first 50 false positives.

The Influence of Parameters on the Predictive Performance of Basic Predictors.
There are several parameters for each basic predictor, which should be optimized. For more information of these parameters, please refer to Materials and Method. In this study, we optimized them by using grid search. The influence of these parameters on the performance was shown in Figure 2, and the optimized values of the parameters and their results were shown in Table 1, from which we can see that SVM-Kmer achieved the best performance, followed by SVM-SC-PseAAC.

Performance of Ensemble Classifier Based on Various
Feature Combinations with Weighted Voting Strategy. As discussed above, predictors based on different feature sets showed different performance. In this study, in order to further improve the performance of protein remote homology detection, we employed an ensemble learning approach to combine various predictors. The performance of ensemble classifier combined various feature combinations was shown in Table 2. The best performance (ROC = 0.943, ROC50 = 0.744) can be achieved with the combination of all the three basic predictors and obviously outperformed all the three basic predictors in terms of both ROC score and ROC50 score. These results were not surprising. The three basic predictors were based on different features, and their predictive results are complementary. The performance can be improved by combining them with an ensemble learning method.

Feature Analysis for Discriminative Power.
To further study the discriminative power of features in the three basic predictors, we employed a feature extraction method, called principal component analysis (PCA) [79], to calculate the discriminative weight vectors in the feature space. The process of PCA for extracting significant features can be found in [32,80]. For each basic predictor, the top 10 most discriminative features in the feature space were shown in Table 3, from which we can see that, for the Kmer features, six of the most discriminative features contain the amino acid , indicating the importance of this amino acid. For ACC features, the hydrophobicity (ℎ 1 ) has important impact on the feature discrimination. For SC-PseAAC features, the amino acid has the most discriminative power and features with small value are more important. Both ACC and SC-PseAAC features with strong discriminative power incorporate the sequenceorder effects. These three kinds of features consider both sequence composition and sequence order effects. Therefore, SVM-Ensemble can further improve the performance by combining them in an ensemble learning approach.

Comparison with Other Related Predictors.
Some stateof-the-art methods for protein remote homology detection were selected to compare with the proposed SVM-Ensemble. SVM-Pairwise [5] represents each protein as a vector of pairwise similarities to all proteins in the training set. The kernel of SVM-LA [26] measures the similarity between a pair of proteins by taking into account all the optimal local alignment scores with gaps between all possible subsequences. Mismatch kernel [28] is calculated based on occurrences of ( , )-patterns in the data. Monomer-dist [47] constructs the feature vectors by the occurrences of short oligomers. SVM-DR is based on the distance-pairs; PseAACIndex is based on the pseudo amino acid composition (PseAAC). disPseAAC constructs the feature vectors by combining the Note: the subscript indexes in ACC features and SC-PseAAC features mean hydrophobicity (ℎ 1 ), hydrophilicity (ℎ 2 ), and mass ( ).  Table 4. The SVM-Ensemble achieved the best performance, indicating that it is correct to combine different predictors via an ensemble learning approach.

Conclusions
In this study, we have proposed an ensemble classifier for protein remote homology detection, called SVM-Ensemble. It was constructed by combining three basic classifiers with a weighted voting strategy. Experimental results on a widely used benchmark dataset showed that our method achieved ROC score of 0.943, which is obviously better than the three basic predictors, including SVM-Kmer, SVM-ACC, and SVM-SC-PseAAC. Compared with some other stateof-the-art methods, the SVM-Ensemble achieved the best performance. Furthermore, by analyzing the discriminative power of these features, some interesting patterns were discovered.

Competing Interests
The authors declare that they have no competing interests.

Authors' Contributions
Bingquan Liu conceived of the study and designed the experiments and participated in designing the study, drafting the paper, and performing the statistical analysis. Junjie Chen participated in coding the experiments and drafting the paper. Dong Huang participated in performing the statistical analysis. All authors read and approved the final paper.