HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features

DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.


Introduction
DNA-binding proteins play a vital role in various cellular processes. They are essential in transcriptional regulation, recombination, genome rearrangements, replication, repair, and DNA modification [1]. Proteins which make bond with DNA in both eukaryotes and prokaryotes while performing like activators or repressors are DNA-binding proteins. It has been observed that the percentages of prokaryotes and eukaryotes protein that can bind to DNA are only 2-3% and 4-5%, respectively [2,3]. There have been a wide variety of experimental methods such as in vitro methods [4,5] like filter binding assays, chromatin immunoprecipitation on microarrays (ChIP-chip) genetic analysis, and X-ray crystallography which are used to predict DNA-binding proteins. However, these methods are proven to be expensive and time consuming. Therefore, there is a growing demand to find a fast and cost effective computational method to solve this problem.
Most of the computational methods used in the literature to predict DNA-binding proteins formulated the problem as a supervised learning problem. Practically, the number of known DNA-binding proteins is very small compared to the large non-DNA-binding proteins and unknown proteins. DNA-binding protein prediction is often modeled as a binary class classification problem where given a protein sequence as input the task is to predict whether the protein is DNAbinding or not. Note that the challenge here is to select a proper dataset for training and testing incorporating the imbalanced situation. Many supervised learning algorithms have been used in the literature to solve the problem. Among them, Artificial Neural Networks (ANN) [6], Support Vector Machines (SVM) [7,8], ensemble methods [9], Nave Bayes classifier [10], Random Forest [11], Convolutional Neural Networks [12], Logistic Regression [13], AdaBoost Classifier [5], and so on are well-regarded. Support Vector Machines (SVM) are one of the best performing classifiers used for DNA-binding protein identification [7,8,14,15].
In this paper, we propose HMMBinder, a novel DNAbinding protein prediction tool using HMM profile based features of a protein sequence. Our method uses monogram and bigram features derived from the HMM profile which shows effectiveness compared to the PSSM or sequence based features. We also use SVM as the classifier and standard benchmark datasets to test our method. Using the standard evaluation metrics, our method significantly improves over the state-of-the-art methods and the features used in the literature. We also developed a web server that is publicly available at http://brl.uiu.ac.bd/HMMBinder. The rest of the paper is organized following the general 5step guideline suggested in [29] for protein attribute prediction. First, benchmark datasets selected for this problem are described followed by a description of the protein representation by extraction of features. Then we describe the classification algorithm that we selected for our approach followed by the performance evaluation techniques deployed in this paper. Lastly, we describe the web server that we developed for this problem. The results section presents the details of the experimental results followed by an analytical discussion. The paper concludes with a summary and indication of future work.

Methods and Materials
In this section, we provide the details of the materials and methods of this paper. Figure 1 provides a system diagram of our proposed method. For the training phase, all the protein sequences are fed to HHBlits [30], a sequenceto-sequence alignment software using the latest UniProt database. HHBlits produces HMM file as output which is then used by our feature extraction method to generate monogram and bigram features. Monogram and bigram features are concatenated together and then used as training feature set to train the classifier. We use SVM with linear kernel as the classification algorithm and a trained model is stored for the testing phase. Testing phase is also similar to the training phase; however, the labels for the test dataset are not given to the classifier. This stored model is also used for the web server implementation of HMMBinder.

Datasets.
Selection of benchmark datasets is essential in classification and prediction design. In this paper we  use a popular benchmark dataset called benchmark1075 to train our model. Later we test the performance using cross validation and on a separate independent test set known as independent186 dataset. This section provides a brief overview of these two datasets. Both of these datasets are widely used in the literature of DNA-binding protein prediction literature [8,14,18,20,31].

Dataset
Benchmark1075. This dataset was first introduced in [14]. This dataset consists of 1075 protein sequences. Among them, 525 are DNA-binding and 550 are non-DNAbinding protein sequences. All the protein sequences were taken from PDB [32]. This dataset is one of the largest DNAbinding protein prediction datasets and thus suitable for training purpose.

Dataset Independent186.
Lou et al. [17] constructed this independent dataset consisting of 93 DNA-binding and 93 non-DNA-binding protein sequences. They used BLASTCLUST [33] on the benchmark dataset to remove the sequences that have more that 25% of similarity.

Feature Extraction.
The training dataset S used for a binary classification problem consists of two types of instances: positive and negative. Formally, Next, the task is to represent each protein instance as feature vectors suitable for training. The idea is to represent each of the protein instances as a vector of features.
Here, a protein, ∈ S, is shown as a feature vector with dimension . Most of the methods in the literature of DNA-binding protein prediction use either sequence and BioMed Research International 3 PSSM profile based features or structure based features. To the best of our knowledge, there has been no application of features using HMM profiles. In this paper, we have used HHBlits [30] to generate HMM profiles. HMM profiles are comparatively more effective [30,34] for remote homology detection. HMM profiles were generated using four iterations of HHBlits with a cutoff value set to 0.001 using the latest UniProt database [35]. HMM profiles are × 20 matrix produced by HHBlits. These 20 values are the substitution probability of each type of amino-acid residue along the protein sequence at each position. These values are first converted to linear probabilities using the following formula: We generated two types of features, monogram and bigram, using the generated HMM profile matrix noted here as . We provide a brief description of monogram and bigram features extracted from the HMM profile matrix.

Monogram Features.
Monogram features [36] are calculated taking the normalized sum of the column wise substitution probability values. Size of these feature group is 20 because of 20 different amino acids. The feature can be defined formally as follows: Note that values of are dependent on the columns; that is, 1 ≤ ≤ 20. Here, are the values in the th row and th column of the matrix. We denote monogram features as which is a vector of the form = [ (1), (2), . . . , (20)].

Bigram
Features. Bigram features have been successfully used in the literature for protein attribute prediction [37]. Bigram features are normalized bigrams taken for all pairs of columns. Hence the total number of features generated from this group is 400. Bigram features are generated using the following formula: Here and denote the column pairs for which the bigram is calculated and are in the ranges 1 ≤ ≤ 20 and 1 ≤ ≤ 20. We denote this feature vector as , where has the form of = [ (1, 1), (1, 2), . . . , (1,20), (2, 1), . . . , (20,20)].
We also generate Positive Specific Scoring Matrix (PSSM) profiles for each of the protein sequences using PSI-BLAST [38]. PSSMs were generated using three iterations of PSI-BLAST using the nr database with a cutoff value of 0.001. PSSM profiles also have a similar form to HMM profiles which is a matrix of the same dimension and each of the matrix values denotes substitution probabilities. We generate monogram and bigram features from PSSM files as well. These PSSM based monogram and bigram features are well used in the literature [36,37,[39][40][41][42]. Note that all the monogram features are vectors of size 20 and bigram features are vectors of size 400. We have also used a combination of the monogram and bigram features which is a vector of size 420.

Support Vector Machine.
We have used Support Vector Machines (SVM) as our classification technique. SVM is successfully used in protein attribute prediction in general [28,39,43] and particularly in DNA-binding protein prediction [7,8]. SVM is maximum margin classifier that attempts to learn a hyperplane from the training samples that separates the positive and negative data points in a binary classification problem. The hyperplane that is selected is the one for which the separation width or the margin is maximum and the nature of the hyperplane depends on the kernel functions used. SVM generally tries to optimize a multiplier function that goes as follows: The prediction of a SVM classifier is defined as follows: Here the transformation of the data points by the function could be linear, polynomial, or any other kernel functions. In this paper, we explored linear and radial basis function (RBF) kernels. Linear kernel is of the following form: Here = 1 for the linear kernels. RBF kernels follow the following definition: Often slack variables are used along with the maximum margin SVM classifier to allow generalization error depending on a parameter .

Performance Evaluation.
A good number of effective evaluation metrics have been suggested for use in single valued and multivalued classification and prediction [29,44]. In the literature of DNA-binding protein prediction, we have found that the most widely used metrics are accuracy, sensitivity, specificity, MCC, auROC, and auPR values. In this section, we first provide a description of these evaluation metrics used in this paper.
This first measure, accuracy, is the ratio or percentage of correctly classified negative or positive instances from a given number of protein instances. Here TP is the total number of true positives or correctly classified positive samples and TN is the correctly classified negative samples. FP and FN are incorrectly classified positive and negative instances, respectively. Sensitivity is the true positive rate or the ratio of true positives to the total number of positive examples. Sensitivity is defined in the following equation: Specificity on the other hand is the true negative rate and can be defined as the following equation: .
Note that all these metrics for probabilistic outputs depend on the threshold set for the classifiers. Two other metrics not dependent on thresholds are area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPR). The value of auROC and auPR has maximum value of 1 for the perfect classifier. ROC curve plots true positive rate against false positive rate at different threshold values and precision-recall curve plots precision against recall.
To reduce the training bias, several sampling methods are proposed in the literature [45] and widely used for protein attribute prediction [29]. In this paper, we have used 10-fold cross validation and jack-knife tests which are widely used in the literature of DNA-binding protein prediction [8,11,14,17].

Results and Discussion
In this section, we present the results of the experiments that were carried out in this study. All the methods were implemented in Python3.4 programming language. The Scikitlearn library [46] of python was used for implementing the machine learning algorithms. All experiments were conducted on computing services provided by CITS, United International University.  best results were found using the combination of HMM-Monogram and Bigram features and 82.87% accuracy was achieved using SVM linear kernels. In each case of the SVM linear kernel, HMM based features achieved better accuracy compared to PSSM based features. Similar results could be noticed for auROC, MCC, and sensitivity analysis. Specificity, auROC, and auPR are slightly improved in the experiments with SVM with RBF kernels. We also show the ROC curves for each of these experiments in Figures 2, 3, and 4.

Comparison with Other Methods.
We have compared the performance of HMMBinder with several previous methods and tools used for DNA-binding protein prediction on the benchmark dataset benchmark1075. They are DNABinder [7], DNA-Prot [16], iDNA-Prot [11], iDNA-Prot|dis [14], DBP-Pred [17], iDNAPro-PseAAC [8], PseDNA-Pro [18], Kmer1 + ACC [19], and Local-DPP [20]. The results reported in this paper for these methods are taken from [8,20]. The comparisons were made in terms of accuracy, sensitivity, specificity, MCC, and auROC. To make a fair comparison with the other methods, we performed jack-knife test as done in earlier studies and the results are reported in Table 2.
The best values in Table 2 are shown in bold faced fonts. The results show a clear margin of more than 7% improvement of accuracy over the previous best method, Local-DPP [20]. Similar improvements were found in other  metrics too. Particularly, MCC is increased by 22% compared to the previous best method.
We further experimented to test the effectiveness of HMMBinder on the independent test set also. These results are shown in Table 3. Here the results are not the best but among the best. In terms of accuracy, our results are almost similar to iDNAPro-PseAAC [8]. Their results were significant in the benchmark dataset and were similar to ours in the independent dataset. Specificity value of HMMBinder was among the best and only second to DNA-Threader which failed miserably in terms of accuracy. Considering the difficulty level of the independent dataset, we believe that our method has not been overtrained on the benchmark dataset and the performance is promising and can be claimed as a generalized method after training and testing. Based on these results, we decided to build the web application based on the model trained on the benchmark dataset.
Note that the results on the independent dataset are comparative but not improved in comparison to the stateof-the-art methods. The main focus of this research was to build a classifier based on HMM profiles instead of the PSSM profile based features and we experimentally showed the effectiveness of the HMM profile based features over PSSM. In the future, we aim to focus on the independent dataset to perform better.
Additionally, we would like to highlight two points. Firstly, the datasets that we used were filtered using BLAST-CLUST. It is important to remove the sequences with 8 BioMed Research International similarity more than 25% from the dataset before applying the training and testing methods. We used the dataset proposed by Lou et al. [17], a widely accepted standard independent test dataset where the sequences with similarity of 25% or more with other sequences had been removed. We believe it would be interesting to see the effects of the other heuristic, CLUSTALW [47]. Secondly, feature selection methods are gaining much popularity in case of bioinformatics data and supervised machine learning. We believe that using sophisticated feature selection methods, such as maximum relevance minimum redundancy (mRMR) [48] and maximum relevance maximum distance (MRMD) [49], could improve the results further.

Web Server Implementation.
We have implemented a web based application based on the proposed method. We call this HMMBinder. This is readily available to use at http://brl.uiu.ac.bd/HMMBinder. The server was implemented using PHP web programming language in the front end and python based prediction engine at the backend. The software requires an HMM profile as input to the tools that can be generated by HHBlits. The features are extracted automatically by the python program and the predicted value from a trained model is shown in the web form. The web site contains a "read me" guide and the necessary information required to run the application.

Conclusion
In this paper, we have introduced HMMBinder, a HMM profile based method for the DNA-binding protein prediction problem. We have used monogram and bigram features extracted from the HMM profiles generated by HHBlits and a SVM classification algorithm to train our data on a standard benchmark dataset. Our method is able to make considerable improvement over the other state-of-the-art methods on this dataset and performed comparably well in the independent dataset. We have also established a web based application for our method that is trained on the benchmark dataset. In the future, we wish to extract more effective features and generate larger dataset to train our model to be able to improve the results on the independent dataset. We believe there is a scope of improvement.