DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.
DNA-binding proteins play a vital role in various cellular processes. They are essential in transcriptional regulation, recombination, genome rearrangements, replication, repair, and DNA modification [
Most of the computational methods used in the literature to predict DNA-binding proteins formulated the problem as a supervised learning problem. Practically, the number of known DNA-binding proteins is very small compared to the large non-DNA-binding proteins and unknown proteins. DNA-binding protein prediction is often modeled as a binary class classification problem where given a protein sequence as input the task is to predict whether the protein is DNA-binding or not. Note that the challenge here is to select a proper dataset for training and testing incorporating the imbalanced situation. Many supervised learning algorithms have been used in the literature to solve the problem. Among them, Artificial Neural Networks (ANN) [
A great number of web based tools and methods are developed for DNA-binding protein prediction and are available for use. In this paper, we would like to mention several of them: DNABinder [
In this paper, we propose HMMBinder, a novel DNA-binding protein prediction tool using HMM profile based features of a protein sequence. Our method uses monogram and bigram features derived from the HMM profile which shows effectiveness compared to the PSSM or sequence based features. We also use SVM as the classifier and standard benchmark datasets to test our method. Using the standard evaluation metrics, our method significantly improves over the state-of-the-art methods and the features used in the literature. We also developed a web server that is publicly available at
The rest of the paper is organized following the general 5-step guideline suggested in [
In this section, we provide the details of the materials and methods of this paper. Figure
System diagram of HMMBinder.
Selection of benchmark datasets is essential in classification and prediction design. In this paper we use a popular benchmark dataset called
This dataset was first introduced in [
Lou et al. [
The training dataset
Here, a protein,
We generated two types of features, monogram and bigram, using the generated HMM profile matrix noted here as
Monogram features [
Note that values of
Bigram features have been successfully used in the literature for protein attribute prediction [
Here
We also generate Positive Specific Scoring Matrix (PSSM) profiles for each of the protein sequences using PSI-BLAST [
We have used Support Vector Machines (SVM) as our classification technique. SVM is successfully used in protein attribute prediction in general [
The prediction of a SVM classifier is defined as follows:
Here the transformation of the data points by the function
Here
Often slack variables are used along with the maximum margin SVM classifier to allow generalization error depending on a parameter
A good number of effective evaluation metrics have been suggested for use in single valued and multivalued classification and prediction [
Note that all these metrics for probabilistic outputs depend on the threshold set for the classifiers. Two other metrics not dependent on thresholds are area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPR). The value of auROC and auPR has maximum value of 1 for the perfect classifier. ROC curve plots true positive rate against false positive rate at different threshold values and precision-recall curve plots precision against recall.
To reduce the training bias, several sampling methods are proposed in the literature [
In this section, we present the results of the experiments that were carried out in this study. All the methods were implemented in Python3.4 programming language. The Scikit-learn library [
We have run a number of experiments to test the effectiveness of the HMM profile based features on the benchmark dataset. We have six groups of features extracted for this experiment: PSSM-Monogram, PSSM-Bigram, PSSM-Mono + Bigram, HMM-Monogram, HMM-Bigram, and HMM-Mono + Bigram. Each of these feature sets is tested with SVM classifiers using linear and RBF kernels. We further tested the performances of these features using two ensemble classifiers: Random Forest and AdaBoost Classifiers. For these experiments we have performed 10-fold cross validation. The results in terms of accuracy, sensitivity, specificity, auPR, auROC, and MCC are reported. Only the average of these values is reported in Table
Comparison of performances of different features and SVM kernels on the benchmark dataset using 10-fold cross validation.
Features | Accuracy | Sensitivity | Specificity | auPR | MCC | auROC |
---|---|---|---|---|---|---|
|
||||||
|
||||||
HMM-Monogram | 76.77% |
|
0.6976 | 0.6931 | 0.5367 | 0.8358 |
PSSM-Monogram | 74.74% | 0.6636 | 0.8362 | 0.8368 | 0.5040 | 0.8105 |
|
||||||
HMM-Bigram | 70.59% | 0.7071 | 0.7049 | 0.7060 | 0.4095 | 0.7511 |
PSSM-Bigram | 62.20% | 0.6454 | 0.5973 | 0.6025 | 0.2502 | 0.6703 |
|
||||||
HMM (Mono + Bi) |
|
0.8150 |
|
|
|
|
PSSM (Mono + Bi) | 72.40% | 0.7364 | 0.7120 | 0.7136 | 0.4486 | 0.8028 |
|
||||||
|
||||||
|
||||||
HMM-Monogram |
|
|
0.7559 | 0.7535 |
|
|
PSSM-Monogram | 73.71% | 0.6890 | 0.7880 | 0.7903 | 0.4771 | 0.8121 |
|
||||||
HMM-Bigram | 76.68% | 0.7052 |
|
0.8253 | 0.5283 | 0.8318 |
PSSM-Bigram | 74.92% | 0.7490 | 0.7495 | 0.7516 | 0.4966 | 0.8166 |
|
||||||
HMM (Mono + Bi) | 77.43% | 0.7129 | 0.8324 |
|
0.5440 | 0.8496 |
PSSM (Mono + Bi) | 72.40% | 0.7363 | 0.7120 | 0.7136 | 0.4486 | 0.8028 |
|
||||||
|
||||||
|
||||||
HMM-Monogram |
|
|
|
|
|
0.8243 |
PSSM-Monogram | 66.14% | 0.7290 | 0.5895 | 0.5862 | 0.3173 | 0.7332 |
|
||||||
HMM-Bigram | 72.19% | 0.7553 | 0.6903 | 0.6880 | 0.4400 |
|
PSSM-Bigram | 71.00% | 0.7854 | 0.6300 | 0.6305 | 0.4174 | 0.7833 |
|
||||||
HMM (Mono + Bi) | 74.43% |
|
|
0.6931 |
|
0.8218 |
PSSM (Mono + Bi) | 72.68% | 0.7909 | 0.6589 | 0.6645 | 0.4557 | 0.7698 |
|
||||||
|
||||||
|
||||||
HMM-Monogram | 73.31% | 0.7013 | 0.7632 | 0.7603 | 0.4579 | 0.8026 |
PSSM-Monogram | 67.07% | 0.7654 | 0.5703 | 0.5737 | 0.3448 | 0.7157 |
|
||||||
HMM-Bigram | 73.97% | 0.7360 | 0.7432 | 0.7396 | 0.4762 | 0.8063 |
PSSM-Bigram | 70.53% | 0.7436 | 0.6647 | 0.6708 | 0.4116 | 0.7710 |
|
||||||
HMM (Mono + Bi) |
|
|
|
|
|
|
PSSM (Mono + Bi) | 70.07% | 0.7327 | 0.6666 | 0.6687 | 0.4005 | 0.7887 |
Using monogram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Monogram features, (b) SVM linear kernel classifier using PSSM-Monogram features, (c) SVM RBF kernel classifier using HMM-Monogram features, and (d) SVM RBF kernel classifier using PSSM-Monogram features.
Using bigram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Bigram features, (b) SVM linear kernel classifier using PSSM-Bigram features, (c) SVM RBF kernel classifier using HMM-Bigram features, and (d) SVM RBF kernel classifier using PSSM-Bigram features.
Using (Mono + Bi)gram features. Receiver operating characteristic curves for (a) SVM linear kernel classifier using HMM-Mono + Bigram features, (b) SVM linear kernel classifier using PSSM-Mono + Bigram features, (c) SVM RBF kernel classifier using HMM-Mono + Bigram features, and (d) SVM RBF kernel classifier using PSSM-Mono + Bigram features.
We have compared the performance of HMMBinder with several previous methods and tools used for DNA-binding protein prediction on the benchmark dataset
Comparison of performance of the proposed method with other state-of-the-art predictors using jack-knife test on the benchmark dataset.
Method | Accuracy | Sensitivity | Specificity | MCC | auROC |
---|---|---|---|---|---|
iDNAPro-PseAAC | 76.76% | 0.7562 | 0.7745 | 0.53 | 0.8392 |
DNABinder (dimension 21) | 73.95% | 0.6857 | 0.7909 | 0.48 | 0.8140 |
DNABinder (dimension 400) | 73.58% | 0.6647 | 0.8036 | 0.47 | 0.8150 |
DNA-Prot | 72.55% | 0.8267 | 0.5976 | 0.44 | 0.7890 |
iDNA-Prot | 75.40% | 0.8381 | 0.6473 | 0.50 | 0.7610 |
iDNA-Prot |
77.30% | 0.7940 | 0.7527 | 0.54 | 0.8310 |
PseDNA-Pro | 76.55% | 0.7961 | 0.7363 | 0.53 | — |
Kmer1 + ACC | 75.23% | 0.7676 | 0.7376 | 0.50 | 0.8280 |
Local-DPP | 79.20% | 0.8400 | 0.7450 | 0.59 | — |
HMMBinder |
|
|
|
|
|
The best values in Table
We further experimented to test the effectiveness of HMMBinder on the independent test set also. These results are shown in Table
Comparison of performance of the proposed method with other state-of-the-art predictors on the independent dataset.
Method | Accuracy | Sensitivity | Specificity | MCC | auROC |
---|---|---|---|---|---|
iDNAPro-PseAAC | 69.89% | 0.7741 | 0.6237 | 0.402 | 0.7754 |
iDNA-Prot | 67.20% | 0.6770 | 0.6670 | 0.344 | — |
DNA-Prot | 61.80% | 0.6990 | 0.5380 | 0.240 | — |
DNABinder | 60.80% | 0.5700 | 0.6450 | 0.216 | 0.6070 |
DNABIND | 67.70% | 0.6670 | 0.6880 | 0.355 | 0.6940 |
DNA-Threader | 59.70% | 0.2370 |
|
0.279 | — |
DBPPred | 76.90% | 0.7960 | 0.7420 | 0.538 | 0.7910 |
iDNA-Prot |
72.00% | 0.7950 | 0.6450 | 0.445 |
|
Kmer1 + ACC | 70.96% | 0.8279 | 0.5913 | 0.431 | 0.7520 |
Local-DPP |
|
|
0.6560 |
|
— |
HMMBinder | 69.02% | 0.6153 | 0.7634 | 0.394 | 0.6324 |
Note that the results on the independent dataset are comparative but not improved in comparison to the state-of-the-art methods. The main focus of this research was to build a classifier based on HMM profiles instead of the PSSM profile based features and we experimentally showed the effectiveness of the HMM profile based features over PSSM. In the future, we aim to focus on the independent dataset to perform better.
Additionally, we would like to highlight two points. Firstly, the datasets that we used were filtered using BLASTCLUST. It is important to remove the sequences with similarity more than 25% from the dataset before applying the training and testing methods. We used the dataset proposed by Lou et al. [
We have implemented a web based application based on the proposed method. We call this HMMBinder. This is readily available to use at
In this paper, we have introduced HMMBinder, a HMM profile based method for the DNA-binding protein prediction problem. We have used monogram and bigram features extracted from the HMM profiles generated by HHBlits and a SVM classification algorithm to train our data on a standard benchmark dataset. Our method is able to make considerable improvement over the other state-of-the-art methods on this dataset and performed comparably well in the independent dataset. We have also established a web based application for our method that is trained on the benchmark dataset. In the future, we wish to extract more effective features and generate larger dataset to train our model to be able to improve the results on the independent dataset. We believe there is a scope of improvement.
The authors declare that there are no conflicts of interest regarding the publication of this article.