DBP-iDWT: Improving DNA-Binding Proteins Prediction Using Multi-Perspective Evolutionary Profile and Discrete Wavelet Transform

DNA-binding proteins (DBPs) have crucial biotic activities including DNA replication, recombination, and transcription. DBPs are highly concerned with chronic diseases and are used in the manufacturing of antibiotics and steroids. A series of predictors were established to identify DBPs. However, researchers are still working to further enhance the identification of DBPs. This research designed a novel predictor to identify DBPs more accurately. The features from the sequences are transformed by F-PSSM (Filtered position-specific scoring matrix), PSSM-DPC (Position specific scoring matrix-dipeptide composition), and R-PSSM (Reduced position-specific scoring matrix). To eliminate the noisy attributes, we extended DWT (discrete wavelet transform) to F-PSSM, PSSM-DPC, and R-PSSM and introduced three novel descriptors, namely, F-PSSM-DWT, PSSM-DPC-DWT, and R-PSSM-DWT. Onward, the training of the four models were performed using LiXGB (Light eXtreme gradient boosting), XGB (eXtreme gradient boosting, ERT (extremely randomized trees), and Adaboost. LiXGB with R-PSSM-DWT has attained 6.55% higher accuracy on training and 5.93% on testing dataset than the best existing predictors. The results reveal the excellent performance of our novel predictor over the past studies. DBP-iDWT would be fruitful for establishing more operative therapeutic strategies for fatal disease treatment.


Introduction
DNA-binding proteins perform many crucial activities like DNA translation, repair, translation, and damage [1]. DBPs are directly encoded into the genome of about 2-5% of the prokaryotic and 6-7% of eukaryotic [2]. Several DBPs are responsible for gene transcription and replication, and some DBPs shape the DNA into a specific structure, called chromatin [3]. e research on DBPs is significant in diverse fatal disease treatment and production of drugs. For instance, nuclear receptors are the key components of tamoxifen and bicalutamide medicines which are used in cancer treatment. Similarly, glucocorticoid receptors participate in the production of dexamethasone, which is utilized in autoimmune diseases and anti-inflammatory, allergies, and asthma treatment [4][5][6]. Onward, Inhibitor of DNA-binding (ID) proteins are closely related to tumor-associated processes including chemoresistance, tumorigenesis, and angiogenesis. In addition, ID proteins are also directly concerned with lung, cervical, and prostate cancers [7].
Protein sequences are rapidly growing in the online database. A series of predictors were developed for diverse biological problems including iRNA-PseTNC [8], iACP-GAEnsC [9], cACP-2LFS [10], DP-BINDER [11], Deep-AntiFP [12], cACP [13], iAtbP-Hyb-EnC [14], iAFPs-EnC-GA [15], and cACP-DeepGram [16]. It is highly demanding to predict DBPs by computational approaches. Several predictors were introduced using the primary sequential information and structural features. Structured-based predictors produce good prediction results, but structural features for all proteins are unavailable. Some of the structure-based protocols are iDBPs [17], DBD-Hunter [18], and Seq(DNA) [19]. Sequence-based systems have been developed using sequential information, more convenient and easy to employ for large datasets. erefore, many sequence-based systems were adopted for DNA-binding proteins identification. Among these methods: DBP-DeepCNN [20], DNA-Prot [21], iDNA-Prot [22], iDNA-Prot|dis [23], Kmer1 + ACC [24], Local-DPP [25], DBPPred-PDSD [26], DPP-PseAAC [27], and StackDPPred [28]. Consequently, Li et al. extracted features by a convolutional neural network (CNN) and Bi-LSTM [29]. Onward, Zhao et al. the features of the proteins are analyzed by six methods and classification is performed with XGBoost [30]. Each computational method contributed well to enhancing the prediction of DBPs. However, more efforts are needed to improve prediction of DBPs. Considering this, a new method (DBP-iDWT) is established to identify DBPs accurately. e contribution of our research is as follows: In addition to LiXGB, the features set is fed into three classification algorithms, namely ERT, XGB, and Adaboost. e efficacy of each classifier was assessed with ten-fold test, while the generalization capability was assessed by a testing set. LiXGB using R-PSSM-DWT secured the highest prediction outcomes than past methods. e flowchart of the DBP-iDWT is depicted in Figure 1. e rest portion of the manuscript comprises three parts. Section 2 comprises details regarding datasets and methodologies; in Section 3, the performance of classifiers has illustrated; and Section 4 summarizes the conclusion.

Selection of Datasets.
We selected two datasets from the previous work [31]. One dataset (PDB14189) is employed model training and the other dataset is deployed as a testing dataset. PDB14189 was collected from the UniProt database [32]. To design a standard dataset, they removed more than 25% of similar sequences by CD-HIT toolkit. e final training dataset comprises 7129 DBPs and 7060 non-DBPs. e independent set was retrieved by a procedure explained in reference [33]. e similar sequences with a cutoff value 25% are removed. e final testing dataset contains 1153 DBPs and 1119 non-DBPs.

Feature Descriptors.
In this work, the patterns are discovered with PSSM-DPC-DWT, F-PSSM-DWT, and R-PSSM-DWT. ese approaches are elaborated in the following parts.

Position-specific Scoring Matrix (PSSM).
Recently, evolutionary features are successfully implemented and improve the prediction results of many predictors [1,20]. We also implemented PSSM for the formulation of evolutionary patterns. Each sequence is searched against the NCBI database applying the PSI-BLASTprogram for the alignment of homologous features [34].
e PSSM can be denoted as follows: PSSM � P 1 , P 2 , . . . , P j , . . . , P 20 T , where Tand P i,j indicate the transpose operator and score of j type of amino acid in the i th position of query sequence.

Filtered Position-specific Scoring Matrix (F-PSSM).
PSSM transforms the evolutionary patterns into numerical forms. It may comprise some negative scores which can lead to generating similar feature vectors despite different sequences. To cope with this hurdle, F-PSSM filters all the negative scores in the preprocessing step. e detail of dimension formulation is provided in [35].

Position-specific Scoring Matrix-Dipeptide Composition (PSSM-DPC).
e local sequence-order patterns contains informative feature which are explored by incorporating DPC into PSSM. DPC calculates the frequency of continuous amino acids and produces a dimension of 400 [36]. DPC is calculated as follows: where

Reduced Position Specific Scoring Matrix (R-PSSM).
It is believed that there exist several similarities among 20 unique amino acids. Based on these similarities, researchers categorized these residues into groups. Li et al. [37] suggested that according to some specific residue the following groups can be formed: 2 Computational Intelligence and Neuroscience Using the Li et al. rule, the L × 20 PSSM is converted to L × 10 matrix by the following equations: If r 1 r 2 r 3 . . . . . . .r L is a given protein sequence, then its reduced PSSM (R-PSSM) is indicated as follows: . . . . . . . . .
We obtain 110 feature vector from RP.

Discrete Wavelet Transform.
To achieve only salient information, some compression approaches like DWT is applied in research areas. DWT is used for compression of signals and denoising [38,39]. DWT divides a signal into low-frequency and high-frequency components [40]. Low frequencies are more important than high-frequencies [41]. e Low frequencies are onward split into low and high levels to achieve discriminative patterns. DWT is computed as follows: where mrepresents the scale variable and n shows the translation variable. X(m, n) is the transform coefficient. e low and high frequencies of a signal f(t) is computed as follows:  , and L, represent the high pass filter, discrete signal, and low pass filter, respectively. To obtain only important features and eliminate the less informative and noisy patterns, DWT is extended into F-PSSM, PSSM-DPC, and R-PSSM to split into low and high frequencies up to two levels. Finally, PSSM-DPC-DWT, R-PSSM-DWT, and F-PSSM-DWT novel feature descriptors are constructed. e dimension of each feature set is 512 after applying DWT. Figure 2 depicts the schematic view of Two-level DWT.

Light eXtreme Gradient Boosting.
During the establishment of the predictor, the model training is performed by a classifier. Gradient Boosting Machine (GBM) classifier uses decision trees for the construction of a model. e model performance is improved with loss function [42]. Unlike GBM, eXtreme Gradient Boosting (XGB) employs an objective function. XGB concatenates loss function and regularization for regulating the model complexity. It performs parallel computations to optimize the computational speed. Due to these benefits of XGB, Light eXtreme Gradient Boosting (LiXGB) was proposed [43]. LiXGB possesses many additional features like lower memory, higher efficiency, and fast model training speed that improve the model performance. LiXGB minimizes the model training time of the large datasets. We utilized the hyperparameters like max depth, estimator, eta, lambda, and alpha. e "eta" maintains the learning rate, "estimator" constructs trees, "max depth" is used for controlling the tree depth, "alpha" shrinks the high dimension of the dataset, and "lambda" avoids the overfitting. Other parameters have been kept as default.
ese hyperparameters are also summarized in Table 1.

Proposed Model Validation Methodologies.
e model performance is examined by different validation approaches e commonly used validation methods are k-fold and jackknife [44][45][46][47]. However, the jackknife is time-consuming and costly [48][49][50]. During 10-fold cross validation, training set is split into 10-folds. e 9 folds are used for model training and 1 fold is used for model validation. is process is repeated 10 times so that each fold is used for the test exactly once. e final prediction is the average of all tested folds [51][52][53][54]. e current work performance is evaluated with 10-fold and five indexes, i.e., specificity (Sp), F-measure, sensitivity (Sn), accuracy (Acc), and Mathew's correlation coefficient (MCC) for evaluating the model performance [55][56][57][58].
ese parameters are computed as follows: where H + is used to denote the DBPs, H − is the non-DBPs, H − + shows the prediction of non-DBPs which the model predicted mistakenly as DBPs, and H + − represents the DBPs which are classified by the model as non-DBPs.

Results and Discussion
After performing experiments on the models, In this part, we will elaborate the obtained results of the learning algorithms via the extracted feature sets of the training and testing sequences.   Table 2. e performance of the individual descriptor is analyzed by 10-fold test and assessment indices. On F-PSSM, the accuracies secured by LiXGB, XGB, ERT, and Adaboost are 76.60%, 74.57%, 75.18%, and 71.52%, respectively. Among all classifiers, LiXGB achieved the best accuracy. On PSSM-DPC, all classifiers enhanced the prediction results and generated 83.62%, 81.63%, 79.56%, and 80.07% accuracies by LiXGB, XGB, ERT, and Adaboost, respectively. Similarly, the classifiers also improved the performance on the R-PSSM descriptor using all evaluation parameters. LiXGB attained the highest (83.62%) accuracy. e predictions indicate that LiXGB possesses higher learning power comparatively XGB, ERT, and Adaboost.

Results of Feature Encoders after DWT.
e features extracted by representative methods may contain some noisy, redundant, or less informative features. To avoid such features, DWT is applied to F-PSSM, PSSM-DPC, and R-PSSM. DWT considers the informative patterns and improves the performance of the model. After applying DWT, we achieve F-PSSM-DWT, PSSM-DPC-DWT, and R-PSSM-DWT. Each feature is fed into Adaboost, ERT, XGB, and LiXGB in order to examine the performance over these feature descriptors and results are summarized in Table 3. With 10-fold test, Adaboost, ERT, XGB, and LiXGB produced 73.20%, 77.26%, 75.37%, and 79.40% accuracies which are 1.68%, 2.08%, 0.80%, and 2.80% than F-PSSM, PSSM-DPC, and R-PSSM, respectively. Similarly, the classifiers also boosted the performance on PSSM-DPC-DWT on all evaluation parameters. Furthermore, with R-PSSM-DWT, Adaboost, ERT, XGB, and LiXGB have enhanced the accuracies by 2.16%, 3.49%, 1.98%, and 3.22% than R-PSSM.
ese results demonstrate that all classifiers show improvement in performance after applying DWT. Among all feature descriptors, the best results are secured by R-PSSM-DWT.
LiXGB has constantly depicted better achievement than other classifiers. LiXGB enhanced the performance and generated 3.23%, 3.79%, and 4.61% higher accuracies than XGB, ERT, and Adaboost with R-PSSM-DWT. It is concluded that the performance of LiXGB is superior to other classifiers.

Comparison with Existing Predictors Using Training Set.
Several methods have been implemented for the identification of DBPs. e proposed work is compared with past studies including iDNA-Prot [22], iDNA-Prot|dis [23], TargetDBP [59], MsDBP [60], PDBP-CNN [29], and XGBoost [30] and summarized the results in Table 4. Our proposed study improved the accuracy by 4.82%, sensitivity by 10.58%, and MCC by 0.09 than the best predictor (PDBP-CNN). Similarly, e DBP-iDWT enhanced 5.42% Acc, 2.49% Sn, 8.65% Sp, and 0.11 MCC than the second best study (XGBoost). In the same fashion, our predictor performance is superior to past studies using all four assessment parameters. e outcomes verified that DBP-iDWT can discriminate DBPs with high precision.

Comparison with Past Predictors Using Independent Set.
A method is considered effective if it has high generality for the new sequences. We also evaluated the proposed work using a testing dataset. e results compared with past studies like PseDNA-Pro, iDNAPro-PseAAC, iDNAProt-ES, DPP-PseAAC, TargetDBP, MsDBP, and PDBP-Fusion as noted in Table 5. It is noted that our predictor (DBP-iDWT) raised 5.06% Acc, 17.06% Sn, 8.22% Sp, and 0.10 MCC than PDBP-Fusion. Similarly, DBP-iDWT improved 6.14% Acc, 14.02% Sn, and 0.13 MCC than TargetDBP. Onward, the proposed study also secured higher prediction results than other past methods in Table 5.
ese results analysis confirm that the incorporation of DWT into R-PSSM in conjunction with LiXGB can identify DBPs more accurately. Past studies have reported that the selection of the best features can improve the model performance [61][62][63]. In this study, we also implemented feature selection approach including mRmR and SVM-RFE, however, no improvement in the model performance is observed.

Conclusion and Future Vision
DBPs play an active role in many biological functions and drug designing. We have designed a predictor for improving DBPs prediction with high precision. e global information, local features, sequence-order patterns, and correlated factors are explored by PSSM-DPC-DWT, R-PSSM-DWT, and PSSM-DPC-DWT. e models are trained with LiXGB, XGB, ERT, and Adaboost. It is concluded that R-PSSM-DWT with LiXGB has effectively attained superlative performance than other predictors. e successful outcomes of the proposed study is due to factors like utilization of effective descriptors, application of a compression scheme, and appropriate classifier.
DBP-iDWT will be effective for the identification of DBPs due to its promising prediction power than other predictors and perform an active role in drug development. DBP-iDWT would be fruitful for establishing more operative therapeutic strategies for fatal disease treatment. In addition, we will apply advanced deep learning frameworks [64][65][66][67] in our future work to further improve the DBPs prediction.

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
Farman Ali: Conceptualization, Methodology. Harish Kumar and Shruti Patil: data collection, writing-original draft preparation. Omar Barukab and Ajay B Gadicha: Visualization, performed experiments suggested by reviewers. Omar Alghushairy and Akram Y Sarhan: Code writing, editing, and reviewed the paper.