An Ensemble Learning-Based Method for Inferring Drug-Target Interactions Combining Protein Sequences and Drug Fingerprints

Identifying the interactions of the drug-target is central to the cognate areas including drug discovery and drug reposition. Although the high-throughput biotechnologies have made tremendous progress, the indispensable clinical trials remain to be expensive, laborious, and intricate. Therefore, a convenient and reliable computer-aided method has become the focus on inferring drug-target interactions (DTIs). In this research, we propose a novel computational model integrating a pyramid histogram of oriented gradients (PHOG), Position-Specific Scoring Matrix (PSSM), and rotation forest (RF) classifier for identifying DTIs. Specifically, protein primary sequences are first converted into PSSMs to describe the potential biological evolution information. After that, PHOG is employed to mine the highly representative features of PSSM from multiple pyramid levels, and the complete describers of drug-target pairs are generated by combining the molecular substructure fingerprints and PHOG features. Finally, we feed the complete describers into the RF classifier for effective prediction. The experiments of 5-fold Cross-Validations (CV) yield mean accuracies of 88.96%, 86.37%, 82.88%, and 76.92% on four golden standard data sets (enzyme, ion channel, G protein-coupled receptors (GPCRs), and nuclear receptor, respectively). Moreover, the paper also conducts the state-of-art light gradient boosting machine (LGBM) and support vector machine (SVM) to further verify the performance of the proposed model. The experimental outcomes substantiate that the established model is feasible and reliable to predict DTIs. There is an excellent prospect that our model is capable of predicting DTIs as an efficient tool on a large scale.


Introduction
The identification of interacting drug-target pairs is of cardinal significance in pharmaceutical science. Previous development of genomics, protein engineering, and molecular biology dynamically helps researchers in finding the potential therapeutic drugs and explaining the by-effect of a trial. In past decades, the Food and Drug Administration (FDA) declared that the demand for new drugs is hard to meet due to the adverse clinical outcomes of some candidate drugs [1]. Classifying DTIs remains to be a critical step for better developing and applying novel molecule-targeted drugs. Previously, researchers utilized clinical experiments as the main approach to discover DTIs. Nevertheless, the traditional experiments are still cumbersome, costly, and time-consuming. Meanwhile, it also has to confront the contingency and inefficiency of the results. Therefore, novel computer-aided drug development (CADD) methods need to be advanced for effectively avoiding these drawbacks [2].
With the progress of protein primary sequence detection technologies and spectral techniques in determination of the chemical composition structure of drugs, the public database has had an explosive growth in size. These databases which provide multiple download formats comprehensively construct a reliable data platform for researchers. Different kinds of databases, such as the Therapeutic Target Database (TTD) [3], DrugBank [4], ChEMBL [5], and KEGG [6], collect the information of the protein primary structure, drug molecular structure, and drug-target pairs with known interactions to assist establishing the prediction model of DTIs. In the past years, researchers have made many achievements in predicting DTIs by combining traditional computing methods and bioinformatics. The most widespread applications are based on molecular docking, genome, and pharmacophore [7]. Molecular docking simulation is utilized to detect the optimal binding position between drug molecules and targets based on energy matching. This method also requires complete three-dimensional (3D) substructures of proteins, but they are hard to explore by Nuclear Magnetic Resonance (NMR), electron microscopy, and X-ray crystallography [8].
Pharmacophores are a characteristic element of drug-active molecules that play a pivotal role in the prediction of DTIs [9]. Researches suggested that the pharmacophore method can effectively inspect the multitarget drug design and reduce the blindness of screening. The difficulty of matching molecular pharmacophores is determined by the number of pharmacophore characteristics. In addition, whether the molecule can match the pharmacophore is also related to the conformations of the molecules [10]. When the conformation changes, the molecule will not match the existing pharmacophore model. Therefore, the establishment of the pharmacophore model is still not comprehensive for further bioassay. At the same time, this method does not take 3D structures of targets into account, which declines the accuracy of the pharmacophore model [11]. In general, it is exceedingly urgent to develop more robust and universal methods for the prediction of DTIs without a ligand and 3D target structure.
Up to now, many learning-based models are developed to detect potential DTIs. For instance, Ding et al. [12] developed a fuzzy bipartite local model (FBLM) based on fuzzy least square support vector machine and multiple kernel learning (MKL) for predicting DTIs. Specifically, MKL is employed to fuse multiple kernels of drugs and targets, and FBLM is adopted to infer the unknown DTIs. Krisztian et al. [13] utilized the Modified Linear Regression (MOLIERE) to predict the potential DTIs based on asymmetric loss models (ALM). Ye et al. [14] proposed a new prediction framework based on Adversarial Bayesian Personalized Ranking (AdvB). More specially, the latent factor matrices of drugs and targets are trained by partial order relationships. Then, the scores of inner products of factors are trained to predict DTIs. Maryam et al. [15] developed an effective model named the Coupled Tensor-Matrix Completion (CTMC) to repurpose drug molecules by constructing drug-drug and target-target tensors. Pliakos and Vens [16] proposed to address DTI prediction as a multioutput prediction task by learning ensembles of multioutput biclustering trees (eBICT) on reconstructed networks. An et al. [17] combined Weighted Extreme Learning Machine (WELM) and Speed Up Robot Features (SURF) to predict DTIs. Laarhoven et al. [18] proposed the Kronecker Regularized Least Square-(Kron-RLS-) based predictive models, which employed the Kronecker product to fuse drug and target feature spaces. Gönen [19] proposed a joint Bayesian formulation of projecting drug compounds and target proteins into a unified subspace, and this formulation combines dimensionality reduction, matrix factorization, and binary classification for predicting drugtarget interaction networks. Zheng et al. [20] proposed a factor model, named Multiple Similarity Collaborative Matrix Factorization (MSCMF) which is an extension of weighted low-rank approximation for one-class collaborative filtering. In this model, drugs and proteins are projected onto low dimensional feature space, and the weights of low-rank matrix and similarity matrix are estimated by alternating least square method to predict DTIs.
In this study, we present a novel computational method which exploits protein primary sequence and molecular fingerprints of drug compounds. More specially, this model numerically characterizes different amino acids as PSSMs to carry biological evolution information. Then, the proposed model employs the PHOG approach to extract the 680dimensional local features of PSSM from different pyramid levels. Finally, the RF classifier is employed to effectively predict DTIs based on the fusion which contains PHOG descriptors of PSSMs and drug fingerprints. This experiment also evaluates the prediction performance by conducting 5-fold Cross-Validation (CV) on enzyme, ion channel, G proteincoupled receptors (GPCRs), and nuclear receptor data sets. For the sake of verifying the reliability of the model, we also carried out the state-of-the-art LGBM and SVM on benchmark data sets. The overall results of the experiments illustrate that the established model is practicable in providing accurate candidates for clinical experiments by predicting DTIs. Figure 1 depicts the workflow of the proposed model.

Materials and Methods
2.1. Data Sets. In this paper, entire experiments were performed on benchmark data sets, viz., enzyme, ion channel, GPCRs, and nuclear receptor. All data sets originate from the databases of DrugBank [4], SuperTarget [21], BRENDA [22], and KEGG BRITE [6]. The statistical quantities of existing drugs are 445, 233, 210, and 54, respectively. The numbers of known proteins are 664, 95, 204, and 26, respectively. The counts of the DTIs which have been proven are 2926, 635, 1476, and 90, respectively. The number of known DTIs which were regarded as positive sample data set is 5127. Table 1 fully lists the statistical amounts of drugs, target proteins, and DTIs.
In this section, the bipartite graph is employed to display the DTI network. The nodes of the graph denote drugs and proteins, the edges which connect the nodes denote the relationships between drugs and targets. The interacting drugtarget pairs are considered as positive samples; the others are regarded as negative samples in the sparse network. Taking the ion channel data set as an instance, there are 42840 (210 × 204) edges existing in the graph. The verified 1476 real drug-target interactions construct the positive sample set, and the residual 41364 (42840-1476) pairs represent the negative samples. It is obvious that there is a big quantity gap between positive samples and negative samples. For attaining sample balance, a downsampling algorithm is adopted in uncorrelated pairs to form the negative set which contains the same number of samples as the positive one. In consideration of the scale of the sparse network and the large ratio of 2 BioMed Research International differences, the possibility that the drug-target pairs with real interactions are collected in the negative data set can be ignored. Therefore, the sample quantities of four negative data sets are 2926, 1476, 635, and 90, respectively.

Drug Substructure Characterization.
In recent years, many physical and chemical properties are utilized to describe the drug compound information including geometry, topology, and quantum chemistry [23,24]. At present, the researchers demonstrate that molecular fingerprints can effectively characterize the drug substructure. The fingerprints of structural bonds represent the drugs as Boolean substructure vectors by separating the drug molecular structure into a variety of segments. Although the molecule is sliced into individual segments, it still retains the entire structure information of the drug [25,26]. These printers reduce the information loss and error accumulation in the process of description and screening. Specifically, the predefined dictionary which contains all substructures matches all fragments of the given drug molecule. If the fragment exists in the dictionary, the corresponding position in the carrier is set to 1; otherwise, it is set to 0. The complete fingerprint database provides an effective way to describe the molecular structure of drugs as binary fingerprint vectors. We utilized the chemical structure map from the PubChem system in the website https://pubchem.ncbi.nlm.nih.gov/, and the map contains 881 molecular substructures [27]. Hence, the feature describers of the drug molecular structure take the form of an 881-dimensional binary vector.

Position-Specific Scoring Matrix (PSSM).
In general, researchers took many physicochemical approaches to numerically describe target proteins [28]. The effective descriptors will differentially convert proteins to enhance the performance of the classifier. Within the experiment, the Position-Specific Scoring Matrix (PSSM) is utilized to represent the biological evolution of proteins [29], and this matrix contains the probability information of 20 amino acids at each position in the original protein sequence. In the practical process, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) is employed to generate the corresponding PSSM for different sorts of amino acids. The matrix is as follows: where the PSSM is expressed as a matrix of L × 20 and L denotes the length of the amino acid. ℓ i,j denotes the evolutionary score that the i th residue mutates into the j th amino acid in the evolutionary process. The experiments also optimized the parameters of PSI-BLAST to obtain more reliable homologous sequences. In summary, parameter e which represents the noise of protein matching is assigned to 0.001, and the frequency of iterations is set to 3.    [30]. Meanwhile, this method has strong antinoise performance and antirotation ability [31]. Firstly, the given original image F is segmented into i × i spatial grids in the i th pyramid level. Then, the histogram of oriented gradient (HOG) vectors of each grid should be calculated. Herein, we adopted Sobel operators to detect the edges and reduce the noise of the image. The Sobel operators can be defined as follows:

Pyramid
Sobel y = where Sobel x and Sobel y represent the horizontal operator and the vertical operator individually [32]. Then, the firstorder differential Sobel operator is utilized to convolute the given image as follows: where G x denotes the convolution of picture F in the x-axis direction, where G y denotes the convolution of image F in the y-axis direction. After convolution, the image F is converted into I which can be obtained as follow: The gradient magnitude g and direction θ of pixels in grids can be obtained by the following formulas: where g x and g y can be computed as follows: where φ and ω represent the coordinate position of a pixel in the picture. The [0-360] orientation is divided into m regions, and the pixels are divided into m regions to count HOG by gradient direction. Then, the HOG eigenvectors which contain m values have to be normalized by the following formula: where V represents the HOG feature vector and ε is a small constant. Finally, HOG features of each spatial grid from all pyramid levels are concatenated to be PHOG feature descriptors. In this experiment, we set parameters L = 3 and m = 8.
The number of grids in four levels is 85 (1 + 2 × 2 + 4 × 4 + 8 × 8), and converted the PSSM into a 680 (85 × 8) dimensional vector. Figure 2 gives an example of merging HOG describers into PHOG describers.   [33]. Rotation forest works well on difference promotions and classifications of small sample data sets [34]. In particular, the RF classifier has outstanding performance on balancing the diversity and accuracy of the base classifier by rotating the subsets. Meanwhile, the model also preserves the efficiency, interpretability, and simplicity of the decision tree. In this paper, we employ RF to predict DTIs. The detailed process is shown as follows.
In practical terms, the data is randomly separated into K subsets containing disjoint features. Afterward, the bootstrap and Principal Component Analysis (PCA) method are applied in subsets to obtain rotation matrices with high diversity. Finally, these matrixes are fed into the corresponding base classifier, and the scores of each decision tree are counted. The matrix X of n × m is treated as a training feature set which contains m features of n samples, and T = ðt 1 , t 2 ,⋯,t n Þ T stores the corresponding labels of n samples. RF has L base classifiers D i . The detailed training process of the base classifier is as follows.
(I) After optimizing the model, the data set M is separated into K disjoint subsets at random, and each subset has C = m/k features (II) Let M i,j represent the j th subset of M, and X i,j is the corresponding feature set of M i,j . Then, calculate    (IV) These coefficients construct the sparse rotation matrix Z i as follows: In the process of classification, the possibility that sample x belongs to category y i is d i,j ðxZ a i Þ generated by base classifier D i . Subsequently, count the confidence degrees that x belongs to each class by mean combination as follows: The sample x will be distributed into the most possible class in accordance with the degree.

Evaluation Criteria.
Throughout the experiments, accuracy (Acc.), sensitivity (Sen.), precision (Pre.), specificity (Spec.), and Matthews correlation coefficient (MCC) comprehensively appraise the prediction performance. These criteria can be defined as follows: Sen: = TP TP + TN , ð15Þ Pre: = TP FP + TP , ð16Þ Spec: = TN TP + FP , ð17Þ where true positive (TP) represents the sum of interacting drug-target pairs with correct predictions, true negative (TN) reflects the aggregate of noninteracting drug-target pairs with correct predictions, false positive (FP) denotes the count of noninteracting drug-target pairs with incorrect classifications, and false negative (FN) represents the count of interacting drug-target pairs with incorrect classifications. Furthermore, receiver operating characteristic (ROC) curves are employed to depict results [35], and the area under the curve (AUC) is calculated to justify the prediction feasibility [36].

Parameter Discussion.
In this experiment, parameters K and L are relevant to the results of the model. The K value and L value represent the numbers of the feature subsets and decision trees of RF, respectively. We applied the grid search algorithm to get the optimum parameters [37]. The method indicates that the accuracy ascends with the growth of the L value. When K = 28 and L = 26, the model has the best performance. Hence, we set the K value and the L value as 28 and 26, respectively. Figure 3 shows the accuracy surface of the RF classifier influenced by parameters K and L.

Fivefold CV Results on Four Data
Sets. This section applied 5-fold CV on enzyme, ion channel, GPCR, and nuclear receptor data sets to obtain evaluation results for further verifying the reliability of our model. During the validation, the data set was broken into five subsets on average. Specifically, each subset took turns to be regarded as the testing part; the other four subsets merged into the training part in five repetitive experiments. Tables 2-5 list the results of validations on benchmark data sets. It is obvious that the model worked well on four golden standard data sets from Tables 2-5. In terms of the results yielded by the enzyme data set, the average accuracy, precision, sensitivity, specificity, and MCC are 88.96%, 89.76%, 87.92%, 90.01%, and 77.93% with standard deviations of 1.13%, 1.82%, 1.25%, 1.50%, and 2.29%, respectively. As for

Comparison between the Models with PHOG Descriptor and LPQ Descriptor.
For fairly evaluating the performance of the PHOG descriptor, we also conducted the experiments with local phase quantization (LPQ) which has a wide application prospect in spatial fuzzy image texture description processing for blurred-invariant property [38,39]. Table 6 displays the comparison between PHOG and LPQ with rotation forest. The summarized table clearly indicates that the model with PHOG descriptors has a performance promotion than the LPQ descriptors on four golden standard data sets. In particular, the precision, sensitivity, specificity, and MCC all improved in the ion channel and GPCR data sets. Figure 8 plots the ROC curves of the PHOG and LPQ models on four data sets with mean AUC values. As can be noted, the AUC values of the PHOG model are higher than the LPQ model. Especially in the GPCR and nuclear receptor data sets, AUC gaps attend to 1.81% and 2.75%, respectively. Hence, our model can effectively describe PSSM to identify potential interacting drug-target pairs.

Comparison with Other Classifiers.
At present, the classifiers which were used in predicting DTIs are mainly based on  7 BioMed Research International traditional machine learning methods. In this section, we adopted advanced SVM and LGBM to combine PHOG descriptors. In the rotation forest, we set parameters K = 28 and L = 26 by utilizing the grid search method. After parameter optimization, SVM embedded Gaussian kernel function with parameters C = 0:2 and Gamma = 40. Parameter C prevents SVM from over fitting, and Gamma determines the number of support vectors.
LGBM is based on a gradient boosting framework, and it is widely used in classification in industrial practice for it is time-saving and memoryconserving. After conducting grid method searching, the best results can be obtained by setting the number of leaves to 60, the learning rate to 0.05, and the number of training rounds to 40. Figure 9 gives the results of RF, SVM, and LGBM on the enzyme, ion channel, GPCR, and nuclear receptor data sets, and it clearly reports that RF has a better performance with the PHOG descriptor than the other classifiers on verifying interacting drug-target pairs. The mean accuracy of RF is 8.30%, 8.07%, 11.43%, and 9.56% higher than SVM on the four golden standard data sets. Compared with the LGBM algorithm, the accuracy of RF improved 3.90%, 4.34%, 1.65%, and 8.33%, respectively. Figures 10 and 11 depict the ROC curves of the benchmark data sets generated by the rates of true positive (TP) and false positive (FP). In addition, mean AUC values are also attached to each graph for more intuitively describing the effect of different classifiers. The reliability of predicting DTIs of the model is proportional to the value of AUC. It can be observed that RF has performance promotions of 9.27%, 9.63%, 12.52%, and 11.49% against SVM on the four benchmark data sets. The value gaps of AUC between RF and LGBM are 8.97%, 7.59%, 9.38%, and 11.48%, respectively, on the four data sets. Accordingly, RF is more competitive than the other models in predicting DTIs.
3.6. Comparison with Other Methods. To date, many researchers have innovatively provided effective solutions for the prediction of DTIs. In order to further validate the efficiency of our model, we selected such previous models as MLCLE [40], NetCBP [41], SIMCOMP [42], WNN-GIP [43], AM-PSSM [44], NetLapRLS [45], MSCMF [20], and Bigram-PSSM [46] to analyze the performance of the proposed model. Meanwhile, all of these models are under the 5-fold CV framework on benchmark data sets. The average AUC values of these methods obviously indicate that the effect of our model has a significant enhancement in prediction in Table 7. In terms of the enzyme, ion channel, and GPCR data sets, the growths of the AUC reached 0.0029, 0.0119, and 0.0320, respectively. With regard to the nuclear receptor data set, Bigram-PSSM has the best performance with an AUC improvement of 0.0204 than our model. The results illustrate that the model which embeds PHOG descriptors and rotation forest is competent to effectively identify DTIs.

Conclusion
In this paper, we fused the pyramid histogram of oriented gradients (PHOG), Position-Specific Scoring Matrix (PSSM),    9 BioMed Research International channel, G protein-coupled receptor (GPCR), and nuclear receptor data sets. The results obviously illustrate that the PHOG features can trace the local characteristics and assist the model to improve the accuracy even compared with LPQ. Meanwhile, the model is considered to be an extraordinarily suitable tool for providing candidates of drug discovery. In the subsequent work, we will experiment with more methods to further raise the feasibility of the prediction model.

Limitation and Future Work
Although the model shows an improved prediction ability than other models, we still noticed the singleness of the local feature, and the noise existing in features also has an adverse effect on forming describers. The main limitations of the model can be explained from two aspects. On the one side, the utilized feature extraction method is sensitive to local feature information. However, it is hard to excavate the global feature information of samples. On another side, the same number of unlabelled samples is randomly selected to be negative samples as the known interacting drug-target pairs; hence, the model wastes a large number of unselected negative samples. The feature studies will mainly focus on the processes of feature extraction and classification. The external edge features which have an excellent application prospect in the field of image tamper prevention will integrate the internal features to comprehensively describe bioinformation with less noise. Meanwhile, unsupervised learning models will be adopted to confront the waste of data sets, and it will make full use of high-throughput unbalanced data. These improvements will bring new challenges and opportunities to develop robust prediction tools for enhancing the model prediction accuracy.

Data Availability
The data are original, and the data source is restricted.

Conflicts of Interest
The author declares that there is no conflict of interest. LGBM-Enzyme (mean AUC = 0.8612)