DrugECs: An Ensemble System with Feature Subspaces for Accurate Drug-Target Interaction Prediction

Background Drug-target interaction is key in drug discovery, especially in the design of new lead compound. However, the work to find a new lead compound for a specific target is complicated and hard, and it always leads to many mistakes. Therefore computational techniques are commonly adopted in drug design, which can save time and costs to a significant extent. Results To address the issue, a new prediction system is proposed in this work to identify drug-target interaction. First, drug-target pairs are encoded with a fragment technique and the software “PaDEL-Descriptor.” The fragment technique is for encoding target proteins, which divides each protein sequence into several fragments in order and encodes each fragment with several physiochemical properties of amino acids. The software “PaDEL-Descriptor” creates encoding vectors for drug molecules. Second, the dataset of drug-target pairs is resampled and several overlapped subsets are obtained, which are then input into kNN (k-Nearest Neighbor) classifier to build an ensemble system. Conclusion Experimental results on the drug-target dataset showed that our method performs better and runs faster than the state-of-the-art predictors.


Introduction
The knowledge of biological targets is key to find new medications where the inventive process is drug design [1]. Most drugs binding to specific proteins activate or inhibit various functions with the change of proteins' biochemical and/or biophysical activities [1][2][3]. Before designing a drug, some properties of the drug are required to the rational drug design, such as binding affinity, bioavailability, metabolic half-life, and side effects. It is a hard work to obtain these properties accurately. Therefore more attentions are being focused on selecting candidate drugs whose physicochemical properties can be predicted more easier than those of the drug. The former can more likely lead to an approved drug with fast rapid and simplified processes. However, synthesizing a drug candidate involves three issues that should be addressed [1,2]. First, the way to finding drug effects to different people [3][4][5] and the biological interaction pathways of the drug effects in human beings [6] are the two important issues. Moreover, since drug discovery is very complicated, the number of new drug approvals is quite low per year. Computational methods are more and more commonly used to design drugs in complement with in vitro experiments, where the identification of drug-target interactions is important in finding potential candidate drugs.
Nowadays, many computational methods have been developed to identify drug-target interactions (DTIs). Chen et al. proposed for the first time a "GO and KEGG enrichment score" method to represent the certain category of drug molecules by constructing a benchmark database [7]. Rarey et al. presented an automatic method for docking organic ligands into protein binding sites, which combined an appropriate model of the physicochemical properties of the docked molecules with efficient methods for sampling the conformational space of ligands [8]. Cheng
to predict small-molecule druggability based solely on the crystal structure of target binding sites, which quantitatively estimated the maximal affinity achievable by a druglike molecule, where the calculated values are correlated with drug discovery outcomes [9]. Zhu et al. proposed a probabilistic model, called mixture aspect model (MAM), with an algorithm for estimating its parameters to mine the relationship of "chemical compound-gene" [10]. Moreover, Chen et al. proposed a prediction method based on Nearest Neighbor Algorithm and a novel metric which combined compound similarity and functional domain composition. It concluded that the combination of compound similarity and functional domain composition is very effective in the drugtarget interaction prediction [11]. Some methods combined information of chemical structure, genomic sequence, and 3D structure information to predict drug-target interaction networks [12,13]. Wang et al. first collected drug pharmacological and therapeutic effects, drug chemical structures, and protein genomic information to characterize the DTIs and then proposed a kernel-based method to predict DTIs by integrating multiple types of data [14]. Other methods developed machine learning methods focusing on HIV protease cleavage site prediction [15], identification of GPCR (G protein-coupled receptors) type [16], protein subcellular location prediction [17,18], membrane protein type prediction [19], and a series of relevant webserver predictors as summarized in a review [20].
Most recently, researchers proposed many machine learning methods to identify DTIs. Yuan et al. proposed an ensemble method that combined multiple well-known similaritybased methods to predict DTIs [21]. Ba-alawi et al. developed an efficient drug-target prediction method, called DASPfind, that used simple paths of particular lengths inferred from a graph to describe DTIs, similarities between drugs, and similarities between the protein targets of drugs [22]. Moreover, Nascimento et al. proposed a multiple kernel learning algorithm to investigate drug-target bipartite networks and automatically selected the more relevant kernels by returning weights indicating their importance in the drug-target prediction at hand [23]. Liu et al. proposed a neighborhood regularized logistic matrix factorization (NRLMF) for DTI predictions. The NRLMF method modelled the probability that a drug would interact with a target by logistic matrix factorization, where the properties of drugs and targets are represented by drug-specific and target-specific latent vectors, respectively [24]. Recently, Hao et al. proposed a dualnetwork integrated logistic matrix factorization (DNILMF) algorithm to predict potential DTIs, which encoded drugs and targets by inferring new drug/target profiles, constructing profile kernel matrix, diffusing drug profile kernel matrix with drug structure kernel matrix, and diffusing target profile kernel matrix with target sequence kernel matrix as well [25].
In this work a new prediction system is proposed to identify drug-target interactions. First, drug-target pairs are encoded with a fragment technique and the adopted software "PaDEL-Descriptor." The fragment technique is for encoding target proteins, which divides each protein sequence into several fragments in order and encodes each fragment with several physiochemical properties of amino acids. The software "PaDEL-Descriptor" creates encoding vectors for drug molecules. Second, the dataset of drug-target pairs is resampled and several overlapped subsets are obtained, which are then input into kNN ( -Nearest Neighbor) classifiers to build an ensemble system. Experimental results on the drug-target dataset showed that our method performs better and runs faster than the state-of-the-art predictors.

Datasets.
To compare with other methods, datasets in [13] were adopted for the drug-target predictions in this study. The datasets contained 4797 drug-target pairs with experimental information, of which 2719 are for enzymes, 1372 for ion channels, 630 for GPCRs, and 86 for nuclear receptors. These datasets are used as the positive ones in this work. To build the drug-target predictor, negative datasets were also selected as in [13]. The selection steps in the reference are (1) separating the pairs in the above positive dataset into single drugs and proteins, (2) recoupling these singles into pairs in a way that none of them occurs in the corresponding positive dataset, and (3) randomly picking the negative pairs thus until they reached the number, two times as many as the positive pairs. The nondrug-target interaction pairs are also divided in terms of protein target family. Finally, the negative datasets contain 9594 drug-target pairs, of which 5438 are for enzymes, 2744 for ion channels, 1240 for GPCRs, and 172 for nuclear receptors. In total, the positive-negative datasets contain 8157, 4116, 1860, and 258 pairs for enzymes, ion channels, GPCRs, and nuclear receptors, respectively. Table 1 lists the details of the four datasets.   AAindex1 dataset were used. After the use of principal component analysis (PCA), the number of principal components PCA that contribute 97.54% of instance variance is obtained, where PCA = 7 for example. Each principal component of amino acid property is a vector with dimensions 1 × 20, noted as AAP = {AAP 푚 }, = 1∼20. For target protein TP 푖 with sequence length Len, its sequence is divided into frag fragments (denoted as, e.g., frag = 10) with roughly the same numbers of amino acids from the N-terminal to C-terminal of the sequence, that is, INT(Len/ frag ). For th fragment, = 1∼ frag , 20 types of amino acids are counted; that is, AA 푗 푛 , = 1∼ AA , where AA is for 20 types of amino acids. The dot product between the property vector AAP 푙 , = 1∼ PCA , of th principal component and the amino acid composition AA 푛 푗 of the th fragment yields the score of amino acid property composition TP 푙,푛 푗 = AAP 푙 ⊙ AA 푛 푗 . Therefore, the encoding vector for the th fragment is denoted as TP 푗 = {TP 푙,푛 푗 }. Concatenating the ten fragments yields the final encoder vector TP 푖 = {TP 푙,푛 푗 }| , where AA means the number of amino acid types AA = 20, for the th whole protein sequence. The vector is then with dimensions 1×1400. The encoding scheme for target protein is shown in Figure 1.

Feature Encoding for Drug Molecule.
To encode drug molecule, PaDEL-Descriptor software was adopted. "PaDEL-Descriptor" can calculate molecular descriptors and fingerprints for small molecules. Molecular descriptors are the way that transformed the real bodies of molecules into numbers, which can be then evaluated by computer and play a fundamental role in chemistry and other fields. They are traditionally divided into two main categories: experimental measurements and theoretical molecular descriptors. The former category contains log , molar refractivity, dipole moment, polarizability, and physicochemical properties, while the latter one can be derived from a symbolic representation of molecules.
Currently the software creates 1D (i.e., atom count, bond count, chi cluster, and hybridization ratio), 2D (i.e., graph invariants), and 3D (i.e., size, steric, surface, and volume descriptors) molecular descriptors and 10 types of fingerprints [26]. In this work, 55 types of 1D and 2D descriptors were used after removing salt and detecting aromaticity information from a molecule. Among these descriptor types, atom type electrotopological state and autocorrelation are the most significant ones during the used descriptors. The atom type electrotopological state consists of 489 descriptors that combines both electronic and topological characteristics of the analyzed molecules [27]. For each atom type in a molecule, the descriptors are catenated and can be used in a group contribution manner. The type of autocorrelation consists of 346 descriptors that encode not only the structures of molecules but also numerical properties assigned to atoms, proposed by Moreau and Broto [28].
In this study, 1444 descriptors of 1D and 2D types are used to encode drug molecules. So the th drug candidate can be formulated as For the th pair of drug-target, DT 푖 , whose target is encoded by the principal components of AAindex1 property AAP, it can be catenated and formulated as a (1444 + 1400)-D input vector given by where PCA , frag , and AA denote the numbers of top principal components, protein sequence fragments, and amino acid types, respectively. Here PCA = 7, frag = 10, and AA = 20, for example. The corresponding target value 푖 of the instance 푖,AAP is 1 or 0, denoting whether the drug-target pair is in interaction or not. Actually, our method expects to learn the relationship between input matrix AAP and the corresponding target array , and it tries to make the outputs of classifier as close to the target array as possible, where AAP denotes the irrelevant AAindex1 property the targets are encoded by.

Ensemble Classifier by Subspace Seperation of Input
Instances. To build a classifier system for the drug-target prediction, the feature space of drug-target instances is separated into several subspaces. For encoding target protein, the use of each principal component of amino acid properties yields features Ft 푖 , = 1∼ PCA , with dimensions 20 × 10 = 200, while, for drug molecule, approximately 206 (≈1444/ PCA , PCA = 7) features Fd 푗 , = 1∼ PCA , are selected in order from the 1D and 2D feature descriptors. A drug-target pair is encoded by one principal component's subfeatures Ft 푖 and one subset of 1D and 2D descriptors Fd 푗 as dt = [Ft 푖 Fd 푗 ] , , = 1∼ PCA . There are totally 7 × 7 = 49 feature subsets when PCA = 7, each of which is with dimensions 200 + 206 = 406. The reason of using 7 subsets of 1D and 2D feature descriptors is for balancing the encoding features of target proteins and those of drugs. Pairs of drugtargets with one subset of 406 features are input into classifier system to identify drug-target interactions.
The classifier system adopted kNN algorithm to implement the drug-target prediction. In the case of classification, kNN classifies an object to the most common class among its -Nearest Neighbors by a majority vote, where is a small positive integer. It always assigns weights to the contributions of the neighbors of the object, that is, assigning a weight of 1/ , where is the distance between the object and the neighbor. The neighbors are taken from the training set of objects for which the class is known. where is a query instance.
Previous results showed that the majority vote with independent classifiers can often make a dramatic improvement [29,30]. Here a pair of drug-target is labelled as interacting if all of the classifiers identified it as positive class 1; otherwise it is identified as nondrug-target interaction. The flowchart of the ensemble system can be seen in Figure 2.

Drug-Target Interaction Prediction Evaluation.
In this work we adopted four evaluation measures to show the ability of our model objectively, criteria of recall (Rec), precision (Prec), -measure ( 1), Matthews correlation coefficient (MCC), and accuracy (Acc) [31][32][33]. They are defined as follows: where TP (true positive) is the number of correctly predicted drug-target pairs, FP (false positive) is the number of false positives (incorrectly overpredicted nondrug-target pairs), TN (true negative) is the number of correctly predicted nondrug-target pairs, and FN (false negative) is false negative, that is, incorrectly underpredicted drug-target pairs.

Results
In the paper, we adopted kNN algorithm to complete the drug-target interaction predictions. In the use of kNN, for the number of neighbors num 퐾 , we adopted the implementation of WEKA software which used cross-validation technique to find the best number of neighbors. It trained input instances to find which number of neighbors yields the best performance. The trained neighboring number num 퐾 was then applied to test the test dataset.  Figure 2: The flowchart of the ensemble system for the drug-target prediction. It illustrates the system for the 7 top principal components. The "PCA 1" denotes the features of protein targets created by the first top principal component, while the " 1 1∼206" means the first feature group of drugs from 1 to 206 created by the "PaDEL" software. Each feature group consists of the roughly same number of features. Therefore each instance is composed of one "PCA" feature group and one "PaDEL" feature group. In total, there are 49 combinations for the case of the 7 top principal components.
Moreover, We used the 7 PCAs in this work because the top 7 components account for the most of instance variance of the 544 amino acid properties, up to 97.54%. The 10 fragments of a protein sequence are adopted corresponding to the number of 1D and 2D descriptors. We used approximately the same numbers of features for drugs and target proteins due to the balance of their effects to classifier. That is to say, if we use 5 fragments of protein sequence, the number of PCAs is suggested to be set as 14 (then the total number of features for protein targets is 5 × 14 × 20 = 1400), 15 fragments are corresponding to 5 PCAs (then the number of features for targets is 15 × 5 × 20 = 1500), and 4 fragments are corresponding to 19 PCAs (1520 features). All numbers of features for targets are approximately the same as the number of features for drugs (totally 1440). For details, please refer to Table 4.

PCA Analysis of the AAindex1
Properties. In AAindex1 dataset, there are 544 amino acid properties. Most of them are highly correlated. In order to extract the main properties from the dataset, PCA technique was adopted. In this study, 7 top principal components are obtained first, which account for 97.54% variance of the properties. Therefore the 7 components are retained and the others are ignored. The other principal components account for only 2.46% variance. Table 2 shows the 7 principal components.

Performance on Different Top Principal Components of Amino Acid Properties.
Drug-target interactions can be commonly grouped in terms of target protein type: enzymes, ion channels, GPCRs, and nuclear receptors. Our proposed method is performed on the four individual types of drugtarget interactions. Instances in each interaction type are divided into 10 subsets with roughly the same number of instances, where one subset is used as test dataset ts and the remaining nine subsets are used as training dataset tr , by 10-fold cross-validation technique. The test subset is selected one by one and finally all of the instances are tested. Meanwhile, the features of each instance consist of one "PCA" feature group and one "PaDEL" feature group. Inputting the instances into kNN classifier forms a predictor to identify drug-target interactions.
For the case of GPCRs type, the instances encoded by different pairs of drug-target feature groups are input into kNN classifier. The prediction performance can be seen in Table 3. From Table 3, the individual classifier encoded by the third "PCA" group and the seventh descriptor group performs better than others, under the estimation of "Acc," and the classifiers with the seventh descriptor group perform the best. Moreover, it is interesting to show that the classifiers encoded by the sixth principal component perform better than those by other components.
Also, performance on different numbers of PCAs and the corresponding numbers of fragments is investigated and shown in Table 4. The number of fragments is changed with the number of PCAs since the number of the corresponding features for targets is approximately the same as that of features for drugs (totally 1444 features). Therefore, the instances encoded by the drug features and the target features are used to build drug-target prediction classifier. From Table 4, the  Table 5 shows the performance comparison of the ensemble system for the four protein target classes. From Table 5, it can be seen that the ensemble system tested on nuclear receptors class performs better than those on other classes. It yields an accuracy of 0.921 and a precision of 0.856 at a recall of 0.916.

Comparison with Other Methods.
On the same datasets, our proposed method, called "DrugECs," was compared with other methods, such as the work in [13], four web-servers, and a random predictor. The performance comparison in terms of the measure "accuracy" is shown in Table 6. Our method yields accuracies of 0.918, 0.882, 0.863, and 0.921 for classes of enzymes, ion channels, GPCRs, and nuclear receptors, respectively. Our method achieves accuracy improvements of 6.3% to 7.8% than the work [13] for the four drug-target classes. Moreover, our method performs better than the four web-servers: iEzy-Drug, iCDI-Drug, iGPCR-Drug, and iNR-Drug. In comparison with the random predictor, the accuracy of our proposed method is increased more than twofold. Furthermore, we also compared our method with other methods based on another dataset. In [11], Chen et al. proposed a prediction method based on Nearest Neighbor Algorithm and a novel metric combining compound similarity and functional domain composition. It was tested on the database similar to that of our method. The difference between the two datasets is in the fact that, in [11], the number of negative pairs in each target class was around 50 times as many as that of positive ones, while being only 2 times as that of positive ones in our dataset. We compared our method with the work based on the same dataset. Actually, since the dataset is extremely imbalanced, the final performance of drug-target interaction predictions is preferable to the class with large instances, the negative class in this study.
The MCC can be a more suitable measure than Acc in the evaluation of classifier performance. Figure 3 shows the performance comparison of our method and the work in [11]. From Figure 3, our method outperformed the work [11] for the target classes of GPCR and enzyme, and it also performed comparably to the work for the class of ion channel. For the   Table 7 shows the most correlated properties to the components of PCA. Data description of each AAindex1 amino acid property is also shown in Table 7.
In the most correlated AAindex1 properties, two are for hydrophobicity property (the first one and the third one) and two are for position-specific amino acid preferences in helices (the fourth one and the sixth one). Another one is for free energies of transferring amino acid side chains from vapor to cyclohexane that are linearly related to their respective surface areas, which is an experimental measure of their susceptibility to attraction by dispersion forces. The other two properties are the Kerr-constant increments, which can be used in the conformational analysis of peptides and proteins, and the correlation coefficient between the contact areas of residues and their spatial positions from the centroids of the best-fitting ellipsoids. These amino acid properties are important for encoding protein sequence in that they represent protein sequences by different environmental features. The encoding schema aims to apply various statistic features to recover real interactions among amino acid residues.
It is interesting to show the correlation coefficient of the top PCA component and AAindex1 property. Figure 4 illustrates the biggest correlation coefficient of each PCA component to AAindex1 properties and the variance of each PCA component accounted for. From Figure 4, the top PCA component has the biggest correlation coefficient. Moreover, the bigger variance the PCA component contains, the bigger correlation coefficient the component has. It suggests that the trend of the correlation coefficients is in accordance with that of the component variances for the AAindex1 dataset. That is to say, for the case of the first principal component, more properties in AAindex1 dataset can be projected into the component and therefore it is more probable for the    [34] for the iEzy-Drug predictor and its reported success rates; b see [35] for the iCDI-Drug predictor and its reported success rates; c see [36] for the iGPCR-Drug predictor and its reported success rates; d see [37] for the iNR-Drug predictor and its reported success rates. The first column denotes the top components in PCA calculation; the second column denotes the property accessions in AAindex1 dataset. component to have bigger correlation coefficient to these properties. Moreover, the paper proposed a method to identify drugtarget interactions only with the physicochemical properties of amino acids from AAindex1. The reason of using physicochemical properties alone is in that it can extend the proposed method to most cases of drug-target predictions. The model with more functional and structural features may lead to better performance than that with only physicochemical properties, but it limited the applications of the structurebased methods without the information of functional and structural features for drug-target predictions.

Conclusions
This paper proposed an ensemble system integrating kNN classifier with a novel feature encoding scheme to identify drug-target interactions. The features of physicochemical properties from AAindex1 for targets and those of descriptors for drugs are catenated for representing each drug-target pair. The feature space for targets and that for drugs are individually divided into PCA = 7 subspaces in this work. A pair of drug-target feature subspaces forms an input dataset and then is input into a kNN classifier. The total PCA × PCA = 49 feature subspaces lead to 49 individual kNN classifiers. The ensemble system performs the drugtarget interaction predictions. Experimental results on the commonly used drug-target dataset showed that our method performs better and runs faster than the state-of-the-art predictors.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.