SDTRLS : Predicting Drug-Target Interactions for Complex Diseases Based on Chemical Substructures

1School of Information Science and Engineering, Central South University, Changsha, Hunan 410083, China 2School of Computer and Information, Qiannan Normal University for Nationalities, Duyun, Guizhou 558000, China 3Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada 4Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA


Introduction
The identification of target molecules associated with specific diseases is the basis of modern drug discovery and development [1][2][3].Therefore, the identification of drugtarget interactions (DTIs) is important for drug development.However, it is well known that drug discovering is a costand time-consuming process in the field of pharmacology.According to the USA Food and Drug Administration statistical data, the cost of new drug discovery is approximately $1.8 billion and it takes an average of 13 years [4].Therefore, how to deal with this problem becomes an emerging issue.Over decades, different computational methods and tools [5][6][7][8][9][10][11][12][13] have been developed to predict large-scale potential DTIs and drug repositing through the unremitting efforts of a large number of researchers and organizations under the development of computing technology.
Meanwhile, many DTI data have been generated with the rapid growth of the public chemical and biological database.For example, PubChem [14] is a freely available chemistry database.There are 7759 drug entities, 4104 target proteins, and 15,199 DTIs present in DrugBank [15] database by now.The freely available online ChEMBL [16] database provides pharmaceutical chemists with a convenient platform for querying target bioactivity data for compounds or targets.In addition, including TTD [17], KEGG [18], SIDER [19], STITCH [20], STRING [21], BindingDB [22], and other various kinds of resources have established the basis for DTI prediction.
Now it is possible for us to quickly and inexpensively identify potential DTIs and repurpose existing drugs [23][24][25][26][27] through the developments of computational methods.These methods are mainly divided into three categories, including basic network-based models, machine learningbased models, and other approaches based on similarity [28].

Complexity
From the viewpoint of basic network-based model, Cheng et al. [29] developed the method to predict DTIs through network-based inference (NBI).Comparing with drug-based similarity inference (DBSI) and target-based similarity inference (TBSI), NBI is better than them because it is in the full use of the known DTIs.Moreover, node-and edgeweighted NBI was developed via constructing the weight of nodes and edges on drug-target network.Network-based Random Walk with Restart on the Heterogeneous network (NRWRH) was developed by Chen et al., which implemented the random walk on the heterogeneous network (proteinprotein similarity network, drug-drug similarity network, and known drug-target interaction networks) [30].It is an enhanced version of the traditional random walk that improved the predictive performance through making full use of data with the integrated heterogeneous network.Some machine learning-based approaches were also developed to predict DTIs.Following Bleakley et al. [31] and Mordelet and Vert [32], Bleakley and Yamanishi [33] further proposed the bipartite local model (BLM) to predict DTIs, which used local support vector machine (SVM) classifiers with known DTIs and integrated the chemical structure similarity and protein sequence similarity information.Gaussian Interaction Profile (GIP) kernels on drug-target networks were significant improvements developed by van Laarhoven et al. [34].In order to solve the problem of negative samples, Lan et al. [35] proposed a prediction method (PUDT) which classified unlabeled samples into the reliable negative examples and likely negative examples based on the similarity of protein structure and achieved good results.
The matrix decomposition technique is also used for predicting DTIs, miRNA-disease associations [36,37], and so on.It maps the DTI matrix to the low-dimensional matrix to infer the hidden interactions based on the known interactions.Gönen [38] proposed a Bayesian model that combined dimensionality reduction, matrix factorization, and binary classification for predicting DTIs via integrating the drug-drug chemical similarity and protein-protein sequence similarity.Multiple Similarities Collaborative Matrix Factorization (MSCMF) [39] method projected drugs and targets into a common low-rank feature space and significantly improved the results via adjusting the weight of similarity matrix of drugs and of targets.Ezzat et al. [40] developed the Regularized Matrix Factorization method that distinguished from many of nonoccurring edges in the interaction matrix which are actually unknown or hide cases by other similarity information.DrugE-Rank [12] developed a machine learning-based model by combining the advantages of two different types of feature-based and similarity-based methods to improve the prediction performance.
Although the above methods have gained good results in predicting the new DTIs on known drugs, it is also important to predict DTIs of failed drugs and new chemical entities.There are thousands of drugs that are failed in clinical phases and even US National Center for Advancing Translational Sciences is paying US$20 million to research for repurposing 58 failed drugs [41,42] as the drugs that failed in their initially targeted diseases may be effective in other diseases.Wu et al. [43] proposed an integrated network and chemoinformatics tool for systematic prediction of DTIs and drug repositioning, namely, SDTNBI (substructure-drug-target network-based inference) which predicted new DTIS of failed drugs and new chemical entities by integrating known DTIS and chemical substructure of failed drugs or new chemical entities in a way of resource diffusion.Their study assumed that chemical substructure played key roles in DTIs.This method achieved good prediction results for large-scale failed drugs and new chemical entities based on chemical substructures shared between them and the known drugs.
In this study, we propose a method called SDTRLS (substructure-drug-target Kronecker product kernel regularized least squares) for large-scale DTI prediction and drug repositioning based on the chemical substructures of known drugs, failed drugs, and new chemical entities.Firstly, we compute the substructure similarity and then create a Gaussian Interaction Profile (GIP) kernels for drug entities and target proteins based on known DTIs.The -nearest neighbor (KNN) was used to compute the initial relational score in the presence of a new chemical entity or failed drug that has no known DTIs.Through similarity network fusion (SNF) technology [44] the similarity of substructure and GIP of drugs are integrated.SNF substantially outperforms singletype data analysis and establishes integrative approaches to predicting DTIs.Finally, the RLS-Kron [34] classifier was used to predict DTIs, which constructs a large kernel that directly relates to the drug-target pairs by combining the similarity kernels of drug entities and target proteins.In order to comprehensively assess the performance of our method, we compare it against current state-of-the-art algorithms with the same data and evaluation criteria.We use the 10-fold cross validation and external validation to show the accuracy and robustness of our method.The computational results show that our proposed SDTRLS is comparable to other five methods in terms of stability.Especially in the G proteincoupled receptors (GPCRs) external validation dataset, the maximum and average AUC values were 0.842 and 0.826, respectively, which are superior to 0.797 and 0.766 from stateof-the-art SDTNBI method.In order to further confirm the prediction ability of STDRLS, we perform an experimental analysis on some prediction results.In summary, we provide a new alternative method for DTI prediction for known drugs, failed drugs, and new chemical entities.It provides the basis for drug discovery, development, and personalized medical treatment in the future.

Materials
This study used five internal validation datasets and two external validation datasets.The internal datasets are used to validate the predictions of the new DTIs of known drugs, and the external datasets are used to validate the predictions of all DTIs of new entities and failed drugs.Five internal datasets are G protein-coupled receptors (GPCRs), kinase superfamily (Kinases), ion channels (ICs), nuclear receptors (NRs), and Global.GPCRs and Kinases were downloaded from ChEMBL database.ICs and NRs were collected from the ChEMBL and BindingDB database.The Global is a global network covering genomewide targets where all drugs also come from DrugBank database.Two external datasets were selected from GPCRs and Kinases in DrugBank database, respectively.The external validation is to predict all DTIs for drugs, so it needs a basic dataset that includes drugs, targets, and known DTIs.GPCRs and Kinases are the basic datasets to ExGPCRs and ExKinases, respectively.The known 17,111 DTIs of GPCRs are the prior knowledge to external validation of ExGPCRs in Table 1.
Table 1 shows that the 92 drugs of ExGPCRs and 4741 of GPCRs are independent of each other.However, the 46 targets of ExGPCRs are the subset of the 92 targets of GPCRs.Furthermore, the relationship of drugs and targets between Kinases and ExKinases is the same as that between ExGPCRs and GPCRs.These datasets can be downloaded from http://lmmd.ecust.edu.cn/methods/sdtnbi/#*.Table 1 contains some statistics of five internal validation datasets and two external datasets.

Chemical Substructure.
In this study, we used seven types of fingerprints to express the chemical substructures of each molecule.All substructure data are generated from PaDEL-Descriptor software, including CDK Fingerprint, CDK Extended Fingerprint, CDK Graph Only Fingerprint, Substructure Fingerprint, Klekota-Roth Fingerprint, MACCS Fingerprint, and PubChem Fingerprint, namely, CDK, CDKExt, Graph, FP4, KR, MACCS, and PubChem, respectively.Each type of substructures of each molecule is represented by a multiple dimensional vector with values of 0 or 1.We only used the substructures that appear in the datasets.
Table 2 contains the overview of the seven substructures of dataset GPCRs, including the dimension of each chemical substructure.The dimensions of substructures were derived from the statistics result of the datasets that include all appearing substructure types.

Chemical Substructure Similarity.
Let  = { 1 ,  2 , . . .,   } be a set of all substructures for one type of seven chemical substructures, where  is the dimension of the chemical substructure.For example, the value of  is 1024 in CDK and the value of  is 153 in MACCS. = { 1 ,  2 , . . .,   }  [27].
where   is the weight of the th substructure (  ), which can be calculated by the formula [27] where   is the frequency of chemical substructure   in the whole dataset,  is the standard deviation of {  } = =1 , and ℎ is a parameter (set to be 0.1 in this study).The basic rationale for introducing the weight to compute substructure similarity between drugs and new chemical entities is that substructures with fewer occurrences should occupy a more proportion than substructures which appear frequently.

Gaussian Interaction Profile Kernel.
We denoted that  = { 1 ,  2 , . . .,   } is the set of  targets.A drug-target network can be represented by a bipartite graph which has an adjacency matrix  ∈   *  , where the value of   is 1 if   and   have known DTI, otherwise 0. The Gaussian Interaction Profile (GIP) kernel is constructed from the topology information of known DTIs network [10,34].The kernel of drugs   and   can be formulated as where (  ) = { 1 ,  2 , . . .,   } is the interaction profile of drug   , and  is a parameter that controls the bandwidth; we set the value to be 1 in this study.Similarly, the kernel of targets   and   can be calculated by (4).

Similarity Network Fusion.
We have two similarity matrices for drugs (including known drugs, new chemical entities), namely, substructure similarity  subsim ∈   *  and  GIP, ∈   *  .To construct more comprehensive similarity kernel for drugs, we used the SNF method to fuse two similarity kernels.Firstly, the row-normalized matrices  (1) and  (2) are calculated from the drug similarity matrices  subsim and  GIP, , respectively.Secondly, according to the -nearest neighbors (KNN) method, the resultant matrices  (1) and  (2)  are obtained from  (1) and  (2) by the following equation [44]: where (  ) is the set of top  similar neighbors of drug   .
In this study, we set the value of  to be 50.The main idea of SNF is iteratively updating similarity matrices  (1) and  (2)  [44].

Kron RLS.
Kronecker product kernels are used widely in prediction issues of other studies and conditions [45][46][47].In this study, we also use a Kronecker product kernel to construct a larger kernel for the drug-target pairs.Then the prediction of DTIs is based on the ranking of the pairs that include known drugs and targets and new entities or failed drugs and targets.The higher rank implies the higher possibility of existing interactions.Based on the kernel of drugs and targets, the Kronecker product kernel of drugtarget pairs is constructed as follows [34]: where   (  ,   ) is the (, )th element of the kernel of drugs with  final , while   (  ,   ) is the (, )th element kernel of targets with  GIP, .
According the Kronecker product kernel of formula ( 8), the predictions of DTIs for all drug-target pairs can be calculated as follows [34]: where  is a regularization parameter.The smoother result can be obtained via the higher value .We get Ŷ =  when  = 0 which shows no generalization [34].We also use the eigendecompositions of the kernel matrices according to Laarhoven's study.The eigendecompositions of matrices   and   are   = ∨  ∧  ∨   and   = ∨  ∧  ∨   , in which ∨  and ∨  are the unitary matrices of feature vectors, and ∧  and ∧  are the diagonal matrices of eigenvalues for drugs and targets, respectively.Since the eigenvalues (vectors) of a Kronecker product are the Kronecker product of eigenvalues (vectors), the Kronecker product kernel of drug-target pairs can be formulated as follows [34]: in which 3.5.KNN for New Chemical Entities.New chemical entities or failed drugs have no known associations with targets, which makes it impossible to predict more associations by existing methods.In this study, we used the KNN method to estimate the interaction scores for new chemical entities or failed drugs by the similarity between them and known drugs.For example, we denote a new chemical entity or failed drug as  new , whose interaction score with target   can be computed by the formula where subsim is the (, )th element of chemical substructure similarity matrix  subsim ∈   *  , and   is the (, )th element of  ∈   *  . new is the set of top  neighbors according to the  subsim matrix.In this study, we set the value of  to be 4.

Benchmark Evaluation and Evaluation Indices.
In order to demonstrate the performance of our method, we adopt the 10-fold cross validation and external validation.The 10-fold validation was widely used in prediction of DTIs [29,48,49] and other interaction prediction in bioinformatics.The main experiment process is that the whole dataset is randomly divided into 10 groups; each group alternates as a testing set, and the rest of the 9 groups alternate as the training set, and this process is repeated 10 times.Furthermore, the DTIs of new chemical entities and failed drugs are a very important portion in this study.We use two external datasets (ExGPCRs, ExKinases) to evaluate performance of our method by predicting all interactions with them.
We use the AUC (area under the ROC curve) as an evaluation metric for our SDTRLS as for SDTNBI methods, and the values in Tables 3, 5, and 6 are presented in the format 4.2.Cross Validation.Table 3 describes the performance evaluation index values of the predicted datasets in the 10fold cross validation for 5 datasets.SDTRLS's minimum AUC among the seven substructures reaches 0.979 and the average is 0.981, which indicates good prediction results.However, on NRs dataset, the validation results of each substructure are relatively poor and the minimum value is 0.905 based on Graph substructure while the maximum value is 0.916.On Kinases dataset, the verification results are also very stable, with the maximum and minimum values of 0.973 and 0.969, respectively.On ICs dataset, the verification results are not bad, the minimum value of AUC is 0.943 with FP4 substructure, and the maximum value is 0.956 with CDK substructure.Similarly, on Global dataset, the results are stable except for the slightly lower values of 7 substructures, between 0.935 and 0.936.In general, the validation results on GPCR and Kinase datasets are better than the other three datasets.Moreover, the prediction performances of EWNBI, NWNBI, and NBI on Kinases dataset are lightly better than SDTRLS, while SDTRLS has obvious advantage on ICs, NRs, and Global datasets.In addition, because the authors did not provide the data needed for EWNBI method on three datasets (ICs, NRs, and Global) and the prediction results of datasets GPCRs and Kinases are not good, we do not compute the AUC values of EWNBI method on these three datasets.Overall, SDTRLS and SDTNBI provide more stable prediction results on 5 datasets.4 describes the evaluation results of six methods on two external datasets ExGPCRs and ExKinases; the basic datasets are GPCRs and Kinases, respectively.Overall, external validation results of all prediction methods are worse than 10-fold cross validation results because new chemical entities have no known DTIs.On ExGPCRs dataset, the AUC values of SDTRLS on the 7 substructures are between 0.804 and 0.842.On ExKinases dataset, the AUC values of SDTRLS of the 7 substructures are between 0.827 and 0.855.As can be seen from Table 4, the verification results of all approaches on ExKinases are better than on ExGPCRs.In the validation on ExKinases dataset, there are no obvious differences in AUC values among DBSI-R, SDTNBI, and SDTRLS.On ExKinases, SDTRLS demonstrates its excellent prediction power.

Comparison with Previous Methods.
Since the datasets used in this study are derived from the datasets used in the SDTNBI method, as the state-of-the-art method, its prediction performances are more stable than the other 4 methods.In this study, the comparison is performed in terms of the -test statistical analyses of SDTRLS and SDTNBI methods, as well as in terms of the parameter-independent AUC value with other 5 methods.
Table 5 shows -tests results of SDTNBI and SDTRLS on five datasets GPCRs, Kinases, ICs, NRs, and Global, respectively.We can see from Table 5 that the average AUC of our method on each dataset is greater than that of the SDTNBI method, especially in the GPCRs and Kinases datasets, respectively, from 0.928 to 0.981 and from 0.919 to 0.971.Moreover, there were significant differences ( < 0.05) in the comparison results of GPCRs, Kinases, and NRs datasets; particularly, the comparison result is more significant ( < 0.01) on GPCRs and Kinases datasets.In conclusion, our method is more stable than the SDTNBI method in terms of the 10-fold cross validation.
We also compare the prediction results with other four methods on five datasets GPCRs, Kinases, ICs, NRs, and Global.The four competing methods are NBI, NWNBI, EWNBI, and DBSI-R [29].NBI applied a mass diffusionbased method to obtain the predicted list by considering  the bipartite graph.However, in EWNBI method, a DTI network was weighted by the potency of binding affinity or inhibitory activity of the interactions with drugs and targets.The theoretical basis of NWNBI method is that the hub node is more difficult to be influenced.The DBSI method is based on the hypothesis that two similar drugs may have similar targets.Table 3 shows that the SDTRLS method is slightly better than NBI, NWNBI, and EWNBI methods on GPCRs and much better than DBSI-R method.In addition, SDTRLS method is much better than DBSI-R method while being comparable with NBI, NWNBI, and EWNBI methods on Kinases.In general, the SDTRLS approach is comparable to these four methods from the results of the 10-fold cross validation on GPCRs and Kinases datasets.Table 6 shows results of SDTNBI and SDTRLS on two datasets, ExGPCRs and ExKinases, respectively.From Table 6 we can see that our method greatly outperforms the SDTNBI method, on ExGPCRs in terms of the average AUC and test result ( < 0.01).In addition, the average AUC of our methods are slightly lower than those of SDTNBI method on ExKinases, which may be due to the sparsity of known DTIs in this dataset.
We compare the prediction result of our method with other four competing methods on the same datasets ExG-PCRs and ExKinases.We can see from Table 4 that SDTRLS method outperforms the other four competing methods on ExGPCRs dataset.In addition, SDTRLS method is also comparable with other four competing methods on ExKinases dataset.

4.5.
Parameter Analysis for (  ) and .In this section, we analyzed two parameters, including (  ) for similarity network fusion and  for new chemical entities.The parameter ℎ was set to be 1 according to previous study [27].Moreover, GIP is widely used in other studies [10,34,37,50,51]; we also set the values of both  and  to be 1.All results were validated over external validation of ExGPCRs datasets based on substructures MACCS and Graph.Figure 1 describes that the sensitivity of the prediction performance of SDTRLS with to different numbers (  ) of similarity network fusion.SDTRLS had stable prediction performance over a wide range from 10 to 100.The impact of parameter  for new chemical entities on the prediction performance of SDTRLS, in terms of AUC value, is illustrated in Figure 2. SDTRLS was robust to different values of parameter .

Case Studies.
In order to further confirm the prediction ability of our method, we conduct an experimental analysis on dataset ExGPCRs, and its known DTIs are not used as a priori knowledge when conducting external validation.The selected predictions of drugs are confirmed with Drug-Bank, ChEMBL, and KEGG databases.Table 7 describes the confirmed result based on ExGPCRs dataset.We select the top five predicted interactions of 5 drugs; the top one predicted interaction of every drug is confirmed by searching databases.Furthermore, 76% of all predicted DTIs (19 out of 25) are also confirmed with three databases; 32% of predicted DTIs (8 out of 25) are simultaneously confirmed with two databases, especially in the predicted result of (DB00209, DB00283, and DB00334); they are all confirmed with the several databases.In addition, we further validate the results marked as unknown in the prediction results; we searched the relevant literature and found the related description.For example, thiethylperazine (DB00372) is an antagonist of human Dopamine D3 (hDRD3 170) according to the description in Petsko and Ringe [52] which shows that the prediction result is meaningful, and other remaining unknown DTIs deserve being validated in the future.In general, it proves that our method is effective in practical applications.

Conclusions
The systematic understanding of the interactions between chemical compounds and target proteins is very important for new drug design and development.In the past decades, in order to solve the time-consuming shortcomings of traditional biochemical methods, many computational approaches have been developed to predict DTIs, like machine learning, network inference, and so on.However, these methods mainly focused on new DTIs for known drugs and paid less attention to new chemical entities for DTIs.In addition, their prediction performances are not good enough.In this study, we have constructed the similarity kernel of approved drugs, failed drugs, and new chemical entities by weighting the chemical substructures.Then, GIP kernels were calculated from drugs and targets according to the known DTIs.For the new chemical entities or failed drugs, we used the KNN to initialize the DTIs before calculating the GIP kernel.To construct a comprehensive similarity kernel for drugs, SNF method is used to fuse GIP kernel and substructure similarity kernel.Finally, the score of drugtarget pairs was predicted by Kron RLS.We compared the prediction performance with other competing methods via the tenfold cross validation and external validation.However, there are still some limitations in this study.First, since the target set is specified within the current datasets, it may be not possible to predict the DTIs of the target beyond the datasets.Other similarity information of targets such as the sequence and functional network [53][54][55][56] is not used when the similarity kernel of targets is constructed.In addition, the 3D structure of drugs may also need to be considered as important information.It is expected that additional information may improve prediction performance.In the future, more information using other methods such as ClusterViz [57] should be integrated to develop a more efficient prediction method.Nevertheless, this study provides an important basis for new drug development and drug repositioning and also plays an important role in the personalized medical development.

Figure 1 :
Figure 1: Robustness of SDTRLS with respect to the number of (  ): the dotted line is the default value and its prediction performance.

KFigure 2 :
Figure 2: Robustness of SDTRLS with respect to the number of : the dotted line is the default value and its prediction performance.

Table 1 :
Drugs, targets, and DTIs in each dataset. is the number of drugs,   is the number of targets,   is the known DTIs, and sparsity is the proportion of the of   to all possible DTIs in datasets.

Table 2 :
The dimensions on GPCRs.
is the set of drugs, where  is the number of drugs.For one chemical substructure, drug   can be represented by a profile (binary vector) of the substructure, that is, (  ) = { 1 (  ),  2 (  ), . . .,   (  )}.If drug   has   , the value of   (  ) is 1, otherwise 0. For a type of chemical substructure, the substructure similarity  subsim (  ,   ) of drugs   and   can be computed by the weighted cosine correlation coefficient based on the substructure information

Table 3 :
The performance of 10-fold cross validation on 5 datasets.
0 represents the fact that we did not compute the prediction performance because of data reason; * stands for the prediction results derived from previous studies.

Table 5 :
The -tests results of 10-fold cross validations on 5 datasets.

Table 6 :
The -tests results of two external validations.

Table 7 :
The new confirmation of drug-target interactions based on Graph substructure in the ExGPCRs.