Drug-Target Interaction Prediction via Dual Laplacian Graph Regularized Matrix Completion

Drug-target interactions play an important role for biomedical drug discovery and development. However, it is expensive and time-consuming to accomplish this task by experimental determination. Therefore, developing computational techniques for drug-target interaction prediction is urgent and has practical significance. In this work, we propose an effective computational model of dual Laplacian graph regularized matrix completion, referred to as DLGRMC briefly, to infer the unknown drug-target interactions. Specifically, DLGRMC transforms the task of drug-target interaction prediction into a matrix completion problem, in which the potential interactions between drugs and targets can be obtained based on the prediction scores after the matrix completion procedure. In DLGRMC, the drug pairwise chemical structure similarities and the target pairwise genomic sequence similarities are fully exploited to serve the matrix completion by using a dual Laplacian graph regularization term; i.e., drugs with similar chemical structure are more likely to have interactions with similar targets and targets with similar genomic sequence similarity are more likely to have interactions with similar drugs. In addition, during the matrix completion process, an indicator matrix with binary values which indicates the indices of the observed drug-target interactions is deployed to preserve the experimental confirmed interactions. Furthermore, we develop an alternative iterative strategy to solve the constrained matrix completion problem based on Augmented Lagrange Multiplier algorithm. We evaluate DLGRMC on five benchmark datasets and the results show that DLGRMC outperforms several state-of-the-art approaches in terms of 10-fold cross validation based AUPR values and PR curves. In addition, case studies also demonstrate that DLGRMC can successfully predict most of the experimental validated drug-target interactions.


Introduction
Identifying potential drug-target interactions (DTIs) is a challenging and meaningful step in precision medicine and biomedical research [1][2][3][4][5][6][7][8]; it is also crucial during drug discovery process. With predicted positive DTIs, one can find novel targets for existing drugs or identify targets for new drugs [9][10][11][12]. Although there are almost 30,000 human genes, only fewer than 400 of them could be used as drug targets in the treatment of diseases [13]. Therefore, identifying more DTIs is an extremely valuable task which can bring huge breakthrough in biopharmaceutical and biomedical research.
The mainly traditional and reliable methods for DTIs prediction are biochemical experiments, but these methods are very expensive and time-consuming. Thus, only a small amount of DTIs have been validated by experiments based methods. This motivates the development of computational methods for DTIs prediction. In addition, various experimental data of drugs and genes such as KEGG [14], DrugBank [15], and Genbank [16] also serve to develop computational techniques to infer the potential DTIs.
A wide variety of computational techniques for DTIs prediction have been proposed, and these techniques often rely on some machine learning models such as support vector machine (SVM) [17][18][19][20], logistic regression [21,22] and naive Bayesian classifiers [23], matrix factorization, and kernel learning, and network inference. Bai et al. [18] applied genetic algorithm to screen related compounds, the drug-target pairs with strong binding capacity were found with SVM and particle swarm optimization. Garcia-Sosa et al. [21,23] 2 BioMed Research International used logistic regression and naive Bayesian classifiers for classification of compounds. In [24], the experimental validated targets are employed to train a SVM model and find potential proteins with similar structure. Matrix factorization based methods decompose the matrix which represents the drug-target network into multiple low-rank matrices. The decomposed matrices consisting of latent features are used to exploit the drug-target interactions. The Bayesian matrix factorization [25] and collaborative matrix factorization [26] are two typical methods. In [27], Ezzat et al. added a dual Laplacian graph regularization term to the matrix factorization model for learning a manifold on which the data are assumed to lie. The typical kernel leaning methods include the pair kernel method [28], net Laplacian regularized least squares [29], and the regularized least squares with Kronecker product kernel [30]. As to network inference methods, they usually formulate the drug-target interactions prediction as a graph leaning problem. Bleakley and Yamanishi [31] proposed a novel supervised inference method to predict unknown drug-target interactions by constructing a bipartite graph; the bipartite local model first predicts target proteins of a given drug and then predicts drugs targeting a given protein. As a improved version of the bipartite local model, Mei et al. [32] considered new drug candidates through its neighbors' interaction profiles. By considering the drug-drug similarities and target-target similarities, Chen et al. [33] developed a network-based random walk with restart on the heterogeneous network to predict potential drug-target interactions. Emig et al. [13] introduced a network-based approach which integrates disease gene expression signatures and a molecular interaction network. In order to enhance the similarity measures to include nonstructural information, Shi et al. [34] introduced a new concept named "super-target" to handle the problem of possibly missing interactions. Different to existing methods which are based on the single view data, Zhang et al. [11] integrated the drug and target data from different views and proposed a multiview DTIs prediction method based on clustering. Li and Cai [35] also extended the single view low-rank representation model to multiview low-rank embedding for DTIs prediction. In [36], Zhang et al. proposed a label propagation method with linear neighborhood information for predicting unobserved drug-target interactions; the drug-drug linear neighborhood similarities are used to rank the interaction scores. A brief review of DTIs prediction can be found in [9].
Although there are so many methods have been proposed for DTIs prediction, the results are far from satisfactory. The key issue of this problem is how to efficiently use the existing validated DTIs and exploit the useful information hidden among drugs or targets [37]. For most of existing methods, the drug-drug similarities and target-target similarities play important roles [26-28, 31, 34, 38, 39]. Therefore, different ways for calculating drug-drug similarities have been proposed, such as cosine similarity, Gauss similarity, and Jaccard similarity. In this paper, we propose a Laplacian graph regularized matrix completion model for DTIs prediction, in which the drug-drug similarities are used to construct a similarity graph for regularizing that drugs with similar chemical structure are more likely to have interactions with similar targets and targets with similar genomic sequence similarity are more likely to have interactions with similar drugs. During the matrix completion process, the experimental validated interactions are preserved well by using an indicator matrix with binary values which indicates whether there exists validated interaction between a drug and a target. An alternative iterative strategy based on Augmented Lagrange Multiplier algorithm is developed to solve the constrained matrix completion problem. Extensive experiments on four benchmark datasets are conducted to validate the efficacy of the proposed Laplacian graph regularized matrix completion model (DLGRMC) for DTIs prediction. The architecture of our proposed method is shown in Figure 1.

Materials.
In order to evaluate the DTIs prediction performance of the proposed DLGRMC, four small-scale benchmark datasets which correspond to four different target protein types and a large-scale dataset are used in our experiments, including nuclear receptors (NRs), G proteincoupled receptors (GPCRs), ion channels (ICs), enzymes (Es) [40], and DrugBank (DB) [41]. The former four datasets are publicly available at http://web.kuicr.kyoto-u.ac.jp/supp/ yoshi/drugtarget/. The last DrugBank dataset is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug-target information. The data used in this study was released on July 03, 2018 (version 5.1.1). The drugs and targets data were extracted from the DrugBank database website at http://www .drugbank.ca/. We only use the approved drug-target interactions in our experiments. Therefore, there are totally 1936 drugs and 1609 targets, respectively. The number of approved drug-target interactions is 7019. The approved drug structures and approved target sequences were downloaded from https://www.drugbank.ca/releases/latest# structures and https://www.drugbank.ca/releases/latest# target-sequences, respectively. Table 1 summarizes the simple statistics of the four datasets. In Table 1, we present three types of information for each dataset, i.e., the experimental validated DTIs, the similarities between drugs, and the similarities between targets. Specifically, the validated DTIs are obtained from public datasets including BRENDA [42], KEGG BRITE [43], DrugBank [44], and SuperTarget [45]. The drug similarities are calculated via the chemical structures of the compounds, which are derived from the DRUG and COMPOUND sections in the KEGG LIGAND dataset [43]. The chemical structure similarities between compounds are computed by using SIMCOMP score [46], where SIMCOMP provides a global similarity score based on the size of the common substructures between two compounds using a graph alignment algorithm. The similarity between two compounds and is computed as ( , ) = | ∩ |/| ∪ |. By applying this operation to all compound pairs, we can construct a drug similarity matrix. The target similarities are computed via the amino acid sequences of target proteins, which are obtained from the KEGG GENES dataset [43]. The sequence  similarities between the proteins are computed by using a normalized version of Smith-Waterman scores [47]. The normalized SmithWaterman score between two proteins and is computed as ( , ) = ( , )/√ ( , ) ( , ), where (⋅, ⋅) means the original SmithWaterman score. By applying this operation to all protein pairs, we can construct a target similarity matrix.

Problem Formulation of DTIs Prediction.
In this work, we use two sets D = { } =1 and T = { } =1 to denote drugs and targets, respectively. The experimentally validated DTIs are represented by a binary matrix ∈ {0, 1} × . If a drug has been experimentally validated to interact with a target , then = 1; otherwise, = 0. The nonzero elements in are called "known interaction" and can be regarded as positive observations, while the zero elements in are called "unknown interaction" and can be regarded as negative observations. In addition, the drug similarities are denoted as ∈ R × , and the target similarities are represented as ∈ R × . The aim of DTIs prediction is to uncover the possible interactions from the negative observations by using certain prior information of drugs and targets. The candidate drug-target interactions will be chosen as predicted interactions according to their predicted probabilities in descending order.

Matrix Completion.
Matrix completion aims to fill in the missing entries of a partially observed matrix . One of the mostly used model of the matrix completion problem is to find the lowest rank matrix which matches the matrix , which we wish to recover, for all entries in the set of observed entries. The basic mathematical formulation of this problem is as follows: Due to the fact that problem (1) is nonconvex and no efficient solution can be obtained, (1) is usually transformed to the following convex problem by relaxing the rank function into the nuclear norm: 4 BioMed Research International where ‖ ⋅ ‖ * is the nuclear norm, which is equal to the sum of singular values of . Equation (2) can be solved by using the singular value thresholding (SVT) algorithm [48].

Dual Laplacian Graph Regularized Matrix Completion (DLGRMC).
Supposing there are drugs and targets, if we use the matrix ∈ R × to denote the drug-target interactions and denote as the validated interaction set, then (2) can be directly used for potential DTIs prediction. However, the drug-drug similarities and target-target similarities which have been demonstrated useful in previous works are not fully exploited to serve the matrix completion model. Thus, we believe that the two kinds of similarities can advantage the matrix completion model; of course, better DTIs prediction results can be expected. In this work, we present a new objective function through incorporation of the drug-drug similarities and target-target similarities into the standard matrix completion framework for DTIs prediction. We use a dual Laplacian graph regularization term to constrain that drugs with similar chemical structure are more likely to have connections with similar targets and targets with similar genomic sequence similarity are more likely to have interactions with similar drugs. The optimization problem of DLGRMC can be formulated as follows: where and represent the -th row and -th row of , respectively. and represent the -th column and -th column of , respectively. , , and are three regularization parameters, and "∘" denotes the Hadamard product of two matrices. The Tikhonov regularization on is used to ensure the smoothness of . The third term aims to ensure that the experimental validated interactions can be well preserved after the matrix completion.
is an adjacency matrix with binary values which is defined to clearly describe the validated DTIs; i.e., if a specific drug is confirmed to be interacted with a target , the entity ( , ) is assigned 1 or otherwise 0. Thus, the adjacency matrix is with size × . Since is with 0 − 1 values, we use itself as the indicator matrix to indicate the indices of the observed DTIs. The forth term regularized by parameter constrains that drugs with similar chemical structure are more likely to be connected with similar targets and targets with similar genomic sequence similarity are forced to have interactions with similar drugs. ( , ) represents the chemical structure similarity between drugs and , and ( , ) represents the genomic sequence similarity between targets and .

Optimization of DLGRMC.
To solve the optimization problem in (3), we first transform it into the following form: where ∈ R × is the drug Laplacian matrix with = − , is the diagonal matrix with ( , ) = ∑ ( , ), ∈ R × is the target Laplacian matrix with = − , and is the diagonal matrix with ( , ) = ∑ ( , ). Since problem (4) contains Hadamard product of two matrices, it is hard to tackle it directly. Thus, we propose an alternative iterative algorithm to solve this problem based on Augmented Lagrange Multiplier (ALM) algorithm [49][50][51][52]. We first introduce two auxiliary variables and to make the objective function separable: min , , The corresponding augmented Lagrange function of (5) is where 1 and 2 are the Lagrange multipliers, 1 > 0 and 2 > 0 control the penalties for violating the linear constraints, and ⟨⋅, ⋅⟩ represents the standard inner product of two matrices. Then the variables can be solved alternatively.

Solving with Other Variables
Fixed. The variable can be solved by the following equation with other variables fixed: where can be solved by singular value thresholding (SVT) operator ( [48]).

Solving with Other Variables
Fixed. When other variables are fixed, can be solved by minimizing following function: Setting the derivative of (8) with respect to to zero and using properties of the Hadamard and Kronecker products, it is easy to get that can be obtained as follows: where = 2 diag (vec( )) + 2 , and = 2 ( ∘ ) + 2 + 2 . This is a simple linear system.

Updating Multipliers. We update the multipliers by
The variables , , and are iteratively updated until convergence. Finally, we obtain the predicted DTIs based on the completed entities in matrix . In summary, the detailed steps for solving the proposed DLGRMC model can be described by Algorithm 1. After we recover , the predicted DTIs can be obtained by sorting the element values of in descending order.

Evaluation Metrics.
To quantitatively evaluate the performance of our method, computational experiments were conducted on the above five benchmark datasets. Similar to previous studies [27,32,54], the Area Under the Precision-Recall (AUPR) curve [55] and precision-recall (PR) curves were employed as the main metric for performance evaluation. AUPR can penalize the false positives more in evaluation, which is desirable here since we do not want incorrect predictions to be recommended by the prediction algorithms   [55]. Before evaluating the performance of our proposed method, we give an intuitive showing of the imbalance ratio between interacting and noninteracting drug-target pairs of different datasets in Figure 2. As can be seen, the number of known drug-target interaction pairs is very small, which demonstrate the urgent need of predicting new drug-target interactions.

Experiments Settings.
In our experiments, five existing techniques including bipartite local model using neighborbased interaction-profile inferring (BLMNII) [32], weighted nearest neighbor profile (WNN) [54], collaborative matrix factorization (CMF) [26], graph regularized matrix factorization (GRMF) [27], neighborhood regularized logistic matrix factorization (NRLMF) [56], and label propagation with linear neighborhood information (LPLNI) [36] were used to compare with our proposed DLGRMC. We adopted 5 repetitions of 10-fold cross validation (CV) for each of the methods on different datasets. In each repetition, the observed DTIs   indicator matrix was divided into 10 folds. Then each fold was left out as the test set while the remaining 9 folds were treated as the training set, and the final AUPR score was the average over 5 such repetitions. As can be seen from (3), there are three parameters that need to be turned in our proposed DLGRMC model, i.e., , , and . In our experiments, we have chosen them from {0.001, 0.01, 0.1, 1, 10, 100, 1000} by a grid search manner, and the best results with optimal parameters were reported. As to the Gaussian kernel function for calculating the drug chemical structure similarity, we set the number of nearest neighbors to be 5 and the kernel width to be 0.1. For the other methods, we set the parameters to their optimal values as recommended in the references.
Similar to previous works [9,26,57], we conducted CV under three different settings as follows: (i) 1 : CV on drug-target pairs-random entries in (i.e., drug-target pairs) were selected for testing, this setting refers to the DTIs prediction for new (unknown) drug-target pairs. (ii) 2 : CV on drugs-random rows in (i.e., drugs) were blinded for testing, this setting refers to the DTIs prediction for new drugs.
(iii) 3 : CV on targets-random columns in (i.e., targets) were blinded for testing, this setting refers to the DTIs prediction for new targets.

Under
1 , we used 90% of elements in as training data and the remaining 10% of elements as test data in each round. Under 2 , we used 90% of rows in as training data and the remaining 10% of rows as test data in each round. Under 3 , we used 90% of columns in as training data and the remaining 10% of columns as test data in each round. Tables 2-4 show the predicted AUPR values of different methods on different datasets under different CV settings. As can be seen, our proposed DLGRMC performs better than other methods on all of the datasets. Since the drug discovery and development aim to serve the treatment of disease, in order to predict new targets which the drugs react, we plot the precision-recall (PR) curves of the results under 3 for all of the datasets. The plots are shown in Figure 3; the results also show the superiority of our proposed DLGRMC. We will release the related datasets, codes, and figures of our algorithm for academic research with this paper.  1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Case Study.
In order to test the capacity of DLGRMC in potential DTIs prediction, we randomly chose one drug from each dataset and reported the top 10 predicted interactions of different methods under 3 . The results are shown in Tables 5-9. As can be seen, our proposed DLGRMC can successfully predict more of the experimental validated DTIs when compared with other methods, which indicates that DLGRMC is capable of predicting novel DTIs for drug development.

Parameter Sensitivity Analysis.
As mentioned in Section 3.2, there are three parameters that need to be tuned Table 5: The top 10 interacting targets of drug "D00094" in dataset NRs predicted by different methods ("√" denotes experimental validated targets and "×" denotes nonvalidated targets).

Rank
Targets predicted by different methods BLM-NII  WNN  CMF  GRMF  NRLMF  DLGRMC  1 h s a 5 9 1 4 ( √) h s a 1 9 0( √) h s a 6 0 9 6( √) h s a 6 2 5 7( √) h s a 5 9  for obtaining the best results. In this subsection, in order to analyse the parameter effect on the final prediction results, for each dataset, we show the AUPR values versus one of the parameters with the other two fixed. Figure 4 plots the AUPR values of DLGRMC with different parameters on different datasets under 3 . As can be seen, DLGRMC is more sensitive to and than , which demonstrates the importance of the Laplacian graph regularization and the preservation of observed DTIs.

Discussion
In this paper, we propose a drug-target interaction prediction model via Laplacian graph regularized matrix completion.  In detail, we transformed the task of drug-target interaction prediction into a matrix completion problem, in which the potential interactions between drugs and targets can be obtained based on the prediction scores after the matrix completion procedure. The novelties of our proposed method line in two aspects. On the one hand, during the matrix completion, the pairwise chemical structure similarities between drugs and genomic sequence similarities between drugs are fully exploited to serve the matrix completion by using a Laplacian graph regularization term. On the other hand, an indicator matrix with binary values which indicates the indices of the observed drug-target interactions is deployed to preserve the experimental confirmed interactions. We developed an alternative iterative strategy to solve the constrained matrix completion problem based on Augmented Lagrange Multiplier algorithm. The final experimental results validate the efficacy of the proposed method, and case studies demonstrate that the proposed method owns the capacity to predict potential novel drugtarget interactions. Of course, experimental results also illustrate that there is still much room for improvement since there are also missed interactions in case studies. In our recent work, only one type of representation for drugs or targets is considered. Practically, each drug or target can have multiple representations. For example, a drug can be represented by its chemical structure or by its chemical response in different cells. A protein target can be represented by its sequence or by its gene expression values in different cells. In our future work, we aim to integrate these multiview representations for drug-target interaction prediction and we believe that the prediction results can be improved with a large margin. Table 8: The top 10 interacting targets of drug "D00002" in dataset Es predicted by different methods ("√" denotes experimental validated targets and "×" denotes nonvalidated targets).