Prediction of Effective Drug Combinations by Chemical Interaction, Protein Interaction and Target Enrichment of KEGG Pathways

Drug combinatorial therapy could be more effective in treating some complex diseases than single agents due to better efficacy and reduced side effects. Although some drug combinations are being used, their underlying molecular mechanisms are still poorly understood. Therefore, it is of great interest to deduce a novel drug combination by their molecular mechanisms in a robust and rigorous way. This paper attempts to predict effective drug combinations by a combined consideration of: (1) chemical interaction between drugs, (2) protein interactions between drugs' targets, and (3) target enrichment of KEGG pathways. A benchmark dataset was constructed, consisting of 121 confirmed effective combinations and 605 random combinations. Each drug combination was represented by 465 features derived from the aforementioned three properties. Some feature selection techniques, including Minimum Redundancy Maximum Relevance and Incremental Feature Selection, were adopted to extract the key features. Random forest model was built with its performance evaluated by 5-fold cross-validation. As a result, 55 key features providing the best prediction result were selected. These important features may help to gain insights into the mechanisms of drug combinations, and the proposed prediction model could become a useful tool for screening possible drug combinations.


Introduction
During the past decade, much effort has been spent on drug discovery, but the rate of new drug approvals is rather low.One of the reasons is that many of the human diseases are so complex with multiple targets that it is very difficult to design a single drug to hit all the targets.Since single targeted drugs can not treat these diseases very effectively [1], employing multiple targeted drugs is a favorable way, by which multiple target genes/proteins can be modulated simultaneously.It is already evidenced that drug combinations can improve therapeutic efficacy in many cases [2].In addition, drug combinations may reduce toxicity and side effects that single targeted drugs may cause.Therefore, drug combinatorial therapy is considered to be effective in treating multifactorial complex diseases.
Drug combinations are becoming more and more popular nowadays, and they have been mainly discovered by experiments or clinical experience.On one hand, the molecular mechanisms of current drug combinations have not been clearly delineated; on the other, there are a myriad of possible drug combinations.Therefore, it is impractical to screen all possible combinations by conventional experiments or empirical rules.Computational methods may provide some valuable information and help to solve the problem.In recent years, some computational methods have been proposed to BioMed Research International predict drug combinations [3][4][5][6][7][8][9].However, these methods have not answered the question of which factors or features are more important for the determination of drug combinations, when it is essential to know which features and why they are able to distinguish good combinations from undesired ones.We propose a method here to identify the characteristic features of effective drug combinations, then analyze them and use them to predict novel combinations.
Drugs are combined according to their essential properties [10,11].In view of this, we considered the following three different kinds of properties: (1) chemical interactions between drugs in the combination [12], (2) protein interactions between the targets of drugs [13], and (3) target enrichment of KEGG pathways [14].These properties were encoded into numeric digits, by which each drug combination was represented by a numeric vector.Feature selection methods, including minimum redundancy maximum relevance [15] and incremental feature selection, were adopted to extract key features.Random forest [16] was adopted as the classification model with its performance evaluated by 5-fold cross-validation.As a result, 55 key features, including one feature from chemical interaction, two features from protein interaction, and others from target enrichment of pathways, were identified and deemed as the most important features for the determination of effective drug combinations.

Benchmark Dataset.
We retrieved all pairwise drug combinations from Zhao et al. 's study [8], which were parsed from FDA orange book [17], which lists approved drug products on the basis of safety and effectiveness by the Food and Drug Administration (FDA).The data in this book has been used as the object of study or reference in some studies [8,[18][19][20][21].If the target information of any drug in the combination was not available, the combination it was involved in was excluded.As a result, 121 drug combinations were retrieved.These combinations were termed as "positive combinations".Totally, 169 drugs were collected from the positive combinations, which were used to investigate drug combinations in this study.
There are 14,196 possible combinations among 169 drugs, where 121 combinations were solidly effective.For the other 14,075 combinations, their effects in treating diseases are not clear and which were assumed to be junk combinations.Among them we randomly selected 605 combinations as "negative combinations, " 5 times as many as the positive ones.The codes of positive and negative combinations can be found in Supplementary Material I (Supplementary Material available online at http://dx.doi.org/10.1155/2013/723780).

Drug Targets.
It has been shown that the targets of agents are an important factor for the formation of effective drug combinations [9].In this study, this information was also employed to construct classification features.The targets of 169 drugs were compiled from three drug target databases including KEGG (ftp://ftp.genome.jp/pub/kegg/medicus/drug/) [22], DrugBank [23], and Therapeutic Target Database (TTD) [24].For each drug, the union of the targets from the three databases was regarded as the final target set.The codes of 169 drugs and their targets were available in Supplementary Material II.

Chemical-Chemical and Protein-Protein Interactions.
It is based on the drugs and their targets to determine whether two drugs should be combined in usage.Thus, the interactions among drugs and among their targets are important for the determination of drug combinations.Here, the information of chemical-chemical interactions and protein-protein interactions were retrieved from Search Tool for Interactions of Chemicals (STITCH) [12] and Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) [13], respectively, as the resources of gaining the classification features.

Chemical-Chemical
Interactions.The information of chemical-chemical interactions was downloaded from STITCH (http://stitch.embl.de/,"chemical chemical.links.detailed.v3.0.tsv.gz")[12].Each interaction consists of two chemicals and five scores entitled "similarity, " "experimental, " "database, " "textmining, " and "combined score, " respectively.The score of "similarity" was obtained by combining open-source Chemistry Development Kit [25] to calculate chemical fingerprints and Tanimoto 2D chemical similarity scores [26,27] between each pair of chemicals.The score of "experiment" was calculated according to the chemical's activities from MeSH pharmacological actions and NCI60 screens.The score of "Database" was calculated by the chemical reactions contained in pathway databases.The score of "textmining" was computed based on a cooccurrence scheme and a natural language processing (NLP) approach [28,29].The score of "combined score" was obtained by combining all of the information that was used to calculate the aforementioned four scores.Thus, the interactivity of two chemicals was determined by the last score.Since a larger score means that the corresponding chemicals can interact with high likelihood, the score is called confidence score in this study.For any two compounds  1 and  2 , the confidence score of the interaction between them was denoted by   ( 1 ,  2 ).Particularly, if the interaction between  1 and  2 is not available in STITCH, the confidence score of the interaction was set to zero, that is,   ( 1 ,  2 ) = 0.

Protein-Protein Interactions.
The file containing the information of protein-protein interactions was retrieved from STRING (http://string.embl.de/)[13].The interactions in STRING include both physical and functional interactions.Like the chemical-chemical interaction in STITCH, each protein-protein interaction in STRING was labeled by a score integrating the information from experimental repositories, computational prediction methods, and public text collections [13].Since the value of the score indicates the likelihood of occurrence of the interaction, it is also termed as confidence score.Here, let   ( 1 ,  2 ) denote the interaction confidence score of the proteins  1 and  2 .If   ( 1 ,  2 ) > 0, we consider that proteins  1 and  2 are interactive proteins.
Likewise,   ( 1 ,  2 ) was set to zero if the interaction between  1 and  2 is not available in STRING.

Features of Drug Combinations.
One of the most important steps of constructing a classification model is to encode each term by its essential properties.The definition of various features is described as follows, which can be deemed as important for the determination of drug combinations.For clarity, each drug combination was denoted by D = ( 1 ,  2 ), where  1 and  2 are two drugs in the combination D, respectively.
We considered three aspects of drug combination: (1) chemical interaction between drugs, (2) protein interactions between drugs' targets, and (3) target enrichment of KEGG pathways.They reflect different levels of the drug-target relationship.The chemical interactions between drugs can indicate whether or not the drugs have antagonism.The protein interactions between drugs' targets and the KEGG enrichment scores of drugs' targets represent the biological functions that the drugs can perturb.

Chemical Interaction.
Two drugs forming a solid combination are more likely to have similar properties.Hence, the interactive chemicals defined in Section 2.3 can share similar biological functions [30,31] with high probability.Accordingly, the interaction confidence score of two drugs in the combination D, that is,   ( 1 ,  2 ), was taken as a feature.

Protein Interaction.
Since drugs take their effects by hitting some target proteins, the target proteins of two drugs are related to each other in a special way [9].In addition, the interactive proteins defined in Section 2.3 always share similar functions [32,33].Thus, it is a reasonable scheme using the information of the protein-protein interactions retrieved from STRING to indicate the special relationship between drug target proteins.For drug combination D = ( 1 ,  2 ), their targets were formulated as ( 1 ) = { 1  1 ,  2 1 , . . .,   1 } and ( 2 ) = { 1 2 ,  2 2 , . . .,   2 }, respectively.We defined the following two kinds of features to describe their relationship.
(1) Protein interactions between the target groups: for any protein   1 in ( 1 ) and any protein   2 in ( 2 ), their interaction confidence score can be obtained from STRING [13] (see Section 2.3).The maximum and mean values of these scores were formulated as follows: which were taken as features.
(2) Protein interactions inside the target groups: for drug   , we can obtain two values V 1  and V 2  , where V 1  and V 2  are the maximum value and mean value of interaction confidence scores between target proteins in (  ), respectively.Since there is no order in the information for a drug combination, ( 1 ,  2 ) and ( 2 ,  1 ) are equivalent.In view of this, we refined , and V 2 2 as follows: which were also taken as features in the study.

Target
Enrichment for KEGG Pathway.The target proteins of a drug are distributed in many pathways, that is, a single drug may belong to multiple pathways and modulate their functions.To partially account for this effect, we employed the pathways in KEGG [22] and KEGG enrichment score [14,34,35] to quantify the relation between drugs and pathways in KEGG.For drug   and KEGG pathway   , the KEGG enrichment score is defined as the −log 10 of the hypergeometric test  value of gene set   , which includes targets of drug   and their direct neighbors in STRING network.It can be calculated as follows: where  is the number of genes in human,  is the number of genes annotated to the KEGG pathway   ,  is the number of genes in gene set   , and  is the number of genes both in gene set   and in KEGG pathway   .The KEGG enrichment scores can measure the biological functions of the genes.( = 1, 2).Similar to the features of protein interactions, 458 features can be derived from these enrichment scores as follows: In summary, there were one feature from chemical interaction, six features from protein interaction, and 458 features from target enrichment, totally (1 + 6 + 458) = 465 features.Thus, each drug combination can be represented by a vector in a 465 D (dimensional) space, that is, each feature is deemed as a dimension.
2.5.Random Forest.Random forest, developed by Breiman [16], is an ensemble classifier integrating multiple decision trees.The procedure of constructing each decision tree is briefly described as follows.
(I) Let  be the number of training samples.We randomly take  samples from the training samples, but with replacement from the original data, to construct the decision tree, while the rest of the samples are used to evaluate the error of the tree by predicting their classes.
(II) Let  be the total number of features. is a positive integer that is much less than .When constructing the tree,  features are selected randomly from  features at each node, and the most optimized split on these  features is utilized to split the node.
(III) Each tree is fully grown without pruning.
For a query sample, each decision tree would make a prediction and the overall prediction is decided by voting.Weka 3.6.4[36] is a software collecting various state-ofart machine learning algorithms.Random forest is implemented by a classifier named RandomForest in Weka, which was adopted as the classification model and run with its default parameters in the study.In its default configuration, each random forest consists of 10 decision trees, and  in step (II) is set to For a query drug combination, each of 10 decision trees would give its prediction ("positive" or "negative").Then, the final predicted result is the class ("positive" or "negative") obtaining a majority vote.

Accuracy Measurement.
For a two-class classification problem, there are four entries in the confusion matrix: TP, TN, FP, and FN, where TP represents true positives, TN true negatives, FP false positives, and FN false negatives [37,38].Based on these values, the prediction accuracy (ACC), specificity (SP), sensitivity (SN), Matthews's correlation coefficient (MCC) [39], and Area Under ROC curve (AUC) score [40] are often used to evaluate the performance of the classification model.They can be calculated as follows: MCC is a measure of the quality of classifiers on the whole and is deemed to be a balanced measure even if the classes are of very different sizes.Thus, it has been widely used to evaluate the quality of classifiers proposed in many studies [14,37,[41][42][43][44]. AUC score is another measurement to evaluate the performance of the classification model on the whole other than MCC.It is the normalized area under the ROC curve, which is plotted in the coordinate system with sensitivity as Y-axis and 1 − specificity (calculated by FP/(FP + TN)) as Xaxis under various classification thresholds.In this study, we selected MCC to measure the performance of the method on the whole, while AUC score was also provided for reference.
2.7.5-Fold Cross-Validation.5-fold cross-validation is often used to evaluate the performance of various classification models [45].In 5-fold cross-validation, the original dataset is equally separated into five portions at random.Each portion is used as testing data in turn and the remaining 4 portions are used as training data.Thus, each datum is tested exactly once since each portion is tested exactly once during the procedure.In the study, 5-fold cross-validation was adopted to evaluate the model presented.

Minimum Redundancy Maximum Relevance (mRMR).
As described in Section 2.4, each drug combination was represented by various features.However, not all features contribute to the classification.In view of this, it is necessary to employ feature selection techniques to analyze these features and extract the useful ones.Minimum redundancy maximum relevance was proposed by Peng et al. [15], and it is deemed as an outstanding method for extracting important information from complicated systems [46][47][48][49], which was also adopted in the study.We could obtain two lists by mRMR program: MaxRel features list and mRMR features list, where the MaxRel features list sorts the features according to the criterion that features contributing more to the classification will have higher ranks, while mRMR features list is produced according to the criteria of both MaxRel and minimum redundancy, which ensures that a feature having minimum redundancy among the already selected features and giving the most contribution to the classification will tend to have a higher rank.The MaxRel features list and mRMR features list were formulated as follows: MaxRel where  represents the total number of features.For detailed description of mRMR method and its analysis, please refer to Peng et al. 's paper [15].

Incremental Feature Selection (IFS).
Based on mRMR features list   = [  1 ,   2 , . . .,    ], incremental feature selection was performed as follows: (I) construct  feature subsets, in a way that the th feature subset is defined as (II) for each  (1 ≤  ≤ ), execute RandomForest in Weka using features in    , respectively, evaluated by 5-fold cross-validation, thereby obtaining ACC, SP, SN, MCC, and AUC scores as described in Section 2.6; (III) plot an IFS curve with MCC value as its Y-axis and the superscript  of    as its X-axis.The first feature ("F1") is the interaction confidence score of  1 and  2 in the combination D = ( 1 ,  2 ) and the second feature ("F2") is the maximum confidence score between the targets of drug  1 and  2 , indicating that the interactions of drugs and their targets are key factors for the determination of drug combinations.The later one is partially consistent with the previous results [9].The remaining 8 features are related to the following seven pathways: (I) hsa04964 ("Proximal tubule bicarbonate reclamation"), (II) hsa00052 ("galactose metabolism"), (III) hsa04970 ("salivary secretion"), (IV) hsa00910 ("nitrogen metabolism"), (V) hsa05215 ("prostate cancer"), (VI) hsa05130 ("pathogenic Escherichia coli infection"), and (VII) hsa00520 ("amino sugar and nucleotide sugar metabolism"), where pathway hsa00910 ("nitrogen metabolism") involved two features, while the others involved one feature.2. These 55 features were deemed as the optimal features for the determination of drug combinations, composing the optimal feature set OS, that is, OS =  55  .In OS, three features were from chemical and protein interactions.In details, besides "F1" and "F2" in Section 3.1, "F3", with the rank of 25 in OS, is the mean value of confidence scores between the targets of drug  1 and the targets of  2 .The rest 52 features were related to 50 pathways (see Table 1 for details), where the pathway hsa04964 ("proximal tubule bicarbonate reclamation") and hsa05020 ("prion diseases") involved two features, respectively, while the other pathways involved exactly one feature.Among the 52 features, 36 were obtained by (5), while the rest 16 by (4) (cf.Table 1).It is clear that the features obtained by (5), measuring the difference of enrichment scores, were better discriminators than those obtained by ( 4), measuring the sum of enrichment scores.It is suggested that in a drug combination, the targets of two drugs should relate to each other in a special way.

Analysis of Optimal
Features.First, we find that there are 8 mRMR features among the top ten features in the MaxRel  list mentioned in Section 3.1.It is suggested that these 8 features are particularly good at distinguishing drug pairs.It is not surprising that the first feature is "F1" (Supplementary Material III), which is the confidence score of interaction between two drugs.The key assumption underlying most drug prediction algorithms is that similar drugs have a tendency to share similar targets [50].This has been observed due to chemical similarity [26,51].In addition, it has been proved that interactive chemicals are more likely to share similar biological functions [30,31].
The second optimal feature is the absolute difference in the value of two drugs' enrichment score in prostate cancer pathway ("abs(hsa05215 1-hsa05215 2), " refer to Supplementary Material III).The prostate cancer pathway is mainly characterized by key molecular changes in prostate cancer cells including cell cycle, carcinogen defenses, cell a: "+" and "−" in this column indicate that the feature is related to the pathways obtained by ( 4) and ( 5), respectively.For example, the feature in the first row with "−" was calculated as abs(hsa05215 1-hsa05215 2).
adhesion, migration and growth, and androgens [52], which are involved in numerous cancers.Therefore, lots of antineoplastic drugs are designed targeting genes in this pathway.
In the study of Wedel et al. [53], they proposed a triple drug combination including RAD001, AEE788, and VPA, which represented a stronger anticancer effect than any single drug.Notably, cyclin B, cdk1, 2, and 4 were reduced, since strong antitumor properties related to adhesion dynamics and cell growth became visible.Therefore, this triple drug combination might possess the potential in the treatment of advanced prostate cancer as well as other cancers [53].In addition, it has been reported that drug combination can extend life for men with prostate cancer [54].Furthermore, Danquah et al. have revealed that a treatment strategy with novel drug combination is a promising approach to treat androgen-independent prostate cancer [55].Overall, the genes in prostate cancer pathway may provide clues for antineoplastic drugs design and application of drug combinations.Drug combination approaches are especially applicable to cancer treatment.On one hand, most tumors depend on more than one signaling pathways for their growth, survival, invasion, and metastasis; on the other, multiple cell signaling pathways may control a single step of tumorigenesis.Thus, efficacious and durable responses in cancer may require a combined usage of conventional single-targeted agents [56].Moreover, cells may develop drug-resistant mutations to a single-targeted agent and most cancers have four to seven independent mutations [57].The chance of overcoming such resistance can be significantly increased by using agent or drug that inhibits multiple pathways or their combination [58][59][60].
The third critical feature for drug combination determination is the sum of enrichment score of two drugs in proximal tubule bicarbonate reclamation pathway ("hsa04964 1+hsa04964 2, " refer to Supplementary Material III).It has been reported that primary porcine proximal tubular cells play an important role in transepithelial drug transport in human kidney [61].Many genes in this pathway have been proved to be related to drug response, drug toxicity, and drug transport.CA4 (carbonic anhydrase IV) is a member of carbonic anhydrases (CAs) family, which is a group of universally expressed metalloenzymes related to multiple pathological and physiological processes, such as lipogenesis, gluconeogenesis, tumorigenicity, ureagenesis, and the virulence and growth of various pathogens [62].Apart from the already known roles of CA inhibitors (CAIs) as antiglaucoma and diuretics drugs, CAIs could also possess the potential to be novel anticancer, anti-infective, and antiobesity drugs [62].PCK1 (phosphoenolpyruvate carboxykinase 1) is a key control point during the regulation of gluconeogenesis.It has been shown that PCK1 is involved in the processes of small molecule biochemistry, carbohydrate metabolism, molecular transport, and response to drugs including 5-tert-butyl-3H-1,2-dithiole-3-thione (TBD), 3H-1,2-dithiole-3-thione (D3T), and its analogues 4-methyl-5-pyrazinyl-3H-1,2-dithiole-3-thione (OLT) [63].MDH1 (malate dehydrogenase 1) in this pathway has been reported to be relevant with drug toxicity [64,65].Another gene in this pathway worth mentioning is AQP1 (Aquaporin-1), which was highly expressed in endothelial cell membranes and involved in water transfer across or into these cells.It has been reported that AQP1 plays a role in response to acetazolamide [66] and drug transport [67,68].In addition, ATP1B1 (Na(+)-K(+) ATPaseB1) in this pathway has also been revealed to be related to drug response in breast cancer cell lines [69].
The fourth feature in the optimal feature set is "F2", which is the maximum confidence score between targets of two drugs in a drug combination.Since drugs sharing the same targets usually have similar pharmacology, they are likely to be replaceable with each other when combined with another drug for similar purposes [8].In general, drugs are combined according to the mechanisms of action, which is characterized by the properties of drugs including their pharmacology and targets [10,11].Therefore, drugs in a drug combination have a high tendency to target the same proteins or similar pharmacology [70].
Besides the top four optimal features, there are several other critical pathways in the optimal feature set.It has been shown that the steroid hormone biosynthesis pathway (hsa00140) can act as a target for endocrine-disrupting chemicals [71], and inhibitors of steroidal cytochrome P450 enzymes have the potential to be targets for drug development [72].Staphylococcus aureus infection pathway (hsa05150) has been shown to be related to drug resistance [73,74].A large amount of studies have shown that the hedgehog signaling pathway (hsa04340) has the potential to be a target for anticancer drug discovery [75].In addition, the inhibition of hedgehog signaling can enhance the delivery of chemotherapy in a mouse model of pancreatic cancer [76].Furthermore, hedgehog signaling can regulate the drug sensitivity by targeting ABC transporters in epithelial ovarian cancer (EOC) [77].It has been proposed that glycosaminoglycan degradation pathway (hsa00531) has significant therapeutic value in cancer [78].Because dysregulated glycosaminoglycan degradation plays an important role in tumorigenesis, targeting glycosaminoglycan-degrading enzymes is a promising anticancer strategy.Dysregulated expression of glycosaminoglycans is ubiquitous in cancer and has been shown to associate with clinical prognosis in several malignant neoplasms.Recently, research on the biological functions of these molecules in tumor angiogenesis, tumor metastasis, and cancer biology has facilitated the development of drugs targeting them.In addition, glycosaminoglycans are utilized as tumorspecific targeting vehicles and delivery for chemotherapeutics and toxins.Animal studies as well as clinical trials have shown the clinical relevance of glycosaminoglycan-based drugs and the utility of glycosaminoglycans as therapeutic targets [78].Another noteworthy pathway is carbohydrate digestion and absorption pathway (hsa04973).Genes in this pathway have been widely used as antidiabetic drugs target [79,80].

Conclusions
In this study, we analyzed molecular mechanisms of drug combinations by extracting certain kinds of features from each combination.After adopting Minimum Redundancy Maximum Relevance and Incremental Feature Selection as the feature selection techniques and random forest as the classification model, 55 optimal features were obtained, with which the classification model achieved the best performance.The results show that the chemical interaction between drugs in the combination and protein interactions between their targets are important for the determination of drug combinations.In addition, some KEGG pathways important for screening drug combinations are also highlighted.We hope that this contribution may help to screen new drug combinations.

Figure 1 :Figure 2 :
Figure 1: The IFS curve.The X-axis represents the number of features participating in the classification model.The Y-axis represents the Matthews's correlation coefficient (MCC) value evaluated by the classification model and 5-fold cross-validation.The highest MCC value of IFS is 0.6731 using 55 features.
As described in Section 2.8, we can obtain two feature lists: MaxRel features list and mRMR features list (available as Supplementary Material III).The ranks of features in MaxRel features list reflect their contribution to classification.Here, we investigated the first 10 features in this list (see the first table in Supplementary Material III for details).

Table 1 :
The 48 pathways related to features in the optimal feature set.