Prediction of Side Effects Using Comprehensive Similarity Measures

Identifying the potential side effects of drugs is crucial in clinical trials in the pharmaceutical industry. The existing side effect prediction methods mainly focus on the chemical and biological properties of drugs. This study proposes a method that uses diverse information such as drug-drug interactions from DrugBank, drug-drug interactions from network, single nucleotide polymorphisms, and side effect anatomical hierarchy as well as chemical structures, indications, and targets. The proposed method is based on the assumption that properties used in drug repositioning studies could be utilized to predict side effects because the phenotypic expression of a side effect is similar to that of the disease. The prediction results using the proposed method showed a 3.5% improvement in the area under the curve (AUC) over that obtained when only chemical, indication, and target features were used. The random forest model delivered outstanding results for all combinations of feature types. Finally, after identifying candidate side effects of drugs using the proposed method, the following four popular drugs were discussed: (1) dasatinib, (2) sitagliptin, (3) vorinostat, and (4) clonidine.


Introduction
Research and development efforts in the pharmaceutical industry have been low during the past few decades [1]. Most drug candidates, owing to their side effects, fail to receive the approval of the US Food and Drug Administration (FDA) for their commercialization. Identifying the undesirable offtarget activities of potential drugs, capable of causing side effects and, in turn, resulting in drug discovery failures, is a challenge in the drug-development process. While most of the serious side effects are identified during preclinical and clinical trials, some of them are reported during the postapproval monitoring. e uncertainty about the potential side effects of new drugs is a concern for not only pharmaceutical companies but also patients because they pose a health risk and can even cause death [2]. e existing computational methods used to predict the side effects of drugs assume that similar drugs have comparable properties in terms of chemical and biological characteristics, such as their structures and targets. Pauwels et al. predicted potential drug side effects using a sparse canonical correlation analysis model based on chemical structures, whereas Mizutani et al. developed a method based on chemical structures of drugs and target proteins [3,4]. It has been recognized that drugs with similar chemical structures exhibit comparable biological activities [5]. e screening of a significant number of chemical databases containing the structures of available chemicals is a pivotal process even in drug design studies that predict the properties of chemical compounds [6]. It is logical to expect that common drug targets that trigger similar therapeutic effects induce similar signaling cascades and, therefore, similar side effects.
Previous research on side effect prediction was extended to phenotypic traits after much focus on chemical and biological properties. Liu et al. integrated drug-phenotypic information, besides chemical and biological information, into the features for machine learning and demonstrated significant improvements in the prediction results [7]. Zheng et al. utilized therapeutic data in addition to drug substitutes, chemical structures, and targets. ese two studies were based on the idea that drugs with similar therapeutic effects may have comparable side effects [8].
However, a vast majority of previously conducted studies did not fully utilize the existing knowledge on drugs. is study focuses on the idea that the phenotypic expression of a side effect can be similar to that of a disease. Furthermore, because drugs are subject to complex influences such as metabolic transformations and other pharmacokinetic transformations while they are metabolized and physiologically distributed, their side effects cannot be simply predicted by their chemical properties [9].
us, it was assumed that similarities at a molecular level used in drug repositioning can be applied to the prediction of side effects; moreover, various similarity measurements could be helpful in improving the predictive capabilities of the model. Single nucleotide polymorphisms (SNPs) and drug-drug interactions (DDIs) that were not used by previous side effect prediction studies were utilized because both have been used in drug repositioning studies, where they have demonstrated outstanding results [10,11].
is study proposes a machine learning approach for the identification of potential drug side effects by leveraging various information resources on drug and side effect properties, such as (1) drug-drug interactions from DrugBank (DDIs-D), (2) drug-drug interactions from network (DDIs-N), (3) SNPs, (4) chemical structures, (5) indications, (6) targets, and (7) side effect anatomical hierarchy. A formulated set of seven features and diverse machine learning algorithms were adopted to develop a drug-side effect pair. e results showed that using new features proposed in this study in addition to chemical, indication, and target features improved the predictive capability of the side effect prediction model. Furthermore, optimizing the machine learning model to achieve maximal prediction performance resulted in the identification of unlabeled side effects of approved drugs. e following four popular drugs are discussed in the study: (1) dasatinib, (2) sitagliptin, (3) vorinostat, and (4) clonidine.

Materials and Methods
In this study, the side effects of drugs were predicted using diverse properties, such as (1) DDIs-D, (2) DDIs-N, (3) SNPs, (4) chemical structures, (5) indications, (6) targets, and (7) side effect anatomical hierarchy. Based on these properties, similarities used for feature values were calculated. Next, a dataset was developed in which a drug-side effect pair was a sample, and feature values were assigned based on the known associations from the training samples. Finally, machine learning algorithms were applied to these datasets, and the evaluation results were obtained. Figure 1 provides an overview of the system and a brief description of the method.

Measuring Similarities.
A variety of similarities were calculated for the following properties: (1) DDIs-D, (2) DDIs-N, (3) SNPs, (4) chemical structures, (5) indications, (6) targets, and (7) side effect anatomical hierarchy. ese were used to indicate that similar drugs for each feature have comparable side effects, as shown in Table 1. 2.1.1. Drug-Drug Interaction. DDI relationships were considered based on the assumption that similar drugs are prone to exhibit comparable reactions to other drugs [14,15]. e gold standard dataset was acquired from Ryu et al., who compiled DDIs from DrugBank [16,17].
is dataset comprised 192,284 DDIs with 1,710 different drugs and 86 interaction types. According to Ryu et al., one drug may "affect" or "be affected by" another drug. Interactions between two drugs were expressed using the following notation: (x, y, z), where x represents the other drug within the interaction, y denotes the interaction type, and z represents whether the drug "affects" or "is affected by" another drug in the interaction. For example, if drug A was affected by drug B with interaction type 1, the relationship is expressed as (B, 1, ←); if drug A affected drug C with interaction type 2, the relationship is expressed as (C, 2, ⟶ ). Accordingly, a set of elements of drug A was established . e same procedure was repeated for all the drugs. e DDIs-D similarity was evaluated considering related drugs and their interaction types.
Additionally, by considering network characteristics, the biological functions of DDIs were analyzed. Because proteins interact with other proteins, drug targets in the human protein interactome were considered. e human PPI network was retrieved from STRING (Search Tool for the Retrieval of Interacting Genes/Proteins); it included 11,754,195 pairs of 10,342 different proteins [18]. For each drug, the component genes of a target-target network were obtained using the shortest paths for targets on the PPI network. Subsequently, the gene ontology terms were obtained. e drug-drug similarity of DDIs-N was evaluated using these terms. e DDIs-N similarity represents the functional similarity between the two drugs.

Single Nucleotide Polymorphism.
It is widely accepted that two drugs with similar indications (i.e., diseases) are likely to exhibit comparable side effects. Furthermore, a previous study demonstrated that similarities in SNPs could be leveraged to identify drug repositioning [11]. us, genes that are affected by SNPs were used to consider the genetic traits of diseases that were affiliated with two drugs. First, disease-associated SNPs from the DisGeNET were collected [19]. Next, the SNPs were mapped to each drug based on indications. Finally, the SNPs were linked to other genes that were regulated by SNPs for each drug using data from Fagny et al. [12]. ese data contained expression quantitative trait loci linking SNPs to tissue-specific regulation of gene transcripts from Genotype-Tissue Expression. e SNP similarity was used to show that the more similar the genetic traits of diseases that were related to two drugs, the more comparable the side effects.  [20]. Second, the Comparative Toxicogenomics Database (CTD), consisting of curated and inferred chemical-disease associations, was used [21].
However, among the curated associations, only those that exhibited direct evidence of a chemical-disease association (marked "marker/mechanism" or "therapeutic") were included, and the inferred associations were excluded. As a result, 150,175 associations, spanning 1,510 drugs and 5,709 diseases, were processed from the CTD. ird, the repoDB contained 6,677 associations spanning 1,519 drugs and 1,229  , datasets for machine learning were created with different combinations of features, and diverse classification algorithms, including a random forest, XGBoost, logistic regression, and naive Bayesian model, were applied to predict the relationship between a side effect and a drug. (d) Stacking ensemble learning that incorporated all four classifiers as its base classifiers and used a neural network as its meta classifier was applied.
diseases [22]. Finally, this study comprised 197,547 associations spanning 2,144 drugs with DrugBank ID and 6,556 diseases with UMLS ID. e indication similarity was used to denote the phenotypic traits of drugs.

Targets and Chemical Structures.
In this study, 6,076 drug-target associations for 1,366 drugs were obtained from DrugBank. When drugs were related to the same protein, they underwent similar biological processes [8]. us, drugs that bind to the same target are associated with the same side effect. e biological characteristics of the drugs were considered based on the target similarity. e simplified molecular-input line-entry system (SMILES) refers to simplified chemical structures and encoded molecular graphs as a human readable string [23]. We obtained the chemical structures of 1,878 drugs in the SMILES format from PubChem and DrugBank [24]. Chemical properties of drugs play an important role. Furthermore, drugs with similar chemical structures traditionally exhibit comparable biological activities [5]. e Open Babel software was used to evaluate the similarity of all drug pairs based on the Tanimoto coefficient, which is the number of bits in common divided by the union of the bits using drug fingerprints [25].

Jaccard Coefficient.
e Jaccard coefficient was used to compute the similarities for all the properties mentioned earlier, except the chemical structures. It is equal to the number of traits in common divided by the union of the traits and is expressed as where SA and SB represent the sets of properties of drug A and drug B, respectively. Five similarities were obtained using the Jaccard coefficient for the following properties: (1) DDIs-D,

Side Effect Anatomical Hierarchy Similarity.
e "side effect anatomical hierarchy" similarity was evaluated using side effect anatomical hierarchy from Wadhaw et al.'s paper [13]. ey categorized each side effect manually according to its anatomical scheme, which is organized at three levels: organ, subsystem, and system levels, as depicted in Figure 2. e "side effect" level considers each side effect as an independent side effect. e "organ" level represents organ classes in which side effects are aggregated based on their anatomical schema. e "subsystem" and "system"' levels are aggregated in the same way. eir study implied that similar side effects are likely to be associated with similar anatomical levels. us, we named this as "side effect anatomical hierarchy" in our study. Finally, the anatomical hierarchies of 5,868 side effects were obtained.
Using equation (1), the anatomical hierarchies of side effects a (SE a ) and b (SE b ) for each level were calculated. S l (SE a , SE b ) denotes the similarity of two side effects in a specific level and l indicates the level.
In equation (2), n denotes the total number of levels, which is 3 in this case. Accordingly, we obtained the anatomical hierarchy similarity S side effect (SE a , SE b ) between two side effects.

Assign Similarity Values for Each Drug-Side Effect Sample.
In the study dataset, a drug-side effect pair is a sample. To build effective prediction models, only side effects that were related to at least five drugs were considered. In addition, a similarity value between the drug in the pair and other drugs that had the side effect in the pair was required. Positive samples consisting of the known drug-side effect associations from SIDER were generated using the drugs and side effects that met the above condition. In total, 107,878 drugside effect associations were collected from the SIDER database (reported up to September 2017) [26]. For each drug, the PubChem compound ID was mapped to the DrugBank ID using the annotations provided in biodb.jp [27].
Negative samples were randomly generated from the drugs and side effects used in the positive samples by excluding the known associations. An identical number of drug-side effect samples were collected for the negative and positive sets. en, a dataset comprising negative and positive sets was built. To avoid a negative sample bias, 100 negative sets were extracted using random sampling with replacement. We performed hold-out validation instead of cross validation because the values in each dataset were highly affected by the positive pairs of the training set. ese pairs were split into three subsets: training dataset, validation dataset, and test dataset. e ratio of the three sets was 8 :1 :1. Subsequently, datasets were built using the similarity values generated in the above step. First, for each drug-side effect sample, drugs that had known associations with the side effect in the training set were identified. en, the maximum similarity values for each feature were obtained from the drug and other drugs. In particular, if the drug had a known association with the side effect in the training set, the similarity of the drug itself was excluded. e feature values for the six types of similarities (viz., DDIs-D, DDIs-N, SNPs, indications, targets, and chemical structures) were assigned in the same manner as mentioned above. is step is illustrated in Figure 3.
is procedure was reversed for the side effect anatomical hierarchy feature. First, for a drug-side effect sample, side effects that were known to be associated with the drug in the training set were identified. en, the maximum similarity value between the side effect and other side effects was obtained. In particular, if the side effect had a known association with the drug in the training set, the similarity of the side effect itself was excluded.
Accordingly, each pair was represented by seven features.

Machine
Learning. Using datasets constructed in the above step, four base machine learning algorithms were applied: random forest (RF), logistic regression (LR), XGBoost (XGB), and naive Bayesian (NB) [28][29][30][31]. In the machine learning process, a validation dataset was used to optimize parameters for each algorithm. e test set was used to evaluate the prediction model. In addition, stacking ensemble learning incorporating all these four classifiers as its base classifiers and using a neural network as its meta classifier was applied [32,33]. e stacking ensemble learning is illustrated in Figure 4. e training, validation, and test datasets are shown, respectively, in blue, orange, and green colors in the base models. In the stacking step, the predictions from the validation dataset in the base step become the training dataset and are shown in orange color.

Performance Evaluation.
In this study, the area under the curve (AUC), precision, recall, F1 score, and specificity were evaluated. Each measurement was averaged across 100 validation datasets and test datasets. True positive (TP) and false positive (FP) indicate the numbers of correctly and falsely predicted positive drug-side effect pairs. True negative (TN) and false negative (FN) represent the numbers of correctly and falsely predicted negative drug-side effect pairs.

Confirmation of Drug Candidate Side Effects with FAERS and MedEffect. e average AUC was obtained for each algorithm across validation datasets and test datasets.
Considering all the average AUCs, the RF algorithm, which exhibited the best results in the validation and test datasets, was chosen to predict the candidate side effects of drugs, and a classifier was applied. e remaining set, which was not used earlier, was used to predict the candidate associations.
For validation purposes, the drug-side effect associations were extracted from the FDA Adverse Event Reporting System (FAERS), which uploads event lists four times a year (2015.4-2017.9) [34]. FAERS was compiled into 1,829,465 drug-side effect associations for 2,921 drugs and 5,737 side effects. e data from Health Canada's Adverse Drug Reaction Reporting System (MedEffect) was compiled into 2,466 drug-side effect associations for 55 drugs and 902 side effects [35]. FAERS and MedEffect are postmarketing surveillance systems.
ere were two types of associations: (1) the associations predicted using the method proposed in this study and (2) the actual associations from FAERS and MedEffect. Two contingency tables were created for FAERS and MedEffect with our predictions as depicted in Table 2. Fisher's exact test (expressed as equation (4)) was applied to check whether the candidate predictions from the proposed method significantly enhanced the existing side effect databases.
In this equation, the cells of the contingency table are denoted as x, y, z, and w; n represents the grand total and p represents the probability.

Results and Discussion
Compared with the use of only the target, indication, and chemical properties, the addition of more features improved  BioMed Research International the capability of predicting the candidate side effects of a drug. e similarities between drugs were evaluated using DDIs from DrugBank (DDIs-D), DDIs from network (DDI-N), SNPs, indications, and the target and chemical properties of drugs. In addition, the similarities between side effects based on side effect anatomical hierarchy were computed. A maximum of seven features for each drug-side effect pair were extracted.
For a drug and a side effect to be considered in the prediction process, a similarity value between that drug and other drugs that had the same side effect was required. Furthermore, only side effects associated with at least five drugs were considered. Accordingly, 638 drugs and 2,117 side effects were used in this study. e number of pairs used in each set is presented in Table 3. e datasets were built by combining a positive set with negative sets. e positive set consisted of known associations from SIDER, whereas the negative sets were randomly generated from the drugs and side effects used in the positive set. e associations included in the positive set were excluded from the negative sets. In total, 100 negative sets were randomly generated, and 100 sets were created by combining positive and negative sets. en, a set was split into three subsets: training, validation, and test datasets. e ratio of these sets was 8 : 1 : 1, as indicated in Table 3.

Performance Evaluation.
Four machine learning algorithms were adopted: NB, RF, LR, and XGB algorithms. First, the performance was tested for a case in which only chemical, indication, and target features (primarily utilized in previous studies) were used. Subsequently, other features were added incrementally. e impact of adding features incrementally is illustrated in Table 4. For each type of feature set, 100 hold-out validation runs were performed, and the resulting AUC scores were averaged.
In Table 4, the "Base" feature comprised the indications, targets, and chemical structures commonly used in previous studies. Next, other features were added to the "Base" feature, with "+" representing the addition of features. For example, "Base + SNPs" indicates that the SNP feature was added to indications, targets, and chemical structures. "SE-AH" denotes the side effect of the anatomical hierarchy feature. Accordingly, 16 different feature combinations were   x + z y + w n (�x + y + z + w) created to demonstrate the improved performance with the inclusion of additional features. e results presented in Table 4 indicate that the addition of features increased the AUCs for all machine learning algorithms. In the RF model, the AUC increased by approximately 3.5% when new features were considered along with chemical, indication, and target features. us, it can be concluded that the addition of new features improved the prediction capability. Table 5 represents various performance measurements when all features are used. e data indicate that the higher the AUC, the higher the other measurements. Table 6 illustrates the relative importance of features when all the features are used. Feature importance signifies the extent of the contribution made by each feature to the performance of the model. We used the varImp function of the caret R package to obtain feature importance [36]. It is evident that new features introduced into our prediction method contribute significantly to the improvement in the overall prediction performance. To elaborate further, we depicted the feature importance of RF in Figure 5. RF showed the best results in Table 4.
As shown in Figure 5, the DDI-D and SNP features play a significant role in the RF model. erefore, it can be concluded that the incorporation of new features into the method can substantially improve side effect predictions.

Candidate Predictions.
e proposed method in this study yielded the best results when the RF model with all the features (DDIs-D, DDIs-N, SNPs, side effect anatomical hierarchy, chemical structures, target, and indication) was used, as depicted in Table 4. erefore, the classifier that yielded the highest AUCs from the RF model was chosen to predict the candidate side effects of drugs. In the proposed model, 638 drugs and 2,117 side effects were used. An entire set was generated that comprised 1,350,646 pairs of drugs and side effects. e pairs used in the training, validation, and test sets were excluded from the entire pairs. e remaining 1,197,356 pairs were used to predict candidate side effects of drugs. Finally, the candidate predictions for     Table 7. e predictions of the proposed method also significantly enhance MedEffect (p � 0.001169 and odd ratio � 1.2266). Table 8 shows the contingency table for MedEffect and the predictions from the proposed method.

Comparison with Previous Studies.
We compared our results with those of two previous studies by Zhao et al. because they too treated a drug-side effect pair as a sample, like we did in this study [37,38]. Two other previous studies used the fingerprint, chemical structure, ATC code, literature association, and target as features to predict side effects and applied diverse machine learning algorithms such as RF, nearest neighbor, dagging, and support vector machine. Zhao et al. adopted similarity and network embedding for feature construction, respectively. ey used RF as their preferred method.
In this study, we showed the results of RF in comparison with the results obtained in previous studies [37,38]. ey adopted RF as their final model. e stacking model was proposed to highlight its slightly improved performance compared to RF. In the stacking ensemble model, we adopted a neural network as a meta classifier, and four models (RF, NB, XGB, and LR) were used as base models. erefore, we compared our study to previous studies with two proposed methods: stacking and RF.
As seen in Table 9, the two proposed stacking and RF methods yielded better results while predicting the side effects of drugs than the method of Zhao et al. e results show significant differences in three measurements (precision, recall, and F1). Because the F1 score considered both precision and recall, a good F1 score indicates low false positives and low false negatives. It means that our model was less likely to predict false side effects as positive side effects and actual side effects as negative side effects. erefore, the additional features proposed in this study can be more effective at ascertaining the side effects of drugs than only the features suggested in previous studies.

Case Studies.
Of the 188,568 candidate predictions, 12,648 associations were verified by FAERS and 377 associations were verified by MedEffect. To evaluate the practical benefits of the predictive classification models, diverse drugs were sampled according to the following criteria: (1) a recently approved drug with a target-based rational design, (2) a drug for long-term administration, whose side effects required more monitoring, (3) a multitargeted drug showing several differential indications, and (4) a long-established drug whose mechanism was imperfectly understood. ey were further examined via a comparison with academic literature and drug discovery material.
First, dasatinib, a small-molecule multikinase inhibitor used to treat cancer, was selected as a recently approved drug with a target-based rational drug design [39]. Consequently, the observational and interventional side effects clinical data on the drug were relatively abundant, and swift data updates for indication expansion were expected. Remarkably, among the 449 predicted side effects of dasatinib, 16 matched the reported data in MedEffect and 194 matched the reported data in FAERS. Accordingly, 43% of the predicted side effects of dasatinib were found in the clinical data. Moreover, literary evidence was found for some predicted side effects present in both MedEffect and FAERS. Angioedema is a serious adverse drug reaction that can be caused by dasatinib    [40]. According to a study by Reyes-Habito et al., dasatinib commonly causes skin reactions, such as pruritus, acne, and xerosis [41]. A symptom of "blood alkaline phosphatase increase" can also be caused by dasatinib [42]. Second, sitagliptin, a diabetes therapeutic acting as a selective dipeptidyl peptidase-4 inhibitor, was chosen as a good representative of a drug prescribed for long-term care [43]. Because long-term administration is needed, the side effect monitoring required here is more than that required for short-term administration drugs. Furthermore, sitagliptin also represents a significant new drug class for diabetes. It has a clear mechanism of action and a rational drug design. To develop combination-therapy drugs applicable to a metabolic syndrome caused by sitagliptin and other drugs, researchers need to have a comprehensive understanding of its side effects. Among the 236 predicted side effects of sitagliptin, 134 matched the data reported in FAERS; that is, 56% of the sitagliptin predictions were found in clinical data.
ird, vorinostat, a multitargeting drug was chosen for its widely varying indications. It is used to treat diseases ranging from HIV infection to diverse types of cancers, including nonsmall-cell lung, ovarian, breast, and pancreatic cancers. As vorinostat is a target-based anticancer drug, its side effects are more reasonably traceable than those of other cytotoxic agents. When considering the risk-to-benefit ratio, the side effects of anticancer drugs can be insignificant in comparison with those of other therapeutic drugs. However, recent cancer treatment tends to consider the quality of life as well as the survival rate. Among the 151 predicted side effects of vorinostat, 56 were identified in the FAERS clinical data and 1 was identified in MedEffect. erefore, 37% of the predictions matched the clinically reported side effects in FAERS, thus proving that the proposed method could be used to effectively identify new side effects from the complete list of drugs with warning black boxes.
Finally, clonidine was chosen as it has an unclear mechanism despite being a long-established drug. Although the available information of its on-target effects was inadequate and could not be used to explain its side effects, the accumulated records of both diverse indications and side effects were sufficient for comparison with the predictions from the proposed method. In total, 563 side effects for clonidine were identified.
To summarize, this study shows that, to a notable extent, predictions can be found in clinically reported data. However, some unprecedented side effects suggested by the predictions of the proposed model will have to be clinically validated in the near future.

Conclusions
In this study, we proposed a method to deduce the candidate side effects of drugs using the existing knowledge on drugs and side effects. Unlike previous studies that primarily focused on the biological, phenotypic, and chemical characteristics of drugs, in this study, we used an array of information on drugs. e focal point of this study is that properties used in drug repositioning studies can be utilized to predict side effects because the phenotypic expression of a side effect is similar to that of the disease.
By leveraging a variety of data such as SNPs, DDIs, and side effect anatomical hierarchy, the benefits of adopting these data as features were demonstrated; the overall performance was found to be 3.5% better than that obtained by only the target, chemical, and indication features. e limitation of the proposed method is that it could not identify the candidate side effects of drugs that did not have similarity values in all features. However, the rapid growth of potential information on drugs will ensure the availability of more drugs for further study.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.