Combination drugs that impact multiple targets simultaneously are promising candidates for combating complex diseases due to their improved efficacy and reduced side effects. However, exhaustive screening of all possible drug combinations is extremely timeconsuming and impractical. Here, we present a novel Hadoopbased approach to predict drug combinations by taking advantage of the MapReduce programming model, which leads to an improvement of scalability of the prediction algorithm. By integrating the gene expression data of multiple drugs, we constructed data preprocessing and the support vector machines and naïve Bayesian classifiers on Hadoop for prediction of drug combinations. The experimental results suggest that our Hadoopbased model achieves much higher efficiency in the big data processing steps with satisfactory performance. We believed that our proposed approach can help accelerate the prediction of potential effective drugs with the increasing of the combination number at an exponential rate in future. The source code and datasets are available upon request.
In the past few years, the novel effective drugs come out slowly although there is a substantial investment into the development of drugs. It is common for the pharmaceutical industry to develop novel drugs targeting a certain target. However, the once dominating paradigm of “mono drug mono target” in drug development is now being challenged by the clinical and pharmaceutical people, since the single drug cannot always be effective for the complex diseases (such as cancer and diabetes), which may involve multiple biological pathways and complex pathological process. Therefore, the drug combination, which consists of multiple drugs (the effective chemical molecules), is now becoming a novel strategy to combat complex diseases [
It is impractical to screen all possible drug combinations experimentally since there will be an exponential explosion when the number of single drugs increases. Therefore, a great number of computational methods have been recently developed for prediction of drug combinations [
Although these existing methods can predict novel drug combinations or provide mechanistic insights into existing ones, they are limited by their efficiency when the size of combination space increases at an exponential growth rate (e.g., the number of drugs increases from pairwise combinations to threewise combinations). Therefore, it is necessary to develop prediction methods that are scalable to data and computation. The Hadoop MapReduce system [
All the basic information about single drugs and effective drug combinations was extracted from the Drug Combination Database (DCDB) (
In order to encode the drug combinations, we focus on the possible effect of different drug combinations on the pathways that they may be involved in. The gene expression profiles of the 1309 smallmolecule drugs or compounds were downloaded from the Broad Institute Connectivity Map Build02 (
For the fact that we can only directly obtain the gene expression data of single drugs, we should first represent the feature of pairwise (or multiple) drug combinations. In this study, we applied two different strategies to define the combination feature described as below.
(1) This first kind of representation is a direct way to define the combination feature as a linear function of single drugs. For a drug
Obviously, this is a simple way to get the combination feature of any pairwise drug combination. However, the representation cannot convey the intricacy of drug combinations due to the complexity of human disease mechanism.
(2) Instead, we try another way to find the frequent feature pattern of effective drug combinations and take them as the feature of potential effective drug combinations. Here, we assume that a pathway is affected if there exist genes in this pathway whose expression level is significantly changed under the effect of a single drug. We first performed the Student’s
The feature construction method brings high dimensional feature space on a dataset with small size of samples. To avoid the overfitting, we applied several feature selection methods on our dataset. For the first type of feature construction method mentioned above, we performed the minimumredundancymaximumrelevance (mRMR) [
In the model building step, we employed two popular machine learning algorithms, support vector machine, and naïve Bayesian to train a classifier for predicting effective drug combinations. In the SVM algorithm, the selection of kernel function and related parameters will have a great effect on the performance of the trained classifier. In the training stage, we compared four types of kernel functions: linear kernel, polynomial kernel, Gaussian kernel, and tangent kernel. The SVM classifiers were implemented by using LibSVM package [
For scalable implementation of our mining process, we used the machine virtualization to build the Hadoop cluster. The master virtual machines included 4 Intel core i3 processor cores and 4 GB RAM and the two slave virtual machines with 2 Intel core i3 processor cores and 2 GB RAM. The software environment includes Hadoop1.2.1, Hive0.11.0, and RHadoop (an integration of R and Hadoop).
After building the scalable Hadoop cluster, we exploited the Hadoop distributed file system to store the raw data and used hive as data ETL tools for relational database and program to process the local files.
The feature construction stage can be regarded as a series of independent similar processes on different samples and features. In the Hadoop, we implemented a chain mapper to parallelize the processes, including the gene expression preprocessing and the construction of the proposed drug combination features.
For the SVM algorithm, it is difficult to implement the parallel version. Here, we only parallelized the grid search of the optimal parameters, which are timeconsuming in the sequential implementation.
For the naïve Bayesian algorithm, the implementation of the scalable version using MapReduce is mainly composed of three steps (shown in Algorithm
A tenfold crossvalidation and leaveoneout crossvalidation test were used to evaluate the classification performance. To assess the performance of the classification models, we used the accuracy (ACC), sensitivity (SN, also called recall), specificity (SP), and
The performance of the prediction model using SVM algorithm is determined by the representation of the features, the type of the kernel function, and parameters. Here, the tenfold crossvalidation test was conducted to evaluate the model performance. We employed three ways of feature representation, including the linear addition, Zhao’s frequent pattern [
Comparison of the accuracy of the prediction models based on SVM using various feature representation and kernel functions.
Linear  Polynomial  Gaussian  Tanh  

Linear addition pattern  47.7%  47.7%  47.7%  53.0% 
Zhao’s frequent pattern [ 
50.0%  55.1%  57.4%  56.2% 
Our frequent pattern  62.2%  64.6%  69.1%  65.4% 
In this section, we evaluated the prediction performance using our proposed frequent pattern and the Gaussian function on the independent test, which is mimicking a true prediction since the model trained on one dataset is used to test on an unseen dataset. We randomly split the whole set of the 76 drug combinations into two datasets (a training set and a testing set). The ratio is about 4 : 1 between the number of the samples of the training set and that of the testing set. The split of the dataset and the independent test is repeated for 10 times. The performance of the 10 runs and their average is presented in Table
The performance of the independent test using our definition of frequent pattern and Gaussian kernel.
Run  ACC  SN  SP 


1  67.7%  70.6%  64.3%  0.706 
2  65.0%  54.5%  77.8%  0.632 
3  60.9%  44.4%  71.4%  0.471 
4  64.0%  66.7%  60.0%  0.690 
5  68.2%  61.5%  77.8%  0.696 
6  65.5%  41.7%  82.4%  0.500 
7  77.8%  64.3%  92.3%  0.750 
8  72.2%  76.9%  60.0%  0.800 
9  72.0%  66.7%  80.0%  0.741 
10  70.4%  66.7%  75.0%  0.714 


Average  68.4%  61.4%  74.1%  0.670 
In the task of the twoclass classification, the assignment of the negative samples (noneffective drug combinations) is not perfect since the unknown pairwise drug combination (we now consider it as noneffective drug combination) may be proved to be an effective drug combination in future. To avoid this problem, we constructed the oneclass SVM classifier trained on the dataset with only effective drug combinations. We made use of leaveoneout crossvalidation to assess the accuracy of oneclass SVM classifiers using different types of kernel functions. As shown in Table
The performance of the oneclass SVM classifiers using different kernel functions.
Linear  Polynomial  Gaussian  Tanh  

ACC  46.1%  81.2%  88.2%  80.3% 
In this section, we constructed a scalable version of the mining tool for identifying the effective drug combinations and compared its efficiency to that of the sequential implementation by the traditional way. The preprocessing steps (including microarray processing, single drug, and drug combination feature construction) were parallelized by a chain of mappers. The naïve Bayesian algorithm is implemented by a series of MapReduce jobs.
The detailed comparison results of our scalable version and the sequential version in efficiency are listed in Table
Comparison of the average efficiency between the scalable and sequential version.
Mining steps  Scalable version  Sequential version 

Microarray processing  2 h 3 min  6 h 18 m 
Feature construction  8 min 34 s  18 min 3 s 
Naive Bayesian  15 s  3 s 
SVM grid search  27 min 6 s  1 h 11 min 
In this study, we proposed a novel Hadoopbased approach to predict drug combinations by implementing the support vector machine and naïve Bayesian classifiers using the MapReduce programming model, which can advance the improvement of scalability of the prediction algorithm. We believe that our proposed model can be potentially useful when more than two drugs (the increasing availability of the number of the drug combination) are combined for combating the complex diseases in the long run.
The authors declare that they have no conflict of interests regarding the publication of this paper.
Yifan Sun and Yi Xiong contributed equally to this paper.