A Weighted Voting Classifier Based on Differential Evolution

. Ensemble learning is to employ multiple individual classifiers and combine their predictions, which could achieve better performance than a single classifier. Considering that different base classifier gives different contribution to the final classification result,thispaperassignsgreaterweightstotheclassifierswithbetterperformanceandproposesaweightedvotingapproachbasedon differentialevolution.Afteroptimizingtheweightsofthebaseclassifiersbydifferentialevolution,theproposedmethodcombines theresultsofeachclassifieraccordingtotheweightedvotingcombinationrule.Experimentalresultsshowthattheproposedmethod notonlyimprovestheclassificationaccuracy,butalsohasastronggeneralizationabilityanduniversality.


Introduction
Ensemble learning is a new direction of machine learning, which trains a number of specific classifiers and selects some of them for ensemble.It has been shown that the combination of multiple classifiers could be more effective compared to any individual ones [1].
From a technical point of view, ensemble learning is mainly implemented as two steps: training weak base classifiers and selectively combining the member classifiers into a stronger classifier.Usually the members of an ensemble are constructed in two ways.One is to apply a single learning algorithm, and the other is to use different learning algorithms over a dataset [2].Then, the base classifiers are combined to form a decision classifier.Generally, to get a good ensemble, the base learners should be as more accurate as possible and as more diverse as possible.So how to choose an ensemble of some accurate and diverse base learners is a focus of concern of many researchers [3].
In recent years, more and more researchers are concerned with ensemble learning [4].There are many effective ensemble methods, such as boosting [5], bagging [6], and stacking [7].Boosting is a method of producing highly accurate prediction rules by combining many "weak" rules which may be only moderately accurate.There are many boosting algorithms.The main variation between many boosting algorithms is their method of weighting training samples and hypotheses.AdaBoost is very popular and perhaps the most significant historically as it was the first algorithm that could adapt to the weak learners.Bagging trains a number of base learners each from a different bootstrap sample by calling a base learning algorithm.A bootstrap sample is obtained by subsampling the training dataset with replacement, where the size of a sample is the same as that of the training dataset.In a typical implementation of stacking, a number of first-level individual learners are generated from the training dataset by employing different learning algorithms.Those individual learners are then combined by a second-level learner which is called metalearner.
Among the most popular combination schemes, majority voting and weighted voting for classification are widely used.Simple majority voting is a decision rule that selects one of many alternatives, based on the predicted classes with the most votes.Majority voting does not require any parameter tuning once the individual classifiers have been trained [8,9].In case of weighted voting, weights of voting should vary among the different output classes in each classifier.The weight should be high for that particular output class for which the classifier performs well.So, it is a crucial issue to select the appropriate weights of votes for all the classes per classifier [2].Weighting problem can be viewed as an optimization problem.Therefore, it can be solved by taking advantage of artificial intelligence techniques such as genetic algorithms (GA) and particle swarm optimization (PSO).

Differential Evolution
Differential evolution algorithm was proposed by Storn and Price [10].DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae and then keeping whichever candidate solution that has the best score or fitness on the optimization problem at hand.DE algorithm starts with an initial population of  individuals:  , , = 1, . . ., , where the index  denotes the th solution of the population at generation .An individual is defined as a -dimensional vector  , = [ , (1),  , (2), . . .,  , (), . . .,  , ()].There are three main operations of DE that are repeated till the stopping criterion is met.They are briefly described below.
Mutation.Mutation operation creates a donor vector  , corresponding to each population member or target vector  , in the current generation.The most frequently referred mutation strategies are presented below [14]: DE/rand/1: DE/best/1: DE/current to best/1: DE/best/2: DE/rand/2: The indexes   ,  = 1, . . ., 5 represent the random and mutually different integers generated within the range [1, 𝑁] and also different from index . is a mutation scaling factor within the range [0, 2], usually less than 1.Vector  best, is the best individual vector with the best fitness in generation .
Selection.After reproduction of the trial individual  , , selection operation compares it to its corresponding target individual and decides whether the target or the trial individual survives to the next generation ( + 1).The selection operation is described as where ( ) is the objective function to be optimized and ensures that a member of the next generation is the fittest individual.From (7), we can see that if the trial individual  , is better than target individual  , , namely, ( , ) ≤ ( , ), then it replaces target individual in the next generation ( + 1); otherwise it will continue with the target individual.
To improve optimization performance, DE algorithms are continually being developed.Many different strategies for performing crossover and mutation are proposed [15][16][17][18].

DE-Based Model for Parameters Selection.
In this section, we are concerned with the parameters selection for the proposed DEWVote.The parameters that should be optimized in DEWVote are the weights of each base classifier in an ensemble.Different parameters settings have a heavy impact on the performance of DEWVote.We select the differential evolution to search the optimal weights.DE has a random initial population of solution candidates that is then improved using the evolution operations.In general, we employ the predefined maximum iterations  max to determine the stopping criterion of DE.Other control parameters for DE are the mutation scaling factor  ∈ (0, 1), the crossover rate CR ∈ (0, 1), and the population size Fitness Evaluation.Train the DEWVote by using each individual vector, and the corresponding 10-fold cross-validation accuracy is then evaluated as the fitness function.
Given the number of categories  and  base classifiers to vote, the prediction category   of weighted voting for each sample  is described as where Δ  is the binary variable.If the th base classifier classifies sample  into the th category, then Δ  = 1; otherwise, Δ  = 0.   is the weight of the th base classifier in an ensemble, which is optimized by DE algorithm in Algorithm 1.
Then, the accuracy is defined as After obtaining the best individual by the differential evolution, namely, the optimal weight ( 1 ,  2 , . . .,   , . . .,   ), we generate the ensemble classifier to classify the test datasets using (8).

Experimental Results and Analysis
In this section, we present and discuss, in detail, the results obtained by the experiments carried out in this research.We run our experiments under the framework of Weka [19] using 15 datasets to test the performance of the proposed method.These datasets are from the UCI Machine Learning Repository [20].Information about these datasets is summarized in Table 1.In DE algorithm, the choice of DE parameters can have a large impact on optimization performance.Selecting the DE parameters that yield good performance has therefore been the subject of much research.For simplicity, we set factor  = 0.5, crossover rate CR = 0.9, population  = 20, and maximum iteration number  max = 100.
C4.5 is an algorithm used to generate a decision tree developed by Quinlan [21].C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy.
A Naive Bayes classifier [22] is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.Bayes theorem provides a way of calculating the posterior probability.Naive Bayes classifier assumes that the effect of the value of a predictor on a given class is independent of the values of other predictors.
Bayes Nets or Bayesian networks [23] are graphical representation for probabilistic relationships among a set of random variables.A Bayesian network is an annotated directed acyclic graph (DAG) that encodes a joint probability distribution.The nodes of the graph correspond to the random variables.The links of the graph correspond to the direct influence from one variable to the other.
-NN is a type of instance-based learning or lazy learning.In -NN classification, the output is a class membership.An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its  nearest neighbors ( is a positive integer, typically small).If  = 1, then the object is simply assigned to the class of that single nearest neighbor.
To obtain a better measure of predictive accuracy, we compare these methods using 10-fold cross-validation.The cross-validation accuracy is the average of the ten estimates.In each fold nine out of ten samples are selected to be training set, and the left one out of the ten samples is testing set.This process repeats 10 times so that all samples are selected in both training set and testing set.Table 2 shows the average accuracy values of four single methods.
From Table 2, we can see that each method outperforms other single methods in some datasets.Comparatively, C4.5 has more accuracies than other methods in 8 of all 15 datasets.It is noted that these base classifiers are more diverse.
To obtain a better measure of predictive accuracy, we also compare several ensemble methods using 10-fold crossvalidation, such as bagging, AdaBoost, majority voting, and our DEWVote approach.In the DEWVote approach, we select five classifiers as base learners, including C4.5, Naive Bayes, Bayes Nets, -nearest neighbor (-NN), and ZeroR [19].ZeroR is the simplest classification method which relies on the target and ignores all predictors.ZeroR classifier simply predicts the majority category (class).Although there is no predictability power in ZeroR, it is useful for determining a baseline performance as a benchmark for other classification methods.Majority voting selects the same base classifiers as our approach.A Naive Bayes classifier is employed as the base learning algorithm of bagging and AdaBoost.Naive Bayes classifiers are generated multiple times by each ensemble method's own mechanism.The generated classifiers are then combined to form an ensemble.
We present the mean of 10-fold cross-validation accuracies for 15 datasets.The results of ensembles are shown in Table 3. DEWVote shows more accuracies than other ensemble methods besides in Diabetes and Segment-challenge datasets, while majority voting outperforms other ensemble methods in these two datasets.It is of note that majority voting has more accuracies than bagging and AdaBoost.Comparatively speaking, DEWVote and majority voting obtain better performance in majority datasets.However, bagging and boosting obtain better performance than majority voting in vote dataset.

Conclusions
In this paper we give a novel approach to optimal weights of base classifiers by differential evolution and present a weighted voting ensemble learning classifier.The proposed approach adopts ensemble learning strategy and selects several base learners, which are as more diverse as possible to each other, to combine an ensemble classifier.Each weight of base learner is obtained by differential evolution algorithm.
We have compared the performance with the three classical ensemble methods, as well as with four base classifiers.Experimental results have confirmed that our approach consistently outperforms the previous approaches.DEWVote searches the weights through iteration operations.So, it has

Algorithm 1 :Figure 1 :
Figure 1: The whole procedure of our proposed approach.

Table 1 :
Summary of datasets.