A Novel Margin-Based Measure for Directed Hill Climbing Ensemble Pruning

Ensemble pruning is a technique to increase ensemble accuracy and reduce its size by choosing a subset of ensemble members to form a subensemble for prediction. Many ensemble pruning algorithms via directed hill climbing searching policy have been recently proposed. The key to the success of these algorithms is to construct an effective measure to supervise the search process. In this paper, we study the importance of individual classifiers with respect to an ensemble using margin theory proposed by Schapire et al. and obtain that ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples. Based on this principle, we propose a novel measure called the margin-based measure to explicitly evaluate the importance of individual classifiers. Our experiments show that using the proposed measure to prune an ensemble leads to significantly better accuracy results compared to other state-of-the-art measures.


Introduction
Ensemble of multiple learning machines has been a very popular research topic during the last decade in machine learning and data mining.The basic idea is to construct multiple classifiers from the original data and then aggregate their predictions when classifying examples with unknown classes.Theoretic and empirical results show that an ensemble is potential to increase the classification accuracy beyond the level reached by an individual classifier alone [1].Dietterich stated "A necessary and sufficient condition for an ensemble of classifiers to be more accurate than any of its individual members is if the classifiers are accurate and diverse" [2].
Many approaches have been proposed to create ensemble members with both high accuracy and high diversity, which can be mainly grouped into three categories: (1) by manipulating data set [3,4], (2) by manipulating features [5][6][7][8], and (3) by manipulating algorithms [9].Bagging [3] and boosting [4], the most widely used and successful ensemble learning methods, fall into the first category, where bagging learns individual classifiers on data sets obtained by randomly sampling from the original training sets and, through randomly disturbing, the learned classifiers obtain a high accuracy and sufficient diversity.Unlike bagging, boosting is an iterative learning process.For each iteration, boosting adjusts the distribution of training set such that classifiers focus more on examples that are hardly correctly classified.The approaches by manipulating features try to build the individual classifiers on diverse feature spaces obtained by selecting subset or by generating new ones from the original features.For example, random forests [5,6] learn each tree on a feature subset obtained by randomly sampling from original features and COPEN [8] learns the base classifiers on new feature spaces mapped from original feature space using pairwise constraints projection.The individual classifiers can also be built by manipulating algorithms.Through adjusting model structure or parameter setting, classifiers with diversity are learned, such that the negative correlation method explicitly constrains the parameters of individual neural networks to be different by a regularization term [9].
Ensemble methods have been successfully applied to many fields such as remote sensing [10], time series prediction [11], and imbalanced learning problem [12].However, an obvious problem existing in ensemble learning methods is that they tend to train a very large number of classifiers which need large storage resources to store them and computational resources to calculate outputs of individual learners.Besides, it is not always true that the larger the ensemble, the better its performance.In fact, Zhou et al. [13] proved that the generalization performance of a subset of an ensemble may be even better than the ensemble consisting of all the given individual learners.These reasons motivate the appearance of ensemble pruning, also called ensemble selection or ensemble thinning, selecting a subset of ensemble members to form subensembles that are subject to less resource consumption and response time with accuracy that is similar to or better than the original ensemble [14][15][16][17][18][19][20][21][22].
Given an ensemble with  members, searching for the best subset of ensemble members by enumerating all subensemble candidates is computational infeasible because of exponential size of the search space 2  − 1, which is NPcomplete problem [23].Several efficient methods that are based on a directed hill climbing search in the space of subsets report good predictive performance results [15,16,18,[24][25][26][27].These methods start with an empty (or full) initial ensemble and search the space of different ensembles by iteratively expanding (or contracting) the initial ensemble by a single model.The search is guided by an evaluation measure that is based on either the predictive performance or the diversity of the alternative subsets.The evaluation measure is the main component of a directed hill climbing algorithm and it differentiates the methods that fall into this category.
In this paper, we apply the concepts of example margins proposed by Schapire et al. [28] to analyse the importance of individual classifiers with respect to an ensemble and conclude that ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples.Based on the gained insight, a criterion called margin-based measure is proposed to supervise the search process of ensemble pruning via directed hill climbing strategy.Our experiments show that using the proposed measure to prune an ensemble leads to significantly better accuracy results compared to other state-of-the-art measures.
The paper is structured as follows.Section 2 briefly describes ensemble pruning via directed hill climbing search.Section 3 proposes a measure for evaluating the importance of individual classifiers.Section 4 reports the experimental settings and results, and we conclude this paper in Section 5.

Related Work
Directed hill climbing ensemble pruning (DHCEP) attempts to find the globally best subset of classifiers by taking local greedy decisions for changing the current subset [17,28,29].An example of the search space for an ensemble of four models is presented in Figure 1.
The direction of search and the measure used for evaluating the search are two important parameters that differentiate one DHCEP method from the other.The following sections discuss the different options for instantiating these parameters and the particular choices of existing methods.

Direction of Search.
Based on the direction of search we have two main categories of DHCEP methods: (a) forward selection and (b) backward elimination (see Figure 1).

Forward selection
Backward elimination In forward selection algorithm, ensemble pruning starts with the current classifier subset  which is initialized to the empty set.Then the algorithm continues by iteratively adding to  the classifier ℎ ∈  \  that optimizes an evaluation function.This function evaluates the addition of classifier ℎ in the current subset  based on the pruning set (labeled data).In the past, this approach has been used in [14,25,26] and in reduce-error pruning methods [30,31].
In backward elimination, the current classifier subset  is initialized to the complete ensemble  and the algorithm continues by iteratively removing from  the classifier ℎ ∈  that optimizes the evaluation function.This function evaluates the removal of classifier ℎ from the current subset  based on the pruning set.In the past, this approach has been used in the AID thinning and concurrency thinning algorithms [15].
In both cases, the traversal requires the evaluation of ( + 1)/2 subsets, leading to a time complexity of ( 2 (, )), where the term (, ) concerns the complexity of the evaluation function, which is linear with respect to  (the size of pruning set) and ranges from constant to quadratic with respect to  (the size of ), as we will see in the following sections.

Evaluation Measure.
Evaluation measures are the main component that differentiates DHCEP methods, which can be grouped into two major categories: those are based on performance and those are based on diversity.
The goal of performance-based measures is to find the model that maximizes the performance of the ensemble produced by adding (or removing) a model to (or from) the current ensemble.Their calculation depends on the method used for ensemble combination, which usually is voting.Accuracy was used as an evaluation measure by Margineantu and Dietterich [30] and by Fan et al. [25], while Caruana et al. [26] experimented with several measures, including Mathematical Problems in Engineering 3 accuracy, root mean squared error, mean cross-entropy, lift, precision/recall break-even point, precision/recall -score, average precision, and ROC area.Another measure is benefit, which is based on a cost model and has been used in Fan et al. [25].The calculation of performance-based metrics requires the decision of the ensemble on all examples of the pruning set.Therefore, the complexity of these measures is (||).However, this complexity can be optimized to (), if the predictions of the current ensemble are updated incrementally each time a classifier is added to (or removed from) it.
Ensemble diversity, that is, the difference among the individual learners, is a fundamental issue in ensemble methods.Intuitively, it is easy to understand that, in order to gain from a combination, individual learners must be different, and otherwise there would be no performance improvement if identical individual learners were combined.
Let ℎ be a classifier and let  be subensemble; Partalas et al. [16,18,29] identify that the prediction of ℎ and  on an instance x  can be categorized into four cases: They concluded that considering the four cases is crucial to design ensemble diversity measure.Many diverse measures are designed by considering some or all the four cases, for example, complementariness [14] and concurrency [15].The complementariness of ℎ with respect to  and a pruning set  pr is calculated as where  (true) = 1,  (false) = 0.The complementariness is exactly the number of examples that are correctly classified by ℎ and incorrectly classified by .The concurrency is defined as which is similar to the complementariness, with the difference that it considers two more cases and weights them.Unlike complementariness and concurrency, Partalas et al. [18] introduce a new metric called uncertainty weighted accuracy (UWA) considering all four cases given above.UWA is defined as where NT  is the proportion of classifiers in the current ensemble  which correctly predict x  and NF  = 1 − NT  is the proportion of classifiers that incorrectly predict x  .In addition to considering all four cases, UWA takes into account the strength of the decision of the current ensemble.
In this paper, we designed a new measure by considering the margin of examples for ensemble pruning via directed hill climbing.More details are discussed in next section.

Importance Assessment for Individual Classifiers
As one of the best off-the-shelf algorithms, AdaBoost demonstrates a high generalization performance.To theoretically analyse this phenomenon, a concept called margin of examples was proposed by Schapire et al. [28].Let  = {(x  ,   ) |  = 1, 2, . . ., } be the training set, where each example x  is associated with a label   ∈ {−1, +1}.Suppose that  = {ℎ  |  = 1, 2, . . ., } is an ensemble with  classifiers and suppose that each member ℎ ∈  maps each example x  ∈  to a label ; namely, ℎ : where   is the weight of the classifier ℎ  .Without loss of generality, normalizing   ,  = 1, 2, . . ., , such that ∑   = 1, then (4) can be written as From ( 5), the margin is a value in [−1, 1], x  is on the border if marin(x  ) = 0, the absolute value of the margin is the confidence of ensemble prediction on x  , and margin(x  ) > 0 (or margin(x  ) < 0) indicates that the ensemble correctly (or incorrectly) classifies x  .Based on this concept, they proved that, for any  > 0 and  > 0, the generalization error is upper bound by where  is the complex of the base classifier and  is the size of the training set.To further explain the correctness of the margin theory, Gao and Zhou [32] proposed th margin theory.Specifically, for any  > 0, with probability at least 1 −  over the random choice of training set with size  ≥ 5, the generalization error is upper bound by where From ( 6) and ( 7

Margin-Based Measure
In this section, we propose a heuristic measure for evaluating the importance of individual classifiers based on the gained insight obtained in Section 3: ensemble pruning via directed hill climbing strategy should focus more on examples with small absolute margins as well as classifiers that correctly classify more examples.Several methods use a different approach to calculate diversity during the search.

Measure for Two-Class Problem.
For simplicity of presentation, this section focuses on forward ensemble pruning: given an ensemble subset  which is initialized to be empty, we iteratively add into  the classifier ℎ ∈  \ .Here, the symbols are similar to the ones in Section 3. Assuming that ensembles use simple majority voting to obtain the predictions, then the margin of an example x  of the ensemble  is margin where || is the size of the ensemble .From ( 9 where MM(ℎ, , x  ) is the margin-based measure of ℎ with respect to the subensemble  and current example x  , defined as where the constant parameter 1/|| is to avoid the denominator equal to zero.Since From ( 9) and ( 10),   ℎ(x  )/|| is exact the margin contribution of ℎ on the example x  and 1/|margin(x  )| is the weight of ℎ.
The rationale of the proposed measure is as follows: (i) If individual classifier ℎ correctly classifies the example x  , ℎ increases the margin of x  , and the corresponding increase value is and thus ℎ favor  correctly classifying x  , namely, MM(ℎ, , x  ) ≥ 0 (refer to (10)).If ℎ incorrectly classifies x  , the prediction of ℎ reduces the margin of x  and the reduction is exact and thus ℎ is harmful to  correctly classifying x  , namely, MM(ℎ, , x  ) ≤ 0 (refer to (10)).The time complexity of calculating (10) or ( 16) is (||), which can be () by incrementally updating margins of examples each time a classifier is added to/removed from it, where  is the number of pruning sets.Therefore, the time complexity of ensemble pruning via directed hill climbing strategy based on the proposed measure is not more than (), where  is the size of the original ensemble learned from training sets.
In this way, the proposed measure focuses more on correct classifiers and the examples lying near the boundaries, which coincides with the conclusions in Section 3.

Measure for Multiclass Problem.
For multiclass classification problem, (11) should be extended so that the proposed measure defined by (10) can deal with the problem.
Let each member ℎ of  map an example x  to a label ; namely, ℎ : x  →  ∈ [1, ], and let where is the number of votes on the th label of example x  of an ensemble combined by majority voting; max is the number of majority votes on the example x  ; V (x  ) sec is the second largest votes on the example x  ; V is the number of votes on label ℎ(x  ).
From [28], for multiclass, the margin of an example is defined as the difference between the number of correct votes and the maximum number of votes received by any incorrect label; namely, margin Combining ( 11) and ( 15) results in where  In this way, MM(ℎ, , x  ) and thus MM(ℎ, ,  pr ) (the proposed measure) focus more on correct classifiers and the examples lying near the boundaries, which coincide with the conclusions in Section 3.

Discussion.
Unlike other measures where each classifier is independently evaluated, the proposed margin-based measure uses a more global evaluation.Indeed, this criterion involves instance margin values that result from a majority voting of the whole ensemble.Thus, the proposed measure is not only based on individual properties of ensemble members (e.g., accuracy of individual learners).It also takes into account some form of complementarity of classifiers.From (11), our margin-based measure considers both the correctness of predictions of current classifier and the confidence of prediction of ensemble.Therefore, this measure deliberately favors classifiers with a better performance in classifying low margin samples.Thus, it is a boosting-like strategy which aims to increase the performance on low margin instances.So our strategy of selection will lead to a subset of classifiers with a potentially improved capability to classify complex data in general and border data in particular.Consequently, it will induce a selection of a subset of learners that are designed to efficiently handle minor classes.
From (16), our measure considers the diversity of between ensemble members.Therefore, the measure considers not only the correctness of classifiers, but also the diversity of ensemble members.Therefore, using the proposed measure to prune an ensemble leads to significantly better accuracy results.

Experiments
This section first introduces the experiment setting and the characteristics of the data sets used in this paper and  then reports the comparison of measures for guild ensemble pruning.

Data Sets and Experimental
Setup.We randomly selected 18 data sets from the UCI repository [33].Each data set was randomly divided into three subsets of equal sizes: one of the subsets as the training set, one as the testing set, and the other as the pruning set.Therefore, we conducted six trials for each data set.We repeated the experiments 50 times and thus conducted a total of 300 trials on each data set.The details of these data sets are summarized in Table 1, where #insts, #Attrs, and #Cls are the size, attribute number, and class number of the corresponding data sets, respectively.
We evaluated the performance of the proposed measure margin-based measure (MM) using forward ensemble selection, where complementariness (COM) [14],   concurrency (CON) [15], and uncertainty weighted accuracy (UWA) [18] were used as the compared measures.In each trial, a bagging [3] with 200 base classifiers was trained, where the base classifier was J48, which is a Java implementation of C4.5 [34] from Weka [35].For simplicity, we denote MM, COM, CON, and UMA as the corresponding pruning algorithms supervised by these measures, respectively.

Accuracy Performance versus the Size of Subensemble.
The goal of this experiment was to evaluate the performance of MM by comparing it with UWA, CON, and COM.The experimental results of the 18 tested data sets can be classified into three cases: (1) MM outperforms UWA, CON, and COM; (2) MM performs comparable to one or more of them and outperforms others; and (3) MM is outperformed by one or more of them.The first case contains 13 data sets, the second case contains two data sets, and the last case contains three.Figures 3, 4, and 5 show the representative results from the three cases.
Figure 3 reports the accuracy curves of the four compared measure for six representative data sets that fall into the first case.Results in the figure are reported as average accuracy curves with regard to the number of classifiers, where the horizontal axis is the size of subensembles growing gradually from 5 to 200 with step 1 and the vertical axis is the average accuracy over 300 trials.For the purpose of clarity, the standard deviations are not shown in the figure.The accuracy curves for data sets "audiology," "autos," "car," "glass," "segment," and "wine" are reported in Figures 3(a), 3(b), 3(c), 3(d), 3(e), and 3(f), respectively.Figure 3(a) shows that, with the increase of the number of aggregated classifiers, the accuracy curves of subensembles selected by MM, UWA, CON, and COM increase rapidly, reach the maximum accuracy in the intermediate steps of aggregation which are higher than the accuracy of the whole original ensemble, and then drop until the accuracy is the same as the whole ensemble.The remaining five data sets, "autos," "car," "glass," "segment," and "wine" (shown in Figures 3(b), 3(c), 3(d), 3(e), and 3(f), resp.) have similar accuracy curves to "audiology."

Summary of Experimental Results
. Table 2 summarizes the accuracy of the 300 trials for each data set, where the value in each parentheses is the rank of compared method and the last row is the average rank.The rank of algorithm is defined as follows: on one data set, the best performing algorithm gets the rank of 1.0, the second best one gets the rank of 2.0, and so on.In the case of ties, average ranks are assigned [36,37].The experimental results in Section 5.2 empirically show that MM, UWA, CON, and COM generally reach maximum accuracy when the size of the subensembles is between 20 and 40 (using forward selection for ensemble pruning).Therefore, the subensembles formed by MM with 30 original ensemble members are compared with subensembles formed by UWA, CON, and COM with the same size.
As shown in Table 2, MM outperforms bagging on all the 18 data sets, which indicates that MM efficiently performs ensemble pruning by achieving better predictive accuracies with small subensembles.Table 2 also shows that MM ranks first on 14 out of the 18 data sets and its average rank is 1.33, followed by CON with an average rank of 2.75, COM with an average rank of 2.91, UWA (3.17), and bagging (4.83).
As aforementioned, the backward elimination is another directed hill climbing strategy for ensemble pruning.From experimental results, we observe that performance based on backward elimination strategy is similar to that based on forward selection strategy, and therefore we only present the mean accuracy and ranks of MM, UWA, CON, COM with 30 base classifiers, and bagging (the original ensemble).The corresponding results are illustrated in Table 3. From the table, MM ranks first on 12 data sets and its average rank is 1.42, followed by CON with an average rank of 2.69, COM (2.83), UWA (3.22), and bagging (4.78).
Table 4 shows a summary of the comparisons among the methods, where the pruning methods with "-F" use forward selection to pruning ensemble and similarly, the pruning methods with "-B" use backward elimination to pruning ensemble.The size of each subensemble selected by these ensemble pruning methods is 30.The entry  , displays the number of times when the method of the column () has a better result than the method of the row ().The number in the parentheses shows how many of these differences have been statistically significant using pairwise -tests at the 95% significance level.For example, MM-F has been better than CON-F with pruned trees in 16 of the 18 comparisons and worse in 2. The numbers in the parentheses show that, in 14 cases, the difference in favor of MM-F has been statistically significant; hence, the value in row 3, column 1 of the table is 16 (14).
Table 5 shows the ranking of the comparing methods according to the significant difference between their performances using pairwise -tests at the 95% significance level.Here, we use all pairwise comparisons as summarized in Table 4.For example, the sum of the numbers in the brackets in the column corresponding to MM-F in Table 4 is 94.The sum of the numbers in the brackets in the row corresponding to MM-F is 10.These are used in Table 5 to calculate the nondominance ranking of MM-F (84).
Tables 4 and 5 demonstrate the significant advantage of MM compared with the best benchmark classifier ensemble methods: CON, COM, and bagging.Besides, compared with ensemble pruning methods using backward elimination, the ones with forward selection show better performance.

Conclusion
In this paper, we analysed the importance of individual classifiers with respect to the whole ensembles using margin theory and obtained that ensemble pruning via directed hill climbing strategy should focus more on correct classifiers and the examples lying near the boundary.Based on the derived general principles, we proposed criterion called the The proposed metric in this paper can apply not only to ensemble pruning based on directed hill climbing search but also to other ensemble pruning methods.Therefore, more experiments will be conducted to evaluate the performance of the proposed measure.

Figure 1 :
Figure 1: The search space of DHCEP methods for an ensemble of 4 models.

Figure 2 :
Figure 2: Rules obtained from the margin theory for evaluating the importance of individual classifiers on examples.

(
ii) From the discussion of Section 3, |margin(x  )| reflects the confidence that  correctly (or incorrectly) classifies the example x  .If |margin(x  )| is very small (equal to 0, e.g.), namely,  correctly (or incorrectly) classifying x  with a low confidence, adding into  the classifier ℎ may change the prediction of  on the example and therefore ℎ's weight 1/|margin(x  )| is large.On the other hand, if |margin(x  )| is very large (equal to 1, e.g.), namely,  correctly (or incorrectly) classifying x  with a high confidence, adding into  the classifier ℎ cannot change the prediction of  on the example and therefore ℎ's weight 1/|margin(x  )| is small.

Figure 3 :
Figure 3: Comparative results for six data sets in the first case.

Figure 4 :
Figure 4: Comparative results for two data sets in the second case.

Figure 5 :
Figure 5: Comparative results for two data sets in the third case.
.Then the proposed measure, margin-based measure (MM), of classifier ℎ with respect to ensemble  and the pruning set  pr is defined as MM (ℎ, ,  pr ) = 1       pr      ∑ ), 1/|| is the weight of each classifier ℎ  ,  = 1, 2, . . ., ||, and   ℎ  (x  )/|| is the margin contribution of ℎ  on the example x x  ∈ pr MM (ℎ, , x  ) , (or   ) is the set of examples that are correctly (or incorrectly) classified by current classifier ℎ and correctly classified by the ensemble; similarly   (or   ) is the set of examples that are correctly (or incorrectly) classified by ℎ and incorrectly classified by .Formally,

Table 1 :
Characteristics of the 18 data sets used in experiments.

Table 2 :
The mean accuracy and ranking of MM, UWA, CON, COM, and bagging.Here, the forward selection is used to pruning ensemble.

Table 3 :
The mean accuracy and ranking of MM, UWA, CON, COM, and bagging.Here, the backward elimination is used to pruning ensemble.

Table 4 :
Summary of results.

Table 5 :
Rank of the methods using the significant differences from all pairwise comparisons.-based measure to explicitly evaluate the importance of individual classifiers.Experimental comparisons on 18 UCI data sets showed that the proposed measure outperforms other state-of-the-art measures and the original ensemble. margin