Constructing Better Classifier Ensemble Based on Weighted Accuracy and Diversity Measure

A weighted accuracy and diversity (WAD) method is presented, a novel measure used to evaluate the quality of the classifier ensemble, assisting in the ensemble selection task. The proposed measure is motivated by a commonly accepted hypothesis; that is, a robust classifier ensemble should not only be accurate but also different from every other member. In fact, accuracy and diversity are mutual restraint factors; that is, an ensemble with high accuracy may have low diversity, and an overly diverse ensemble may negatively affect accuracy. This study proposes a method to find the balance between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. The quality assessment for an ensemble is performed such that the final score is achieved by computing the harmonic mean of accuracy and diversity, where two weight parameters are used to balance them. The measure is compared to two representative measures, Kappa-Error and GenDiv, and two threshold measures that consider only accuracy or diversity, with two heuristic search algorithms, genetic algorithm, and forward hill-climbing algorithm, in ensemble selection tasks performed on 15 UCI benchmark datasets. The empirical results demonstrate that the WAD measure is superior to others in most cases.


Introduction
As distinguished from general individual classification methods, including naïve Bayes [1], decision tree [2], and svm [3], the most important idea behind the ensemble methods [4] is the use of a set of base classifiers and combining their predictive capabilities into a single classification task. Through the combination of multiple base classifiers, a more accurate and stronger prediction can be obtained. Ensemble methods can also be understood by comparison to the scenario of people making decisions because people often consider diverse opinions to reach their final decision, thus reducing the risk of making mistakes. In recent decades, many researchers have investigated ensemble technology, resulting in a number of outstanding algorithms proposed in the literature, such as bagging [5], adaboost [6], mixtureof-experts [7], and random forest [8]. Nevertheless, there are two primary shortcomings in generic ensemble methods: efficiency and redundant classifiers. According to the survey results reported by Tsoumakas et al. [9], a largescale ensemble learning task can easily create thousands of base classifiers, or even more. There is no doubt that having such a large number of classifiers in an ensemble requires large memory and computational overhead. This in turn leads to an increase in the training cost, storage demands, and prediction time. In addition, an ensemble with a large number of classifiers does not always generate better prediction results. This is because an ensemble tends to contain redundant classifiers in addition to high-quality ones. The former negatively affects the overall ensemble predictive performance.
Ensemble selection (i.e., ensemble pruning, ensemble thinning, or classifier selection) is regarded as a type of effective technique to solve these two shortcomings. The goal in ensemble selection is to reduce the memory requirement and accelerate the classification process while preserving or improving the predictive ability [10]. Just as the name implies, ensemble selection refers to the approaches that address the selection of a subset of optimal classifiers from the original ensemble prior to prediction combination. Given an original ensemble with base classifiers = { 1 , 2 , . . . , } and a 2 The Scientific World Journal validation (evaluation, pruning, or selection) dataset with samples = {( 1 , 1 ), ( 2 , 2 ), . . . , ( , )}, the objective is to form an optimal subensemble = { 1 , 2 , . . . , }, where the size of the optimal subensemble, , is less than or equal to the size of the original ensemble, ( ≤ ). The ensemble selection behavior relies on two core elements, that is, evaluation measure and search method. The score calculated by the evaluation measure is the quality assessment used to guide the ensemble selection. The target to evaluate could be an individual classifier or an ensemble, from which two types of measure can be derived, that is, classifier and ensemble based. The score assessed from a classifier-based measure represents the quality of an individual classifier; the ensemble-based measure evaluates the quality of the whole ensemble. The goal of the search method is to find the classifiers with high quality scores examined by the evaluation measure. Various ensemble selection approaches are examples of combining an evaluation measure and a searching method [9]. For example, in ranking-based ensemble selection approaches [11,12], classifiers in the ensemble are reordered in descending order based on their quality scores, and the first (user-defined number) top classifiers are used. Intuitively, a ranking-based approach is the combination of a classifierbased measure and a ranking search method. The advantage of these methods is the low searching complexity, that is, ( ), because it applies a ranking search algorithm. This approach may sometimes work well, but it is theoretically unsound, as illustrated by the classical example mentioned in [10]: "an ensemble of three identical classifiers with 95% accuracy is worse than an ensemble of three classifiers with 67% accuracy and least pairwise correlated error. " Another representative instance of an optimization-based ensemble selection approach is constructed using an ensemble-based measure and an optimization search method [10,[13][14][15]. It consists of an optimization process of searching for an optimal subensemble in the space of 2 − 1 (nonempty case). The evaluation measure in this case should have the capability of evaluating the quality with respect to the whole ensemble. Unlike the ranking-based methods, it needs the help of an optimization searching algorithm (e.g., a genetic algorithm or a hill climbing algorithm) to avoid exhaustive search complexity, that is, (2 − 1).
In this study, ensemble-based measures for optimizationbased ensemble selection are emphasized. Two characteristics of ensemble measures are used in this method: (a) they assess the quality of an ensemble with multiple classifiers rather than the quality of an individual classifier, and (b) they usually work with heuristic search algorithms to perform the ensemble selection.
The latent question of ensemble-based evaluation measures is thus "what is a good classifier ensemble?" Many researchers have tried to answer this question in their ensemble selection tasks. One comprehensive strategy is to emphasize the ensemble accuracy, so the subensemble with high accuracy stays in the validation dataset. Margineantu and Dietterich [13] first claimed the feasibility of using ensemble accuracy for the ensemble selection task. Zhou et al. proposed the GASEN [14] and GASEN-b [15] selective ensemble learning algorithms, in which the ensemble selection procedures apply genetic algorithm (GA) to search the optimal subensemble according to the majority voting accuracy (MVA) in the validation dataset. In the ensemble selection experiments conducted by Fan et al. [16], two accuracy-class evaluation measures were used, that is, average accuracy and mean squared error. Caruana et al. [17] performed a similar trial, experimenting with several evaluation measures, including root mean squared error, precision/recall -measure, and average precision. Although the experimental results from the above studies illustrated that the selected subensembles based on accuracy measures may provide some improvements with respect to the original ensemble, there exists NO solid proof for the strong correlation between the ensemble accuracy on the validation data and the predictive performance on the test data. However, several studies have proved that too high accuracy may lead to the overfitting problem.
Other scholars insist that the ensemble constructed by a set of diverse classifiers should survive. Such scholars prefer using diversity to represent the ensemble quality. Ruta and Gabrys [18] applied twelve widely known diversity measures, including the disagreement measure, entropy measure, and interrater agreement, in their experiments, to achieve better results than accuracy measures. Martínez-Muñoz and Suárez [12], Banfield et al. [19], and Partalas et al. [20] proposed four similar diversity measures, that is, concurrency, margin distance minimization, complementariness, and the uncertainty aware measure, for selecting subensembles through the greedy search algorithm, producing impressive results. Their results show that computing the degrees of diversity may be a good choice for the evaluation measure. Nevertheless, using diversity as the direct measure of ensemble quality is still a controversial issue. On the one hand, the above studies show promising predictive performance using diversity as the evaluation measure. On the other hand, the theoretical and experimental investigations from Tang et al. [21] concluded that diversity could not be explicitly used for constructing the ensemble, based on directed hill-climbing methods.
From the analysis above, it is determined that it is insufficient to use either accuracy or diversity to represent the ensemble quality. A well-known hypothesis in the ensemble learning community claims that an ensemble with high performance and generalization ability should be simultaneously accurate and diverse [4,22]. Hence, ensemble selection approaches should endeavor to generate such an ensemble. In other words, the evaluation measures need to assess ensemble quality by considering both accuracy and diversity. However, this is not an easy task because accuracy and diversity are mutual restraint factors, where the ensemble with high accuracy may reduce the diversity and diverse ensembles often will negatively affect accuracy. A number of classifier-based evaluation measures motivated by this idea were proposed in recent decades. In the most representative study in [11], the authors proposed a measure to evaluate each individual classifier's contribution to the whole ensemble by integrating the accuracy and diversity. However, for ensemble-based evaluation measures, there are seldom explorations on assessing the quality by considering both accuracy and diversity.
The Scientific World Journal 3 Therefore, in this study, a new ensemble-based evaluation measure, the weighted accuracy and diversity (WAD) measure, is designed to meet this challenge. The proposed measure has three main features. (1) It is designed to evaluate an ensemble quality and work with heuristic search algorithms to conduct optimization-based ensemble selection.
(2) It assesses the ensemble quality by considering both accuracy and diversity. To be more precise, inspired from the -measure [7] in information retrieval, the WAD measure combines accuracy and diversity by obtaining the harmonic mean of both measurements. Two weight factors are appended that contribute to the trade-off between accuracy and diversity. (3) It can automatically trade-off accuracy and diversity because these two weight parameters are learned by a linear programming approach. Empirical results on 15 UCI datasets showed that ensemble selection via the WAD measure produces significantly better results.
The structure of this paper is as follows. Section 2 introduces the design of the new evaluation measure in detail. Section 3 reports the experimental tests of the proposed measure, including the corresponding process and final results. The conclusions and discussion are summarized in Section 4.

Method Design
The primary objective of this work is the design of a novel ensemble-based measure to assess the ensemble quality of the ensemble selection task. As mentioned in Section 1, it is important to consider both accuracy and diversity when assessing ensemble quality. To accomplish this, the method integrates accuracy and diversity measurements in a composite score, formulating a mathematical function = (Acc, Div), where , Acc, and Div denote ensemble quality, accuracy and diversity, respectively. Three main obstacles remain to be solved in the design of the measure. (a) The method of calculating accuracy and diversity must be determined. The new measure is expected to integrate accuracy and diversity and, though a number of approaches exist that can calculate both terms, the definitions must be clear in preparation for the subsequent design. Section 2.1 gives the corresponding descriptions. (b) The form of the new measure must be determined; that is, the function = (Acc, Div) must be defined. Although several studies have tried to find the solution to this question, there has been no approach to date that has yielded a reasonable composite form using both accuracy and diversity. The new measure tackles this problem by proposing the harmonic mean form to combine accuracy and diversity, as reported in Section 2.2. (c) The method used to balance accuracy and diversity must be determined. In the form of the new measure, two weight parameters are used to balance accuracy and diversity. Weight parameters control the importance of accuracy and diversity. The trade-off process is equivalent to a weight value assignment. Particularly, the weight should be adjusted to the specific dataset. The new measure therefore employs a linear programming technique to automatically estimate the weight value, as described in Section 2.2.

Notations and Definitions.
The common notations and definitions summarized in the following are used in the remainder of the paper. Let = { 1 , 2 , . . . , } be an original ensemble containing base trained classifiers, where the classifiers are either homogenous, that is, trained by the same base classification algorithm, or heterogeneous, that is, trained by different classification algorithms. Given a validation dataset with samples, It is a straightforward concept that accuracy refers to the correct rate. In the example of an individual classifier, the accuracy on a certain dataset equals the quantity of correct predictions over the total number of samples of the dataset. For an ensemble, however, because the prediction is a collective decision from a set of classifiers, there are various types of accuracy, such as majority voting accuracy, average voting accuracy, and weighted majority voting accuracy. In this work, the most common approach is used, that is, the simple majority (plurality) voting accuracy, to assess the ensemble accuracy. The majority voting accuracy, summarized in Notation 2, is attractive for and adaptable to this task because it only needs to validate and collect statistics for the predictions that are chosen by the majority of the classifiers. Moreover, as one of the simplest and most intuitive ensemble fusion techniques, the majority voting technique is widely used among various ensemble methods, such as bagging [5] and random forest [8].
Notation 1 (correct/incorrect (1/0) output). This type of representation for prediction is well known as an oracle output that only considers the correctness of the solution. Oracle output is used in this study because "it incorporates no a priori knowledge of the data and makes no assumption on what the base classifier is" [21]. Hence, the oracle output provides a general model for the following computation of accuracy and diversity. The oracle output, ( ), from the th classifier on th sample, as shown in (1), will have an output of 1 if the training sample is classified correctly by the base classifier ; otherwise the output is 0, expressed as follows: Notation 2 (ensemble accuracy). Given an ensemble and a dataset with samples, denote as the number of classifiers from that correctly recognize . The oracle output ( ) for the ensemble using the simple majority voting for the input sample, , can be expressed by (2) as follows: The Scientific World Journal (√) (×) (√) 11 10 (×) 01 00 where ( ) ∈ {0, 1}; that is, 1 denotes that the ensemble prediction is correct if the number of correct predictions, , is greater than the number of incorrect predictions, − , and 0 denotes the case when it is not true. When the number of correct predictions is equal to the number of incorrect predictions, the result is a random selection between 0 and 1. Based on the ensemble oracle output, the ensemble accuracy, acc, for the entire dataset, , can be determined using (3) as follows: where the final result is the sum of the ensemble oracle outputs for all samples in the dataset over the size of the dataset. The variable acc varies between 0 and 1, where the higher score means that the ensemble is more accurate.
In the research field of ensemble learning, it is well known that the base classifiers in the ensemble should be as diverse as possible [4,13,23]. If a classifier output in the ensemble makes errors, it would be an advantage to have additional output from other, different ensemble members. It is meaningless to combine a set of duplicate classifiers. Diversity measures the difference in the classifiers. Researchers have proposed a variety of diversity measures, such as the Kohavi-Wolpert variance [24], generalized diversity [24], and double-fault measure [25]. In this study, the disagreement measure is used, as illustrated in Notations 3 and 4, as proposed by Skalak [26]. This measure was selected because it is a widely accepted measure to evaluate the diversity, and it has been applied to many ensemble problems. For example, Ho [27] used it to assess the diversity in a decision forest problem, and Lu et al. [11] used it as the diversity measure to calculate the classifier contribution in their study.
Notation 3 (confusion matrix for two classifiers). Assume two classifiers, and , and their oracle predictions on , ( ). Table 1 shows a 2 × 2 confusion matrix that records the statistics of four scenarios between two classifiers, where 01 represents the number of cases in which the sample is incorrectly predicted by but correctly predicted by , 10 is the number of cases correctly predicted by but incorrectly predicted by , and 00 and 11 are the number of cases in which the sample is incorrectly predicted by both and and correctly predicted by both and , respectively.
Notation 4 (ensemble diversity). Given a dataset with samples, based on the confusion matrix for two classifiers as defined in Notation 3, div , denotes the diversity of the pair of classifiers and . The diversity of two classifiers is defined based on the intuition that two diverse classifiers disagree with each other or perform differently on the same data. The diversity therefore is the ratio between the number of cases of disagreement ( 10 and 01 ) and the total number of all cases ( 00 , 11 , 10 , and 01 ) as follows: div , = 10 + 01 00 + 11 + 10 + 01 .
Extending (4) to the entire ensemble, , with size , Div denotes the ensemble diversity, which is div , , averaged over all pairs of classifiers using (5) as follows: Because for any pair of classifiers in ensemble , 00 + 11 + 10 + 01 = , (5) can be further reduced as follows: Div varies between 0 and 1, where 0 indicates no difference and 1 indicates the highest possible diversity.

Weighted Accuracy and Diversity Measure.
As mentioned in Section 1, many studies have shown that ensemble quality is strongly correlated with accuracy and diversity. Additionally, the accuracy and diversity are not directly proportional to the ensemble quality. Too high accuracy may lead to the problem of overfitting; that is, the accuracy of the validation dataset is increased, but worse predictions are achieved on unseen data [28]. However, an ensemble that is too diverse tends to comprise multifarious base classifiers that may seriously reduce the overall ensemble performance [21]. In addition, accuracy and diversity are mutual restraint factors, where classifiers with high accuracy put together may downgrade the complementarity (diversity) and a highly diverse ensemble negatively affects accuracy. There is thus a balance to be achieved between accuracy and diversity that enhances the predictive ability of an ensemble for unknown data. To obtain this balance, accuracy and diversity measurements are integrated, forming a composite form between accuracy and diversity. In other words, if the results of accuracy and diversity for an ensemble have been evaluated, this evaluation can determine if a certain combinational way to generate a more rational score based on those two results can be applied. Inspired by the wellknown evaluation method in information retrieval, that is, the -measure or -score [7], considering both the precision and the recall, the WAD ensemble evaluation measure is developed. WAD is an acronym for weighted accuracy and diversity and performs the evaluation score of the ensemble quality by computing the harmonic mean of the accuracy and diversity measurements. According to Sasaki [29], the harmonic mean can create a more reasonable score to balance two factors and is more intuitive than the arithmetic mean The Scientific World Journal 5 when computing a mean of ratios. Particularly, different from the form of the -measure, two parameters are assigned, and , representing the weight of accuracy and diversity, respectively, balance the effects of two factors. The composite form of measuring ensemble quality is therefore defined in Lemma 1 as follows.

Lemma 1.
Given an ensemble and a dataset , let each classifier in predict all samples in , collecting their results by . The ensemble accuracy and the ensemble diversity V can be computed according to Notations 2 and 4, respectively. Denote the ensemble quality score as and the form by (7) as follows: where and are two weight parameters that control the importance of accuracy and diversity, respectively. The sum of the two weight parameters equals 1. If the measure focuses more on accuracy, the value of should be greater than . If the measure focuses more on diversity, should be less than .
Given two weights, and , associated with the accuracy and diversity measurements, Acc and Div, respectively, the weighted harmonic mean (WHM) is defined by (8) The formula derived in (8) is exactly the same as the form of the WAD measure. The ensemble quality increases with the increasing value of WAD score. The WAD score varies between 0 and 1, where it reaches its best value at 1 and worst value at 0.
When calculating the WAD measure, the ensemble accuracy Acc and ensemble diversity Div can be computed using Notations 2 and 4. However, for the two weight parameters, and , a solution must be proposed to determine their adaptable values. In the rest of this subsection, the estimation of the weight parameters will be discussed. The WAD measure employs the two parameters to balance accuracy and diversity. A straightforward approach is to manually preset the values. However, such a hard-coded approach is irrational and lacks theoretical support because the ensembles are applied to different datasets and should therefore have specific optimal weight values. Ideally, the values should be adjusted to the dataset and could be automatically estimated from the data. In fact, the weight parameter estimation for WAD can be formulated to a constrained linear programming problem, as described by Lemma 2.

Lemma 2.
Assume the current ensemble with classifiers and the predictions of each classifier in on the validation dataset . The ensemble accuracy has been computed using Notation 2, and the diversity V has been computed using Notation 4. The estimation of the weight parameters ( and ) can then be formulated as a linear programming problem, and the corresponding mathematical programming formulation is as follows: The objective function of this problem is expressed by maximizing the WAD score, where and V in this case are two constants. Meanwhile the objective function is subject to three constraints, that is, the equality + = 1, and the inequalities if > V, / ≤ / V, else, / ≥ / V and 0 ≤ , ≤ 1, that specify a convex polytope to be optimized. The second constraint is defined according to the intuition that the results of accuracy and diversity are simultaneously required to be as large as possible. If the accuracy result is greater than the diversity result, let the ratio between and be less than or equal to the ratio between accuracy and diversity. If the accuracy result is less than the diversity result, then the ratio between and should be greater than or equal to the ratio between accuracy and diversity.
The function in (9) is a very typical linear programming problem. We can optimize it using the simplex algorithm in [30], developed by Dantzig in 1947, which solves the problem by forming a feasible solution at a vertex of the polytope and then walking along a path on the edges of the polytope to vertices with nondecreasing values of the objective function until an optimum is reached.
The pseudocode of the WAD measure is presented in Pseudocode 1. For computing the WAD score of an ensemble, an original ensemble with base classifiers and a validation dataset with instances should be provided. The computation starts from collecting predictions ℎ ( ) of each classifier in the ensemble on each data sample in validation dataset . The results are recorded in Preds. The accuracy Acc of the ensemble is then computed based on the prediction results Preds and the approach in Notation 2. Similarly, the diversity Div of the ensemble is computed according to Notation 4 in Preds. Afterwards, the linear programming algorithm is used to estimate the weight parameter values and . In the last step, the WAD score is calculated using (7).

Evaluation
In this section, the effectiveness of the WAD measure is investigated in ensemble selection tasks. Coupled with two (1) Classifier predictions: Preds (2) Ensemble accuracy: Acc (3) Ensemble diversity: Div (4) The weight parameters: and Output: (1) WAD: ensemble quality score Begin (1) For each in ensemble E: for each invalidation dataset D: get the jth classifier 's prediction ℎ ( ) on and put it in Preds end for End For (2) Compute accuracy Acc of E according to Notation 2 (3) Compute diversity Div of E according to Notation 4 (4) Estimate and according to Lemma 2 (5) Compute the score by WAD = Acc ⋅ Div ⋅ Acc + ⋅ Div End; Pseudocode 1: The pseudocode to compute the WAD score. existing representative ensemble evaluation measures and two threshold measures, the proposed measure was combined with two heuristic search algorithms for conducting ensemble selection on 15 UCI benchmark datasets. In the following subsections, the setting of the experiments is introduced, and the results of the comparison experiments are reported.

Experimental Settings.
The experimental datasets are taken from the UCI machine learning repository [31]. In the experiments, 15 different datasets are chosen for the evaluation. The characteristics of the various datasets are shown in Table 2. To avoid bias, the datasets are selected as follows: (a) four small-size datasets with less than 500 instances, that is, hepatitis, autos, heart-statlog, and ionosphere; (b) six mediumsize datasets with 500-3,000 instances, that is, credit, diabetes, vehicle, car, cmc, and segment; (c) five large-size datasets with more than 2,000 instances, that is, kr-vs-kp, hypothyroid, waveform-5000, page-blocks, and nursery. In addition, the experimental datasets cover six binary-class problems and nine multiclass problems. All datasets have removed the samples with missing values. The experimental workbench is WEKA [32], a popular suite of machine learning software written in Java, developed at the University of Waikato.
Initially, each dataset is divided into three disjunctive parts, that is, the training set, validating set, and testing set, each containing 40%, 40%, and 20% of the samples, respectively. The training set is for original ensemble production, the validatingset is for ensemble selection and the testing set is for selected ensemble evaluation. The proportionate stratified sampling is employed to guarantee the balance of class distribution in the three divided sets. Based on the training set, the original ensemble is produced with 200 base classifiers generated using the bagging method [5], where 200 diverse datasets are randomly generated by drawing with replacement amongst , where is the size of the original training set, and then trained up the corresponding 200 base classifiers by the unpruned J48 decision tree, a variant of C4.5 [2].
For the comparison, two existing representative ensemble evaluation measures are used, that is, the Kappa-Error Convex Hull Pruning measure [13] and the GenDiv [33] measure, because their objective is similar to the WAD measure objective. The former is a typical evaluation measure for ensemble selection, considering accuracy and diversity. Several studies [10,33] employed this method as an important comparison candidate. The latter is the latest representative measure that trades off accuracy and diversity. In addition to those two candidates, two additional threshold measures are used, Acc-Only and Div-Only. The first one takes only the accuracy into consideration, and the quality score is computed according to Notation 2. The second one only assesses the quality score by the diversity, according to Notation 4.
All five candidates of evaluation measure are compared using two common heuristic search algorithms, that is, the genetic algorithm and the forward hill-climbing algorithm, to conduct ensemble selection on the validating set. The genetic algorithm, inspired by evolution and developed by John Holland [1] at the University of Michigan in the 1970s, can be used to yield useful solutions to optimization and search problems. To use a genetic algorithm, the solution for a specific problem should be projected to a genome or chromosome. The genetic algorithm randomly generates a population of chromosomes The Scientific World Journal 7 and utilizes genetic operators such as mutation and crossover operators to evolve the population, producing more diverse chromosomes to find the best one. This search approach has been applied in many ensemble selection tasks [14,15,33]. The forward hill-climbing algorithm [34] belongs to a greedy search class of algorithms that focuses on adding or removing a specific classifier such that the improvement in the ensemble performance is maximal. The searching starts from a single best classifier and seeks a pair of classifiers that maximally increases the ensemble performance at each round. As one of the most effective search algorithms, it is also widely used in ensemble selection tasks [13,16,17,35]. In this experiment, the evaluation measures are considered the objective or evaluation function in the search algorithms. The parameters of the search algorithms are set as follows.
(i) GA: the population size is 50, the crossover rate is 0.8, the mutation rate is 0.7, and the termination condition is no improvement for 100 iterations.
(ii) FHC: the direction is forward, and the termination condition is that it stops at convergence.
Simple majority voting is used to combine the predictions of the selected ensemble on the test set. The size of the resulting ensemble and its classification correct rate of the test data using the combination method are recorded. The whole experiment is performed 10 times for each dataset, and the results are averaged. Tables 3 and 4 show the average size of the ensemble selected by all five evaluation measures, that is, WAD, Kappa-Error, GenDiv, Acc-Only and Div-Only, equipped with the two search algorithms, GA and FHC,for the 15 UCI datasets. The last column (No-Selection) of the table lists the size of the original ensemble, and the bottom row reports the average size across all datasets for each measure. The results show that the average size of ensemble selected via the WAD measure ranks in the exact middle of the pack. In the GA search case, the greatest reduction with respect to the original ensemble occurs for Acc-Only, where the average ensemble size is 10.6. The WAD case is in third place, where the size is 24.3, showing a reduction of approximately 12% from the original ensemble. A similar scenario occurs in the FHC search case, where the average ensemble size of WAD is 23.3, also showing an approximate reduction of 12%. The results of the selected ensemble size are shown to testify that, for a selected classifier in the ensemble, sufficient classifiers are more essential than less ones in an ensemble. According to a previous experiment [35], the selected ensemble size and predictive performance are not strongly correlated. Although ensemble selections via other measures such as Kappa-Error and Acc-Only exhibit a greater reduction in ensemble size, there are fewer than five classifiers left in several cases, indicating that such a situation may be unreasonable and unreliable. Breiman [5] and Opitz and Maclin [36] proposed that in most ensemble cases, most or all of the generalization can be gained in a wellconstructed ensemble with 25 base classifiers. The results in Tables 3 and 4 demonstrate that the WAD results (24.3 and 23.3) fit this golden size. In addition to the ensemble size, good quality is a better target for the classifier. The following experimental results validate that the ensemble selection via the WAD measure can generate the ensemble not only with reasonable size but also with robust performance. Tables 5 and 6 summarize the predictive performance for 15 datasets with all five candidate evaluation measures, that is, WAD, Kappa-Error, GenDiv, Acc-Only, and Div-Only. Table 5 reports the classification correct rate with ensemble selection using the GA search method, and Table 6 reports the classification correct rate using the FHC search method. The last column, No-Selection, in both tables, indicates the performance of the original ensemble without any ensemble selection process. Each cell in these two tables records the mean and standard deviation value of the 10 runs of the experiment. The bottom row illustrates the win/loss/tie summary that is computed using a pairwise t-test at 95% significance level. To comprehensively probe the proposed measure, three comparisons are made based on the empirical results in Tables 5 and  6. (1) WAD versus No-Selection. The ensemble selected via the WAD measure outperforms the original ensemble in the overwhelming majority of cases, where WAD + GA achieved 13 significant wins and WAD + FHC achieved 12 significant wins among 15 datasets. Furthermore, there is not a single case of significant loss. Although the WAD on three datasets, that is, car, segment and kr-vs-kp, does not win significantly, it is still comparable to No-Selection. This comparison reveals that ensemble selection using the WAD measure can dramatically upgrade the predictive performance compared to original ensembles. It further shows that fewer classifiers can be employed to preserve or even improve predictive ability.

Ensemble Quality Evaluation.
(2) WAD versus Acc-Only and Div-Only, because the goal of the WAD measure is to balance accuracy and diversity, the comparison with the two threshold cases that consider either accuracy or diversity enables the direct demonstration of   Hepatitis  23  11  40  10  25  200  Autos  20  22  34  9  32  200  Heart-statlog  28  13  46  8  34  200  Ionosphere  30  10  34  13  45  200  Credit-a  25  5  34  15  56  200  Diabetes  27  12  34  12  125  200  Vehicle  22  3  27  14  111  200  Car  23  23  32  7  78  200  cmc  26 Hepatitis  27  5  31  9  59  200  Autos  21  12  36  14  43  200  Heart-statlog  20  23  56  11  54  200  Ionosphere  17  4  53  10  65  200  Credit-a  27  3  40  15  34  200  Diabetes  25  9  26  3  66  200  Vehicle  29  18  29  13  56  200  Car  18  11  19  9  69  200  cmc  22  18  43  2  45  200  Segment  24  10  109  4  43  200  kr-vs-kp  25  9  69  9  12  200  Hypothyroid  23  25  33  12  34  200  Waveform  15  10  30  27  78  200  Page-blocks  29  16  37  6  45  200  Nursery  28  7  the performance of the WAD measure with respect to them. To date, no sufficient evidence has been published to support that Acc-Only or Div-Only outperforms No-Selection. Both threshold methods produced poorer results than No-Selection over seven datasets in Tables 5 and 6. This result verified the commonly accepted hypothesis that to consider only accuracy or diversity in ensemble selection is inadequate for producing good classifiers and may degrade the predictive performance. The WAD predictions are superior to Acc-Only and Div-Only in most of the cases. As shown in Tables 5 and  6, there is only one (6%) significant loss, and the average rate of significant wins is approximately 75%. In particular, for the significant loss cases (20 in total) when comparing Acc-Only (9 cases) or Div-Only (11 cases) against No-Selection, WAD is still able to come out ahead. This result shows that taking both accuracy and diversity into consideration helps improve the quality of the ensemble selection task. (3) WAD versus Kappa-Error and GenDiv, this comparison is performed between WAD and two state-of-the-art evaluation measures, that is, Kappa-Error and GenDiv. WAD outperforms Kappa-Error and GenDiv on ten and nine datasets out of fifteen cases under both search methods. The results in Tables 5 and 6 also show that the maximum number of significant losses is only two, made with GenDiv. In summary, under the same search algorithm, the performance of ensemble selection relies strongly on the evaluation measure. The experimental results clearly demonstrate that WAD outperforms other evaluation measures by simultaneously considering both accuracy and diversity, as well as balancing their influence in assessing the ensemble  quality. The measure is capable of computing a rational score to guide good ensemble selection. The comparison results with Acc-Only and Div-Only strongly support this. Furthermore, unlike Kappa-Error and GenDiv, the balance between accuracy and diversity in WAD is performed in a way that the accuracy and diversity weights are learned automatically from the validating set. The learned parameters therefore can better represent the characteristics of the given datasets and maximally contribute to performance improvement.

Analysis of Four Representative Cases.
In this subsection, four representative datasets were extracted according to the empirical results of Tables 5 and 6: (a) credit-a, the case in which WAD outperforms all other approaches; (b) pageblocks, the case in which WAD did not outperform both Acc-Only and Div-Only; (c) autos, the case in which Kappa-Error outperforms WAD; and (d) cmc, the case in which GenDiv outperforms WAD. Figure 1 shows the curves of average correct rate for these four datasets with respect to the specific original ensemble size. The same experimental settings are used as in the previous experiments, but the size of the original ensemble is increased progressively (the ensemble size ranged from 3 to 400). The first observed target is focused on the baseline case of No-Selection, in which, with the increase in the ensemble size, the classification correct rate grows placidly until approximately 30 classifiers. The ensemble then begins to overfit with large ensemble sizes (>30), and the improvement appears to become nearly asymptotic to a plateau. This phenomenon is consistent with the claim in the ensemble selection community that combining all of the original ensembles does not always give better performance [9,11,14,20,35]. The second observed target is shifted to the five ensemble selection cases, that is, WAD, Kappa-Error, GenDiv, Acc-Only, and Div-Only. The ensemble selections with ascending original ensemble size allow an easier verification of the generalization ability of the ensemble selection. Intuitively, a larger ensemble should provide more classifier candidates for constructing a better subensemble. At the same time, however, the chances of picking "bad" classifiers for the subensemble are improved. The delicate ensemble selection measures therefore tend to produce unfavorable results in this situation, and the selected ensemble performs worse than the original ensemble. Figure 1 shows that the ensemble selection via WAD gave the best performance on each dataset. It not only achieved the advantages of the datasets, as shown in Figure 1(a), where WAD outperformed others in the last experiment, but the ensemble selection also retrieved the situation when the other measures outperformed WAD, as shown in Figures 1(b) and 1(d). This outcome shows that the WAD measure allows the corresponding ensemble selection to achieve high generalization ability. There was only one exceptional case found, in Figure 1(c), when the Kappa-Error performed better than WAD when conducting the ensemble selection with GA search. In reality, it is impossible and unrealistic to request the new measure to be superior to all others under whatever circumstances.

Conclusion and Future Works
This study introduces a novel and effective evaluation measure, that is, the weighted accuracy and diversity (WAD), for the ensemble selection task. The goal of the proposed measure is to assess the ensemble quality with respect to the whole ensemble. Simultaneously considering and balancing accuracy and diversity are the best solution for the ensemble quality evaluation. To achieve this goal, the proposed measure performs the evaluation in a different way such that the final quality score for an ensemble is a combination of accuracy and diversity measurement. Inspired by the -measure evaluation approach in information retrieval, the ensemble quality score is determined by computing the harmonic mean of accuracy and diversity. Additionally, two weight parameters are assigned to balance accuracy and diversity. Another feature of the proposed measure is that the values of the weight parameter are automatically learned from the data. Experimental comparisons on 15 UCI datasets indicate that ensemble selection via the WAD measure can produce the ensemble with a reasonable size and robust performance and that WAD performs better than three baseline cases, that is, No-Selection, Acc-Only, and Div-Only, and better than two existing measures, that is, Kappa-Error and GenDiv.
Several improvements of the current version of the measure are possible. First, to compute accuracy and diversity, the scope of this study is limited to two specific methods, majority voting accuracy and disagreement diversity. However, the question of employing other accuracy and diversity methods while achieving favorable results can still be answered. Second, the balance between accuracy and diversity is still a controversial issue. In this paper, the problem is resolved using a value assignment for the weights and . Their values are adjusted to the validating set using a linear programming technique. However, advance knowledge of the result of accuracy and diversity is required to apply the technique. An interesting improvement would be to trade off the accuracy and diversity when computing them. Third, in addition to accuracy and diversity, there may be other factors that can be used to help evaluate ensemble quality. If so, what are they, and what is their form? Future works will involve evaluating the current version of the WAD measure in other ensemble selection tasks that work on different datasets, original ensembles, and search algorithms and optimizing the current version to find a better version of the WAD measure by considering these possible improvements.