Variance Reduction Trends on ‘Boosted’ Classifiers

Ensemble classification techniques such as bagging, (Breiman, 1996a), boosting (Freund & Schapire, 1997) and arcing algorithms (Breiman, 1997) have received much attention in recent literature. Such techniques have been shown to lead to reduced classification error on unseen cases. Even when the ensemble is trained well beyond zero training set error, the ensemble continues to exhibit improved classification error on unseen cases. Despite many studies and conjectures, the reasons behind this improved performance and understanding of the underlying probabilistic structures remain open and challenging problems. More recently, diagnostics such as edge and margin (Breiman, 1997; Freund & Schapire, 1997; Schapire et al., 1998) have been used to explain the improvements made when ensemble classifiers are built. This paper presents some interesting results from an empirical study performed on a set of representative datasets using the decision tree learner C4.5 (Quinlan, 1993). An exponential-like decay in the variance of the edge is observed as the number of boosting trials is increased. i.e. boosting appears to ’homogenise’ the edge. Some initial theory is presented which indicates that a lack of correlation between the errors of individual classifiers is a key factor in this variance reduction.


Introduction
This paper is concerned with the classification problem, whereby a model builder (classifier) is presented with a training set comprising of a series of n labelled training examples of the form (x  , y  ), . . ., (x n , y n ), with y i ∈ (1, . . ., k).The classifier's task is to use these training examples to produce an hypothesis, h(x), which is an estimate of the unknown relationship y = f (x).This 'hypothesis' then allows future prediction of y i given new input values of x.A classifier built by combining individual h(x)'s to form a single classifier is known as an ensemble.Whilst there are many ensemble building methods in existence, this discussion focusses on the method of boosting which is based on a weighted subsampling of the training examples.
Introduced by Freund and Schapire in 1997, boosting is recognised as being one of the most significant recent advances in classification (Freund & Schapire, 1997).Since its introduction, boosting has been the subject of many theoretical and empirical studies (Breiman, 1996b;Quinlan, 1996;Schapire et al., 1998).Empirical studies have shown that ensembles grown from repeatedly applying a learning algorithm over different randomly chosen subsamples of the data of size n improves generalisation error for unstable learners (i.e.methods where a small change in the input data leads to large changes in the learned classifier).

Current Explanations of the Boosting Mechanism
Ensemble classifiers and the reasons for their improved classification accuracy have provided a fertile ground for research and significant gains may still be made if these reasons are addressed.In theory, as the combined classifier complexity increases, the gap between training and test set error should increase.However, this is not reflected in empirical studies.There is strong empirical support for the view that overfitting is less of a problem (or perhaps a different problem) when boosting and other resampling methods are used to improve a learner.Some authors have addressed this issue via bias and variance decompostions in an attempt to understand the stability of a learner (Breiman, 1997;Breiman, 1996b;Friedman, 1997).
Boosting is an iterative procedure which trains a classifier over the n weighted observations.Boosting begins with all training examples being weighted equally.(i.e. 1 n ) At the m + 1-th iteration, examples which were classified incorrectly at the m-th iteration have their weight increased multiplicatively so that the total weight on incorrect observations is equal to 0.5 for the m+1-th iteration.Hence, the learning algorithm will be given more opportunity to explore areas of the training set which are more difficult to classify.Hypotheses from these parts of the space make fewer mistakes on these areas and play an important role in prediction when all hypotheses are combined via weighted voting.Weighted voting takes places by having each hypothesis assigned a voting weight which is a function of the error made on that particular hypothesis.Hypotheses which make fewer errors are given a higher voting weight when the ensemble is formed.Accuracy of the final hypothesis depends on the accuracy of all the hypotheses returned at each iteration and the method exploits hypotheses that predict well in more difficult parts of the instance space.An advantage of boosting is that it does not require any background knowledge of the performance of the underlying classification algorithm.Refer to The following comments and conclusions on boosting and ensemble classification have been made to date.
• Breiman (1996b) claims the main effect of the adaptive resampling when building ensembles is to reduce variance, where the reduction comes from adaptive resampling and not the specific form of the ensemble forming algorithm.
• A weighted algorithm in which the classifiers are built via weighted observations performs better than weighted resampling at each iteration, apparently due to removing the randomisation (Friedman, Hastie & Tibshirani, 1998).
• Successful ensemble classification is due to the non-overlap of errors (Dietterich, 1997) i.e. observations which are classified correctly by one hypothesis are classified incorrectly by others and vice versa.
• Margin and edge analysis are recent explanations for Breiman, 1997 andSchapire et al., 1998.More detail on these measures and related studies is provided in the next section.

Edge and Margin Analysis
Recent explanations as to the success of boosting algorithms have their foundations in margin and edge analysis.These two measures are defined for the ith training observation at trial m as follows: Assume we have a base learner which produces hypothesis h m (x) at the m-th iteration, and an error indicator function, • edge i (m, c) = total weight assigned to all incorrect classes.The edge is defined formally in (Breiman, 1997) as • margin i (m, c) = total weight assigned to the correct class minus the maximal weight assigned to any incorrect class.
For the two class case margin i (m, c) = 1 − 2 edge i (m, c) and in general, Whilst more difficult to compute, the value of the margin is relatively simple to interpret.Margin values will always fall in the range [−1, 1], with high positive margins indicating confidence of correct classification.An example is classified incorrectly if it has a negative margin.The edge on the other hand cannot be used as an indicator variable for correct classification (except in the two-class case).Whilst the margin is a useful measure due to its interpretability, mathematically it is perhaps not as robust and tractable as the edge.
Schapire et al. (1998) claim that boosting is successful because it works to increase low margins for difficult observations, hence increasing the confidence of correct classification.Similarly, Breiman (1997) claims that a lower average edge (or higher average margin) should lead to higher classification accuracy on unseen cases.
This study examines the variance and average of the edge values versus the number of boosting trials performed.Methodology for the study is discussed in detail in the next section.

Empirical Results and Initial Theory
In all experiments, the decision tree learner C4.5 (Quinlan, 1993) with default values and pruning was used as the base classifier, with a boosted ensemble being built from M = 50 iterations.Datasets used are a selection from the UCI 1 Machine Learning Repository.The datasets were chosen to provide a representative mixture of dataset size and boosting performance previously reported (Quinlan, 1996), Schapire et al., 1998).10-fold crossvalidation was applied whereby the original training data were shuffled randomly and split into 10 equal-sized partitions.Each of the ten partitions was used in turn as a test set for the ensemble generated using the remaining 90% of the original data as a training set.
At each iteration, the values for edge were calculated for each observation in the 10 training sets (cross-validation folds).The average and variance of edge i (m, c) were calculated as follows: The results of these trials for the colic, glass and letter datasets appear in graphical form below, -(these results are indicative of the average and variance trends for all datasets tested).
oNte an apparent exponential decrease in variance perhaps indicating an asymptote of zero or some small value, .These results prompt the question "does boosting homogenise the edge?".The most dramatic variance decay is seen in boosting trials m ≤ 5 i.e. most of the 'hard' work appears to be done in the first few trials.This observation is consistent with several authors noting in earlier published empirical studies that little additional benefit is gained after 10 boosting trials when a relatively strong learner is used as the base learner.If the above variance reduction trends are truly exponential, replotting on a log scale will show a linear trend.This is not the case, however, and we may conclude that the decrease is not purely exponential.
It appears that observations with low inital edges are 'sacrificed' for observations with high initial edges.i.e observations which were initially classified correctly are classified incorrectly in later rounds in order to classify 'harder' observations correctly.This notion is consistent with margin It has been suggested in recent studies that reduction in test set error may correlate with a reduction in the edge values.With the exception of the letter dataset, plotting the average edge versus test set error for all crossvalidated folds showed no relationship with reduction in test set error.However, the glass and colic datasets are quite small, resulting in crossvalidated test sets containing 10-20 observations only.It was also noted that boosting with unweighted votes where the vote for each classifier was equal to 1 m resulted in a similar 'exponential' decrease, perhaps alluding to a voting effect rather than the specific form of the algorithm.
Interestingly, an increase in the average edge is apparent as the number of boosting trials increases.Refer to Figure 4 below for results on the colic, glass and letter datasets.Again these trends are indicative of trends for all other datsets tested.

Developing Expressions for Var[edge i (m, c)] and E[edge i (m, c)]
An alternative expression for the variance of the edge is derived below.Firstly, the definition of edge follows that given in equation (1) (Breiman, 1997).Letting c j = a j / a j , and defining the unweighted error of the j-th hypothesis as e j , this expression may be rewritten as : Now for Adaboost, The expression derived for V ar[edgei(m, c)] above is dependent only on j , e j and Cov[I j (x i ), I k (x i )].Intuitively, this result makes sense, since, at each iteration, the learner attempts to correctly predict observations that were predicted incorrectly at the previous iteration.For this to happen, the indicator variables for unweighted error should be negatively correlated or uncorrelated for pairwise iterations.If errors were positively correlated, voting could degrade performance since individual hypotheses may consistently vote incorrectly on some observations and never be given the chance to explore different areas of the training set.Hence, from the expression derived, negative or zero covariance terms will result in non-increasing values for V ar[edge i (m, c)].

Now, Cov[Ij(xi), I k (xi)]= E[Ij(xi), I k (xi)] − eje k and a loose lower bound for
Cov[I j (x i ), I k (x i )] is given by min(e j , e k ) − e j e k .
A simplified expression for the average edge is given by: After algebraic manipulation it can be shown that the condition for average edge to increase between the mth and m + 1th trials is That is, the average edge will increase between the mth and m + 1th trials if the unweighted error of the m + 1th trial is greater than or equal to the average edge on the mth trial.Generally the average edge increases but stays below a threshold of the maximum unweighted error of the hypotheses.The maximum unweighted error is an upper bound on the average edge.This may imply the following: ] ↓ ⇒ T estError ↓, perhaps indicating that further improvements in test set error may be obtained by actively minimising the variance of the edge (or margin) values.
A theoretical expression for V ar[edgei(m, c)] has been given in Section 4.1 in (2).To prove that this is monotonic non-increasing function, it must be shown that: Algebraically, it is quite straightforward to show that: Now, C m < 1 ∀m and hence A m must be non-negative in the limit as the LHS variance expression would fall below zero.But Am = Am,1 + Am,2 with Am,1 being negative in the limit and A m,2 being strictly positive.If A m is non-negative in the limit, A m,2 must be less than -1 in the limit.Refer to Figure 7 below where it appears that this is the case, implying that A m has a non-negative limit.
Alternatively, define δm as follows: The values of δm are plotted below.Again, via algebraic manipulation it can be shown that: Now, δm,1 < 0 and δm,2 < δm,1 .It can be shown via mathematical induction that δ m,2 < 0. Hence in both cases above we seek the distribution of A m to prove the non-increasing property of V ar[edge i (m, c)].Since A m is essentially a correlation term with a slightly positive value in the limit, this confirms the notion that the success of boosting is due to the lack of correlation between errors made by individual classifiers (or a non-overlap of errors between individual classifiers).

Empirical Trials of New Terms
Using the glass, colic and letter datasets, the values of A m , A m,1 , A m,2 , C m , δm, δm,1 and δm,2 were evaluated.As with previous empirical results, 10-fold crossvalidation was applied to the same shuffled datasets with each plotted point representing the result from one fold.M =50 boosting trials were employed.
It can be seen in Figure that A m appears to be zero or slightly positive in the limit with a tighter scatter about the zero line as m increases.Figure shows Am,1 being negative in the limit but not exhibiting the same scatter decrease as It can be seen in Figure that Cm is always negative with an apparent limit of 1.It can be seen in Figures and that δ m,1 , δ m,2 are always negative, both with an apparent limit of 0.
It can be noted in all the above figures, the variation in each term for the colic dataset is larger than the variation observed for the other 2 datasets plotted (i.e.glass, letter )  8. Interesting Empirical Observations on E[I j (x i ), I k (x i )] and log β j For the glass,bands, colic and letter datasets, the values of log β j were calculated for all j and E[Ij(xi), I k (xi)] for j, k ≤ 5, j = k.
• the colic dataset had E[Ij(xi), I k (xi)]¿0 for all j, k ≤ 5. i.e. there are observations which are predicted incorrectly in one round and also predicted incorrectly in subsequent rounds.Could this be a factor in the degradation of performance of a learner on colic data when boosting is applied?
• it appears that E[I(x j ), I(x j + )] = 0. • log β m shows no trending as m increases for glass and colic but shows variance reduction and possible cycles for letter.Additionally, the values of β j are highly variable for the colic datatset.This is an interesting result as boosting degrades performance on the colic dataset.
• P m j=1 log β j is linear in m, suggesting that log β j is constant.• Since log β j appears to be constant, ( P log β j ) −2 is strongly a x −2 type curve and this normalising factor has a strong decaying effect.

General Forms of Voting Systems
Mathematical analysis of variance reduction may be simplified by considering general forms of voting systems.This may also allow us to partition the variance into components pertaining to the voting mechanism and those pertaining to the method of formation of a sequence of classifiers.In boosting, consecutive classifiers are formed via an adaptive procedure but for bagging they are formed via a sequence of bootstrap replicates.Examples of possible schemes to consider are : • all m classifiers make identical predictions at each iteration and hence have the same individual error rate with corr[Ij(xi), I k (xi)]= 1. Voting weight of the mth classifier = 1 m ; in this case V ar[edge i (m, c)]= e(1 − e), which is constant and independent of m.Therefore, if this type of voting scheme was employed, no reduction in the variance of the edge would occur.
• the m classifiers do not make identical predictions at each iteration but have the same individual error rates with corr[I j (x i ), The first term in the variance expression above may be considered to be the voting component and the second term involving ρ the component pertaining to the method of classifier formation.Figure 9 below shows the value of this variance with e fixed at 0.04 and ρ varying; −1 1−m ≤ ρ ≤ 1.The reason ρ has a lower limit of −1 m−1 is given by Kendall & Stuart (1963).It may be possible to apply this lower limit on ρ in future work when trying to prove asymptotic limits on V ar[edge i (m, c)].
To check the degree of correlation between individual hypotheses for the letter dataset, the variance values obtained empirically are overlaid onto the variance trend graph above.The value of e is again 0.04, which is a close match to the values of ej obtained empirically for the letter data.Refer to Figure 5 below where we may conclude that corr[I j (x i ), I k (x i )] for the letter data is in the range 0 ≤ corr[Ij(xi), I k (xi)] ≤ 0.10.Clearly the assumptions of equal individual error rates and equal correlation between classifiers is too loose but we can still gain an appreciation for the type of variance reduction occuring and the degree of correlation between hypotheses.

Conclusion
This study has presented some interesting results on the variance of the edge when a boosted ensemble is formed.Variance reduction trends are consistent across all datasets tested.The initial theory and associated empirical results presented confirm that a key factor in this reduction is lack of correlation between errors of individual classifiers.Some initial theoretical work presented in Section

Figure 10 .
Figure 10.Plot of calculated V ar[edge i (m, c)] for varying ρ versus number of combined classifiers (e = 0.04).

•
It has been noted earlier that E[I j (x i ), I k (x i )] ≤ min(e i , e j ).This upper bound is now seen to be very loose when the exact values of E[Ij(xi), I k (xi)] are calculated empirically.i.e.E[I j (x i ), I k (x i )] << min(e i , e j ) in this study.