Analysis of Generalization Ability for Different AdaBoost Variants Based on Classification and Regression Trees

As a machine learning method, AdaBoost is widely applied to data classification and object detection because of its robustness and efficiency. AdaBoost constructs a global and optimal combination of weak classifiers based on a sample reweighting. It is known that this kind of combination improves the classification performance tremendously. As the popularity of AdaBoost increases, many variants have been proposed to improve the performance of AdaBoost. Then, a lot of comparison and review studies for AdaBoost variants have also been published. Some researchers compared different AdaBoost variants by experiments in their own fields, and others reviewed various AdaBoost variants by basically introducing these algorithms. However, there is a lack of mathematical analysis of the generalization abilities for different AdaBoost variants. In this paper, we analyze the generalization abilities of six AdaBoost variants in terms of classification margins. The six compared variants are Real AdaBoost, Gentle AdaBoost, Modest AdaBoost, Parameterized AdaBoost, Margin-pruning Boost, and Penalized AdaBoost. Finally, we use experiments to verify our analyses.


Introduction
In the last two decades, AdaBoost and its variants were used in various fields, such as face detection [1,2], hands detection [3,4], and human detection [5].AdaBoost was first introduced to the machine learning literature by Freund and Schapire [6]; and it has obtained tremendous success in classification.AdaBoost is proved efficient in increasing the classification margins of training data [7].To describe AdaBoost mathematically, Schapire and Singer proposed Real AdaBoost, which is a generalized version of AdaBoost [8].Real AdaBoost calculates its weak hypotheses by directly optimizing the upper bounds of training errors.Therefore, it converges faster than AdaBoost in the training [9].To improve the training speed of Real AdaBoost, Wu and Nagahashi devised Parameterized AdaBoost which utilizes a new weight adjustment policy [10].In 2000, Friedman et al. used additive logistic models to explain AdaBoost and proposed Gentle AdaBoost which computes weak hypotheses by minimizing the least square errors [11].Friedman et al. also proved that Gentle AdaBoost is more robust than AdaBoost and Real AdaBoost [11].For reducing the generalization error of Gentle AdaBoost, A. Vezhnevets and V. Vezhnevets suggested Modest AdaBoost which highlights weak classifiers that work well on difficult-to-classify instances [12].Modest AdaBoost achieves better generalization errors than Gentle AdaBoost in some data sets [13].However, its performance is unstable because the accuracy drops occasionally.For the same purpose, Wu and Nagahashi devised Margin-pruning Boost [14] and Penalized AdaBoost [15].Margin-pruning Boost applies a weight reinitialization approach to reduce the influence from noise-like data, while Penalized AdaBoost improves Margin-pruning Boost by introducing an adaptive weight resetting policy.Moreover, it utilizes a margin distribution to penalize the misclassification of small-margin instances.Freund created BrownBoost to reduce the influence of outliers in training [16].LPBoost was introduced to optimize the minimal margin of training data by using linear programming [17].However, a comparison showed that LPBoost overall performs worse than AdaBoost [18].Similarly to Marginpruning Boost, MadaBoost and SmoothBoost were devised to increase the robustness against malicious noise data [19,20].Some AdaBoost variants such as AdaCost, AdaC1, AdaC2, AdaC3, CSB0, CSB1, CSB2, and RareBoost assign weights to positive training instances and negative training instances differently to obtain a better performance on imbalanced data sets [21][22][23][24].While others trade off the integrity of training data for a faster training [25][26][27][28].In 2004, AdaTree was proposed to speed up the training process.It utilizes the same way as in AdaBoost to select weak classifiers but combines them in a nonlinear manner [29].Filterboost and Regularized AdaBoost were proposed to solve overfitting problem [30,31].Filterboost is based on a new logistic regression technique whereas Regularized AdaBoost requires validation subsets to identify and correct the overfitting iteratively.FloatBoost and FM-AdaBoost filter the less effective weak classifiers so that they can outperform AdaBoost when they have the same number of weak classifiers as AdaBoost [32,33].Nevertheless, they require more training cycles than AdaBoost.Nowadays, many novel AdaBoost variants were devised to improve the generalization ability such as SoftBoost, Interactive Boosting, ReweightBoost, Soft-LPBoost, and RobustBoost [34][35][36][37][38][39].SoftBoost maximizes a soft margin instead of the hard margin used in AdaBoost [34,36].Interactive Boosting gives weights to both features and training instances [37].ReweightBoost builds a tree structure by reusing the selected weak classifiers.Nevertheless, it can only use stump decision trees as its weak classifiers [38].Soft-LPBoost combines SoftBoost with LPBoost [39].While RobustBoost is an extension of Brown-Boost [35].All the five AdaBoost variants can achieve better generalization errors than AdaBoost.However, they suffered from a lot of complicated calculations which may lead to a longer training time.In the last few years, A novel approach called SemiBoost has been developed rapidly.It combines supervised learning with semisupervised learning by using both the labelled and unlabelled training instances [40].The purpose of SemiBoost is to increase the generalization ability when the labelled training instances are insufficient [41].In addition, AdaBoost.M1, Conservative.2AdaBoost, and Aggressive AdaBoost are AdaBoost variants proposed for multiple classification problems [42].
With the proposals of many AdaBoost variants, a lot of surveys and comparison studies based on these variants were published.Miao and Heaton compared AdaBoost with Random Forest in ecosystem classification problems and showed that AdaBoost overall outperforms Random Forest [43].Another comparison of AdaBoost and neural networks proved that AdaBoost ensemble of trees performs better than an individual neural network in experiments using cross validations [44].Research in [45] applied AdaBoost and SVM to the Synthetic Aperture Radar Automatic Target Recognition Systems and found that AdaBoost is more robust than SVM.Ferreira briefly introduced many boosting algorithms and labelled them as "supervised learning" or "semisupervised learning" [46].Seiffert et al. compared resampling boost algorithms with reweighting boost algorithms on imbalanced data sets and summarized that boosting by resampling generally outperforms boosting by reweighting [47].A comparison of LPBoost and AdaBoost based on the experimental results of UCI repository was also conducted in [18].Hegazy and Denzler evaluated AdaBoost and SoftBoost in generic object recognition and concluded that AdaBoost is more suitable for low-noise data sets while SoftBoost is more suitable for high-noise data sets [48].Another study which compares AdaBoost with AdaTree was accomplished by Drauschke and Forstner [49].Its experimental results showed that AdaTree usually performs better than AdaBoost but is prone to overfitting due to its tree-like structure.Jurić-Kavelj and Petrović evaluated three AdaBoost variants (Real, Gentle, and Modest AdaBoost) based on experiments in leg detection and found that Modest AdaBoost can not reduce the error rate as the number of iterations increases [50].Sun et al. compared Discrete, Real, and Gentle AdaBoost by analyzing the experimental results in license plate detection and explained that Gentle AdaBoost achieves better performance than the other two methods [51].Comparison in [52] focused on comparing weak classifiers of AdaBoost constructed by Bayes net, naive Bayes, and decision trees, and it showed that decision trees are the best.A review systematically introduced AdaBoost variants proposed during 1999 to 2012.Nevertheless, there is a lack of comparison between different AdaBoost variants [53].
In general, the above surveys and comparison studies either introduce the basic ideas of AdaBoost variants or compare different AdaBoost variants by experiments in a specific research.Differently from these studies, we compare the generalization abilities of six AdaBoost variants (Real AdaBoost, Gentle AdaBoost, Modest AdaBoost, Parameterized AdaBoost, Margin-pruning Boost, and Penalized AdaBoost) by analyzing the classification margins.The remainder of this paper is organized as follows.Section 2 explains the materials and methods.Section 3 shows experimental results.Section 4 draws a conclusion.

Materials and Methods
This section describes the training data and weak classifiers used in our research.It also explains the basic ideas of the six compared AdaBoost variants and their generalization abilities in terms of the classification margins.

Training Data and Weak Classifiers.
Here we give a brief introduction of the training data.Given  = {( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   )} is a training set, where  is the number of instances.We let   be 1 if   is positive, or −1 if   is negative.In this paper, we only discuss binary classification problems.We use CART as weak classifiers [54].As shown in Figure 1, a CART is a decision tree whose leaves output the classification results and inner nodes split the tree to minimize its error rate.In Figure 1,   = [ ,1 ,  ,2 , . . .,  , ] is the feature vector of instance   [54].x i,2 < 3.5 x i,3 < 10 For an instance   , its weak hypothesis   (  ) equals    (), where  is the index of partition that   belongs to.
For tuning parameter , the training and generalization errors on a data set Gamma Telescope which includes nearly 20000 instances were measured for  = {0.1,0.3, 0.5, 0.7, 0.9}.The results showed that  = 0.5 has the best overall performance, so  is set to be 0.5 [10].
The difference between Real and Parameterized Ada-Boost is the weight updating policy.In Step (2)(c), Parameterized AdaBoost adds a parameter and an absolute item to emphasize the instances whose margins are near 0 [10].

Margin-Pruning Boost. Margin-pruning Boost was designed to decrease the influence of noise-like instances.
Next we describe this approach as follows.
(2) Do the following tasks for  = 1, 2, . . ., .(c) Update the weights of instances as (d) Set a threshold  +1 by the following equation: For any instance   , if  ,+1 >  +1 , reset  ,+1 = 1 and Margin-pruning Boost restrains the weight increase of potential noise instances by resetting their weights and summed weak hypotheses [14].Here the resetting of the sum of weak hypotheses can keep the weights of noise-like instances small.On the other hand, the resetting means that the combination of these weak hypotheses (the current strong hypothesis) for these noise-like instances is reset to be 0 because it can not correctly classify these instances.

Penalized AdaBoost.
Penalized AdaBoost is an extension of Margin-pruning Boost.It introduces a margin distribution to penalize the misclassification of small-margin instances.Before introducing Penalized AdaBoost, we first explain the classification margins.The classification margin of an instance   shows the difference between prediction confidence of weak hypotheses providing correct classification and that of weak hypotheses leading to misclassification [7].It is in the range [−1, 1], and the instance   is correctly classified if and only if its margin is positive [7].Therefore, the margin of instance   is defined as [10] Margin where   (  ) denotes the weak hypothesis for instance   at round  and  is the number of total iterations.Next we explain Penalized AdaBoost by the following algorithm.(b) Calculate a margin feedback factor  , as shown in (e) For each instance   , if  ,+1 >  +1 and Margin  (  ) < 0, reinitialize instance   as follows: Then normalize  ,+1 by letting ∑   ,+1 = 1.       is assigned to  [15].Then we will analyze the generalization abilities of the six variants in the next section.

Generalization Ability Analysis.
In this section, we analyze the generalization abilities of the six AdaBoost variants by comparing their weak hypotheses and weight updating policies.

Real and Gentle AdaBoost. The difference between
Real and Gentle AdaBoost is how they calculate their weak hypotheses.Real AdaBoost computes the weak hypothesis by minimizing the upper bound of training error in each loop [8].Gentle AdaBoost calculates its weak hypothesis by optimizing the weighted least square error iteratively [11].
Real AdaBoost tries to decrease the training error whereas Gentle AdaBoost aims at reducing the variance of its weak hypotheses.Thus, in most cases, Real AdaBoost converges faster than Gentle AdaBoost in training, but Gentle AdaBoost is more stable than Real AdaBoost with respect to the generalization error.

Real and Parameterized AdaBoost. Comparing Step
(2)(c) in Real AdaBoost with that in Parameterized AdaBoost, we find that the weight updating policies of the two variants are different.From (12), we know that the training error converges to 0 if and only if the margins of all training instances are increased to be positive.In Step (2)(c) of Real AdaBoost, instances with small margins obtain more weights so that they are more likely correctly classified in future iterations.However, as Real AdaBoost focuses on instances with small margins, it may lead to an increase of instances whose margins are near 0; that is, these instances change the sign of their margins back and forth in the boosting process.Here we call them "swinging instances." These swinging instances slow the convergence of training error.To decrease the number of swinging instances, Parameterized AdaBoost introduces

Gentle and Margin-Pruning Boost.
At each round of Gentle AdaBoost, the weights of misclassified instances are increased whereas the weights of correctly classified instances are decreased.This will lead to the phenomenon that the weights of difficult-to-classify instances are increased too large.If these instances are noise data or outliers, the performance of the final strong classifier will be degraded.To solve this problem, Margin-pruning Boost utilizes a threshold to filter instances whose weights are too large and then resets their weights to be 1.Margin-pruning Boost reduces the influence from noiselike instances effectively in the early training phase by restraining the weight increase of filtered instances [14].However, as the number of iterations increases, the weights of instances filtered by thresholding become smaller and smaller.In the late training phase, the weights of these filtered instances are probably reduced smaller than 1.In that case, resetting their weights to be 1 actually increases the influence of these instances.Thus, the performance of Margin-pruning Boost drops when the number of loops increases.

Margin-Pruning Boost and Penalized
AdaBoost.Penalized AdaBoost is an improvement of Margin-pruning Boost.First it introduces a margin feedback factor to assign higher prediction confidence to weak hypotheses that can correctly classify small-margin instances.From ( 13), (14), and (15), we can see that (1 −   − ) and (1 −   + ) are proportional to the margins, and they are computed from the sum of margin feedback factors of misclassified instances.This means that misclassifying small-margin instances will lead to small (1 −   − ) and (1 −   + ).Therefore, the prediction confidence of weak hypotheses which misclassify small-margin instances will be degraded.Compared with Gentle AdaBoost and Margin pruning Boost, Penalized AdaBoost can stand out more competent weak hypotheses.Therefore, it is more robust than the other two variants.Modest AdaBoost highlights more competent weak hypotheses in some cases but downplays these weak hypotheses in other cases.By contrast, Penalized AdaBoost attaches importance to these more competent weak hypotheses under any circumstance.Thus it is more stable than Modest AdaBoost.Furthermore, Penalized AdaBoost solves the problem of Margin-pruning Boost by utilizing a more adaptive thresholding method.Penalized AdaBoost also uses the thresholding to filter the large-weight instances similarly to Marginpruning Boost.However, it only resets the weights of filtered instances with negative margins.This technique guarantees that the reset weights are always smaller than the original ones.For these noise-like instances, Penalized AdaBoost does not completely exclude them because they are not definitely noise.Nevertheless, Penalized AdaBoost keeps their weights small to reduce their influence on the final strong classifier.Thereby, it has better generalization ability than Marginpruning Boost.

Margin Distribution Comparison.
In this section, we compare the generalization abilities of the six different variants by analyzing their margin distributions.Li and Shen showed that reducing the minimal margin of training data plays little role in improving the generalization ability [18].However, enlarging the whole margin distribution to obtain a balance between the training error and complexity is crucial to the generalization ability [18].Here we use three kinds of CART as weak classifiers to evaluate the six AdaBoost variants.The three kinds of CART are CART-1 (CART with one inner node), CART-2 (CART with two inner nodes), and CART-3 (CART with three inner nodes).To get the cumulative margin distributions, for each data set, we use 2/3 of its data to train the final strong classifiers.Figure 2 shows the cumulative margin distributions based on CART-1 in data set German at iteration 200. Figure 3 shows the generalization errors of the same data set with respect to Figure 2. In Figure 2, Penalized AdaBoost enlarges the whole margin distribution more than the other variants so that it achieves the best generalization error in Figure 3. Real AdaBoost, Gentle AdaBoost, Parameterized AdaBoost, and Margin-pruning Boost perform similarly on the margins.So their generalization errors are also similar when the number of iterations reaches to 200.The margin curve of Modest AdaBoost in Figure 2 is not smooth.That may explain why its generalization error in Figure 3 is not changed gradually.
Margin distributions based on CART-2 in the same data set at iteration 200 are shown in Figure 4. Figure 5 shows the generalization errors with respect to Figure 4. From Figures 4  and 5, we notice that the generalization abilities of the six variants are consistent with their performance on the margins.We also evaluate margin distributions using CART-3 on the data set German.Margin curves at iterations 10, 100, and 1000 are shown by Figures 6, 7, and 8, respectively.
Comparing the three figures, we find that the margins are enlarged gradually as the number of iterations increases.Furthermore, we notice that Margin-pruning Boost outperforms Gentle AdaBoost in Figures 6 and 7. However it performs worse than Gentle AdaBoost in Figure 8.This demonstrates that the performance of Margin-pruning Boost drops as the number of iterations increases.Differently from Margin-pruning Boost, Penalized AdaBoost outperforms others at most cases.Thus it is most robust and stable.Figure 9 shows the generalization errors of the six variants which use CART-3 as their weak classifiers.In Figure 9, Margin-pruning Boost obtains lower generalization errors than Gentle AdaBoost before iteration 500.Unfortunately, it leads to severe overfitting after iteration 500.Figures 10 and 11 show margin distributions of other data sets.From these margin curves, we can conclude that Penalized AdaBoost generally outperforms the other five variants on enlarging the whole margin distributions.Real and Gentle AdaBoost perform very similarly, and Parameterized AdaBoost is slightly worse than Real AdaBoost when it uses CART-2 and CART-3.Margin-pruning Boost is better than Gentle AdaBoost if the number of iterations is small, While the margin curves of Modest AdaBoost are not smooth, they may lead to an unstable performance on generalization errors.

Experiments
In this section, we compare the six AdaBoost variants using 25 binary classification data sets from UCI [55].For every data set, we used Matlab AdaBoost Toolbox [54] and 3-fold cross validation.First we measure the generalization errors (estimated by the classification error on the test set) of the six variants based on CART-1.Table 1 summarizes the results of the six variants using CART-1 at iteration 200.Tables 2 and  3 show their generalization errors using CART-1 at iterations 500 and 800, respectively.We also compare the generalization errors of the six variants based on CART-2 and CART-3.Table 4 shows the comparison results using CART-2, and Table 5 compares the six variants using CART-3.In Tables 1, 2, 3, 4, and 5, RAB, GAB, MAB, PAAB, MPB, and PAB denote Real AdaBoost, Gentle AdaBoost, Modest AdaBoost, Parameterized AdaBoost, Margin-pruning Boost, and Penalized AdaBoost separately.The row of VS.GAB shows the residues that the sum of generalization errors of other variants subtracts that of Gentle AdaBoost.Here the bold values show the best performance and No.Best means the number of best generalization errors.No.To.GAB denotes the number of data sets in which Gentle AdaBoost is outperformed by others.From Tables 1, 2, and 3, we can conclude that Real, Gentle, and Parameterized AdaBoost perform similarly when using CART-1 as weak classifiers.Modest AdaBoost performs worse than other variants at most cases.Moreover, its error rates are rarely changed even when the number of loops increases.Comparing No.Best of Margin-pruning Boost in Tables 1, 2, and 3, we find that its performance drops as the number of iterations increases.We can also see that Penalized AdaBoost generally outperforms other variants from VS.GAB, No.Best, and No.To.GAB in Tables 1, 2, and 3. Comparing Tables 1, 4, and 5, we can conclude that increasing the inner nodes of CART is important to reduce the generalization errors.In Tables 4 and 5, we can see that Gentle AdaBoost is slightly better than Real AdaBoost.However, the performance of Parameterized AdaBoost and Margin-pruning Boost drops sharply.This means the two variants are more suitable for CART-1.On the other hand, Modest AdaBoost using CART-2 or CART-3 performs better than that using CART-1.This suggests that Modest AdaBoost is suitable for CART with more inner nodes.From all tables, we notice that the performance of Gentle and Penalized AdaBoost is not degraded neither by the number of inner nodes in CART nor by the number of iterations.Nevertheless, Penalized AdaBoost shows stronger robustness when compared with Gentle AdaBoost.

Conclusion
This paper analyzes the generalization abilities of six Ada-Boost variants mathematically.The novel contributions of our work are listed as follows.
(1) There are many comparison studies of AdaBoost variants.However, we compare three new proposed variants (Parameterized AdaBoost, Margin-pruning Boost, and Penalized AdaBoost) with three traditional variants (Real, Gentle, and Modest AdaBoost).This kind of comparison is new in the machine learning studies.
(2) Differently from conventional comparison works that draw conclusion from experimental results, we analyze the generalization abilities of the six variants by comparing their classification margins.
(3) We design experiments to verify our analyses.The experimental results are consistent with our analyses.
In general, the analyses and comparison in this paper are useful for researchers who want to improve the classification performance by switching to a new AdaBoost variant.In our current research, we focus on two classification problems.In our future work, we want to extend our analyses to multiclassification problems.In addition, we will compare more kinds of weak classifiers such as SVM and ANN to find out which kind of weak classifiers is suitable for which AdaBoost variant.

Figure 1 :
Figure 1: An example of classification and regression tree.

( c )
Compute the weak hypothesis for every partition
, the factor (1 −   + ) assigns higher prediction confidence to weak hypotheses that correctly classify small-margin instances.At the same time, the factor (1 −   − ) reduces the prediction confidence for weak hypotheses misclassifying small-margin instances.In this case, Modest AdaBoost outperforms Gentle AdaBoost.However, if   + (1−   + ) is smaller than   − (1 −   − ) in the case   + >   − , the sign of the weak hypothesis will be negative.This means the factor (1 −   − ) reduces the prediction confidence for weak hypotheses that correctly classify small-margin instances.Meanwhile, (1 −   + ) increases the prediction confidence of weak hypotheses misclassifying small-margin instances.In this case, Modest AdaBoost performs far worse than Gentle AdaBoost.