Online Boosting Algorithm Based on Two-Phase SVMTraining

We describe and analyze a simple and effective two-step online boosting algorithm that allows us to utilize highly effective gradient descent-based methods developed for online SVM training without the need to fine-tune the kernel parameters, and we show its efficiency by several experiments. Our method is similar to AdaBoost in that it trains additional classifiers according to the weights provided by previously trained classifiers, but unlike AdaBoost, we utilize hinge-loss rather than exponential loss and modify algorithm for the online setting, allowing for varying number of classifiers. We show that our theoretical convergence bounds are similar to those of earlier algorithms, while allowing for greater flexibility. Our approach may also easily incorporate additional nonlinearity in form of Mercer kernels, although our experiments show that this is not necessary for most situations. The pre-training of the additional classifiers in our algorithms allows for greater accuracy while reducing the times associated with usual kernel-based approaches. We compare our algorithm to other online training algorithms, and we show, that for most cases with unknown kernel parameters, our algorithm outperforms other algorithms both in runtime and convergence speed.


Online Training Methods.
During the recent years, there has been an increasing amount of interest in the area of online learning methods.Such methods are useful not only for setting where a limited amount of samples in being fed sequentially into a training system, but also for system, where the amount of training data is too large to fit into memory.In particular, several methods for online boosting and online support vector machine (SVM) training has been proposed.However, those methods have several limitations.The online boosting algorithms, such as [1] or [2], are usually limited to a fixed number of classifiers, while the online SVM training methods employing Mercer kernels [3,4] may not converge well in the cases where the inappropriate kernel was chosen.Furthermore, kernel-using SVM usually have significant storage and computational requirements due to the large amount of kernel expansion terms.In this paper, we exploit the similarity between the mathematical description of boosting and support vector machines to create a two-stage online boosting algorithm with variable number of classifiers, that can exhibit greater flexibility than kernel-based SVM while having smaller computational costs.
Our method uses Pegasos, Stochastic Gradient Descent-(SGD-) based SVM training method introduced in [4] to produce both a set of weak classifiers and boosting weights for combining them into strong classifier.Using this kind of training algorithm allows us to utilize solid theoretical background and well-defined convergence rate of SGD algorithms, as well as increased accuracy of classifier produced by boosting.We utilize a simplified variant of Pegasos with parameter k set to 1 on both stages.Our algorithm uses three parameters: λ H and λ, corresponding to the regularization parameters in the SGD algorithm, and additional parameter r that defines the length each additional classifier is trained and may be defined in several ways.
The resulting algorithm is simple and has low computational and storage costs, which allow it to be easily incorporated into real-time application.1.2.Related Work.Our algorithm is closely related to algorithms used for SVM training and boosting.These algorithms, taken together, can be separated into several classes: 1.2.1.Support Vector Machines.The support vector machines are a class of linear or kernel-based binary classifiers that attempt to maximize the minimal distance (margin) between each member of the class and separating surface.Such maximization was shown to give near-optimal levels of generalization error [5].In most cases, when dealing with kernels, the task of learning a support vector machine is cast as a constrained quadratic programming problem.Methods that deal with such problem usually need access to all labeled samples at once and require about O(m 2 ) operations, where m is the number of samples.Several approaches to the solution exist, such as interior point [6] methods and the decomposition methods, such as SMO [7].Also, recently, interest in gradient-based methods to solving primal SVM problem has risen drastically.These methods exhibit convergence rate independent of the number of samples, which is particularly useful for large datasets, and only need one or several samples for a single iteration, which lends itself well to the online setting.
Main method that is used as a basis for our algorithm is Pegasos [4], which is a modified SGD method with an added projection step, although more generic algorithms such as NORMA [3] can be easily used in its place.

Boosting Algorithms.
Boosting is a meta-algorithm for supervised learning that combines several weak classifiers, that is, classifiers that can label examples only slightly better than random guessing, into a single strong classifier with arbitrary classification accuracy.One of the most successful and well known examples is AdaBoost [8] and its variants, like LPBoost.The convergence properties of AdaBoost has been carefully analyzed [9], and it was used for such problems as text recognition, filtering, and face recognition and feature selection.Boosting algorithms that employ linear combination of weak classifiers to form confidence function for strong classifier were shown to be closely related to the primal formulation for support vector machines [10].As in case of SVM, many boosting methods were designed for offline setting, where all of the training examples are given a priori, and share the same set of problems when dealing with larger datasets.Recently, there were several successful approaches [1,2] designed to shift boosting to online setting, however, they assume that the number of weak classifiers in the boosting process stays the same, which limits the flexibility of their methods.

Outline of the Paper.
The remainder of this article is organized as follows.In Section 2, we give the description of our algorithm.In Section 3, we compare it to several existing online training methods in terms of computational cost and flexibility.Finally, we present our conclusion and further directions of research.

Algorithm Description
2.1.Overview of SVM and Pegasos Algorithm.Support vector machines are a useful classification tool, developed by Cortes and Vapnik [11], that attempt to construct a hyperplane separating the data that has the largest distance to any point in any class (see Figure 1).The original formulation assumed data to be linearly separable, with no noise.Later, a loss term was added to account for noisy data.The most widely used loss function for SVM training is hinge-loss: where ρ is called a margin parameter, y is a label of a sample, and f is classification function of the svm, sometimes also used as a confidence measure.For linear SVMs, f is a simple inner product of input and weight vectors, while for SVMs using kernel trick for nonlinear classification: where k(•, •) is a kernel function satisfying Mercer's condition and b is a bias term.Several methods exist for solving both primal and dual formulations of the SVM optimization problem.In this article, we employ the primal estimated subgradient solver (Pegasos) method, as described in [4], with some modifications.
The primal form of SVM problem is a minimization problem with added regularization term.Formally, for a reproducing kernel Hilbert space H with kernel k : R n × R n → R, and a set of vectors X ∈ R n with corresponding labels y : X → {−1; 1}, find function f such that Algorithm 1: The Pegasos algorithm with weighted samples.
In the linear case, function f can be represented as f ( x) = α, x , where •, • denote an inner product, and the equation becomes In our work, we also incorporate a bias term b by substituting x with x e = { x, 1}.In this case, α ∈ R n+1 .To simplify notation, we assume that all input vectors have been extended in such a way, and simply use x.
In online setting, only one of the sample-label pairs is available at a time.There exist several algorithms to deal with such a problem that have well-established convergence bound.Amongst them, methods using stochastic gradient descent (or, in case of hinge-loss function, subgradient descent) are most prominent.In this paper, we choose a Pegasos algorithm for its rapid convergence and simplicity of implementation.Here, we only present algorithm with our modifications for weighting sample, for detailed analysis please refer to the original article [4].
On each iteration, the Pegasos algorithm is given iteration number t, regularization coefficient λ (regulating how "soft" the resulting margin is, i.e., whether the priority is given to low error rate over training set or larger margin between classes), sample vector x with weight w and label y, and updates vector α as shown on Algorithm 1 .
The only difference from [4] is addition of weighting term w, the significance of which will be explained below.The number of simultaneous input samples k, used in the original Pegasos algorithm, is taken to be 1, reducing Pegasos to an SGD algorithm with an additional projection step.

Overview of AdaBoost.
AdaBoost [8] is an algorithm that iteratively adds weighted weak classifiers h i to obtain a strong classifier H.Each additional classifier is selected from the pool of available classifiers to minimize the weighted error rate over training samples.This error rate was also used to calculate the weight of h i in H and to update training sample weights in such a way so that the next classifier would favor samples misclassified by the current strong classifier and ignore the ones classified correctly.
As most learning algorithms of that time, AdaBoost was designed for offline (batch) training, with the error rate estimated over all available samples.There are, therefore, several difficulties involved in employing boosting as an online training algorithm, several of which were mentioned in [1].Their solution was to use a limited number of meta-classifiers, called selectors, each selecting a single weak classifier with least estimated error from a pool, combined into H.Both the weak classifiers and selectors were updated each iteration.The limited number of selectors and features allowed for simplified online algorithm.Their algorithm, however, had two potential problems, one being that a limited number of selectors limited flexibility of the classifier, the second being that the effect constant updates have on the error rate of the weak classifiers was never addressed.
In our work, we note that the expression for the confidence function of strong classifier in AdaBoost can be expressed in the same form as objective function f of SVM: where Using this notation, the algorithm is initialized with the following data: Assuming a simplified cutoff criteria for weak classifier training, single iteration of the algorithm takes the form illustrated on Algorithm 2 .
First, the input vector is classified by each of the weak classifiers already added to the pool, forming vector h of classifier outputs.This pool has T i classifiers in it, starting from a single classifier on the first iteration, and an additional bias expansion term, as described in Section 2.1.
Next, objective function F and the loss function L for the strong classifier are calculated, given boosting weights β.The loss function is then used as a weight for training Input: Iteration numbers i, i w , x, y, λ and λ H , vectors the weights of the latest weak classifier α T with the Pegasos algorithm.It can be seen that, similar to AdaBoost family of boosting algorithms, our algorithm increases the weights of samples misclassified by the current strong classifier, and decreases the weights of samples classified correctly.In fact, due to the use of the hinge-loss function for weight calculation, samples classified with high confidence by the already existing classifiers have no effect on the weights α T .This allows creation of the classifiers that compensate for the errors introduced by the previous ones, and eventual convergence to the true distribution of the samples.
After the weak classifier, boosting weights are adjusted, using vector h as an input.Usage of Pegasos algorithm guarantees that, eventually, the weights would converge arbitrarily close to the optimum described by the function (4).
As the last step of an iteration, we calculate the cutoff parameter for adding new classifier.If the preset threshold is reached (in this case, the number of training iterations i w reaches r), the value of T is increased, and a new classifier is initialized with zero weights.
Each iteration of the algorithm produces a strong classifier that can be used to get the class of vector x: There are several key differences between our algorithms and SVM using Mercer kernels as described in [3,4].The first and possibly most important one is that the number of weak classifiers T is much less than the number of input samples, i.While the algorithm described in [3] allows truncation of the kernel expansion coefficients, this is only applicable to the SGD algorithms with constant learning rate, and it results in an accuracy penalty.The second difference is the update of vector β, which allows change in the weights different from the exponential decay of [3].As our experiments in Section 3 show, these differences allow our algorithm to achieve better accuracy while being less resource intensive.It is also important to note that while parameters λ, λ H and r somewhat affect the convergence rate, they are independent from the form of the class-separating surface, that is, unlike the kernel methods, the convergence rate and resulting accuracy both depend heavily on the type and parameters of the kernel selected, our method converges similarly to the AdaBoost, with the resulting accuracy depending mainly on regularization parameters and resulting amount of weak classifiers T. This is shown in our Section 3, where our algorithm is running on the same set of parameters for datasets with different variable distributions, and often outperforming even the kernel-using algorithms with ideal kernel settings.

Discussion on the Proposed Method.
The process of convergence is illustrated in Figure 2 for the case of twodimensional dataset, where the data is classified according to whether it is inside unit circle about origin.The parameters used for this illustration, λ = 0.02, λ H = 0.02, r = 50.As can be seen, the original separation is quite bad since a linear classifier cannot converge with such data distribution, however, as new weak classifiers are added, the separation achieved by strong classifier becomes closer and closer to the true data distribution.
Each phase of our training algorithm is trying to solve the primal formulation of SVM (3), using Pegasos algorithm, which has a convergence rate of O(R 2 /λ ), where R is a bound on the norm of input vector, and is desired accuracy.It is easy to see that, for the second phase R = T, T being the number of added classifiers, and each classifier producing output h i ∈ {−1; 1}, so the convergence rate slowly decreases as additional classifiers are added.To combat this, certain classifiers with lows weights may be removed from the pool.This has a small additional effect of increasing the ability of algorithm to adapt to a changing classification target.
The changing weights of sample vectors in the first phase, as well as increasing the size of classifier pool, can be easily recast as a moving classification goal, described in [3].According to them, movement of the target introduces an accuracy penalty that is approximately linear to the total distance traveled by the target, which suggests that in our case the convergence should be slower than the convergence of an SVM training algorithm using kernel parameters appropriate for the sample distribution.This is an additional reason why we only update a single weak classifier, rather than all of them, since, otherwise, the drift would be proportional to the number of classifiers added, significantly penalizing convergence rate.
However, the bounds mentioned in [3] are not tight, and it is not clear how their bounds are altered by the additional projection step in Pegasos.Adding to this the fact that in most real-time applications the exact form of the data distribution is not known, we have decided to compare algorithm performance using experimental data.

Possible Extensions.
In this section, we discuss several possible extensions of our algorithm.For example, as can be easily noticed, while parameter T is much less than the kernel expansion terms, it still grows with additional samples, which may lead to loss of effectiveness and overfitting.There are several possible extensions that may allow to avoid this.One way is to remove classifiers α i for which the condition |β i | < held for several iterations in a row.This will also allow the algorithm to adapt better to the case of changing distribution.The other way is to increase the parameter r depending on T, or to choose a different cutoff algorithm altogether.
Also, to increase adaptability to changing input conditions, it is possible to change the calculation of the learning rate of the Pegasos algorithm, for example, stopping its decay on a certain threshold.However, such experiments are outside the scope of this paper.

Experiments
3.1.Description.We compare our algorithm to both Pegasos [4] and Norma [3], implemented on MATLAB for both the linear and the kernel-based case.The experiments are being run on AMD Phenom X4 965, with only one core being used for calculations.We perform several experiments, aiming to compare generalization error and convergence rates over different datasets, as well as the ability of the algorithm to adapt to the distribution with the changing parameters (flexibility).
We use several artificial datasets with known distributions and separation properties, and a Forest Covertype dataset (separating class 5 from other classes), originally used in [12], and also used for comparison of convergence speed in [4].The artificial datasets are generated according to the following distributions.
(1) High-dimensional linearly separable data (Linear).A random hyperplane is created in 50-dimensional space.Data points are generated randomly to both positive and negative sides of the hyperplane.Data points too close to the hyperplane are filtered out.
(3) Bayes-separable data (Bayesian).This dataset is generated as described in [3], that is, in such a way so that data is clearly separable using ideal Bayesian classifier for known class distributions.
(4) Bayes-separable data with moving distribution (Drifting).As in (4), but the parameters of a distribution are changed slightly each iteration, simulating target movement.This experiment estimates the ability of the algorithms to adapt to gradual changes in the data distribution.
(5) Bayes-separable data with switching distribution (Switching).Once again, a dataset generated according to the description in [3], with the distribution changed drastically every 1000 iterations.This x experiment shows the ability of the algorithms to completely relearn a distribution.
For each distribution, we measure the decrease of estimated error rate over training dataset (estimated error being simply the number of misclassified training samples divided by number of iterations), and the resulting error rate over the testing dataset (generated without noise in case of noisy distribution).In case of the dataset with the changing distribution, the distribution at the last iteration is used for testing.
For all experiments, the parameters of our algorithm were fixed, with λ = 0.02, λ H = 0.03 and the cutoff parameter r = 150.Pegasos and Norma used parameter λ = 0.02, and either a linear kernel or a Gaussian RBF kernel with γ = 0.01, which is the same value of γ used for generating Bayes-separable datasets.

Experimental Results.
The graphs for the estimated error rate are shown on Figure 3, while the resulting error rate on the test datasets is shown in Table 1.It can be seen that for linearly separable problems our algorithms performs on par with the Pegagos algorithm, with slight increase of the error rate possibly due to the overfitting.For kernelbased methods, however, our algorithm usually outperforms both Norma and Pegasos, unless the exact kernel parameters are used, and even then (see Figure 3(b)), our algorithm performs slightly better in the long run.It is interesting to note that, for switching dataset, NORMA actually outperforms Pegasos by a considerable margin, indicating that Pegasos algorithm is more sensitive to rapid changes in the classification target, most likely due to the rapid decay of the learning rate with time, while our algorithm was largely able to compensate, demonstraing stability of the method to condition changes.
For the Covertype dataset, linear classifiers work best and approach the error rates indicated in the paper [12], with Pegasos and our algorithm giving virtually the same results.It is also important to note that, when compared to kernel-based SVM algorithms, our algorithm is much more efficient both in terms of computing and storage requirements, since the amount of weak classifiers, each requiring only a single inner product calculation, is much lower than the amount of kernel expansion terms produced by both NORMA and Pegasos for the same accuracy levels.For example, in the test shown on Figure 3(c), the resulting amount of kernel expansion terms was over 5000 after 10000 iterations (for both Norma and Pegasos), while the amount of weak classifiers generated by our method was only 67, that is, both the memory and computational requirements (per classification) were less by a factor of around 75.

Conclusion and Future Work
We have shown how combination of boosting and online SVM training creates an algorithm that outperforms standard training algorithms in the case when the kernel parameters are not known and in general allows for the creating more efficient classifiers that simple kernel expansion.The drawbacks of our algorithm include the fact that it is not as efficient in case of linearly separable problems, and that it inherits some of the sensibility to the rapid changes in target function from Pegasos.
In the future, we plan to study the application of our algorithm to various image and signal processing tasks, in partiular to object tracking problem to compare its effectiveness to methods based on various online AdaBoost modifications, like the ones described in [1].

Figure 1 :
Figure 1: Illustration for support vector machine.

Figure 2 :
Figure 2: Illustration of the convergence process: (a) the first linear classifier trained (i = 50, T = 1), (b) several additional weak classifiers added (i = 250, T = 5), (c) the data separation corresponding to the set of classifiers in (b).Background color shows class generated by classifier, form and color of data points show actual label y, (d) data separation achieved after convergence (i = 5000, T = 100).
The label provided for x, y ∈ {−1; 1} [1]ector that consists of outputs provided by a set of T weak classifiers, h i ( x) ∈ {−1; 1}.That means that an SVM training algorithm can be applied for training weights β, with the same guarantees for convergence rate and generalization error bound shown in[5].2.3.Resulting Algorithm.Using the above similarity, we separate the boosting process into two phases: the training of each successful weak classifier, and adjusting the boosting weights that combine weak classifiers into a strong one.Both phases can then be cast as a primal SVM problem, same as (4), with the difference that in phase 1 (weak classifier training) each sample is weighted according to the result provided by the current strong classifier.To reduce the amount of calculations, we have chosen the loss function for the strong classifier H, L, as a weight, although other weighting solutions are possible.Also unlike[1], we only train the last weak classifier of the set rather than all of them.x:An input sample vector.A single vector is provided for each iteration of the algorithm.Sample vectors are assumed to be extended with an additional constant element representing bias term y: t = f t ( x): Objective function of a t's weak classifier F = F( h): Objective function of a strong classifier L: A loss function for the combined strong classifier.In this work we use hinge-loss function (see (1)) h t : Output value of a weak classifier, h t = sign( f t ) β: Boosting coefficients, used for combining several weak classifiers into a stronger one.With the bias term, the number of elements in β is T + 1 H: A combined strong classifier.

Table 1 :
Error rate over the testing dataset, O: our algorithm, PL: linear Pegasos, NL: linear NORMA, PG: Pegasos using Gaussian RBF kernel with γ = 0.01, NG: NORMA with the same kernel.