^{1}

^{2}

^{1}

^{2}

We describe and analyze a simple and effective two-step online boosting algorithm that allows us to utilize highly effective gradient descent-based methods developed for online SVM training without the need to fine-tune the kernel parameters, and we show its efficiency by several experiments. Our method is similar to AdaBoost in that it trains additional classifiers according to the weights provided by previously trained classifiers, but unlike AdaBoost, we utilize hinge-loss rather than exponential loss and modify algorithm for the online setting, allowing for varying number of classifiers. We show that our theoretical convergence bounds are similar to those of earlier algorithms, while allowing for greater flexibility. Our approach may also easily incorporate additional nonlinearity in form of Mercer kernels, although our experiments show that this is not necessary for most situations. The pre-training of the additional classifiers in our algorithms allows for greater accuracy while reducing the times associated with usual kernel-based approaches. We compare our algorithm to other online training algorithms, and we show, that for most cases with unknown kernel parameters, our algorithm outperforms other algorithms both in runtime and convergence speed.

During the recent years, there has been an increasing amount of interest in the area of online learning methods. Such methods are useful not only for setting where a limited amount of samples in being fed sequentially into a training system, but also for system, where the amount of training data is too large to fit into memory. In particular, several methods for online boosting and online support vector machine (SVM) training has been proposed. However, those methods have several limitations. The online boosting algorithms, such as [

Our method uses Pegasos, Stochastic Gradient Descent-(SGD-) based SVM training method introduced in [

The resulting algorithm is simple and has low computational and storage costs, which allow it to be easily incorporated into real-time application.

Our algorithm is closely related to algorithms used for SVM training and boosting. These algorithms, taken together, can be separated into several classes:

The support vector machines are a class of linear or kernel-based binary classifiers that attempt to maximize the minimal distance (margin) between each member of the class and separating surface. Such maximization was shown to give near-optimal levels of generalization error [

Main method that is used as a basis for our algorithm is Pegasos [

Boosting is a meta-algorithm for supervised learning that combines several weak classifiers, that is, classifiers that can label examples only slightly better than random guessing, into a single strong classifier with arbitrary classification accuracy. One of the most successful and well known examples is AdaBoost [

The remainder of this article is organized as follows. In Section

Support vector machines are a useful classification tool, developed by Cortes and Vapnik [

Illustration for support vector machine.

Several methods exist for solving both primal and dual formulations of the SVM optimization problem. In this article, we employ the primal estimated subgradient solver (Pegasos) method, as described in [

The primal form of SVM problem is a minimization problem with added regularization term. Formally, for a reproducing kernel Hilbert space

In online setting, only one of the sample-label pairs is available at a time. There exist several algorithms to deal with such a problem that have well-established convergence bound. Amongst them, methods using stochastic gradient descent (or, in case of hinge-loss function, subgradient descent) are most prominent. In this paper, we choose a Pegasos algorithm for its rapid convergence and simplicity of implementation. Here, we only present algorithm with our modifications for weighting sample, for detailed analysis please refer to the original article [

On each iteration, the Pegasos algorithm is given iteration number

The only difference from [

AdaBoost [

As most learning algorithms of that time, AdaBoost was designed for offline (batch) training, with the error rate estimated over all available samples. There are, therefore, several difficulties involved in employing boosting as an online training algorithm, several of which were mentioned in [

In our work, we note that the expression for the confidence function of strong classifier in AdaBoost

Using the above similarity, we separate the boosting process into two phases: the training of each successful weak classifier, and adjusting the boosting weights that combine weak classifiers into a strong one. Both phases can then be cast as a primal SVM problem, same as (

Using this notation, the algorithm is initialized with the following data:

Input: Iteration numbers

Pegasos (

Pegasos (

First, the input vector is classified by each of the weak classifiers already added to the pool, forming vector

Next, objective function

After the weak classifier, boosting weights are adjusted, using vector

As the last step of an iteration, we calculate the cutoff parameter for adding new classifier. If the preset threshold is reached (in this case, the number of training iterations

Each iteration of the algorithm produces a strong classifier that can be used to get the class of vector

It is also important to note that while parameters

The process of convergence is illustrated in Figure

Illustration of the convergence process: (a) the first linear classifier trained (

Each phase of our training algorithm is trying to solve the primal formulation of SVM (

The changing weights of sample vectors in the first phase, as well as increasing the size of classifier pool, can be easily recast as a moving classification goal, described in [

However, the bounds mentioned in [

In this section, we discuss several possible extensions of our algorithm. For example, as can be easily noticed, while parameter

Also, to increase adaptability to changing input conditions, it is possible to change the calculation of the learning rate of the Pegasos algorithm, for example, stopping its decay on a certain threshold. However, such experiments are outside the scope of this paper.

We compare our algorithm to both Pegasos [

We use several artificial datasets with known distributions and separation properties, and a Forest Covertype dataset (separating class 5 from other classes), originally used in [

For each distribution, we measure the decrease of estimated error rate over training dataset (estimated error being simply the number of misclassified training samples divided by number of iterations), and the resulting error rate over the testing dataset (generated without noise in case of noisy distribution). In case of the dataset with the changing distribution, the distribution at the last iteration is used for testing.

For all experiments, the parameters of our algorithm were fixed, with

The graphs for the estimated error rate are shown on Figure

Error rate over the testing dataset, O: our algorithm, PL: linear Pegasos, NL: linear NORMA, PG: Pegasos using Gaussian RBF kernel with

Dataset | O | PL | NL | PG | NG |
---|---|---|---|---|---|

Linear | 0.04 | 0.03 | 0.09 | 0.07 | 0.1 |

Linear + noise | 0.02 | 0.002 | 0.005 | 0.02 | 0.08 |

Bayesian | 0.03 | 0.17 | 0.22 | 0.08 | 0.15 |

Drifting | 0.04 | 0.28 | 0.35 | 0.15 | 0.18 |

Switching | 0.11 | 0.20 | 0.21 | 0.32 | 0.13 |

Covertype | 0.19 | 0.2 | 0.35 | 0.43 | 0.48 |

Experimental results. (a) Linearly separable dataset, Linear Norma and Pegasos, (b) exact Bayes-separable dataset, Pegasos, and Norma using RBF kernel, (c) Bayes-separable dataset with drifting distribution parameters, Pegasos and Norma using RBF kernel, our algorithm demonstrating remarkable adaptability to changing classification targets, (d) Bayes-separable dataset with distribution parameters being switched every 1000 iterations, Pegasos and Norma using RBF kernel, (e) covertype dataset, Pegasos, and Norma using RBF kernel (do not converge), (f) covertype dataset, linear Pegasos, and Norma.

For the Covertype dataset, linear classifiers work best and approach the error rates indicated in the paper [

It is also important to note that, when compared to kernel-based SVM algorithms, our algorithm is much more efficient both in terms of computing and storage requirements, since the amount of weak classifiers, each requiring only a single inner product calculation, is much lower than the amount of kernel expansion terms produced by both NORMA and Pegasos for the same accuracy levels. For example, in the test shown on Figure

We have shown how combination of boosting and online SVM training creates an algorithm that outperforms standard training algorithms in the case when the kernel parameters are not known and in general allows for the creating more efficient classifiers that simple kernel expansion. The drawbacks of our algorithm include the fact that it is not as efficient in case of linearly separable problems, and that it inherits some of the sensibility to the rapid changes in target function from Pegasos.

In the future, we plan to study the application of our algorithm to various image and signal processing tasks, in partiular to object tracking problem to compare its effectiveness to methods based on various online AdaBoost modifications, like the ones described in [