Single Directional SMO Algorithm for Least Squares Support Vector Machines

Working set selection is a major step in decomposition methods for training least squares support vector machines (LS-SVMs). In this paper, a new technique for the selection of working set in sequential minimal optimization- (SMO-) type decomposition methods is proposed. By the new method, we can select a single direction to achieve the convergence of the optimality condition. A simple asymptotic convergence proof for the new algorithm is given. Experimental comparisons demonstrate that the classification accuracy of the new method is not largely different from the existing methods, but the training speed is faster than existing ones.


Introduction
In a classification problem, we consider a set of training samples, that is, the input vectors {x } =1 along with corresponding class labels { } =1 . Our task is to find a deterministic function that best represents the relation between input vectors and class labels. For classification or forecasting problems in machine learning, support vector machine (SVM) has been adopted in many applications because of its high precision [1][2][3][4]. SVMs require the solution of a quadratic programming problem. Another successful method for machine learning is least squares support vector machine (LS-SVM) [5]. Instead of solving a quadratic programming problem as in SVMs, the solutions of a set of linear equations are obtained in LS-SVMS. There are many proposed algorithms for training LS-SVMs: Suykens et al. proposed an iterative algorithm based on conjugate gradient (CG) algorithms [6]; Ferreira et al. presented a gradient system which can train the LS-SVM model [7] effectively; Chua introduced efficient computations for large least square support vector machine classifiers [8]; Chu et al. improved the efficiency of the CG algorithm by using one reduced system of linear equations [9]; Keerthi and Shevade extended the sequential minimal optimization (SMO) algorithms to solve the linear equations in LS-SVMs where the maximum violating pair (MVP) was selected as the working set [10]; based on the idea of SMO algorithm, Lifeng Bo et al. presented an improved method for working set selection by using functional gain (FG) [11]; Jian et al. designed a multiple kernel learning algorithm for LS-SVMs by convex programming [12]; and so on. These numerical algorithms are computationally attractive. Empirical comparisons show that SMO algorithm is more efficient than CG one for the large scale datasets.
Fast SVM training speed with SMO algorithm is an important goal for practitioners and many other proposals have been given for this in the literature. Initially, Platt presented two heuristics that resulted in a bit cumbersome selection [13]. Later, Keerthi et al. introduced the concept of a violating pair to denote two coefficients which cause a violation in the KKT optimality conditions of the dual, and the authors suggested to select always the pair that violated them the most, that is, the maximum violating pair (MVP) [14]. Finally, Fan et al. proposed a second order selection that usually results in faster training than the MVP rule [15]. By the above improvement, we can decrease the computational expense of SMO algorithm, while there are repeated selections of some concrete updating patterns in sequential minimal optimization. They are called training cycles. Barbero et al. 2 Computational Intelligence and Neuroscience studied the presence of them from a geometrical point of view [16]. They pointed out that the training cycles can be partially collapsed in a single updating vector that gave better optimal directions. The idea for training cycles can reduce the number of iterations and kernel operations for SMO algorithm.
Inspired by Barbero et al. [16], we present a single directional SMO algorithm for LS-SVMs, abbreviated as SD-SMO algorithm. In optimization procedure, an adaptive objective function is selected, and the single directional steps are given for the lagrangian multipliers, which can lessen the number of training cycles and further reduce iterations and kernel operations for SMO algorithm. Experiments show that the training time for LS-SVMs by SD-SMO algorithm can be reduced significantly, and it has a testing accuracy which is not largely different from traditional SMO algorithm.
The rest of this paper has the following structure. In the next section, LS-SVMs are briefly reviewed. In Section 3, SD-SMO algorithm for LS-SVMs is provided and the convergence of the improved algorithm is proved theoretically. Based on standard datasets, computational experiments describing the effectiveness of the improved algorithm are presented in Section 4. Finally, Section 5 is devoted to concluding remarks.

LS-SVM
In this section, we concisely review the basic principles of LS-SVMs. Given a training dataset of points {x , } =1 with input data x ∈ R and output data ∈ R, we consider the following optimization problem in primal weight space: min , , such that − ( (x ) + ) = , = 1, 2, . . . , , where is a regularization factor, is the difference between the desired output and the actual output, and (⋅) is a nonlinear function mapping the data points into a highdimensional Hibert space; in addition, the dot product in the high-dimensional space is equivalent to a positive-definite kernel function k(x , x ) = (x ) (x ).
In primal weight space, a linear classifier in the new space takes the following form: The weight vector may be infinite dimensional; hence, using (1) to find the solutions is impossible in general. In order to solve this problem, we would compute the model in the dual space instead of the primal space. Let = 0, and the simple problem without a bias term is considered in this paper as in the paper by Keerthi and Shevade [10]. The Lagrangian for the simple problem is After elimination of and , we could obtain the following linear system: . . ] , and ∈ × is the kernel matrix. By solving the linear system (6), are obtained; hence, LS-SVM greatly simplifies the problem. The resulting LS-SVM model for function estimation is For the choice of the kernel function k(⋅, ⋅), there are several possibilities: In this case, we focus on the choice of an RBF LS-SVM for the sequel. When solving large linear systems, we should apply iterative methods to (6), which was introduced by Jiao et al. [17]. The speed of convergence depends on the condition number of the matrix in (6). It is influenced by the choice of ( , ) in the case of RBF LS-SVM. In the following section, we will discuss the algorithm of SMO versions and give the proof of convergence for SD-SMO algorithm.

SMO and SD-SMO Algorithms for LS-SVM
For solving the LS-SVM problem, the matrix in (6) is usually fully dense and may be too large to be stored. Decomposition methods are designed to handle the difficulties, see Jiao et al. [17]. Unlike other optimization algorithms which update the whole Lagrangian multipliers vector in each iterative process, the decomposition algorithm modifies only a subset of per iteration. We denote the subset as the working set . The SMO algorithm was developed in [10] as a decomposition method to solve the dual problems arising in LS-SVM formulations. In each iteration, SMO algorithm restricts to have only two elements. Because of the problem (4) without the bias term , SMO can be simplified to optimize with only one element at an iteration. By substituting the KKT conditions (5) into the Lagrangian (4), the dual problem is to maximize the following objective function: Computational Intelligence and Neuroscience where (x , x ) = (x , x ) + / , and = 1 if = and 0 otherwise.
The SMO algorithm for (8) is sketched in the following.
(2) If the stop criterion is satisfied, stop. If not, find a one- . . , } \ and and to be subvectors of corresponding to and , respectively.
(3) Solve the following subproblem with the variable : where [ ] is a permutation of the matrix .
In order to find working set , we usually consider whether the KKT conditions is violated or not. The KKT conditions for the dual problem (8) are / = 0, which lead to − ∑ (x , x ) = 0, = 1, 2, . . . , . If we define then the KKT optimality condition is violated if there exists any index point such that ̸ = 0. SMO algorithm for (8) achieves the convergence of optimal process when → 0, for all .
A simple illustration of this is shown in Figure 1.
Since only one component is updated per iteration, the decomposition method can be quite costly and suffers from slow convergence. For this reason, many researchers improved SMO algorithm. For example, Chen et al. improved SMO algorithm by using the shrinking and caching techniques [18]; Barbero et al. presented a cycle-breaking acceleration of SVM training [16]; and Lin et al. provided threeparameter sequential minimal optimization for support vector machines [19].
As mentioned by Barbero et al. in [16], SMO algorithm is not free of cycle-related problems. For all in working set , if is optimized with step ( > 0 or < 0) in a single direction per iteration, the number of cycles in SD-SMO algorithm will be reduced. We now detail SD-SMO formulation in the LS-SVM training process. Define Then, the KKT optimality condition is violated if there exists any index point such that ̸ = 0. SD-SMO algorithm works by optimizing only one at each iteration and keeping the others fixed, that is, is adjusted by a sign-invariable step ( > 0 or < 0) per iteration as follows: The update of causes the change of all the as and; therefore, the function value of will change. At each iteration we need to be sure that the sign of is not variable, that is, if ≥ (or ≤) 0, then +1 ≥ (or ≤) 0. As increases, → 0 + (or 0 − ) with the sign keeping invariable. A simple illustration of this is shown in Figure 2.
To derive the optimal step and the termination conditions of iteration, we define as Because → 0 as → ∞, ( ) ≤ (0). Therefore, let Δ = −( ( ) − (0)) and it can be written as The optimal step is obtained by maximizing Δ as and the optimal step opt can induce the change of as Hence we can choose an index point which has the maximum value of /2 (x , x ) and update by (12) and (16). Suppose ( ) = ( 1 , 2 , . . . , , . . . , ) and ‖ ( )‖ can be used as a termination criterion for the iterative algorithm as where is a positive constant. The flowchart of SD-SMO algorithm is shown in Algorithm 2.

Numerical Experiments
Under the framework Algorithm 2, we conduct experiments to check whether using SD-SMO is really faster than using SMO or not in this section. There have been two techniques for working set selection in SMO-type decomposition methods. The former is first order SMO (FO-SMO) algorithm and the latter is second order SMO (SO-SMO) algorithm for LS-SVM classifiers [20]; that is, the former uses first order information to achieve fast convergence and the latter uses second order information. Two groups of experiment have been done in order to compare SD-SMO with the above two algorithms. All methods are implemented in MATLAB and executed on a personal computer with Intel(R) Core(TM) i3 2.53 GHz processors, 2.00-GB memory, and Windows 7 operation systems. For all algorithms, the optimization process is terminated when the maximal violation of the KKT conditions is within = 0.001. For simplicity, we consider only Gaussian kernel (x, x ) = exp{−‖x − x ‖ 2 2 /2 2 } to construct LS-SVM.

The Comparison of SD-SMO with First Order SMO.
In this section, we compare SD-SMO with first order SMO on four benchmark datasets for evaluating the performance of the proposed method. We compare the two methods in terms of computational cost, which is measured by the number of iteration. The examples introduced by Keerthi and Shevade [10] are used. Datasets used for this purpose are Banana, Image, Waveform, and Splice. For each dataset, the value of 2 is determined by the five-fold cross validation on a small random subset.
In the first experiment, we vary over a small range because the extremely small and large values are usually of little interest. We try the following nine values: 2 , = −4, −3, . . . , 3, 4. In Table 1, the computational costs associated with the four datasets as functions of are given when the optimization process is terminated.
As a basis for the comparisons, Table 1 shows the computational costs of first order SMO and SD-SMO algorithms at different values of parameter . For first order SMO algorithm, the computational cost increases with the increase of . While for SD-SMO algorithm, it is not so. For instance, see the computational cost of SD-SMO for the Banana and Waveform datasets. From Table 1, we can see that the number of iterations of SD-SMO algorithm is much smaller than that of first order SMO one, especially for Image dataset.   In order to further show the performance of SD-SMO algorithm, Tables 2 and 3 are given. The tables report the training time and the generalization performance of first order SMO and SD-SMO algorithms for four benchmark datasets. The generalization performance is illustrated by the classification accuracy of an independent test set for each dataset.
From Tables 2 and 3, we can see that the generalization capabilities of both methods are comparable, but the training time of SD-SMO algorithm is shorter than first order SMO algorithm. For instance, in the case of Image dataset, the training time for first order SMO algorithm with the best generalization performance is 41.6108 s. It represents the equivalent of ten times the cost of SD-SMO algorithm.  The classification accuracy for Image dataset with SD-SMO algorithm is 0.963, and it is almost equal to the one with first order SMO algorithm. In consequence, the efficacy and feasibility of the proposed SD-SMO algorithm is superior to that of first order SMO one for LS-SVMs.

The Comparison of SD-SMO with Second Order SMO.
To further explore the performance of the proposed method, we compare SD-SMO with second order SMO by a second set of experiments on the datasets Titanic, Heart, Breast Cancer, Thyroid, and Pima (available in [21]). We use the datasets provided in [21] to certify the good generalization properties of the proposed method. In Table 4, the number of iterations and execution times per experiment is reported. The misclassification rates are also reported in Table 4.
It can be seen that for these datasets it is better to use SD-SMO in Cancer, Pima, and Titanic. The results in Table 4 shows that the biggest improvement with SD-SMO happens for Titanic. Therefore, this is further evidence on the previous observation that for large-scale problems SD-SMO outperforms second order SMO.
The final set of experiments aims to ascertaining how well the SMO algorithm scales for large-scale datasets when it uses the different working set selections. In order to test this, we use the datasets a8a and covtype.binary, available with several increasing numbers of patterns in [22].
In Figure 3, we plot the results for a8a with = 2, 2 = 10 and covtype.binary with = 10, 2 = 10, respectively. As it can be seen, the number of iterations scales linearly with the training set size. Note that SD-SMO needs less iterations to convergence, as expected. And the reduction is greater for covtype.binary because of its larger value of . In any case, the scaling is linear in both cases.

Conclusion
In this paper, a new algorithm, that is, SD-SMO, is proposed. It can be used to select working set for LS-SVM classifier training, and its asymptotic convergence is proved theoretically. Based on SMO formulation, the path of one-side convergence is used effectively in our method. The number of iterations and kernel operations in SD-SMO algorithm is less than that of the traditional SMO algorithm, so the new algorithm provides faster convergence speed. Simulation experiments have been carried out on four benchmark datasets. The empirical comparisons demonstrate that SD-SMO algorithm is much more efficient in terms of computational time than first order and second order SMO, and at the same time there are no large differences in terms of accuracy.