Efficient Model Selection for Sparse Least-Square SVMs

The Forward Least-Squares Approximation (FLSA) SVM is a newly-emerged Least-Square SVM (LS-SVM) whose solution is extremely sparse. The algorithm uses the number of support vectors as the regularization parameter and ensures the linear independency of the support vectors which span the solution. This paper proposed a variant of the FLSA-SVM, namely, Reduced FLSA-SVM which is of reduced computational complexity and memory requirements. The strategy of “contexts inheritance” is introduced to improve the efficiency of tuning the regularization parameter for both the FLSA-SVM and the RFLSA-SVM algorithms. Experimental results on benchmark datasets showed that, compared to the SVM and a number of its variants, the RFLSA-SVM solutions contain a reduced number of support vectors, while maintaining competitive generalization abilities. With respect to the time cost for tuning of the regularize parameter, the RFLSA-SVM algorithm was empirically demonstrated fastest compared to FLSA-SVM, the LS-SVM, and the SVM algorithms.


Introduction
As with the standard Support Vector Machine (SVM), the Least Squares Support Vector Machine (LS-SVM) optimizes the tradeoff between the model complexity and the squared error loss functional [1,2].The optimization problem, in the dual form, can be solved by the sequential minimal optimization (SMO) algorithm [3].Bo et al. proposed a novel strategy for working-set selection which improved the efficiency of the SMO implementation [4].Meanwhile, the optimization problem, which is subject to equality constraints, can be transformed into a set of linear equations for which the conjugate gradient (CG) method can be applied [5].Chu et al. further reduced the time complexity of training the LS-SVM using the CG method [6].However, the solution of the LS-SVM is parameterized by a majority of the training samples, which is known as the nonsparseness problem of the LS-SVM.As the classification of test samples involves primarily the kernel evaluation between the test sample and the training samples contained in the solution, a nonsparse solution can cause a slow classification procedure.
A range of algorithms, aiming at easing the nonsparseness of LS-SVM solutions, have been available.Suykens et al. proposed to prune training samples with the minimal Lagrangian multiplier [7].De Kruif and De Vries, on the other hand, proposed to remove samples that introduced the least approximation error for the next iteration [8].Zeng and Chen presented a pruning algorithm based on the SMO implementation which causes the least change to the dual objective function [9].
Another class of sparse LS-SVM algorithms views each column of the kernel matrix as the output of a specific "basis function" on the training samples.The "basis function" is selected iteratively into the solution.Among them is the kernel matching pursuit algorithm which adopts a squared error loss function [10], and the algorithm was extended to address large-scale problem [11].Jiao et al. proposed to select, at each iteration, the basis function which minimizes the change in the Wolfe dual objective function of the LS-SVM.Also among this class is the Forward Least-Squares Approximation SVMs (FLSA-SVMs) algorithm which uses the number of support vectors as the regularization parameter.The FLSA-SVM Mathematical Problems in Engineering avoids the explicit formulation of training cost, by applying a function approximation technique of squared error loss [12].The algorithm can also detect the linear dependencies among column vectors of the input Gramian matrix, ensuring that the set of the training samples which span the solution are linearly independent.
Unfortunately, the exhaustive search for the optimal basis function at each iteration in the FLSA-SVM is computationally expensive.To tackle this problem, the Reduced FLSA-SVM (RFLSA-SVM) is proposed in which a random selection of the basis functions is adopted.The RFLSA-SVM also has a lower memory requirement since the input Gramian matrix for training is now rectangular, in contrast to the square one of the FLSA-SVM.Compared to the FLSA-SVM, the RFLSA-SVM risks increasing number of support vectors.Nevertheless, the paper empirically proves that the FLSA-SVM and the RFLSA-SVM variant both provide sparse solutions, in comparison to the conventional LS-SVM and the standard SVM, as well as the other sparse SVM algorithms developed upon the idea of "basis functions." Further, the technique of "contexts inheritance" is proposed which is another effort to reduce the time complexity for training the proposed RFLSA-SVM and the FLSA-SVM.Taking the RFLSA-SVM algorithm, for example, "contexts inheritance" takes advantages of the connection between any two RFLSA-SVMs whose kernel functions are identical, but value settings on the regularization parameter are different.The intermediate variables, the by-products of training the RFLSA-SVM with a smaller regularization parameter, can be inherited to be the starting point for training the RFLSA-SVM with a greater one.This property, referred to as "contexts inheritance", can be further utilized in the tuning of the regularization parameter for both the RFLSA-SVM and the FLSA-SVM.
The paper is organized as follows.Section 2 briefly reviews sparse LS-SVM algorithm, namely, the Forward Least-Squares Approximation SVM (FLSA-SVM), followed by a description of the Reduced FLSA-SVM (RFLSA-SVM) variant in Section 3. In Section 4, the attribute of "contexts inheritance", which makes these two algorithms more computationally attractive, is described.Experimental results are given in Section 5 and concluding remarks in Section 6.

The Forward Least-Squares Approximation SVM [12]
Given a set of ℓ samples (x  ,   ),  = 1, . . ., ℓ where x  ∈ R  is the th training pattern and  ∈ {−1, +1} the class label, the Forward Least-Squares Approximation SVM (FLSA-SVM) seeks a classifier which is the solution to the following optimization problem: where (⋅) is the function mapping an input pattern into the feature space.The vector Γ = ( 1 , . . .,   ) is composed of the indices of the support vectors and |Γ| is the cardinality of the set.
Introducing ℓ Lagrange multipliers   ( = 1, . . ., ℓ) for the equality constraints of (2), it results in a linear system of: where K ∈ R ℓ×ℓ and The constraints of (3) demands that only  Lagrange multipliers   end up non-zero in the solution to (4).The resultant decision function of the FLSA-SVM classifier on a test sample z is: where   ( = 1, . . ., ) is the column index of the th basis function whose weight   is non-zero.Each associated training sample of x   is known as the "support vector" which makes actual contribution to the establishment of the decision function.Equation (5) suggests that, the training of the FLSA-SVM is mathematically composed of two major procedures which are the selection of each d   and then the computation of the -dimensional weight vector of .
The algorithm selects one basis function iteratively to span the solution.The following describes how the algorithm selects a basis function at each iteration.

The Selection of a Basis Function.
At the end of the th iteration for the algorithm,  basis functions whose indices in the dictionary matrix is ( 1 , . . .,   ) have been selected.They form a matrix Ω  = (d  1 , . . ., d   ) ∈ R ℓ× where d   = d(x   , ⋅) for simplicity of notations.The objective of the ( + 1)-th iteration is to select the ( + 1)-th basis function d  +1 from the dictionary.d  +1 is identified by solving the optimization problem of where   (d  ) is the perturbation in the loss function as a result of an additional basis function d  to the matrix Ω  .
For the calculation of   (d  ), the algorithm defines a "residue matrix" R  ∈ R ℓ×ℓ : where I is a unity matrix.The squared error loss   , after  iterations, can be shown to be The decrease of the local approximation error   =  ⊤ R   due to the addition of d  has been proven to be d  +1 is eventually identified as the one which leads to the maximum of   (d  ).
With introduction of the residue matrices (R 1 , . . ., R  ), starting with R 0 = I, any d  that can be expressed as a linear combination of the previously selected basis functions (d  1 , . . ., d   ) can be detected and pruned from the pool of the candidate basis functions.This fact also contributes to the sparseness of the FLSA-SVM solution.
After the identification of all the  basis functions, the vector of ( 1 , . . .,   ), where   is the weight for d   , is then computed.

The Computation of the Weight Vector 𝛼.
Other than selecting an extra basis function d  +1 to span the solution, the ( + 1)-th iteration, in fact, establishes the following linear system whose solutions are the last  −  elements of where the vector  () +1 = R  d  +1 ∈ R ℓ×1 and the vector y () = R  y ∈ R ℓ×1 .
After  iterations, a set of linear equations, represented by (11), is constructed.The weight vector  can be computed by performing a back substitution procedure that is used by a typical Gaussian elimination process.

The Reduced FLSA-SVM
At each iteration of the FLSA algorithm, it costs a major share of the computational efforts to solve the optimization problem formulated by (6).Meanwhile, it is noted that, in the FLSA algorithm, the sequence of local approximation errors {y ⊤ R  y,  = 1, . . ., rank(D)} form a sequence decreasing monotonously, where rank(D) is the rank of the matrix D. Thus the following proposition can be applied to the monotonously increasing sequence {y ⊤ y − y ⊤ R  y,  = 1, . . ., rank(D)} Proposition 1 (see [11]).Assuming a uniform distribution of z, the maximum of a sample { 1 , . . .,   } has a quantile of at least  1/ with probability 1 − .
The proposition suggests that the probability of reaching a value that has a quantile of  is 1 −  if  = ⌈log / log ⌉ basis functions are randomly chosen from the dictionary matrix.Consequently, given a training set of 10000 samples, in order to obtain the best 1% values for the approximation with a probability of 0.98, the maximum number of training samples required is ⌈log 0.01/ log 0.98⌉ = 228.
It is thus proposed to select basis functions randomly from D without evaluating (9), which results in the Reduced FLSA-SVM (RFLSA-SVM) whose pseudocode is given in Algorithm 1.With a value setting of  on the regularization parameter, a random selection of  basis functions, denoted as D  which is a submatrix from the matrix D ℓ built on the entire training set, are used as the dictionary.
The RFLSA-SVM also differs from the FLSA-SVM with respect to the interpretation of the value of the regularization parameter.For the FLSA-SVM, the value of the parameter, provided it does not exceed the column rank of D, is the actual number of support vectors.For the RFLSA-SVM whose dictionary D  exhibits linear dependencies between column vectors, the number of basis functions available is lower than .From the perspective of the input kernel matrix, the RFLSA-SVM is analogous to the Reduced SVM [13] although the latter is subject to inequality constraints.The RFLSA-SVM also bears resemblance to the PFSALS-SVM algorithm [14].Both are in the framework of Least-Squares SVMs and iteratively select basis functions into the solution.Nonetheless, rather than converting the optimization problem to a set of linear system, the PFSALS-SVM algorithm addresses the dual form of the objective function.
In terms of a single round of training, the time complexity of RFLSA-SVM is ( 2 ℓ) while that of FLSA-SVM is (ℓ 2 ).The memory requirement for FLSA-SVM in order to store the dictionary matrix is (ℓ 2 ), while with RFLSA-SVM, it is reduced to (ℓ).Thus the RFLSA-SVM method is more computationally attractive, requiring less storage space and less computational time.Encouragingly, the computational cost of the FLSA-SVM can be further reduced by employing the technique of "contexts inheritance" which is discussed in detail in the folowing.

INITIALIZATION:
(i) Generate a permutation of integers between 1 and ℓ.The first  elements form a vector Γ = { 1 , . . .,   } which are the indices of randomly-sampled columns from the dictionary matrix D 0 .(ii) Current residue vector y, current dictionary D which is initially a matrix of evaluations of ℓ candidate basis functions on training data: ) and D ← ( The matrix A and the vector b both starts as empty A is appended a row and b grows by one extra element at each iteration, which in the end forms a linear system.(iv) A variable  which is the pointer to the current investigated basis functions and also a count of selected basis functions.At the start,  = 0.
The  positive elements of Γ, which is represented by Λ = { 1 , . . .,   } in ascending order, are the indices of the  selected basis functions. columns of matrix A whose indices are Λ and b forms a linear system, on which the process of back substitution is performed for the solution: The solution is defined by Algorithm 1: The Reduced Forward Least-Squares Approximation SVM.

The Technique of Contexts Inheritance
Assuming an RFLSA-SVM whose solution contains  basis functions (d  1 , . . ., d   ) has been trained.For simplicity of notations, define d   = q  ,  = 1, . . ., .The weight vector   for the selected basis functions (p 1 , . . ., p  ) is obtained by solving the linear system of (11).Denoting its upper triangle coefficient matrix as A ∈ R × and the target vector as b ∈ R ×1 (11) can be written as A    = b  .Now consider the training for the RLFSA-SVM whose solution is parameterized by (> ) support vectors.An extra of ( − ) basis functions are required to be randomly selected and they are denoted as (q +1 , . . ., q  ).Defining the linear system required to be built as A    = b  , it can be derived that the upper triangular coefficient matrix A  is in the form of
It can be seen that the upper triangular submatrix on the upper left corner is in fact A  .Denoting the upper right  × ( − ) submatrix as B and the bottom right ( − ) × ( − ) submatrix as C, the matrix A  is simplified into.It is rather clear that the FLSA-SVM can also benefit from the technique of "contexts inheritance." The technique of "contexts inheritance" makes the tuning of the regularization parameter  much faster for the RFLSA-SVM and the FLSA-SVM, which was demonstrated by the experimental results in Section 5.

Experimental Results
A set of experiments were performed to evaluate the performance of the proposed RFLSA-SVM algorithm.It was first applied to the two-spiral benchmark [15] to illustrate its generalization ability.The following Gaussian kernel function was used throughout: The standard SVMs were implemented using LIBSVM [16].
The conjugate gradient implementation for the LS-SVM was conducted using the toolbox of LS-SVMlab [17] and its sequential minimal optimization implementation using the software package of [4].The FLSA-SVM and the RFLSA-SVM are implemented in our own C source code.All experiments were run on a Pentium 4 3.2 GHz processor under Windows XP with 2 GB of RAM.

Generalization Performance on the Two-Spiral Dataset.
The 2D "two-spiral" benchmark is known to be difficult for pattern recognition algorithms and poses great challenges for  neural networks [18].The training set consisted of 194 points of the - plane, half of which had a target output of +1 and half −1.These training points were sampled from two intertwining spirals that go around the origin, as illustrated in Figure 1, where the two categories are marked, respectively, by "x" and "o."  comparatively more wavy, the pattern has been still recognized successfully.
In conclusion, on the small but challenging "two-spiral" problem, the RFLSA-SVM achieved the outstanding generalization performance when the number of support vectors is large enough.Given a smaller set of support vectors, in an effort to ease the nonsparseness of its solution, the RFLSA-SVM still managed acceptable generalization performance.Thus the RFLSA-SVM offers more flexibility in choosing the number of support vectors.

Generalization Performance on More Benchmark
Problems.The FLSA-SVM algorithm was applied to 4 binary problems: the Ringnorm dataset and Banana, Image, Splice and Ringnorm which are all accessible at http://theoval.cmp.uea.ac.uk/matlab/#benchmarks/.The detailed information of the datasets was given in Table 1.Among all the realizations for each benchmark, the first one of them was used.FLSA-SVMs were compared with SVMs, LS-SVMs, the fast sparse approximation scheme for LS-SVM (FSALS-SVM) and its variant called PFSALS-SVM, both of which were proposed by Jiao et al. [14].The parameter  of FSALS-SVMs and PFSALS-SVMs was uniformly set to be 0.5 which was empirically proved to work well with most datasets [14].Comparisons were also made against D-optimality orthogonal forward regression (D-OFR) [19] which is a technique for nonlinear function estimation, promised to yield sparse solutions.The parameters, which were the penalty constant and  in (14), were tuned by tenfold cross-validation (CV).The regularization parameter and the  in (14) were also tuned by tenfold CV.
Tables 2 and 3, respectively, present the best test correctness and the number of support vectors for different SVM algorithms, with the best results highlighted in bold.It can be seen that the FLSA-SVM and the RFLSA-SVM achieved comparable classification accuracy to the standard SVM and the LS-SVM.The number of support vectors required for the RFLSA-SVM was much less compared to the LS-SVM, the SVM, the FSALS-SVM, and PFSALS-SVM on the Banana and the Ringnorm benchmarks.
The test correctness for the RFLSA-SVM, as well as the FLSA-SVM with the number of support vectors ranging from 100 to 400 on the Splice and the Image datasets was further reported in Table 4.It can be seen that the RFLSA-SVM parameterized by 240 support vectors already achieved an accuracy of 89.15% which is over 99% of the obtainable best accuracy of 89.98% which requires 680 support vectors.Similarly, on the Image dataset, the RFLSA-SVM parameterized by 200 support vectors already achieved an accuracy of 96.83% which is over 98% of the obtainable best accuracy of 98.32% which requires 380 support vectors.
These statistics showed that, allowing slight degradation of the classification accuracy, the sparseness of the RFLSA-SVM's solutions can be further enhanced.

Merits of the Contexts Inheritance Technique.
To demonstrate the merits of the "contexts inheritance" technique, the RFLSA-SVM was compared with the SVM, the LS-SVM, and the FLSA-SVM, in terms of the time cost of tuning the regularization parameter denoted as .For each dataset, the kernel parameter  was fixed at the value which produced the best tenfold cross-validation accuracy and the regularization parameter was varied.For the SVM and the LS-SVM, the regularization term  was set as 2  where  ∈ [−10, −10], providing 21 values to be examined.For the FLSA-SVM and the RFLSA-SVM, the regularization term  was initially 1.The remaining 20 integer values of  in training order formed an arithmetic sequence, with both the first term and the common difference being ℓ/20.The last term of the sequence is equal to ℓ where ℓ is the number of training samples.The technique of "contexts inheritance" was applied to the consecutive training of both the FLSA-SVM and the RFLSA-SVM.
Table 5 reports the time cost of the different algorithms on the Banana datasets.For the SVMs and the LS-SVM, each row entry in Table 5 gives the time cost for training the SVM with different value settings for the regularization parameter .For the RFLSA-SVM and the FLSA, each row entry indicates the time cost for the regularization parameter, denoted by , to reach the current value setting from the previous one.Since  also indicates the number of support vectors, each row entry is the time cost for a specific growth in the number of support vectors.For example, in the case of the FLSA-SVM, the time cost for the number of support vectors to grow from 1 to 20 was 0.1880 seconds, suggested by the entry at the second row and the second column.This indicates that, if a FLSA-SVM with 20 support vectors is to be trained from scratch, the time cost in all was 0.2510(= 0.1880 + 0.0630) seconds, that is, the sum of the first two rows in the second column.
If the FLSA-SVM with 20 support vectors is trained upon the FLSA-SVM with 1 support vector, applying the technique of "contexts inheritance, " the time cost is reduced to 0.1880 seconds.
For the RFLSA-SVM algorithm, the row entry starting with  = 60 is the training time required for an input dictionary matrix composed of randomly selected 60 basis functions.The row entry of  = 60 corresponds to the setting of ( = 0.05,  = 0.95) for Proposition 1, which is the number of randomly samples required to obtain the top 5% function approximation values with a probability of 0.95.It resulted in a time cost of 0.0470 = (0.000 + 0.015 + 0.016 + 0.016) seconds, which was the sum of the first 4 rows in the third column.Similarly, the row entry of  = 240 is the training time required for a dictionary matrix composed of randomly selected 240 basis functions, which corresponds to the setting of ( = 0.01,  = 0.98) for Proposition 1.
In contrast to the RFLSA-SVM, the FLSA-SVM selects support vectors into the solution by solving an optimization problem rather than random sampling of the training set.Thus for the FLSA-SVM, these two rows correspond to the time cost for selecting 60 and 240 support vectors, respectively, to span the solution.The last row of Table 5 shows the training time cost for using the full dictionary matrix, which also applies to the SVM and the LS-SVM.
It can be seen that the time cost of tuning the regularization parameter, given a dictionary matrix of 60 columns, was only 0.047 seconds by the RFLSA-SVM and 0.626 seconds by the FLSA-SVM.These was much less than the 1.211 seconds required by the SVM, the 7.608 seconds implemented by the CG method and the 12.2263 seconds by the SMO for the LS-SVM.Using the full dictionary matrix of 240 columns, it took 0.480 seconds for the RFLSA-SVMs which is still much less time cost in comparison to the LS-SVM and the SVM.
Tables 6, 7, and 8 further report time cost of the different algorithms on the datasets of Splice, Image and Ringnorm.RFLSA-SVMs also achieved the least time complexity on the Splice, Image and Ringnorm datasets, which respectively required 0.344, 0.140, and 2.140 seconds for the setting of ( = 0.05,  = 0.95).For the setting of ( = 0.01,  = 0.98), it took the RFLSA only 1.533, 2.046, and 18.703 seconds respectively on the three datasets, which makes it the fastest algorithm of all the four algorithms.
For the FLSA-SVM algorithm, given a dictionary matrix of 60 columns, the training cost is 8.3280 seconds on the Splice dataset, making it the second fastest.On the Image dataset, the time cost of the FLSA-SVM using a dictionary matrix of 60 columns is 11.3750 seconds which is faster than the LS-SVMs implemented by CG and the SVM.
In Table 7, at the row of  = 1300, the entries for the FLSA-SVM and the RFLSA-SVM are both 0.000.This is due to the fact that the column rank of the dictionary matrix built on the full training set is less than 1235.At each iteration for both the FLSA-SVM and the RFLSA-SVM, the basis functions that can be expressed as the linear combination of the previously selected ones can be identified and then pruned.As a result, no candidate basis function is in fact, available any more for any setting of  > 1235.

Conclusion
While maintaining competitive generalization performance to the SVM and the Least-Square SVM (LS-SVM), the proposed Reduced Forward Least-Squares Approximation (RFLSA) SVM uses only a random sampling of, rather than all, the training samples as the candidates for support vectors during the training procedure.This strategy of random selection was shown to be statistically justified.Meanwhile, when an RFLSA-SVM is trained whose solution is spanned by  support vectors, the training of a second RFLSA-SVM with  support vectors where  >  requires primarily the computation associated with the additional ( − ) support vectors, by inheriting the intermediate variables from training the RFLSA-SVM with  support vectors.This technique, referred to as "contexts inheritance, " reduces the time cost of tuning the regularization parameter and makes RFLSA-SVMs more computationally attractive.The technique can also be applied to the FLSA-SVM algorithm.
The experiments confirmed that, for the RFLSA-SVM and the FLSA-SVM algorithms, the technique of contexts inheritance made the procedure of the tuning of the regularization parameter much faster than the SVM and the LS-SVM.
is a ℓ-by-ℓ matrix of ones and D ∈ R ℓ×ℓ where D  = K  +1.The th column of D can be viewed as the output on the ℓ training samples of the function d(x  , ⋅) = (x  , ⋅) + 1 which is parameterized by the training sample of x  .d(x  , ⋅) is often referred to as a "basis function" and D a dictionary of basis functions.
It is noted that the submatrix C is also an upper triangular matrix.And for the target vectors, it follows that b⊤  = [b ⊤  , ( () +1 ) ⊤ y () , . . ., ( (−1)  ) ⊤ y (−1) ].Hence, in order to construct the linear system of A    = b  , the following intermediate variables, produced from training the FLSA-SVM with  basis functions, can be simply inherited: (a) the coefficient matrix A  , (b) the residue matrices for the first  matrices R  ,  = 1, . . ., , (c) the target vector b  .The residue matrix R  ,  = 1, . . .,  makes the determination of the matrix B fast.Thus the determination of the matrix A  is primarily reduced to the determination of the upper triangular matrix C.

Figure 2 :
Figure 2: (a) the two-spiral pattern recognized by the RFLSA-SVM using 180 support vectors with  = 1 in (14); (b) the two-spiral pattern recognized by the RFLSA-SVM using 193 support vectors with  = 0.5.

Figure 2 (
Figure 2(b) depicts the performance of the RFLSA-SVM with the parameter settings of  = 193,  = 0.5 which achieved a LOOCV accuracy of 96.91%.It can be seen that the two-spiral pattern has been recognized smoothly.Meanwhile, Figure 2(a) shows the performance of the RFLSA-SVM using 180 support vectors.Although the decision boundaries are

Table 3 :
Number of support vectors (best in bold).

Table 4 :
The value of  versus test correctness (%) for FLSA-SVMs and RFLSA-SVMs on image and splice datasets.

Table 5 :
Training time (in CPU seconds) of the FLSA-SVM, the RFLSA-SVM, the SVM, and the LS-SVM on the banana dataset.

Table 6 :
Training time (in CPU seconds) of the FLSA-SVM, the RFLSA-SVM, the SVMs, and the LS-SVM on the splice dataset.

Table 7 :
Training time (in CPU seconds) of the FLSA-SVM, the RFLSA-SVM, the SVMs, and the LS-SVM on the image benchmark.