A Faster Gradient Ascent Learning Algorithm for Nonlinear SVM

We propose a refined gradient ascent method including heuristic parameters for solving the dual problem of nonlinear SVM. Aiming to get better tuning to the particular training sequence, the proposed refinement consists of the use of heuristically established weights in correcting the search direction at each step of the learning algorithm that evolves in the feature space. We propose three variants for computing the correcting weights, their effectiveness being analyzed on experimental basis in the final part of the paper. The tests pointed out good convergence properties, and moreover, the proposed modified variants proved higher convergence rates as compared to Platt’s SMO algorithm. The experimental analysis aimed to derive conclusions on the recognition rate as well as on the generalization capacities. The learning phase of the SVM involved linearly separable samples randomly generated fromGaussian repartitions and theWINEandWDBCdatasets.Thegeneralization capacities in case of artificial data were evaluated by several tests performed on new linearly/nonlinearly separable data coming from the same classes. The tests pointed out high recognition rates (about 97%) on artificial datasets and even higher recognition rates in case of theWDBC dataset.


Introduction
According to the theory of SVMs, while traditional techniques for pattern recognition are based on the attempt to optimize the performance in terms of the empirical risk, SVMs minimize the structural risk, that is, the probability of misclassifying yet-to-be-seen patterns for a fixed but unknown probability distribution of data [1][2][3][4].The most distinguished and attractive features of this classification paradigm are the ability to condense the information contained by the training set and the use of families of decision surfaces of the relatively low Vapnik-Chervonenkis dimension.
SVM approaches to classification lead to convex optimization problems, typically quadratic problems in a number of variables equal to the number of examples, and these optimization problems become challenging when the number of data points exceeds few thousands.
For making SVM more practical, several algorithms have been developed such as Vapnik's chunking and Osuna's decompositions [1,5].They make the training of SVM possible by breaking the large QP problem into a series of smaller QP problems and optimizing only a subset of training data patterns at each step.Because the subset of training data patterns optimized at each step is called the working set, these approaches are referred to as the working set methods.
Recently, a series of works on developing specializations as, for instance, reduced support vector machines (RSVM) [6] and smooth support vector machines (SSVM) [7] as well as parallel implementations of training SVM's have been proposed [8].Also, there have been proposed methods to solve the least squares SVM formulations [7][8][9][10] as well as software packages as SVM light [11], mysvm [12], and many others [3,11,[13][14][15].It is worth to mention that a series of developments aimed to improve the accuracy of the resulted SVM classifier by combining it with boosting-type techniques [16,17].
Assume that the data is represented by a finite set of labeled examples   = {(  ,   ),   ∈ R  ,   ∈ {−1, 1}, 1 ≤  ≤ } coming from two pattern classes ℎ 1 , ℎ 2 , and  : R  → R  ISRN Applied Mathematics is a vector-valued function representing the filter that extracts information from the current input.The function  is usually referred to as the feature extractor, and R  is thought as the feature space.Briefly, from mathematical point of view, the problem of determining the parameters of an optimal margin classifier reduces to the quadratic programming (QP) problem [3]: (   (  ) + ) ≥ 1, 1 ≤  ≤ . ( If ( * ,  * ) is a solution of (1), then the SVM classifier corresponds to the decision rule  : R The parameter  cannot be explicitly computed by solving the SVM problem, a convenient choice of  being derived in terms of the support vectors.Usually, a suitable value of  should be selected such that 1 − min ,   =1  *  (  ) ≤  * ≤ −1 − max ,   =−1  *  (  ) holds, for instance [3], In our work, we prefer to use a value of the parameter  * computed on a heuristic basis aiming to take into account the available information about the variability of the subsamples coming from the classes [18].
The performance of the resulted classifier is essentially determined by the quality of the feature extractor , the main problem becoming the design of a particular informational feature extractor.One way in over passing this difficulty is the "kernel trick." Basically, the method consists in selecting a suitable kernel that, on one hand, "hides" the explicit expression of  and, on the other hand, allows working in a feature space of possibly very high dimension without increasing the computational complexity [3].Usually, we assume a certain particular functional expression for the kernel that "hides" both the dimension of the feature space and the explicit expression of the feature extractor .
If we assume that the kernel  : R  × R  → [0, ∞) is given by (,   ) = ()  (  ) for a certain feature extractor , then  is a semidefined Mercer kernel [4].Then, the problem of determining the parameters of an optimal margin classifier corresponds to solving the optimization problem (2), where the objective function is In our work, we use exponential type kernels (RBF) and develop a modified gradient ascent method for solving the QP problem (2).The refinement considered in our developments comes from the use of weights in determining the direction of the search displacement at each step in order to get a better tuning to the particular training sequence.We propose three attempts in determining the weights, partially heuristically, and their corresponding performance is experimentally analyzed in the final section of the paper.In our developments, we implemented an SVM classifier of SVM light type [11].

Modified Gradient Ascent Method for Learning Nonlinear SVM Purposes
In [19], we proposed a modified learning rule of gradient ascent type for linear SVM that can be extended to the nonlinear case as follows.Assume that Note that (()) is a negative semidefined matrix.For a given learning rate  > 0, if  old is the current value of the parameter , the updating rule of a gradient type learning algorithm is However, being given that ∑  =1     = 0, the updating rule (6) should be modified such that to assure that the new parameter still belongs to the space of the feasible solutions.Our method can be briefly described as follows.Assume that  1 ,  2 are the components of the current parameter vector  old selected for being updated, 1 ≤ Therefore, one has to pick up a pair satisfying (8) for which () − ( old ) is maximized.The stopping condition for the search process is controlled by a given threshold  > 0, and it holds when at least one of the conditions (9) is satisfied:

Strategies for Determining the Influence Weight in Implementing the Gradient Ascent Search
The option on the values of the learning rate  and the weight parameter  1 should be such that to assure good performance from both point of views, accuracy and efficiency.In our tests, we used  ∈ [10 −4 , 0.8].
Our research focused on several ways to compute the weight  1 , all of them being expressed in terms of the firstand second-order statistics computed in the feature space.Let μ,1 , μ,2 , Σ,1 , Σ,2 be the sample means and sample covariance matrices computed on the basis of the samples labeled by 1 and −1 in the feature space, where  is a particular feature extractor.We denote by  the kernel generated by , that is, (,   ) = ()  (  ).Since we assumed the first  examples as coming from the first class and the next  −  examples as coming from the second class, we get . ( Concerning the options on the choice of the weight parameter  1 , we have to take into account that its particular expression should be justified by evidence or by mathematical arguments, and moreover its value should be computable in the feature space without increasing the computational complexity.We propose three variants for the expression of the weight parameter  1 estimated exclusively from data in terms of first-and second-order sample statistics, namely, where The expression (11) is mostly heuristic, justified by geometric reasons, while the significance of ( 12) and ( 13) is supported by standard arguments coming from mathematical statistics (in terms of eigenvalues of sample covariance matrices and Fisher information, resp.).Note that the weight coefficients ( 11), (12), and ( 13) can be evaluated in the feature space using exclusively the values of the kernel on the available sample.Indeed, using straightforward computation, the coefficient  (1)  1 can be expressed in terms of the kernel  as follows: where the norms ‖ μ,1 ‖ 2 and ‖ μ,2 ‖ 2 are evaluated as Usually, the kernels are normalized, that is, ‖()‖ 2 = (, ) = 1 for any  ∈ R  .In case of a normalized kernel, using straightforward computations, we get and similarly Consequently, the expression of the weight coefficient  (2)  1 is The evaluation of  (3)  1 can be carried out as follows.Since by denoting that is, Note that the weight coefficient  (3)  1 is the extension of the Fisher coefficient to the multidimensional repartitions.
Note that the generated data used for training were linearly separable.The comparative analysis aimed to establish conclusions about the performance of the modified gradient ascent algorithm in the feature space using the weight coefficients  () 1 ,  = 1, 2, 3 against Platt's SMO method and the standard gradient ascent algorithm in the initial space.
Also, we aimed to evaluate (i) the influence of different values of the parameter  on the number of iterations needed to obtain significant accuracy, (ii) the dependency of the number of iterations required to obtain significant accuracy on the distance between the classes the samples come from and on the samples variability, (iii) the influence of different values of the parameter  on class separability index.
The variability of a sample coming from a certain class can be expressed by many ways.We considered the indicator given by the mean distance between the feature vectors representing examples coming from that class to quantitatively express the variability within the sample.If  is the subset of   containing the examples coming from one of the labeled classes, then the measure of the variability of  is that can be expressed in terms of the kernel function as [18] var The class separability index is evaluated by Test 1.We aimed to derive conclusions of previously mentioned types in case of simulated data from normal multidimensional classes (  , Σ  ),  = 1, 2. The closeness degree between the resulted datasets  (1) ,  (2) of sizes  1 ,  2 can be evaluated by many ways.One way was to express it in terms of the Mahalanobis distance, . We consider two model-free indices to express the closeness degree using only the datasets, given by the sample Mahalanobis distance, ‖ − ‖, respectively, where μ1 , μ2 , Σ1 , Σ2 are the sample means and sample covariance matrices corresponding to  (1) ,  (2) .For instance, the conclusions derived on the basis of the samples  (1) ,  (2) generated from Gaussian repartitions, where  1 = 45,  2 = 45, (ℎ 1 , ℎ 2 ) = 6.6301, d 1 , 2 (ℎ 1 , ℎ 2 ) = 9.8424, and d 1 , 2 (ℎ 1 , ℎ 2 ) = 0.0776, are summarized in Figures 1-3 and Table 1.
The samples prove to be comparable values of the variability index, for each  ∈ (0, 1], and the variability indices depend increasingly on  (Figure 1).
The results obtained in solving the QP problem (2) using Platt's SMO algorithm and the variants of the gradient algorithm using the weight parameters given by ( 15), (20), and (24), respectively, are summarized in Table 1 and shown in Figure 2.Each entry a/b of Table 1 represents a = number of iterations and b = the resulted estimation of the maximum value of (4) corresponding to a particular value of  and resulted by applying one of the proposed variants of algorithms.
According to the experimental results, we can conclude that in order to obtain the same accuracy, the variants of

The variability indices
The variability index of S (1)  The variability index of S (2)   Figure 1: The dependency on  of the variability index (26).the gradient ascent algorithm using the weight parameter  1 given by ( 15), (20), and (24) require a far less number of iterations.Moreover, as  increases, the number of iterations dramatically decreases as compared to Platt's SMO algorithm.
In order to evaluate the recognition rate of the resulted SVM classifier, we used linearly and nonlinearly datasets coming from the same distributions.In the case of this example, the mean recognition rate was around 94%.
Similar results were obtained in case "closer" or "farther" normal distributions were used to generate datasets.
Test 2. We aim to develop a comparative analysis among the performance corresponding to Platt's SMO and the variants of gradient ascent algorithms presented in Section 2 on the WINE dataset [20].The data in WINE dataset are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars, the analysis being determined by the quantities of 13 constituents found in each of the three types of wines.The dataset consists of labeled examples of sizes 59, 71, and 48 coming from the pairwise linearly separable classes  (1) ,  (2) ,  (3) , respectively.In processing the WINE dataset, we used the kernel (,   ) = exp{−‖ −   ‖}, for different values of  > 0.
In order to implement the previously mentioned algorithms, we used the variability and separability indices (26) and ( 27) to find out a suitable criterion for planning a twoclass classification of these three subsamples.Unfortunately, the three subsamples proved very close values of the interclass separability index (27), for all values of ; therefore, from this point of view, all two-class classifications seemed to be almost equivalent.The variation with respect to  of the interclass separability index ( 27) is presented in Figure 4.
Hopefully, the three subsamples proved quite different variability in the sense of the variability index (26), for all values of , enabling us to formulate the plan: discriminate first between  (3) and  (1) ∪ (2) and then discriminate between  (1) and  (2) .
The results obtained in solving the QP problem (2) using Platt's SMO algorithm, and the variants of the gradient algorithm using the weight parameters given by ( 15), (20), and (24), respectively, are summarized in Tables 2 and 3 and shown in Figures 5 and 6.Each entry a/b of the tables represents a = number of iterations and b = the resulted estimation of the maximum value of (4) corresponding to a particular value of  and resulted by applying one of the proposed variants of algorithms.
Table 2: The comparative analysis in terms of the number of iterations and the values of  in discriminating between  (3) and  (1) ∪  (2) . Number of iterations/the estimation of the maximum value of (4)

SMO
The variant of the gradient algorithm using the weight parameter (15) The variant of the gradient algorithm using the weight parameter (20) The variant of the gradient algorithm using the weight parameter ( 24 3: The comparative analysis in terms of the number of iterations and the values of  in discriminating between  (1) and  (2) .

𝛾
Number of iterations/the estimation of the maximum value of ( 4)

SMO
The variant of the gradient algorithm using the weight parameter (15) The variant of the gradient algorithm using the weight parameter (20) The variant of the gradient algorithm using the weight parameter ( 24 Separability index between S (1) and S (2)  Separability index between S (1) and S (3)  Separability index between S (2) and S (3)   The separability index In case of this example, the recognition rate was 100% for all values of .Being given the missing information about the measured features and the sizes relative small sizes of the subsamples coming from the three categories of wines, we had no possibility to test the generalization capacities of the resulted classifiers either on simulated data or by splitting the subsamples in design and test data, respectively.Test 3. We performed a series of tests on the dataset Wisconsin Diagnostic Breast Cancer (WDBC) [20] aiming to develop a comparative analysis on the performance of the variants of gradient ascent algorithms presented in Section 2 and Platt's SMO.We used the kernel (,   ) = exp{−‖ −   ‖}, for different values of  > 0, and the weighting coefficients given by ( 15), (20), and (24).
The examples in WDBC dataset 30-dimensional vectors representing the features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass describing characteristics of the cell nuclei present in the image, a confirmed diagnosis either benign (B) or malignant (M) being supplied.It is stated that the dataset is linearly separable and relevant for the design of classifiers having good generalization capacities.The sizes of subsamples labeled by Band M are 357 and 212, respectively.
The variation with respect to  of the values of the variability and separability indices (26) and ( 27) is presented in Figures 7 and 8.The separability index (27) proves to be quite sensitive to the variation of the values of the parameter  pointing out an increasing dependency on .Moreover, for  ≥ 0.5, the values of the separability index (27) stabilize around 1.41.Using the variability index (26), the tests revealed that the variability of the M-class seems to be insensitive at the variations of , while the variability index of the B-class depends increasingly on , the values stabilizing around the value 0.705 for  ≥ 0.5.
In the first series of tests, all examples were used in both design and test phases.The results are summarized in Table 4 and Figure 9. Concerning the use of the weight coefficients The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms in discriminating between  (3) and  (1) ∪  (2) .The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms in discriminating between  (1) and  (2) .( 15), (20), and (24), the accurate estimates of the maximum value of (4) were computed in a far less number of iterations than Platt's SMO algorithm.A slight better performance resulted for all values of  in case of using (24), the mean value of  (3)  1 being around 0.99, while the mean values of  (1)  1 and  (2)  1 were around 0.5 and 0.4, respectively.From the point of view of the performance in discriminating between the B-class and M-class, for all values of , the recognition rate was 100%.
Being given that the size of the WDBC dataset is relatively large, we used it in order to develop a comparative analysis from the point of view of the generalization capacity corresponding to the proposed variants.In order to derive a suitable splitting strategy of the available data into design and test datasets, we took into consideration the relative relevance of the examples with respect to the class from which they come.We computed a prototype (barycenter) for each class by averaging the examples belonging to it, and the relative relevance of each example coming from each class is expressed in terms of the Euclidian distance to the corresponding prototype.
In order to establish suitable partitions into design and test subsets for each class, we used several strategies, the differences between them being given by the sizes and the

SMO
The variant of the gradient algorithm using the weight parameter (15) The variant of the gradient algorithm using the weight parameter (20) The variant of the gradient algorithm using the weight parameter ( 24 In a series of tests, we considered the following experimental plan. ( The results are summarized in Table 5 and Figure 10.Concerning the usefulness of the proposed weight coefficients, the variants ( 15), (20), and (24) proved almost equal efficiency, the number of iterations required to obtain an accurate estimate of the maximum value of (4) being far less than in case of Platt's SMO algorithm.By submitting the test sample to the resulted classifiers, we obtained correct recognition rates in the range [94.01%, 95.07%], the maximum value 95.07%being obtained for  ∈ [0.1, 0.15] and the weight coefficient (20).
Several tests were performed using the same strategy for different sizes of the design and test datasets.For instance, in

Conclusions and Suggestion for Further Work
In the paper, we propose a modified gradient ascent method for solving the dual problem of nonlinear SVM.Basically, the refinement proposed here consists in using weight parameters to tune the direction of the search to the particular training sequence.The work was based on the use of variability and separability indices expressed in terms of the exponential RBF kernel.A part of the comparative analysis aimed to evaluate the dependency of the expected number of iterations required to obtain reasonable accuracy of the criterion function on the kernel parameter.The proposed variants of the gradient ascent learning algorithm are somehow heuristically justified in the sense that there is no mathematically founded proof of the convergence properties.Therefore, several tests were performed in order to derive conclusions on experimental basis.The tests pointed out good convergence properties of the modified variants, and their convergence rates were significantly higher as compared to Platt's SMO algorithm.The experimental analysis aimed to derive conclusions on the recognition rate as well as on the generalization capacities.All linear classifiers proved almost equal recognition rate and generalization capacities, the difference being given by the number of iterations required for learning the separating hyperplanes.
The learning phase of the SVM involved linearly separable samples randomly generated from Gaussian repartitions and the WINE and WDBC datasets.In order to evaluate the generalization capacities, several tests were also performed on new linearly/nonlinearly separable data coming from the same classes in case of samples randomly generated from Gaussian classes.In case of the WINE and WDBC datasets, both of them are linearly separable and no information concerning the generative model is supplied; therefore an additional strategy for splitting them into design and test samples was required.In case of the tests performed on the WINE dataset, being given its relatively small size, the performance was analyzed using all samples in the design and test phases, while the size of the WDBC dataset, being significantly larger, allowed us to develop different experimental plans by splitting the available data into design and test samples of different sizes.
Being given the optimality of SVMs from the point of view of generalization capacities, as expected, we obtained high recognition rates on new test data in most of cases (around 97%).In case of the WDBC dataset, higher recognition rates were obtained in case the design dataset was enlarged to contain more of the less relevant examples.For instance, a 100% recognition rate resulted using  = 0.15 and 66.44% examples in the design set.
The tests pointed out that the variation of the recognition rates depends also on the inner structure of the classes from which the learning data come as well as on interclass separability degree.Consequently, we estimate that the results are encouraging and entail future work toward extending these refinements to multiclass classification problems and approaches in a fuzzy-based framework.

𝛾Figure 2 :
Figure 2: The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms.

Figure 3 :
Figure 3: The dependency on  of the class separability index (27).

Figure 5 :
Figure5: The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms in discriminating between (3) and (1) ∪ (2) .

Figure 6 :
Figure6: The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms in discriminating between (1) and (2) .

Figure 7 :
Figure 7: The dependency on  of the variability index (26).

Figure 8 :
Figure 8: The dependency on  of the separability index (27).

Figure 10 :( 2 )
Figure 10: The dependency on  of the number of iterations required by Platt's SMO and the variants of the gradient ascent algorithms.

Table 1 :
The comparative analysis in terms of the number of iterations and the values of .
In order to develop a comparative analysis on the proposed variants of the modified gradient-like algorithm in solving the QP problem (2), we performed a long series of tests on simulated data coming from Gaussian repartitions and on the public databases WINE and Wisconsin Diagnostic Breast Cancer (WDBC)[20], all tests involving the feature extractor  corresponding to the RBF kernel (,

Table 4 :
(4) comparative analysis in terms of the number of iterations and the values of .Number of iterations/the estimation of the maximum value of(4)

Table 6 and
Figure 11, the results of a test that used the design and test datasets obtained by including from each class the most relevant 10 examples in the test set and the least relevant 30 examples in the design sample are represented.This way, the learning phase was developed on a design dataset containing 189 and 116 examples coming from the B-class and the Mclass, respectively.The test phase was performed on a dataset containing 168 and 96 examples coming from the B-class and the M-class, respectively, the resulted recognition rates being in the range [95.08%, 96.59%],where the maximum value 96.59% was obtained for  = 0.1.