Support vector machine (SVM) has been applied very successfully in a variety of classification systems. We attempt to solve the primal programming problems of SVM by converting them into smooth unconstrained minimization problems. In this paper, a new twice continuously differentiable piecewise-smooth function is proposed to approximate the plus function, and it issues a piecewise-smooth support vector machine (PWSSVM). The novel method can efficiently handle large-scale and high dimensional problems. The theoretical analysis demonstrates its advantages in efficiency and precision over other smooth functions. PWSSVM is solved using the fast Newton-Armijo algorithm. Experimental results are given to show the training speed and classification performance of our approach.
1. Introduction
In the last several years, support vector machine (SVM) has become one of the most promising learning machines because of its high generalization performance and wide applicability for classification, forecasting, and estimation in small-sample cases [1–6]. In addition, SVMs have surpassed the performance of artificial neural networks in many areas such as text categorization, speech recognition, and bioinformatics [7–11].
Basically, the main idea behind SVM is the construction of an optimal hyper plane, which has been used widely in classification [5, 8–16]. It can be formulated into an unconstrained optimization problem [17–21], but the objective function is nonsmooth. To overcome this disadvantage, Lee and Mangasarian used the integral of the sigmoid function to get a smooth SVM(SSVM) model in 2001 [17]. It is a very important and significant result to SVM since many famous algorithms can be used to solve it. In 2005, Yuan et al. proposed two polynomial functions, namely, the smooth quadratic polynomial function and the smooth forth polynomial function, and got QPSSVM and FPSSVM models [20, 21]. Xiong et al. derived an important recursive formula and a new class of smoothing functions using the technique of interpolation functions in [20]. In 2007, Yuan et al. used a three-order spline function to smooth the objective function of unconstrained optimization problem of SVM and obtained TSSVM model [21]. However, the efficiency or the precision of the algorithms was limited.
A natural problem is whether there is another smooth function to get a more efficient smooth SVM than existing works. In this paper, we introduce a piecewise function to smooth SVM and obtain a novel piecewise smooth support vector machine (PWSSVM). Theoretical analyses show that approximation accuracy of the piecewise smooth function to the plus function is higher than the available. Rough set theory is used to prove the global convergence of PWSSVM and the upper bound of convergence is proposed. The fast Newton-Armijo algorithm [22, 23] is employed to train the PWSSVM. Our new method is implemented in batches and can efficiently handle large-scale and high dimensional problems. Numerical experiments confirm the theoretical results and demonstrate that PWSSVM is more effective than the previous smooth support vector machine models.
The paper is organized as follows. In Section 2, we state the pattern classification and describe the PWSSVM. The approximation performance of smooth functions to the plus function is compared in Section 3. The convergence performance of PWSSVM is given in Section 4. The Newton-Armijo algorithm is applied to train PWSSVM in Section 5. Section 6 shows numerical comparisons. Finally, a brief conclusion of this paper is made.
In this paper, unless otherwise stated, all vectors are column vectors. For a vector x in the n-dimensional real space Rn, the plus function x+ is defined as (x+)i=max(0,xi), i=1,…,n. The scalar (inner) product of two vectors x,y in the n-dimensional real space will be denoted by x·y and the p-norm will be denoted by ∥·∥p. For a matrix A∈Rm×n, Ai is the ith row of A which is a row vector in Rn. A column vector of ones of n dimension will be denoted by e. If φ is a real valued function defined in the n-dimensional real space Rn, the gradient of φ is denoted by ∇φ(x) which is a row vector in Rn and n×n the Hessian matrix of φ at x is denoted by ∇2φ(x).
2. Piecewise-Smooth Support Vector Machine
In this paper, let us consider a binary classification problem with m training samples in the n-dimensional real space Rn. It is represented by the m×n matrix A, according to membership of each point Ai in the class 1 or −1 as specified by a given m×m diagonal matrix D with 1 or −1 along its diagonal. For this problem, the standard SVM with a linear kernel is given by the following quadratic program with parameter υ>0(1)min(w,γ,y)∈Rn+1+mυeΤy+12wΤws.t.D(Aw-eγ)+y≥ey≥0,
where e is a vector of ones, w is the normal to the bounding plane, and b is the distance of the bounding plane to the origin. The linear separating plane is defined as follows:
(2)P={xi∣xi∈Rn,wΤxi=b}.
The first term in the objective function of (1) is the 1-norm of the slack variable y with weight υ. Replace the first term with the 2-norm vector y. Add (1/2)γΤγ to the objective function which induces strong convexity but has little or no effect on the problem. SVM model is replaced by the following problem:
(3)min(w,γ,y)∈Rn+1+mυ2yΤy+12(wΤw+γΤγ)s.t.D(Aw-eγ)+y≥ey≥0.
Let y=(e-D(Aw-eγ))+, where (·)+ replaces negative components of a vector by zeros, then we can convert the SVM problem (3) into the following unconstrained optimization problem:
(4)min(w,γ)∈Rn+112υ∥(e-D(Aw-eγ))+∥2+12(wΤw+γΤγ).
This is a strongly convex minimization problem and it has a unique solution. Let max{0,x}=x+. The function (x)+ is a continuous but nonsmooth function. Therefore, many optimization algorithms based on derivative and gradient cannot solve the problem (4) directly.
In 2001, Lee and Mangasarian [17] employed the integral of the sigmoid function p(x,k) to approximate the nondifferentiable function x+ as follows:
(5)p(x,k)=x+1klog(1+ε-kx),k>0,
where ε is the base of natural logarithm and k is a smoothing parameter. They got the SSVM model.
In 2005, Yuan et al. [18] presented two polynomial functions as follows:
(6)q(x,k)={x,x≥1k,k4x2+12x+14k,-1k<x<1k,k>0,0,x≤-1k,h(x,k)={x,x≥1k,-116k(kx+1)3(kx-3),-1k<x<1k,k>0,0,x≤-1k.
Using the above smooth functions to proximate plus function x+, they got two smooth polynomial support vector machine models (FPSSVM and QPSSVM). The authors also showed that FPSSVM and QPSSVM were more effective than SSVM in [18].
In 2007, Yuan et al. [21] presented a three-order spline function as follows:
(7)T(x,k)={0,x<-1k,k26x3+k2x2+12x+16k,-1k≤x<0,-k26x3+k2x2+12x+16k,0≤x≤1k,x,x>1k.
They used the smooth function T(x,k) to approach the plus function and got a new smooth SVM model TSSVM. However, the efficiency or the precision of these algorithms above was limited.
In this paper, we propose a novel smooth function φ(x,k) with smoothing parameter k>0 to approximate to the function x+ as follows:(8)φ(x,k)={0,x<-13k,32k2(x+13k)3,-13k≤x<0,x+32k2(13k-x)3,0≤x≤13k,x,x>13k.
The first- and second-order derivatives of φ(x,k) are
(9)∇φ(x,k)={0,x<-13k,92k2(x+13k)2,-13k≤x<0,1-92k2(13k-x)2,0≤x≤13k,1,x>13k,∇2φ(x,k)={0,x<-13k,9k2(x+13k),-13k≤x<0,9k2(13k-x),0≤x≤13k,0,x>13k.
The solution of the problem (3) can be obtained by solving the following smooth unconstrained optimization problem with the smoothing parameter k approaching infinity as
(10)min(w,γ)∈Rn+1Φk(w,γ)=12υ∥φ(e-D(Aw-eγ),k)∥2+12(wΤw+γΤγ).
Thus, we develop a new smooth approximation for problem (3).
3. Approximation Performance Analysis of Smooth Functions
In this section, we will compare the approximation performance of smooth functions to plus function.
Lemma 1 (see [<xref ref-type="bibr" rid="B17">17</xref>]).
p(x,k) is defined as the integral of the sigmoid function in [17], and x+ is the plus function. The following conclusions can be obtained:
p(x,k) is arbitrary rank smooth about x;
p(x,k)≥x+;
for ρ>0,|x|<ρ, p(x,k)2-x+2≤(log2/k)2+(2ρ/k)log2.
Lemma 2 (see [<xref ref-type="bibr" rid="B18">18</xref>]).
q(x,k) and h(x,k) are defined as (6), and x+ is plus function. We can obtain the following conclusions:
q(x,k) is one rank smooth about x, and h(x,k) is twice rank smooth about x;
q(x,k)≥x+ and h(x,k)≥x+;
for any x∈R, q(x,k)2-x+2≤1/11k2 and h(x,k)2-x+2≤1/19k2.
Lemma 3 (see [<xref ref-type="bibr" rid="B19">19</xref>]).
T(x,k) is defined as (7), and x+ is plus function. The following results are easily obtained:
T(x,k) is twice rank smooth about x;
T(x,k)≥x+;
for any x∈R, T(x,k)2-x+2≤1/24k2.
Theorem 4.
The piecewise approximation function φ(x,k) defined in (8) has the following properties:
φ(x,k) is twice rank smooth about x;
for any x∈R, φ(x,k)≥x+,
for any x∈R, then φ(x,k)2-x+2≤1/216k2.
Proof.
(1) According to the formulas (8) and (9), one can easily obtain the results in (1).
(2) In the following, we verify the fact φ(x,k)≥(x)+≥0. (i) The equation φ(x,k)=(x)+ holds while x∈(-∞,-(1/3k))∪(1/3k,∞). (ii) Since φ(x,k) is a monotone increasing function, we have the following result φ(x,k)-(x)+=φ(x,k)≥φ(-(1/3k),k)=0 while x∈[-(1/3k),0). (iii) For x∈[0,1/3k], we have φ(x,k)-(x)+=(3/2)k2(1/3k-x)3≥0. Hence, we have the conclusion φ(x,k)≥(x)+≥0.
(3) For x∈(-∞,-(1/3k))∪(1/3k,∞), φ(x,k)2-x+2=0, the inequality in conclusion (3) is satisfied naturally.
For -(1/3k)≤x≤0, since x+=0, φ(x,k)2-x+2=φ(x,k)2. Because φ(x,k) is positive value, continuous, and increasing function for -(1/3k)≤x≤0, we have φ(x,k)2≤φ(0,k)2=1/324k2≤1/216k2.
For 0<x≤1/3k, let
(11)s(x)=φ(x,k)2-x+2=(x+32k2(13k-x)3)2-x2=94k4(x-13k)6-3k2x(x-13k)3.
In order to obtain the result, making the variable substitution t=kx (obviously t∈(0,1/3)), then we have s(t)=(3/k2)[(3/4)(t-(1/3))6-t(t-(1/3))3]. For t∈(0,1/3), the maximum point of s(t) is t=0.0605 and s(t)=φ2(x,k)-(x)+2≤g(0.0605)≈0.0046/k2≤1/216k2.
In conclusion, we have φ(x,k)2-x+2≤1/216k2.
Theorem 5.
Let ρ=1/k, and k>0. Consider the following.
If the smooth function is defined as (5), then by Lemma 1, we have
(12)p(x,k)2-x+2≤(log2k)2+2ρklog2=(log22+2log2)1k2≈0.69271k2.
If the smooth functions are defined as (6), by Lemma 2,
(13)q(x,k)2-x+2≤111k2≈0.09091k2,h(x,k)2-x+2≤119k2≈0.05261k2.
If the smooth function is defined as (7), by Lemma 3,
(14)T(x,k)2-x+2≤124k2≈0.04171k2.
If the smooth function is defined as (8), by Theorem 4,
(15)φ(x,k)2-x+2≤1216k2≈0.00461k2.
Theorem 5 shows that the proposed piecewise smooth function φ(x,k) achieves the best degree of approximation to the plus function x+. When k is fixed, it is easy to obtain the different smooth capability of the above smooth functions. The smooth performance comparison is given in Figure 1, where we set the smooth parameter k=10 and ρ=1/k.
Comparison of approximation performance of smooth functions (k=10).
4. Convergence Performance of PWSSVM
In this section, the convergence of PWSSVM will be presented. By using rough set theory, we prove that the solution of PWSSVM can closely approximate the optimal solution of the original model (4) when k goes to infinity. Furthermore, a formula for computing the upper bound of convergence is deduced.
Theorem 6.
Let A∈Rm×n and b∈Rm×1. Define the real-valued functions in the n-dimensional real space Rn as follows:
(16)f(x)=12∥(Ax-b)+∥22+12∥x∥22,g(x,k)=12∥φ(Ax-b,k)∥22+12∥x∥22,
where φ(·) is defined in (8), k>0. Then we have the following results:
f(x) and g(x,k) are strongly convex functions;
there exists a unique solution x* to minx∈Rnf(x), and a unique solution xk* to minx∈Rng(x,k);
for any k>0, xk* and x* satisfy the following condition:
(17)∥xk*-x*∥2≤m432k2;
limk→∞∥xk*-x*∥=0.
Proof.
(1) For any k>0, f(x) and g(x,k) are strongly convex functions because ∥·∥2 is strong convex function.
(2) Let Lv(f(x)) be the level set of f(x) and let Lv(g(x,k)) be the level set of g(x,k). Since x+≤φ(x,k), it is easy to obtain Lv(g(x,k))⊂Lv(f(x))⊂{x∣∥x∥22≤2v}. Therefore, Lv(f(x)) and Lv(g(x,k)) are compact subsets in Rn. Using the strong convexity property of f(x) and g(x,k) for k>0, there is a unique solution to minx∈Rnf(x) and minx∈Rng(x,k), respectively.
(3) By using the first order optimization condition and considering convex property of f(x) and g(x,k), we have
(18)f(xk*)-f(x*)≥∇f(x*)(xk*-x*)+12∥xk*-x*∥22=12∥xk*-x*∥22,g(x*,k)-g(xk*,k)≥∇g(xk*,k)(x*-xk*)+12∥xk*-x*∥22=12∥xk*-x*∥22.
Add the two formulas above and notice that φ(x,k)≥x+, and then we have
(19)∥xk*-x*∥22≤f(xk*)-f(x*)+g(x*,k)-g(xk*,k)=(g(x*,k)-f(x*))-(g(xk*,k)-f(xk*))≤g(x*,k)-f(x*)=12∥φ(Ax*-b,k)∥22-12∥(Ax*-b)+∥22.
According to the third result of Theorem 4, we obtain the conclusion ∥xk*-x*∥2≤m/432k2.
(4) According to ∥xk*-x*∥2≤m/432k2, we have limk→∞xk*=x*.
5. The Newton-Armijo Algorithm for PWSSVM
Following the results of the previous section, one can obtain the twice continuous differentiability of the objective function of problem (10). In order to take advantage of this feature, we use the Newton-Armijo method to train PWSSVM since it is a faster method than the BFGS algorithm [18, 19, 21]. The Newton-Armijo algorithm for problem (10) works as follows.
If ∥gi∥2≤τ, then stop and accept (wi,γi). Otherwise, compute Newton direction di∈Rn+1 from the following linear system:
(20)∇2Φ(wi,γi;ki)di=-(gi)Τ,
where “Τ” denotes transpose symbol.
Step 4 (Armijo stepsize).
Choose a stepsize λi=max{1,1/2,1/4,…} such that
(21)Φk(wi,γi)-Φk((wi,γi)+λidi)≥-ρλigidi,
where ρ∈(0,1/2) and set
(22)(wi+1,γi+1)=(wi,γi)+λidi.
Step 5.
Replace i by i+1 and go to Step 2.
We need to only solve a linear system of (20) instead of a quadratic program in our smooth approach. Because the objective function is strong convex, it is not difficult to obtain that our Newton-Armijo algorithm for training PWSSVM converges globally to the unique solution [17, 23]. Hence, the start point is not important. In this paper, we always set (w0,γ0)=e, where e denotes a column vector of ones of n dimension.
PWSSVM described above can solve the linear classification problems. In fact, we can extend some of the results in Section 2 to nonlinear PWSSVM with kernel technique as [17]. Furthermore, The Newton-Armijo algorithm can also solve nonlinear PWSSVM successfully.
6. Numerical Experiments
Newton-Armijo cannot be applied to QPSSVM model due to lack of the second-order derivative. In fact, the classification capacity of FPSSVM is slightly better than QPSSVM [18–21]. In our experiment, we do not compare QPSSVM with the other smooth SVM method. To demonstrate the effectiveness and speed of PWSSVM, we compare the performance numerically among SSVM, FPSSVM, TSSVM, and PWSSVM. The four smooth SVMs are all trained by the fast Newton-Armijo algorithm. All experiments are run on Personal Computer with 3.0 GHz and a maximum of 1.99 GB of the memory available. The programs of PWSVM, FPSSVM, and TSSVM are written in the MATLAB language. This computer runs Win7 with MATLAB 7.0.1. The source code of SSVM, ‘‘ssvm.m,’’ is obtained from the author’s website for the linear problem [24], and ‘‘lsvmk.m’’ for the nonlinear problem. In our experiments, all of the input data and the variables needed in programs are kept in the memory. For SSVM, TSSVM, FPSSVM, and PWSSVM, an optimality tolerance of 10-8 is used to determine when to terminate. Gaussian kernel is used in all our experiments.
The first experiment is used to demonstrate the capability of PWSSVM in solving larger problems. The results in Table 1 are designed to compare the training correctness, the testing correctness, and the training time among the four smooth SVMs on a massively sized dataset. The datasets are created using Musicants NDC Data Generator [25] with different sizes. The test samples are 5% of the training samples. The experiment results show that PWSSVM has the highest training accuracy and testing accuracy. Furthermore, PWSSVM can be used to solve problems more quickly than the other three smooth SVMs when the number of the sample data is relative small.
PWSSVM compared with SSVM, FPSSVM, and TSSVM on NDC generated dataset with difference sizes (C=10).
Trains/dimension
Algorithm
Train correctness (%)
Test correctness (%)
Time (s)
SSVM
90.86
91.25
278.97
2,000,000/10
FPSSVM
90.86
91.25
367.46
TSSVM
90.98
91.33
342.45
PWSSVM
91.34
91.75
339.64
SSVM
87.64
87.08
417.64
2,000,000/20
FPSSVM
87.88
88.05
449.28
TSSVM
87.89
88.05
446.45
PWSSVM
88.14
88.36
449.20
SSVM
94.26
93.60
11.17
10,000/100
FPSSVM
94.78
93.60
11.33
TSSVM
94.77
93.77
6.24
PWSSVM
94.85
95.52
4.26
SSVM
96.67
86.20
56.22
10,000/1000
FPSSVM
96.69
86.14
66.52
TSSVM
96.69
86.16
26.69
PWSSVM
97.94
86.23
18.73
The second experiment is designed to demonstrate the effectiveness of PWSSVM through the “tried and true” checkerboard dataset [26]. One highly nonlinearly separable but simple example is the checkerboard dataset which has often been used to show the effectiveness of nonlinear kernel methods [17]. The checkerboard dataset is generated by uniformly discretizing the regions [0,1]×[0,1] to 2002=40000 points and labeling two classes “White” and “Black” spaced by 3×3 grid as Figure 2 shows.
The figure of the checkerboard dataset.
In the first trial of this experiment, the training set contains 1000 points randomly sampled from the checkerboard (for comparison, they are obtained from [17]) which contain 514 “white” samples and 486 “black” samples and the rest 39,000 points are in the testing set. Gaussian kernel function K(x,y)=exp(-0.5||x-y||2) is used and C=100. Total time for the 1000-point training set using PWSSVM with a Gaussian kernel is 4.01 s. The train accuracy of PWSSVM is 99.80%. The test accuracy of PWSSVM is 98.76% on a 39,000-point test set. TSSVM solves the same problem within 4.23 s, and the train accuracy and the test accuracy are 99.62% and 98.28%. FPSSVM and SSVM obtain the train accuracy of 99.60% within 4.35 s and 4.61 s, respectively. The test accuracy of them are 98.28%.
The rest results are presented in Table 2. The training set is randomly selected from the checkerboard with different sizes. The remaining samples are used as test samples. We compare the classification results of PWSSVM, TSSVM, FPSSVM, and SSVM with the same Gaussian kernel function. The results in Table 2 demonstrate that PWSSVM can solve massive problems quickly, followed by TSSVM, FPSVM, and SSVM in turn. The experimental results show that PWSSVM can obtain the highest train precision and test precision.
PWSSVM compared with SSVM, FPSSVM, and TSSVM on checkerboard dataset with difference sizes (C=100).
Training size
Algorithm
Train correctness (%)
Test correctness (%)
Time (s)
SSVM
99.60
98.28
4.61
1000
FPSSVM
99.60
98.28
4.35
TSSVM
99.62
98.28
4.23
PWSSVM
99.80
98.76
4.01
SSVM
98.84
98.54
11.97
2000
FPSSVM
99.22
98.59
9.63
TSSVM
99.22
98.59
9.35
PWSSVM
99.22
98.68
9.55
SSVM
98.47
98.88
25.21
3000
FPSSVM
98.64
98.88
21.58
TSSVM
98.68
98.90
17.21
PWSSVM
98.85
98.94
17.36
SSVM
99.34
99.51
47.09
5000
FPSSVM
99.48
99.51
39.29
TSSVM
99.52
99.51
38.76
PWSSVM
99.79
99.65
38.23
7. Conclusions
A novel PWSSVM is proposed in this paper. It only needs to find the unique minima of the unconstrained differentiable convex quadratic function. The proposed method has many advantages over those available, such as good classification performance and less training time cost. The numerical results show that PWSSVM has excellent generalization ability.
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments. This work was supported in part by the National Natural Science Foundation of China under Grants 61100165, 61100231, and 51205309 and the Natural Science Foundation of Shaanxi Province (2010JQ8004, 2012JQ8044).
VapnikV. N.VapnikV. N.MllerK. R.SmolaA. J.RtschG.SchlkopfB.KohlmorgenJ.VapnikV.ScholkopfB.BurgesJ.SmolaA.Using support vector machines for time series predictionFarooqT.GuergachiA.KrishnanS.Knowledge-based Green's Kernel for support vector regressionZhengJ.Lu Bao-LiangB. L.A support vector machine classifier with automatic confidence and its application to gender classificationZhuJ. Y.RenB.ZhangH. X.DengZ. T.Time series prediction via new support vector machinesProceedings of the International Conference on Machine Learning and Cybernetics (ICMLC '02)November 20023643662-s2.0-0036925049RamanaJ.GuptaD.LipocalinPred: a SVM-based method for prediction of lipocalinsJoachimsT.Text categorization with support vector machines: learning with many relevant featuresProceedings of the 10th European Conference on Machine Learning (ECML '98)1998Heidelberg, GermanySpringer137142JaumeP. S.DaríoM. I.FernandoD. M.Support vector machines for continuous speech recognitionProceedings of the 14th European Signal Processing Conference (EUSIPCO '06)September 2006Florence, ItalyBevilacquaV.PannaraleP.AbbresciaM.CavaC.Comparison of data-merging methods with SVM attribute selection and classification in breast cancer gene expressionSpinosaE. J.CarvalhoA. C.Support vector machines for novel class detection in bioinformaticsNurettinA.CüneytG.SunY. F.FanX. D.LiY. D.Identifying splicing sites in eukaryotic RNA: support vector machine approachLinH. J.YehJ. P.Optimal reduction of solutions for support vector machinesChristmannA.HableR.Consistency of support vector machines using additive kernels for additive modelsShaoY. H.DengN. Y.A coordinate descent margin based-twin support vector machine for classificationLeeY.-J.MangasarianO. L.SSVM: a smooth support vector machine for classificationYuanY. B.YanJ.XuC. X.Polynomial smooth support vector machine (PSSVM)YuanY. B.HuangT. Z.XiongJ. Z.HuJ. L.YuanH. Q.HuT. M.LiG. M.Research on a new class of functions for smoothing support vector machinesYuanY.FanW.PuD.Spline function smooth support vector machine for classificationBertsekasD. P.XuC.ZhangJ.A survey of Quasi-Newton equations and Quasi-Newton methods for optimizationMusicantD. R.ManagsarianO. L.LSVM: Lagrangian support vector machine2000http://www.cs.wisc.edu/dmi/svm/MusicantD. R.NDC: normally distributed clustered datasets1998http://www.cs.wisc.edu/~musicant/data/ndc/HoT. K.KleinbergE. M.Checkerboard dataset1996http://www.cs.wisc.edu/~musicant/data/ndc/