A Simpler Approach to Coefficient Regularized Support Vector Machines Regression

and Applied Analysis 3 It is easy to see that V(y, π(f)(x)) ≤ V(y, f(x)), so E (π (f)) ≤ E (f) , Ez (π (f)) ≤ Ez (f) . (15) We thus take π(fz,λ) instead of fz,λ as our empirical target function and analyze the related learning rates. 2.1. Error Decomposition. Theerror decomposition is a useful approach to the error analysis for the regularized learning schemes. With sample-dependent hypothesis space HK,z, [12] proposes a modified error decomposition with an extra hypothesis error term, by introducing a regularization function as fμ := arg min f∈HK {E (f) + μ 󵄩󵄩󵄩󵄩f 󵄩󵄩󵄩󵄩 2 K } , μ > 0. (16) We can conduct the error decomposition for scheme (9) with the same underlying idea of [12]. Proposition 5. Let fz,λ, fμ be defined in (9) and (16). Then, E (π (fz,λ)) −E (f ∗ ) ≤ S (z, λ, μ) + P (z, λ, μ) + D (μ) . (17) Here, S (z, λ, μ) := {E (π (fz,λ)) −Ez (π (fz,λ))} + {Ez (fμ) −E (fμ)} , P (z, λ, μ) := {Ez (π (fz,λ)) + λΩz (fz,λ)} − {Ez (fμ) + μ 󵄩󵄩󵄩󵄩 fμ 󵄩󵄩󵄩󵄩 2 K } , D (μ) := E (fμ) −E (f ∗ ) + μ 󵄩󵄩󵄩󵄩 fμ 󵄩󵄩󵄩󵄩 2 K = inf f∈HK {E (f) −E (f ∗ ) + μ 󵄩󵄩󵄩󵄩f 󵄩󵄩󵄩󵄩 2 K } . (18) Proof. A direct computation shows that E (π (fz,λ)) −E (f ∗ ) ≤ E (π (fz,λ)) −E (f ∗ ) + λΩz (fz,λ) = {E (π (fz,λ)) −Ez (π (fz,λ))} + {Ez (π (fz,λ)) + λΩz (fz,λ)} + {Ez (fμ) −E (fμ)} − {Ez (fμ) + μ 󵄩󵄩󵄩󵄩 fμ 󵄩󵄩󵄩󵄩 2 K } + {E (fμ) −E (f ∗ ) + μ 󵄩󵄩󵄩󵄩 fμ 󵄩󵄩󵄩󵄩 2 K } = S (z, λ, μ) + P (z, λ, μ) + D (μ) . (19) This proves the proposition. S(z, λ, μ) is usually called the sample error; it will be estimated by some concentration inequality in the next section. D(μ) is independent of the sample and is often called the approximation error, the decay of D(μ), as μ → 0 characterizes the approximation ability of HK. We will assume that, for some 0 < β ≤ 1 and cβ > 0, D(μ) ≤ cβμ β , ∀μ > 0. (20) Remark 6. SinceE(f)−E(f) ≤ ‖f − f‖ L ρX ,D(μ) concerns the approximation of f in L ρX by functions from HK. In fact, (20) can be satisfied when f is in some interpolation spaces of the pair (L ρX ,HK) (see, e.g., [19, 20]). P(z, λ, μ) is called hypothesis error since the regularization functionfμ maynot be in the hypothesis spaceHK,z.The major contribution we make in this paper is to give a simpler estimation of P(z, λ, μ) by a stepping stone between fz,λ and fz,μ. 2.2. Hypothesis Error Estimate. The solution fz,λ of scheme (9) has a representation similar to fz,μ in scheme (5); it is reasonable to expect close relations between the two schemes. So, the latter may play roles in the analysis of the former. Theorem 7. Let λ, μ > 0, 1 ≤ q < ∞, and then P (z, λ, μ) ≤ mλ (2mμ) q . (21) Proof. Let fz,μ = ∑ m i=1 ixi be the solution to (5). By (7), we have


Introduction
Recall the regression setting in learning theory, and let  be a compact subset of R  ,  ⊂ [−, ], for some  > 0.  is an unknown probability distribution endowed on  :=  × , and z := {  }  =1 = {(  ,   )}  =1 ∈   is a set of samples independently drawn according to .Given samples z, the regression problem aims to find a function  z :  → R, such that  z () is a satisfactory estimate of output  when a new input  is given.
The classical SVMR proposed by Vapnik and his coworkers [2,3] is given by the following regularization scheme: fz, := arg min where E z () := (1/) ∑  =1 (  , (  )) is the empirical error with respect to z and  is a regularization parameter.It is well known, see for example, [4,Proposition 6.21], that the solution is of the form where the coefficients α are a solution of the optimization problem maximize Remark 1.The equality constraint ∑  =1   = 0 needed in [4, Proposition 6.21] is superfluous since we do not include an offset term  in the primal problem (5).
The mathematical analysis of algorithm (5) has been well understood with various techniques in extensive literature; see, for example, [5][6][7].In this paper, we are interested in a different regularized SVMR algorithm.In our setting, the regularizer is not the RKHS norm but an   -norm of the coefficients in the kernel ensembles.
Then, the SVMR with   -coefficient regularization learning algorithm that we study in this paper takes the form  z, := arg min Remark 3. The regularization parameter  in (9) may be different from  in scheme (5), but a relationship between  and  will be given in Section 3 as we derive the learning rate of algorithm (9).
Learning with coefficient-based regularization has attracted a considerable amount of attention in recent years, on both theoretical analysis and applications.It was pointed out in [8] that by taking  < 2, and especially for the limit value  = 1, the proposed minimization procedure in (9) can promote the sparsity of the solution; that is, it tends to result in a solution with a few nonzero coefficients [9].This phenomenon has been also observed in the LASSO algorithm [10] and the literature of compressed sensing [11].
However, it should be noticed that there are essential differences between the learning schemes (9) and (5).On one hand, the regularizer Ω z () is not a Hilbert space norm, which causes a technical difficulty for mathematical analysis.On the other hand, both hypothesis space H ,z and regularizer Ω z () depend on samples z, and this increases the flexibility and adaptivity of algorithm (9) but causes the standard error analysis methods for scheme (5) which are not appropriate to scheme (9) any longer.To overcome these difficulties, [12] introduces a Banach space H of all functions of the form with norm An error analysis framework is established and then a series of papers start to investigate the performance of kernel learning scheme with coefficient regularization (see [13][14][15][16]).
In those literatures, an   condition imposed on the marginal distribution of  on  plays a critical role in the error analysis.
A probability measure   on  is said to satisfy   condition if there exist some  > 0 and   > 0 such that In general, the index  is hard to estimate.If  satisfies some regularity conditions (such as an interior cone condition) and   is uniform distribution on , then (12) holds with  = .It leads to a low convergence rate and depends on , the dimension of the input space , which is often large in learning problem.
In this paper, we succeed to remove   condition ( 12) and provide a simpler error analysis for scheme (9).The novelty of our analysis is a stepping stone technique applied to bound the hypothesis error.As a result, we derive an explicit learning rate of (9) under very mild conditions.

Error Decomposition and Hypothesis Error
The main purpose of this paper is to provide a convergence analysis of the learning scheme (9).With respect to the insensitive loss , the prediction ability of a measurable function  is measured by the following generalization error: where   is the marginal distribution on  and (⋅ | ) is the conditional probability measure at  induced by .Let  * be a minimizer of E() among all measurable functions on .It was proved in [6] that | * ()| ≤ + for almost every  ∈ .
To make full use of the feature of the target function  * , one can introduce a projection operator, which was extensively used to the error analysis of learning algorithm; see, for example, [17,18].Definition 4. The projection operator  =  + is defined on the space of measurable functions  :  → R as Abstract and Applied Analysis 3 It is easy to see that (, ()()) ≤ (, ()), so We thus take ( z, ) instead of  z, as our empirical target function and analyze the related learning rates.

Error Decomposition.
The error decomposition is a useful approach to the error analysis for the regularized learning schemes.With sample-dependent hypothesis space H ,z , [12] proposes a modified error decomposition with an extra hypothesis error term, by introducing a regularization function as We can conduct the error decomposition for scheme (9) with the same underlying idea of [12].
Proposition 5. Let  z, ,   be defined in (9) and (16).Then, Here, Proof.A direct computation shows that This proves the proposition.
(z, , ) is usually called the sample error; it will be estimated by some concentration inequality in the next section.() is independent of the sample and is often called the approximation error, the decay of (), as  → 0 characterizes the approximation ability of H  .We will assume that, for some 0 <  ≤ 1 and   > 0,  () ≤     , ∀ > 0. (20) , () concerns the approximation of  * in  1   by functions from H  .In fact, (20) can be satisfied when  * is in some interpolation spaces of the pair ( 1    , H  ) (see, e.g., [19,20]).
(z, , ) is called hypothesis error since the regularization function   may not be in the hypothesis space H ,z .The major contribution we make in this paper is to give a simpler estimation of (z, , ) by a stepping stone between  z, and fz, .

Hypothesis Error Estimate.
The solution  z, of scheme ( 9) has a representation similar to fz, in scheme (5); it is reasonable to expect close relations between the two schemes.So, the latter may play roles in the analysis of the former.
Remark 8.The stepping stone method was first introduced in [21] to the error analysis of linear programming SVM classifiers.This technique was also used in [22,23] to study  coefficient regularized least square regression with  ∈ [1,2].While, in this paper, we extend the index  to a large range [1, +∞), it will be helpful to improve the understanding of those coefficient-based regularized algorithms.
Remark 9. Theorem 7 presents a simpler approach for estimating the hypothesis error.Different from the former literature (see, e.g., [13][14][15][16]), we conduct the estimation without imposing any assumptions on the input space , kernel , and marginal distribution   .

Sample Error and Learning Rate
This section is devoted to estimating the sample error (z, , ) and deriving the learning rate of algorithm (9).

Sample Error
Estimate.We will adopt some results from the literature to estimate the sample error.To this end, we need some definitions and assumptions.For a measurable function  :  → R, denote E := ∫  ().
There exists an exponent  with 0 <  < 2 and a constant We now set out to bound the sample error.Write (z, , ) as Applying [6, Proposition 4.1], we yield the following estimation for  1 (z, ).

Lemma 12.
For any  > 0, under the assumption (26), with confidence 1 − 2 − , one has The estimation for  2 (z, ) is based on the following concentration inequality which can be found in [25].
We may apply Lemma 13 to a set of functions F  with  > 0, where Proposition 14.If assumptions (26) and (31) are satisfied, then for any  > 0, with confidence 1 −  − , there holds for all  ∈ B  , where Proof.Each function  ∈ F  has a form with some  ∈ B  .We can easily see that ‖‖ ∞ ≤ 2( + ) and The assumption (26)  Hence, all the conditions in Lemma 13 hold, and we know that, for any  > 0, with confidence 1 −  − , there holds, for every  ∈ B  , Here, Recall an elementary inequality , ∀, > 0. ( Applying it with  = [E(()) − E( * )]  ,  =  1−  ,  = 1/ to the first term of (44), we can derive the conclusion.
It remains to find a ball containing  z, for all z ∈   .Lemma 15.Let 1 ≤  < ∞,  z, be defined by (9).Then, for any z ∈   , one has      z,     ≤ It is easy to check that (51) still holds for  = 1.Let  → 0; we then get the assertion.
Remark 19.Another advantage of coefficient-based regularization scheme is its flexibility in choosing the kernel.For instance, [26,27] consider the least square regression with indefinite kernels and an  2 -coefficient regularization, where they relax the requirement of the kernel to be only continuous and uniformly bounded bivariate function on .It will be a very interesting topic in future work to extend the method in this paper to the indefinite kernel setting.
Let us end this paper by comparing our result with the learning rate presented in [7] in a special case  = 0. To this end, we reformulate [7, Theorem 2.3] for  = 0 as follows.