Learning Rates for l 1-Regularized Kernel Classifiers

We consider a family of classification algorithms generated from a regularization kernel scheme associated with l-regularizer and convex loss function. Our main purpose is to provide an explicit convergence rate for the excess misclassification error of the produced classifiers. The error decomposition includes approximation error, hypothesis error, and sample error. We apply some novel techniques to estimate the hypothesis error and sample error. Learning rates are eventually derived under some assumptions on the kernel, the input space, the marginal distribution, and the approximation error.


Introduction
Let  be a compact subset of R  ,  = {−1, 1}.Classification algorithms produce binary classifiers C :  → , such a classifier C labels a class C() ∈  for each point  ∈ .The prediction power of the classifier C is measured by its misclassification error.If  is a probability distribution on  :=  × , then the misclassification error of C is defined by Here   is the marginal distribution on  and  (⋅ | ) is the conditional probability measure at  induced by .The classifier minimizing the misclassification error is called the Bayes rule   and is given by The classifiers considered in this paper have the form sgn(), defined as sgn()() = 1, if () ≥ 0, and sgn()() = −1, if () < 0, induced by real-valued functions  :  → R.Those functions are generated from a regularization scheme associated with convex loss function (see [1]).Definition 1.A continuous function : R → R + is called a classifying loss (function) if it is convex, differentiable at 0 with   (0) < 0, and 1 is the smallest real for which the value of  is zero.
The following concept describes the increment of .
Definition 2. One says that  has a increment exponent  ≥ 1, if there exists some   > 0 such that where   ± denotes the right and left derivatives of .
It is easy to see that  ℎ ,  ls , and   satisfy Definition 2 with increment exponent 1, 2, and , respectively.
the empirical error with respect to z. Regularized learning schemes are implemented by minimizing a penalized version of empirical error over a set of functions, called a hypothesis space.
Definition 3. Given a classifying loss  and a hypothesis space H, Ω : H → R + is a penalty functional called regularizer that reflects the constrains imposed on functions from H. The regularized classifier is then defined as sgn( z ), where  z is a minimizer of the following regularization scheme: Here  is a regularization parameter which may depend on the sample size  = () with lim  → ∞ () = 0.
Choosing different hypothesis spaces and regularizers in (5) will lead to different regularization algorithms.These learning algorithms are often based on a kernel function  :  ×  → R (see, e.g., [10]).One way appears naturally when  is a Mercer kernel.Such a kernel is continuous, symmetric, and positive semidefinite on  × .The reproducing kernel Hilbert spaces (RKHS) H  associated with the Mercer kernel  is defined [11] to be the completion of the linear span of functions {  := (, ⋅) :  ∈ } with the inner product: and the reproducing property is given by By setting H = H  , Ω() = ‖‖ 2  , (5) becomes the classical regularized classification scheme: fz, := arg min Its mathematical analysis has been well understood with various techniques in extensive literature (see, e.g., [4,5,8,[12][13][14][15]).In this paper we will consider a different regularization scheme in RKHS for classification; in our setting, the regularizer is  1 -norm of the coefficients in the kernel expansions over the sample points.
Then the  1 -regularized classification scheme is given as Algorithm (10) can be efficiently computed because it reduces to solve a convex optimization problem in a finite dimensional space H ,z , containing linear combinations of kernels centered on the training points.
In the last ten years, learning with  1 -regularization has attracted much attention.The increasing interest is mainly brought by the progress of the Lasso algorithm [16][17][18] and compressive sensing [19,20], in which  1 -regularizer is able to yield sparse representation of the resulting minimizer.Kernel methods formulate learning and estimation problems in RKHS of functions expanded in terms of kernels.There have been a series of papers to investigate the learning ability of coefficient-based regularization kernel regression methods (see, e.g., [21][22][23][24][25]).However, as we know, there are currently a few results on classification based on kernel designing.For example, [26] studies classification problem with hinge loss  ℎ and  1 complexity regularization in a finite-dimensional hypothesis space spanned by a set of base functions.While it does not assume a kernel setting nor is it assumed that the expansion must be in terms of the sample points, so the problem of data-dependent hypothesis space is not present there.Although [27] provided an error analysis for linear programming SVM classifiers by means of a stepping-stone from quadratic programming SVM to linear programming SVM, no evidence shows that this method can still work for other classifying losses.
In this paper we will present an elaborate error analysis for algorithm (10), and we use a modified error decomposition technique that was firstly introduced in [28], by dealing with the approximation error, the hypothesis error, and the sample error, and we derive an explicit learning rate for classification scheme (10) under some assumptions.

Preliminaries
For a classifying loss , we define the generalization error of  :  → R as Let    be a measurable function minimizing the generalization error: where the minimum is taken over all measurable functions.According to Theorem 3(c) in [12], we may always choose an    satisfying    () ∈ [−1, 1] for each  ∈ .This choice will be taken throughout the paper.
Estimating the excess misclassification error for classification scheme (10) is our main purpose.The following comparison theorem (see [7,8,12]) describes the relationship between excess misclassification error and excess generalization error.
Proposition 5.If  is a classifying loss, then, for any measurable function , where   is some constant dependent on .
Since    () ∈ [−1, 1], we can improve the error estimates by replacing values of  by projections onto [−1, 1].The idea of the following projection operator was firstly introduced for this purpose in [29].Definition 6.The projection operator  is defined on the space of measurable functions  :  → R as The definition of classifying loss implies that (()()) ≤ (()), so It is trivial that sgn(()) = sgn().By Proposition 5, So it is sufficient for us to bound (13) by means of E(( z, )) − E(   ), which in turn can be estimated by an error decomposition technique.However, there are essential differences between algorithm (8) and (10).For example, the hypothesis space H ,z and the regularizer Ω z () in (10) are dependent on samples z.This causes that the standard error analysis methods of (8) (see, e.g., [8,12,13,30]) cannot be applied to (10) any more.This difficulty was overcome in [28] by introducing a modified error decomposition with an extra hypothesis error term.In this paper we apply the same underlying idea to classification scheme (10).To this end, we need to consider a Banach space containing all of the possible hypothesis space H ,z .Definition 7. The Banach space H 0 is defined as the function set on  containing all functions of the form with the norm Obviously, By the continuity of  and compactness of , we have It implies that H 0 is a subset of the continuous function space (), and To formulate the error decomposition for scheme (10), we introduce a regularization function as Proposition 8. Let  z, be defined in (10),  > 0; then Here Proof.
(z, ) and (z, ) are called hypothesis error and sample error, and they will be estimated, respectively, in next two sections.() is independent of samples and usually called approximation error, and it characterizes the approximation ability of the function space H 0 with respect to target function    .We will assume that, for some constants 0 <  ≤ 1 and   > 0,  () ≤     , ∀ > 0. (28)

Estimating the Hypothesis Error
In this section we bound the hypothesis error (z, ) by a technique of scattered data interpolation which was firstly used in kernel regression context in [25].To this end, we need some assumptions on input space , margin distribution   , and kernel .
Definition 9. A subset  of R  is said to satisfy an interior cone condition if there exist an angle  ∈ (0, /2), a radius  > 0, and a unit vector () for every  ∈  such that the cone  (,  () , , ) is contained in .

Estimating the Sample Error
In this section we focus on the sample error, it is the major improvement we make in this paper for the error analysis of algorithm (10).
Definition 15.Let F be a class of functions on  and z := {  }  =1 ∈   .The  2 -metric  2,z is defined on F by For every  > 0, the covering number of F with respect to  2,z is The function sets in our situation are balls of H 0 in the form of B  = { ∈ H 0 : ‖‖ ≤ }.We need the  2 -empirical covering number of B 1 defined as According to a bound for  2 -empirical covering number derived in [32], we know that if  ∈   ( × ), then where   is a constant independent of  > 0, and  ∈ (0, 2) is a power index defined dy (51) For a measurable function  :  → R, denote E := ∫  ().The following definition is a variance-expectation condition for the pair (, ), which is generally used to achieve tight bounds.Definition 16.A variance power  of the pair (, ) is a number in [0, 1] such that for any  :  → [−1, 1], there exists some constant   > 0 satisfying Remark 17.It is easy to see that (52) always holds for  = 0 and   =  2 (−1).Larger  is possible when  has strong convexity or  satisfies some noise condition (see [1,5]).
We are in a position to bound the sample error.Write (z, ) as We will first bound  2 (z, ), and to this end we need the following one-side Bernstein inequality (see [33]).
Let  be a random variable on a probability space  with mean E =  and variance  2 () =  2 .If | − | ≤  almost everywhere, then for all  > 0 where  3 is a constant independent of , , or .
Proof.Denote  1 := (  ()) − ((  )()),  2 := ((  )()) − (   ()).Then By ( 22) and ( 26), we can see that We may assume |  ()| > 1, since otherwise  1 = 0. Then from (3), we can derive that ( Applying the one-side Bernstein inequality to  1 we have, for any  > 1 with confidence 1 −  − , On the other hand, both (  )() and    () are contained in [−1, 1], and we know from (3) and (52) that Applying the one-side Bernstein inequality again, we have, with confidence 1 −  − , where in the second inequality we have used the elementary inequality Since E 1 + E 2 = E(  ) − E(   ) ≤ (), combining the estimates above, we can get that, under assumption (28) Then we prove the proposition by setting  = log(8/), and It is more difficult to bound  1 (z, ), because it involves the samples z and thus runs over a set of functions.To get a better error estimation, An iteration technique is often used to shrink the radius of the ball containing  z, (see, e.g., [5,12,30,32]); however this process is rather tough and complicated.In this paper We succeed to avoid the prolix iteration by considering the following reweighted empirical process Here   () = ( + ()) −1 for a threshold  > 0 and () = E(()) − E(   ) + ‖‖.Different from the classical weight function, () contains the regularization term ‖‖ and thus makes it possible to control the variances and ‖‖ by the threshold  simultaneously.
The following concentration inequality is a scaled version of Theorem 2.3 in [34], where the case  = 1 is given.Lemma 19.Assume that  1 , . . .,   are identically distributed according to .Let F be a countable set of measurable functions from  to [−, ] and assume that all functions  in F satisfy E = 0,  2 () ≤  2 for some positive real number .Denote Then, for all  > 0, one has This lemma allows us to take care of the deviation of the supremum of a empirical process with respect to its expectation.

Proposition 20. Let 𝑟 ≥ 0 and
If (3) and (52) are satisfied, then, for any  > 0 with confidence 1 −  − , Proof.Before presenting the proof, let us first introduce some additional notations: Then By ( 3) and (52), we can see that Here in the second inequality of (72), we have used the elementary inequality (62) again with  = 2/,  = (E  ) 1/ and  = ( * ) 1/ * .So applying Lemma 19 to Φ  , we get with confidence 1 −  − So we can bound Φ  through bounding its expectation.To this end, we need some preparations.Definition 21.Let (, ) be a probability space, and F is a class of measurable functions from  to R. Set {  }  =1 to be  independent random variables distributed according to  and {  }  =1 to be  independent Rademacher random variables.
The following lemma was given in [35]. Therefore, . (80) Now we can give a bound of Φ  .

Deriving Learning Rates
We may now present the main results by combining the results obtained in the previous two sections.The following theorem gives the bounds for the excess generalization error.
Proof.Putting the estimates of Theorem ] .
(99) By the choice of , we can easily check that Theorem 26 together with (17) allows us to give an explicit learning rate for misclassification error of scheme (10).where C :=   √ .