Constructive Analysis for Least Squares Regression with Generalized K-Norm Regularization

and Applied Analysis 3 The first and second terms S 1 and S 2 are called sample error which will be studied in Section 5, while the third termD(λ) is regularization error (or approximation error) which is our main work in this paper. It is known that f λ can be freely chosen in H K which is close to f ρ in some sense and in previous work f λ is always naturally chosen to be the oneminimizingD(λ). However, we will encounter difficulties if the minimizer does not exist or the expression of the minimizer is not explicit. In this paper, we construct a special function in the form f λ = (L α K + λ β I) −1 L α K f ρ (15) with some α, β > 0 to handle this problem. 4. Regularization Error It is the main contribution of this section to conduct error analysis for the regularization error. Regularization error, also called approximation error, has already been studied in [18]. However, we will analyze this part of error in a different viewpoint. From [1], we know L K is a compact self-adjoint and positive operator. By applying the spectral theorem for compact operators, we can bound a compact positive operator with its eigenvalues. Firstly, we have to introduce a useful lemma. Lemma 5. Let a, b, c, d > 0 and c < d; one has


Introduction
In learning theory, we are always given a sample set z := {  }  =1 = {(  ,   )}  =1 , which is drawn from a joint distribution  on the sample space  :=  × .Here, the input space  is a compact metric space and  = R for a regression problem.For a function  obtained via some algorithm, a loss functional ((), ) is defined to measure its performance on a sample point (, ).In regression problem, least square loss ((), ) = (() − ) 2 is most widely used.Then, we can use the generalization error to evaluate  over the whole sample space: From [1], we know the goal function is   = ∫  ( | ), which is called the regression function, minimizing the generalization error.Since  is always unknown in practice, we have to find another function close to   based on the sample.The famous empirical risk minimization (ERM) algorithm is introduced in [2,3].To avoid overfitting, a penalty term Ω() related to  is added into this algorithm, which is usually called regularization.While the squared norm regularization term is extensively studied in [4], and so forth, in this paper, we consider a more general model: with some  > 0. In this algorithm, minimization is restricted to a hypothesis space H  which is a reproducing kernel Hilbert space (RKHS) on .The RKHS [5] is defined as H  := span{  :  ∈ } with   () = (, ), associated with a Mercer Kernel  :  ×  → R which is continuous, symmetric, and positive definite.Since  is a compact metric space, Kernel  is bounded and we denote  = sup ,∈ (, ) in the following.

Main Result
Though uniform bounded assumption was abandoned in previous work [6], we still assume almost surely for some constant  > 0 throughout this paper for simplicity, since our analysis can be extended to the unbounded situation by choosing some different probability inequality.
For the hypothesis space, a polynomial decay condition is given to control the capacity.To state this condition, we have to firstly recall covering number.Definition 1.Let (M, ) be a pseudometric space and  ⊂ M. For  > 0, the covering number N(, , ) of the set  with 2 Abstract and Applied Analysis respect to  is defined to be the minimal number of balls of radius  whose union covers .That is, where (  , ) = { ∈ M : (,   ) ≤ }.
The integral operator   :  2   →  2   defined by is also important in learning theory and has been studied in [15].In [1], the authors claim that, for a Mercer Kernel , the associated   is a compact operator with nonincreasing positive eigenvalue sequence   .And the induced fractional operator is well defined, for any  = ∑ ≥1     ∈  2   with orthogonal basis {  } ≥1 of  2   .In the following, we will make use of this notion in our construction analysis.
We additionally introduce the projection operator  on the space of measurable function  :  → : The main result is stated as follows which will be proved in Section 6.
Theorem 4. Assume (3), ( 7) hold for sample distribution and hypothesis space H  .The regression function satisfies   ∈    ( 2   ) for some  > 0.  z, is obtained from (2).Then, by choosing appropriate  (explicit expression can be found in the proof) with confidence 1 −  for any 0 <  < 1/2, we have for some constant  ,,, not depending on  or  and

Error Decomposition
Various error decomposition methods motivate our research, especially [7,12,14,16,17].A general idea of error decomposition is to transform the excess generalization error E( z, )− E(  ) = ‖ z, −   ‖ 2  (see [1] for details) to two parts, which can be bounded by some concentration inequality and approximation analysis.In our setting, let   be a function to be determined in H  ; it can be expressed as where Abstract and Applied Analysis 3 The first and second terms  1 and  2 are called sample error which will be studied in Section 5, while the third term () is regularization error (or approximation error) which is our main work in this paper.
It is known that   can be freely chosen in H  which is close to   in some sense and in previous work   is always naturally chosen to be the one minimizing ().However, we will encounter difficulties if the minimizer does not exist or the expression of the minimizer is not explicit.In this paper, we construct a special function in the form with some ,  > 0 to handle this problem.

Regularization Error
It is the main contribution of this section to conduct error analysis for the regularization error.Regularization error, also called approximation error, has already been studied in [18].However, we will analyze this part of error in a different viewpoint.From [1], we know   is a compact self-adjoint and positive operator.By applying the spectral theorem for compact operators, we can bound a compact positive operator with its eigenvalues.Firstly, we have to introduce a useful lemma.
Lemma 5. Let , , ,  > 0 and  < ; one has Proof.By simply taking derivative of the right-hand side with respect to , we can find that it reaches its maximum when  = (/( − )) 1/  1/ ; that is, sup Since (/)(/( − )) (/)−1 =   (1 − ) with some constant  , depending on ,  and Recall that sup ≥1   = ‖  ‖ ≤ , combining with Lemma 5; there holds sup For the term ‖  ‖   , we have the following inequality as This means To minimize the sum of upper bounds ( 22) and ( 24) is the same to maximize the power of .We can choose Then,  () can also lead to the same result except for the constants.
Remark 8.In the case  = 2, our result turns to () ≤    min{2,1} which is consistent with the classical one [4].In fact, for a general  ≤ 2 of interest, the bound is better than min{2, 1} since  ≥ 2, while  ≤ 1/2.In [7], the authors construct a function based on the generalized Fourier expansion of   and derive that () ≤  1  2/(+2) with some constant  1 for any 0 <  ≤ 2. The rate is always much less than 2 and cannot achieve 1 when 1/2 <  ≤ 2. On the other hand, our result is better than 2, while <  < 1/2.
Compared with [19], we get the same rate of upper bound.There, the authors find a connection between inf ∈H  E() − E(  ) + ‖‖   and inf ∈H  E() − E(  ) + ‖‖   with different , .However, their analysis needs an existent result, while our method does not.
Now, we can obtain the sample error bound involving   .
Proposition 10.Assume (3), for any 0 <  < 1, with confidence 1 − (/2); there holds Proof.Let we have By Bernstein inequality, holds with   = 16 2 .Set the right-hand side to be /2; we can solve  and the following bound This proves the proposition.
Abstract and Applied Analysis 5 For the sample error term  1 , it is more difficult since it involves the function  z, which varies, while the sample size  is different.So, we need a concentration inequality for a set of functions as in [21].By setting  = 1, the inequality becomes as follows.
Lemma 11.Let F be a set of measurable functions on , and  1 ,  2 > 0 is constant such that each function  ∈ F satisfies ‖‖ ∞ ≤  1 and E( 2 ) ≤  2 E.If for some  > 0 and 0 <  < 2, then there exists a constant    depending only on  such that, for any  > 0, with probability at least 1 − , there holds where The result will be used to estimate  1 .We apply this lemma to the function set and have the following proposition.
Proposition 12. Let G be defined as above with some  ≥ 1 satisfying ‖ z, ‖  ≤ , whose expression will be given in the next section.Assume (3) and (7) hold.Then, we have for some constant    depending only on  with confidence 1 − (/2).
In (45) This proves the proposition.

Total Error
Combining the regularization and sample error bounds, we can prove the main result as follows.