Consistency Analysis of Spectral Regularization Algorithms

and Applied Analysis 3 The analysis made in previous literatures provides us with a deep understanding of the connection between learning theory and regularization. A large class of learning algorithms can be considered as spectral regularization algorithms in accordance with different regularizations. Example 1.2. The regularized least square algorithm is given as f z arg min f∈HK 1 m m ∑ i 1 ( yi − f xi )2 λ∥f∥2K. 1.9 It has been well understood due to a lot of literatures 4–11 , and so forth. It is proved in 7 that f z Tx λI Sxy, 1.10 which corresponds to algorithm 1.8 with the regularization gλ σ σ λ −1. 1.11 In this case, we have B D γ γν0 α 1, the qualification ν0 1. Example 1.3. In regression learning, the coefficient regularization with l2 norm becomes f z fαz , αz argmin α∈Rm 1 m m ∑ i 1 ( yi − fα xi )2 λm m ∑


Introduction
In this paper, we study the consistency analysis of spectral regularization algorithms in regression learning.
Let X, d be a compact metric space and ρ a probability distribution on Z X ×Y with Y R. The regression learning aims at estimating or approximating the regression function through a set of samples z { x i , y i } m i 1 ∈ Z m drawn independently and identically according to ρ from Z.
In learning theory, a reproducing kernel Hilbert space RKHS associated with a Mercer kernel K x, y is usually taken as the hypothesis space. Recall that a function K : X × X → R is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The reproducing kernel Hilbert space H K is defined to be the closure of the linear span of K x : K ·, x , x ∈ X. The reproducing property takes the form f x f, K x K , ∀f ∈ H K , ∀x ∈ X. 1 Our first contribution is to generalize the definition of regularization in 1 such that many more learning algorithms can be included in the scope of spectral algorithms. Definition 1.1. We say that a family of continuous functions g λ : 0, κ 2 → R, λ ∈ 0, 1 is regularization, if the following conditions hold.
Our definition for regularization is different from that in 1 . In fact, the definition given by 1 is the special case when taking α 1 in 1.5 and 1.7 . So from this viewpoint, our assumption is more mild and it is fit for more general situations, for example, coefficient regularization algorithms correspond to spectral algorithms with α 1/2, the relation between coefficient regularization algorithms and spectral algorithms had been explored in 2 . Let For simplicity, we use T x to stand for S T x S x . The spectral regularization algorithm considered here is given by The integral operator L K associated with kernel K from L 2 ρ X X to L 2 ρ X X is defined by L K is a nonnegative self-adjoint compact operator 4 . If the domain of L K is restricted to H K , it also is a nonnegative self-adjoint compact operator from H K to H K , with norm L K H K → H K ≤ κ 2 16 . In the sequel, we simply write L K instead of L K H K → H K and assume that |y| ≤ M almost surely. As usual, we use the following error decomposition: The first term on the right-hand side of 1.19 is called sample error, and the second one is approximation error. Sample error depends on the sampling, and the law of large numbers would lead to its estimation; approximation error is independent of the sampling, and its estimation is mainly through the method of operator approximation.
In order to deduce the error bounds and learning rates, we have to set restriction on the class of possible probability measures that is usually called prior condition. In previous literatures, prior conditions are usually described through the smoothness of regression function f ρ . We suppose the following prior condition: Here, ϕ called the index function is some continuous nondecreasing function defined on 0, κ 2 with ϕ 0 0. In the sequel, we request the qualification ν 0 > 1/2, and there exists μ 0 > 0 covering ϕ, which means that there is c > 0 such that It is easy to see that, for any μ ≥ μ 0 , μ covers ϕ. Furthermore, we request that ϕ t is operator monotone on 0, κ 2 , that is, there is a constant c ϕ < ∞, such that for any pair U,V of nonnegative self-adjoint operators on some Hilbert space with norm less than κ 2 , it holds It is proved that ϕ t t α for 0 ≤ α ≤ 1 is operator monotone 8 . In 1 , Bauer et al. consider the following prior condition: This condition is somewhat restrictive, since it asks that f H must belong to H K . Our result shows that satisfactory error bound is available with a more general prior condition, this is our second main contribution. So from this view point, our work is meaningful. The main result of this paper is the following theorem.

1.27
and C 1 is a constant independent of λ, m, δ.
This theorem shows the consistency of the spectral algorithms, gives the error bound, and also can lead to satisfactory learning rates by the explicit expression of ϕ.
This paper is prepared as follows. In Section 2, we will prove a basic lemma about estimation of operator norms related to the regularization and two concentration inequalities with vector value random variables. In Section 3, we give the proof of Theorem 1.5. In Section 4, we derive learning rate under the setting of several specific regularization.

2.11
The estimates of operator norm mainly adopt the following classical argument in operator theory. Argument: let A be a positive operator in a Hilbert space, for f ∈ C 0, A , then f A is self-adjoint by 17 Here ξ is the random variable on X, ρ X given by ξ x For x ∈ X and f ∈ H K , the reproducing property insures that

2.16
Abstract and Applied Analysis 9 Hence, E ξ L K , and thereby According to 2.15 , there holds σ 2 ξ E ξ 2 HS ≤ κ 4 . Inequality 2.14 then follows from 2.12 and the fact that

2.18
Proof. Define ς f λ x − y K x , so ς is a random variable from Z to H K . Combing the reproducing property with Cauchy-Schwartz inequality, we get

2.19
Since L 1/2 K is an isometric isomorphism from H K , · ρ X onto H K , · K see 16 , we obtain

2.20
where the last inequality follows from 2.4 . By |y| ≤ M almost surely, there holds

2.21
By 2.3 and L K f ρ L K f H , we get So the following error estimation holds where the last inequality follows from 2.2 .
Let us focus on the estimation of sample error. Consider The idea is to separately bound each term in H K . We start dealing with the first term of 3.3 . Consider

3.4
According to 1.4 and 1.5 , we derive the following bound: 3.5 Now, we are in the position to bound 3.4 .

Abstract and Applied Analysis
Suppose that m ≥ 2 log 4/δ , then By Lemmas 2.3 and 2.4, with confidence 1 − δ, the following inequalities hold simultaneously: Combing 1.6 , 3.5 together with the operator monotonicity property of ϕ t and t 1/2 , we obtain

3.9
Abstract and Applied Analysis

13
For the purpose of bounding J 3 K , we rewritten J 3 as the following form:

3.10
In the same way, we have that

3.11
Thus, we can get the bound for I 1 K by combining 3.8 , 3.9 , and 3.11 . What left is to estimate I 2 K and I 3 K , we can employ the same way used in the estimation of I 1 K . Consider ≤ γRDϕ ζ α 1 RDϕ λ λ α−1 μ 0 , 3.12

Learning Rates
Significance of this paper lies in two facts; firstly, we generalize the definition of regularization and enrich the content of spectral regularization algorithms; secondly, analysis of this paper is able to undertake on the very general prior condition 1.21 . Thus, our results can be applied to many different kinds of regularization, such as regularized least square learning, coefficient regularization learning, and accelerate landweber iteration and spectral cutoff. In this section, we will choose a suitable index function and apply Theorem 1.5 to some specific algorithms mentioned in Section 1.

Least Square Regularization
In this case, the regularization is g λ σ 1/ σ λ , λ ∈ 0, 1 with B D γ γ 0 α 1. The qualification of this algorithm is ν 0 1. Suppose ϕ t t r with 0 < r ≤ 1, that means f H L r K h 0 , h 0 ∈ L 2 ρ X X . Thus, we have that μ 0 r covering ϕ t .