We investigate the consistency of spectral regularization algorithms. We generalize the usual definition of regularization function to enrich the content of spectral regularization algorithms. Under a more general prior condition, using refined error decompositions and techniques of operator norm estimation, satisfactory error bounds and learning rates are proved.
1. Introduction
In this paper, we study the consistency analysis of spectral regularization algorithms in regression learning.
Let (X,d) be a compact metric space and ρ a probability distribution on Z=X×Y with Y=ℝ. The regression learning aims at estimating or approximating the regression functionfρ(x)=∫Yydρ(y∣x)
through a set of samples z={(xi,yi)}i=1m∈Zm drawn independently and identically according to ρ from Z.
In learning theory, a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel K(x,y) is usually taken as the hypothesis space. Recall that a function K:X×X→R is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The reproducing kernel Hilbert space ℋK is defined to be the closure of the linear span of Kx:=K(·,x),x∈X. The reproducing property takes the formf(x)=〈f,Kx〉K,∀f∈HK,∀x∈X.
For the Mercer kernel K(x,y), we denote thatκ=maxx∈XK(x,x).
Our first contribution is to generalize the definition of regularization in [1] such that many more learning algorithms can be included in the scope of spectral algorithms.
Definition 1.1.
We say that a family of continuous functions gλ:[0,κ2]→R,λ∈(0,1] is regularization, if the following conditions hold.
There exists a constant D such that
sup0<σ≤κ2|σgλ(σ)|≤D.
There exists a constant B>0,0<α≤1 such that
sup0<σ≤κ2|gλ(σ)|≤Bλα.
There exists a constant γ such that
sup0<σ≤κ2|1-gλ(σ)σ|≤γ.
The qualification ν0 of the regularization gλ is the maximal ν such that
sup0<σ≤κ2|1-gλ(σ)σ|σν≤γνλαν,
where γν does not depend on λ.
Our definition for regularization is different from that in [1]. In fact, the definition given by [1] is the special case when taking α=1 in (1.5) and (1.7). So from this viewpoint, our assumption is more mild and it is fit for more general situations, for example, coefficient regularization algorithms correspond to spectral algorithms with α=1/2, the relation between coefficient regularization algorithms and spectral algorithms had been explored in [2].
Let x={xi}i=1m and y={yi}i=1m. The sample operator Sx:ℋK→ℝm is defined as Sxf={f(xi)}i=1m. The adjoint of Sx under 1/m times the Euclidean norm is SxTc=(1/m)∑i=1mciKxi. For simplicity, we use Tx to stand for SxTSx.
The spectral regularization algorithm considered here is given byfzλ=gλ(Tx)SxTy.
The regularization gλ,λ∈(0,1] in (1.8) was proposed originally to solve ill-posed inverse problems. The relation between learning theory and regularization of linear ill-posed problems has been well discussed in a series of articles, see [1, 3] and the references therein. The analysis made in previous literatures provides us with a deep understanding of the connection between learning theory and regularization.
A large class of learning algorithms can be considered as spectral regularization algorithms in accordance with different regularizations.
Example 1.2.
The regularized least square algorithm is given as
fzλ=argminf∈HK1m∑i=1m(yi-f(xi))2+λ‖f‖K2.
It has been well understood due to a lot of literatures [4–11], and so forth. It is proved in [7] that
fzλ=(Tx+λI)-1SxTy,
which corresponds to algorithm (1.8) with the regularization
gλ(σ)=(σ+λ)-1.
In this case, we have B=D=γ=γν0=α=1, the qualification ν0=1.
Example 1.3.
In regression learning, the coefficient regularization with l2 norm becomes
fzλ=fαz,αz=argminα∈Rm1m∑i=1m(yi-fα(xi))2+λm∑i=1mαi2,
where
fα=∑i=1mαiKxi,∀α∈Rm.
The coefficient regularization was first introduced by Vapnik [12] to design linear programming support vector machines. The consistency of this algorithm has been studied in literatures [2, 13, 14]. In [2], it is proved that the sample error has O(1/m) decay, even for nonpositive semidefinite kernels, andfzλ=(λI+Tx2)-1TxSxTy.
Thus, it corresponds to algorithm (1.8) with the regularization
gλ(σ)=(σ2+λ)-1σ.
In this case, we have B=D=γ=γν0=1, the qualification ν0=2 and α=1/2.
Example 1.4.
Landweber iteration is defined by gλ(σ)=∑i=0⌊1/λ⌋-1(1-σ)i where ⌊a⌋=max{m:m∈Z,m≤a}. This corresponds to the gradient descent algorithm in Yao et al. [15] with constant step-size. In this case, we have that any ν∈[0,+∞) can be considered as qualification of this method and γν=1 if 0<ν≤1 and γν=νν otherwise.
Let fℋ+ be the projection of fρ onto ℋK¯, here ℋ¯K denotes the closure of ℋK in LρX2(X). The generalization error of fzλ isE(fzλ)=∫Z(fzλ(x)-y)2dρ=∫X(fzλ(x)-fH+(x))2dρX+∫X(fH+(x)-fρ(x))2dρX+σ2,
where ρX is the marginal distribution of ρ on X, σ2 is the variance of random variable y-fρ(x). So the goodness of the approximation fzλ is measured by ∥fzλ-fℋ+∥ρX, where we take the L2 norm defined as
‖f‖ρX=(∫X|f(x)|2dρX)1/2,∀f∈LρX2(X).
The integral operator LK associated with kernel K from LρX2(X) to LρX2(X) is defined byLKf(x)=∫XK(x,t)f(t)dρX(t),∀f∈LρX2(X).LK is a nonnegative self-adjoint compact operator [4]. If the domain of LK is restricted to ℋK, it also is a nonnegative self-adjoint compact operator from ℋK to ℋK, with norm ∥LK∥ℋK→ℋK≤κ2 [16]. In the sequel, we simply write ∥LK∥ instead of ∥LK∥ℋK→ℋK and assume that |y|≤M almost surely.
As usual, we use the following error decomposition:‖fzλ-fH+‖ρX≤‖fzλ-fλ‖ρX+‖fλ-fH+‖ρX,
where
fλ=gλ(LK)LKfH+.
The first term on the right-hand side of (1.19) is called sample error, and the second one is approximation error. Sample error depends on the sampling, and the law of large numbers would lead to its estimation; approximation error is independent of the sampling, and its estimation is mainly through the method of operator approximation.
In order to deduce the error bounds and learning rates, we have to set restriction on the class of possible probability measures that is usually called prior condition. In previous literatures, prior conditions are usually described through the smoothness of regression function fρ. We suppose the following prior condition:fH+=φ(LK)h0,h0∈LρX2(X),‖h0‖ρX≤R.
Here, φ called the index function is some continuous nondecreasing function defined on [0,κ2] with φ(0)=0.
In the sequel, we request the qualification ν0>1/2, and there exists μ0>0 covering φ, which means that there is c>0 such thatcλμ0φ(λ)≤infλ≤σ≤κ2σμ0φ(σ),0<λ≤κ2.
It is easy to see that, for any μ≥μ0, μ covers φ.
Furthermore, we request that φ(t) is operator monotone on [0,κ2], that is, there is a constant cφ<∞, such that for any pair U,V of nonnegative self-adjoint operators on some Hilbert space with norm less than κ2, it holds‖φ(U)-φ(V)‖≤cφφ(‖U-V‖),
and, there is dφ>0 such that
dφλφ(λ)≤σφ(σ),0<λ<σ≤κ2.
It is proved that φ(t)=tα for 0≤α≤1 is operator monotone [8].
In [1], Bauer et al. consider the following prior condition:fH+∈Ωφ,R,Ωφ,R={f∈HK:f=φ(LK)v,‖v‖K≤R}.
This condition is somewhat restrictive, since it asks that fℋ+ must belong to ℋK.
Our result shows that satisfactory error bound is available with a more general prior condition, this is our second main contribution. So from this view point, our work is meaningful. The main result of this paper is the following theorem.
Theorem 1.5.
Suppose the index function φ with covering number μ0>0 is operator monotone on [0,κ2]. The qualification ν0 satisfies ν0>max{1/2,μ0} and that m≥2log(4/δ) for 0<δ<1. Then, with confidence 1-δ, there holds
‖fzλ-fH+‖ρX≤C1{(1+λ-α/2ζ1/2)(φ(λ)λ(α-1)μ0+φ(ζ)+λ-α/2η)+[λ(α-1)(φ(λ))1/μ0]min{μ0,ν0-1/2}},
where
ζ=2κ22log(4/δ)m,η=φ(λ)λ-μ0+min{α(μ0-1/2),0}m-1log4δ+(1+φ(λ)λ(α-1)μ0)m-1/2log4δ+λmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1},
and C1 is a constant independent of λ,m,δ.
This theorem shows the consistency of the spectral algorithms, gives the error bound, and also can lead to satisfactory learning rates by the explicit expression of φ.
This paper is prepared as follows. In Section 2, we will prove a basic lemma about estimation of operator norms related to the regularization and two concentration inequalities with vector value random variables. In Section 3, we give the proof of Theorem 1.5. In Section 4, we derive learning rate under the setting of several specific regularization.
2. Some Lemmas
We simply write γ0 instead of γν0 in (1.7) for qualification ν0. To estimate the error ∥fzλ-fℋ+∥ρX, we need the following lemma to bound the norms of some operators.
Lemma 2.1.
Let φ be an index function and ν0>max{1/2,μ0}. Then, the following inequalities hold true:
sup0<σ≤κ2|1-gλ(σ)σ|σs≤γ1-s/v0γ0s/v0λαs,∀0<s≤ν0,sup0<σ≤κ2|1-gλ(σ)σ|φs(σ)≤αsφs(λ)λ(α-1)μ0s,∀0<s≤ν0μ0,sup0<σ≤κ2|1-gλ(σ)σ|φ(σ)σ≤β1λmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1},sup0<σ≤κ2|gλ(σ)σ1/2φ(σ)|≤β2φ(λ)λmin{α(μ0-1/2),0}-μ0.
Here, αs,β1,β2 are constants only dependent on ν0,μ0,γ,γ0,c,φ(κ2).
Proof.
By (1.6) and (1.7), for any 0<s≤ν0, we have
sup0<σ≤κ2|1-gλ(σ)σ|σs≤sup0<σ≤κ2[|1-gλ(σ)σ|σν0]s/ν0×|1-gλ(σ)σ|1-s/ν0≤γ1-s/v0γ0s/v0λαs.
Since μ0s≤ν0 and φ is covered by μ0, by (2.1) and (1.6), we getsup0<σ≤κ2|1-gλ(σ)σ|φs(σ)=max{sup0<σ<λ|1-gλ(σ)σ|φs(σ),supλ≤σ≤κ2|1-gλ(σ)σ|φs(σ)σμ0sσμ0s}≤max{γφs(λ),1csφs(λ)γ1-μ0s/v0γ0μ0s/v0λ(α-1)μ0s}=max{γ,1csγ1-μ0s/v0γ0μ0s/v0}φs(λ)λ(α-1)μ0s≐αsφs(λ)λ(α-1)μ0s.
In order to prove the third inequality, let τ=min{2ν0/(2ν0-1),ν0/μ0} and τ(1-1/2ν0)=(1/μ0)min{μ0,ν0-1/2}, by (2.2), we havesup0<σ≤κ2|1-gλ(σ)σ|1-1/2ν0φ(σ)=sup0<σ≤κ2[|1-gλ(σ)σ|φτ(σ)]1-1/2ν0(φ(σ))1-τ(1-1/2ν0)≤(φ(κ2))1-τ(1-1/2ν0)ατ1-1/2ν0λmin{μ0,ν0-1/2}(α-1)×(φ(λ))min{(2ν0-1)/2μ0,1}.
Thus,
sup0<σ≤κ2|1-gλ(σ)σ|φ(σ)σ=sup0<σ≤κ2[|1-gλ(σ)σ|σν0]1/2ν0×|1-gλ(σ)σ|1-1/2ν0φ(σ)≤β1λmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1},
where β1 is a constant only dependent on ν0,μ0,γ,γ0,c,φ(κ2).
If 0<μ0≤1/2, we havesup0<σ≤κ2|gλ(σ)σ1/2φ(σ)|=max{sup0<σ<λ|gλ(σ)σ1/2φ(σ)|,supλ≤σ≤κ2|gλ(σ)σμ0+1/2φ(σ)σμ0|}≤max{sup0<σ<λ|gλ(σ)σ|1/2|gλ(σ)|1/2φ(λ),φ(λ)cλμ0Dμ0+1/2B1/2-μ0λ-(1/2-μ0)α}≤max{BD,c-1Dμ0+1/2B1/2-μ0}φ(λ)λα(μ0-1/2)-μ0.
Similarly computation shows that, for μ0≥1/2,
sup0<σ≤κ2|gλ(σ)σ1/2φ(σ)|≤max{BD,c-1Dκ2μ0-1}φ(λ)λ-μ0.
Thus, the last inequality holds, and we complete the proof.
By taking s=1/2 in (2.1), we havesup0<σ≤κ2|1-gλ(σ)σ|σ1/2≤γ1-1/2v0γ01/2v0λα/2.
The estimates of operator norm mainly adopt the following classical argument in operator theory. Argument: let A be a positive operator in a Hilbert space, for f∈C[0,∥A∥], then f(A) is self-adjoint by [17, Proposition 4.4.7] and σ(f(A))={f(t):t∈σ(A)} by [17, Theorem 4.4.8] where σ(A) is the spectral set of A. Consequently, ∥f(A)∥≤∥f∥∞.
The following probability inequality concerning random variables with values in a Hilbert space is proved in [18].
Lemma 2.2.
Let H be a Hilbert space and ξ a random variable on (Z,ρ) with values in H. Assume ∥ξ∥≤M̃<∞ almost surely. Denote σ2(ξ)=E(∥ξ∥2). Let {zi}i=1m be independent random drawers of ρ. For any 0<δ<1, with confidence 1-δ, there holds
‖1m∑i=1m[ξ(zi)-E(ξ)]‖≤2M̃log(2/δ)m+2σ2(ξ)log(2/δ)m.
Let HS(ℋK) be the class of all the Hilbert Schmidt operators on ℋK. It forms a Hilbert space with inner product〈T,S〉HS:=∑i=1∞〈Tφi,Sφi〉K,
where φi is an orthonormal basis of ℋK and this definition does not depend on the choice of the basis. The integral operator LK, as an operator on ℋK, belongs to HS(ℋK) and ∥LK∥HS≤κ2 (see [9]). By Lemma 2.2, we can estimate the following operator norms.
Lemma 2.3.
Let x={xi}i=1m be a sample set i.i.d drawn from (X,ρX). With confidence 1-δ, we have
‖LK-SxTSx‖≤κ2(2log(2/δ)m+2log(2/δ)m).
Proof.
Observe that SxTSx=(1/m)∑i=1mKxi〈·,Kxi〉K. Denote SxTSx=(1/m)∑i=1mξ(xi). Here ξ is the random variable on (X,ρX) given by ξ(x)=Kx〈·,Kx〉K.
Consider〈ξ(x),ξ(x)〉HS=∑i=1∞〈Kx〈φi,Kx〉K,Kx〈φi,Kx〉K〉K=∑i=1∞〈φi,Kx〉K2K(x,x)≤κ4.
For x∈X and f∈ℋK, the reproducing property insures that
ξ(x)(f)=Kx〈f,Kx〉K=f(x)Kx.
Hence, E(ξ)=LK, and thereby
(LK-SxTSx)=Eξ-1m∑i=1mξ(xi).
According to (2.15), there holds σ2(ξ)=E∥ξ∥HS2≤κ4. Inequality (2.14) then follows from (2.12) and the fact that ∥LK-SxTSx∥≤∥LK-SxTSx∥HS.
Lemma 2.4.
Under the assumption of Lemma 2.1. Let z={zi}i=1m be a sample set i.i.d drawn from (Z,ρ). With confidence 1-δ, we have
Define ς=(fλ(x)-y)Kx, so ς is a random variable from Z to ℋK. Combing the reproducing property with Cauchy-Schwartz inequality, we get
‖fλ‖∞=supx∈X|〈fλ,Kx〉K|≤κ‖fλ‖K.
Since LK1/2 is an isometric isomorphism from (ℋK¯,∥·∥ρX) onto (ℋK,∥·∥K) (see [16]), we obtain
‖fλ‖K=‖gλ(LK)LK1/2φ(LK)LK1/2h0‖K≤‖gλ(LK)LK1/2φ(LK)‖×‖h0‖ρX≤sup0<t≤κ2|gλ(t)t1/2φ(t)|×R≤β2φ(λ)λmin{α(μ0-1/2),0}-μ0R,
where the last inequality follows from (2.4).
By |y|≤M almost surely, there holds‖ς‖K2=〈(fλ(x)-y)Kx,(fλ(x)-y)Kx〉K≤κ2(M+κβ2φ(λ)λmin{α(μ0-1/2),0}-μ0R)2.
By (2.3) and LKfρ=LKfℋ+, we get
‖Eς‖K=‖LK(fρ-fλ)‖K=‖LK(fH+-fλ)‖K=‖LKφ(LK)(I-gλ(LK)LK)h0‖K≤‖(I-gλ(LK)LK)φ(LK)LK1/2‖×‖LK1/2h0‖K≤β1Rλmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1},E‖ς‖K2=E(y-fλ(x))2K(x,x)≤κ2E(y-fλ(x))2≤κ2[∫Z(y-fρ(x))2dρ+∫X(fρ(x)-fH+(x))2dρX+∫X(fH+(x)-fλ(x))2dρX]≤κ2(α12λ2(α-1)μ0φ2(λ)R2+∫Z(y-fρ(x))2dρ+‖fρ-fH+‖ρX2),
where, in the last step, we used the result of Proposition 3.1 in Section 3. For simplicity, we write cρ2 for ∫Z(y-fρ(x))2dρ+∥fρ-fℋ+∥ρX2. Applying Lemma 2.2, there holds
‖1m∑i=1m[ς(zi)-Eς]‖K≤2κ(M+κβ2Rφ(λ)λ-μ0+min{α(μ0-1/2),0})log(2/δ)m+κ(α1λ(α-1)μ0φ(λ)R+cρ)2log(2/δ)m.
Then, we can use the following inequality to get the desired error bound,
‖SxTy-Txfλ‖K≤‖1m∑i=1m[ς(zi)-Eς]‖K+‖Eς‖K.
This completes the proof of Lemma 2.4.
3. Error AnalysisProposition 3.1.
Let φ be an index function with μ0>0 covering φ and ν0>max{1/2,μ0}, so under the assumptions of (1.21), there holds ∥fλ-fℋ+∥ρX≤α1λ(α-1)μ0φ(λ)R.
Proof.
From the definition of fλ and fℋ+, we have
fλ-fH+=gλ(LK)LKfH+-fH+=(gλ(LK)LK-I)φ(LK)h0.
So the following error estimation holds
‖fλ-fH+‖ρX≤‖(gλ(LK)LK-I)φ(LK)‖×‖h0‖ρX≤sup0<σ≤κ2|(gλ(σ)σ-1)φ(σ)|×‖h0‖ρX≤α1λ(α-1)μ0φ(λ)R,
where the last inequality follows from (2.2).
Let us focus on the estimation of sample error.
Consider‖fzλ-fλ‖ρX=‖LK1/2(fzλ-fλ)‖K≤‖LK1/2(gλ(Tx)Tx-I)fλ-(gλ(Tx)Tx-I)LK1/2fλ‖K+‖(gλ(Tx)Tx-I)gλ(LK)LKφ(LK)LK1/2h0‖K+‖LK1/2gλ(Tx)(SxTy-Txfλ)‖K:=‖I1‖K+‖I2‖K+‖I3‖K.
The idea is to separately bound each term in ℋK. We start dealing with the first term of (3.3).
ConsiderI1=(LK1/2-Tx1/2)(gλ(Tx)Tx-I)(φ(LK)-φ(Tx))gλ(LK)LK1/2LK1/2h0+(LK1/2-Tx1/2)(gλ(Tx)Tx-I)φ(Tx)gλ(LK)LK1/2LK1/2h0+(gλ(Tx)Tx-I)(Tx1/2-LK1/2)φ(LK)gλ(LK)LK1/2LK1/2h0=J1+J2+J3.
According to (1.4) and (1.5), we derive the following bound:‖gλ(LK)LK1/2‖≤sup0<σ≤κ2|gλ(σ)σ1/2|=sup0<σ≤κ2|gλ(σ)σ|×|gλ(σ)|≤λ-α/2DB.
Now, we are in the position to bound (3.4).
Suppose that m≥2log(4/δ), thenκ2(2log(4/δ)m+2log(4/δ)m)≤2κ22log(4/δ)m:=ζ,2κ(M+κβ2Rφ(λ)λ-μ0+min{α(μ0-1/2),0})log(4/δ)m+κ(α1λ(α-1)μ0φ(λ)R+cρ)2log(4/δ)m+β1Rλmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1}≤2κ2β2Rφ(λ)λ-μ0+min{α(μ0-1/2),0}log(4/δ)m+κ(M+α1Rφ(λ)λ(α-1)μ0+cρ)2log(4/δ)m+β1Rλmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1}≤C0(φ(λ)λ-μ0+min{α(μ0-1/2),0}m-1log4δ+(1+φ(λ)λ(α-1)μ0)m-1/2log4δ+λmin{μ0,ν0-1/2}(α-1)+α/2(φ(λ))min{(2ν0-1)/2μ0,1}):=C0η.
By Lemmas 2.3 and 2.4, with confidence 1-δ, the following inequalities hold simultaneously:‖LK-SxTSx‖≤ζ,‖SxTy-Txfλ‖K≤C0η.
Combing (1.6), (3.5) together with the operator monotonicity property of φ(t) and t1/2, we obtain‖J1‖K≤‖LK1/2-Tx1/2‖×‖gλ(Tx)Tx-I‖×‖φ(LK)-φ(Tx)‖×‖gλ(LK)LK1/2‖×‖LK1/2h0‖K≤cφγDBRλ-α/2‖LK-Tx‖1/2×φ(‖LK-Tx‖)≤cφγDBRλ-α/2ζ1/2φ(ζ).
By Lemma 2.1 and (3.5),‖J2‖K≤‖LK1/2-Tx1/2‖×‖(gλ(Tx)Tx-I)φ(Tx)‖×‖gλ(LK)LK1/2‖×‖LK1/2h0‖K≤RDB‖LK-Tx‖1/2×α1λ(α-1)μ0-α/2φ(λ)≤α1RDBζ1/2φ(λ)λ(α-1)μ0-α/2.
For the purpose of bounding ∥J3∥K, we rewritten J3 as the following form:J3=(gλ(Tx)Tx-I)Tx1/2(φ(LK)-φ(Tx))gλ(LK)LK1/2LK1/2h0+(gλ(Tx)Tx-I)Tx1/2φ(Tx)gλ(LK)LK1/2LK1/2h0-(gλ(Tx)Tx-I)(φ(LK)-φ(Tx))gλ(LK)LKLK1/2h0-(gλ(Tx)Tx-I)φ(Tx)gλ(LK)LKLK1/2h0.
In the same way, we have that‖J3‖K≤γ1-1/2v0γ01/2v0RDBφ(ζ)+β1λmin{μ0,ν0-1/2}(α-1)(φ(λ))min{(2ν0-1)/2μ0,1}DBR+γRDφ(ζ)+RDα1φ(λ)λ(α-1)μ0.
Thus, we can get the bound for ∥I1∥K by combining (3.8), (3.9), and (3.11). What left is to estimate ∥I2∥K and ∥I3∥K, we can employ the same way used in the estimation of ∥I1∥K.
Consider‖I2‖K≤‖(gλ(Tx)Tx-I)(φ(LK)-φ(Tx))gλ(LK)LKLK1/2h0‖K+‖(gλ(Tx)Tx-I)φ(Tx)gλ(LK)LKLK1/2h0‖K≤γRDφ(ζ)+α1RDφ(λ)λ(α-1)μ0,‖I3‖K≤‖(LK1/2-Tx1/2)gλ(Tx)(SxTy-Txfλ)‖K+‖Tx1/2gλ(Tx)(SxTy-Txfλ)‖K≤C0Bλ-αζ1/2η+C0DBηλ-α/2.
Lastly, combining (3.8) to (3.13) with Proposition 3.1, we have Theorem 1.5 holds.
4. Learning Rates
Significance of this paper lies in two facts; firstly, we generalize the definition of regularization and enrich the content of spectral regularization algorithms; secondly, analysis of this paper is able to undertake on the very general prior condition (1.21). Thus, our results can be applied to many different kinds of regularization, such as regularized least square learning, coefficient regularization learning, and (accelerate) landweber iteration and spectral cutoff. In this section, we will choose a suitable index function and apply Theorem 1.5 to some specific algorithms mentioned in Section 1.
4.1. Least Square Regularization
In this case, the regularization is gλ(σ)=1/(σ+λ),λ∈(0,1] with B=D=γ=γ0=α=1. The qualification of this algorithm is ν0=1. Suppose φ(t)=tr with 0<r≤1, that means fℋ+=LKrh0, h0∈LρX2(X). Thus, we have that μ0=r covering φ(t).
Using the result of Theorem 1.5, we obtain the following corollary.
Corollary 4.1.
Under the assumptions of Theorem 1.5, we have the following.
For 0<r≤1/2, with confidence 1-δ, there holds
‖fzλ-fH+‖ρX≤O((λr+m-r/2+λ-1m-3/4)(1+λ-1/2m-1/4)(log4δ)5/4).
By taking λ=m-1/2, we have the following learning rate:
‖fzλ-fH+‖ρX≤O(m-r/2(log4δ)5/4).
For 1/2≤r<1, with confidence 1-δ, there holds
‖fzλ-fH+‖ρX≤O((λ1/2+λ-1/2m-1/2+λ-1m-3/4+m-1/4)(log4δ)5/4).
By taking λ=m-1/2, we have the following learning rate:
‖fzλ-fH+‖ρX≤O(m-1/4(log4δ)5/4).
4.2. Coefficient Regularization with the l2 Norm
In this case, the regularization is gλ(σ)=σ/(σ2+λ),λ∈(0,1] with B=D=γ=γ0=1,α=1/2. The qualification is ν0=2. We also consider the index function φ(t)=tr with 0<r≤1 and μ0=r.
Corollary 4.2.
Under the assumptions of Theorem 1.5, we have the following.
For 0<r≤1/2, with confidence 1-δ, there holds
‖fzλ-fH+‖ρX≤O((1+λ-1/4m-1/4)(λr/2+m-r/2+m-1/2λ-1/4+m-1λ(1/2)(r-1))(log4δ)5/4).
By taking λ=m-1, we have the following learning rate:
‖fzλ-fH+‖ρX≤O(m-r/2(log4δ)5/4).
For 1/2≤r≤1, with confidence 1-δ, there holds
‖fzλ-fH+‖ρX≤O((1+λ-1/4m-1/4)(λr/2+m-r/2+m-1/2λ-1/4)(log4δ)5/4).
By taking λ=m-2/(2r+1), we have the following learning rate:
‖fzλ-fH+‖ρX≤O(m-r/(2r+1)(log4δ)5/4).
For coefficient regularization, the learning rates derived by Theorem 1.5 are almost the same, see Corollary 5.2 in [2]. For least square regularization, the learning rates in Corollary 4.1 are weak, the analysis in [8] by integral operator method gives learning rate O(m-3r/4(1+r)) for 0<r≤1/2; leave one out analysis in [11] gives the rate O(m-r/(1+2r)).
Our analysis is influenced by both the prior condition and the regularization. Under the weaker prior condition (1.21), some techniques for error analysis in [1] are inapplicable; we take more complicated error decomposition and refined analysis to estimate error bounds and learning rates.
Acknowledgment
This work is supported by the Natural Science Foundation of China (Grant no. 11071276).
BauerF.PereverzevS.RosascoL.On regularization algorithms in learning theory2007231527210.1016/j.jco.2006.07.0012297015ZBL1109.68088SunH.WuQ.Least square regression with indefinite kernels and coefficient regularization20113019610910.1016/j.acha.2010.04.0012737935ZBL1225.65015Lo GerfoL.RosascoL.OdoneF.De VitoE.VerriA.Spectral algorithms for supervised learning20082071873189710.1162/neco.2008.05-07-5172417109ZBL1147.68643CuckerF.SmaleS.On the mathematical foundations of learning200239114910.1090/S0273-0979-01-00923-51864085ZBL0983.68162EvgeniouT.PontilM.PoggioT.Regularization networks and support vector machines200013115010.1023/A:10189460253161759187ZBL0939.68098PoggioT.SmaleS.The mathematics of learning: dealing with data20035055375441968413ZBL1083.68100SmaleS.ZhouD.-X.Learning theory estimates via integral operators and their approximations200726215317210.1007/s00365-006-0659-y2327597ZBL1127.68088SunH.WuQ.A note on application of integral operator in learning theory200926341642110.1016/j.acha.2008.10.0022503313ZBL1165.68059SunH.WuQ.Regularized least square regression with dependent samples201032217518910.1007/s10444-008-9099-y2581234ZBL1191.68535WuQ.YingY.ZhouD.-X.Learning rates of least-square regularized regression20066217119210.1007/s10208-004-0155-92228738ZBL1100.68100ZhangT.Leave-one-out bounds for kernel methods2003151397143710.1162/089976603321780326ZBL1085.68144VapnikV. N.1998New York, NY, USAA Wiley-Interscience1641250GuoQ.SunH. W.Asymptotic convergence of coefficient regularization based on weakly depended samples201024199103WuQ.ZhouD.-X.Learning with sample dependent hypothesis spaces200856112896290710.1016/j.camwa.2008.09.0142467678ZBL1165.68388YaoY.RosascoL.CaponnettoA.On early stopping in gradient descent learning200726228931510.1007/s00365-006-0663-22327601ZBL1125.62035SunH.WuQ.Application of integral operator for regularized least-square regression2009491-227628510.1016/j.mcm.2008.08.0052480050ZBL1165.45310KadisonR. V.RingroseJ. R.19831San Diego, Calif, USAAcademic PressElementary TheoryPinelisI.Optimum bounds for the distributions of martingales in Banach spaces199422416791706133119810.1214/aop/1176988477ZBL0836.60015