AAA Abstract and Applied Analysis 1687-0409 1085-3375 Hindawi Publishing Corporation 134727 10.1155/2013/134727 134727 Research Article Coefficient-Based Regression with Non-Identical Unbounded Sampling Cai Jia Wu Qiang School of Mathematics and Computational Science Guangdong University of Business Studies Guangzhou, Guangdong 510320 China gdcc.edu.cn 2013 3 6 2013 2013 18 01 2013 15 04 2013 2013 Copyright © 2013 Jia Cai. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We investigate a coefficient-based least squares regression problem with indefinite kernels from non-identical unbounded sampling processes. Here non-identical unbounded sampling means the samples are drawn independently but not identically from unbounded sampling processes. The kernel is not necessarily symmetric or positive semi-definite. This leads to additional difficulty in the error analysis. By introducing a suitable reproducing kernel Hilbert space (RKHS) and a suitable intermediate integral operator, elaborate analysis is presented by means of a novel technique for the sample error. This leads to satisfactory results.

501100001809 National Natural Science Foundation of China 11001247
1. Introduction and Preliminary

We study coefficient-based least squares regression with indefinite kernels from non-identical unbounded sampling processes. In our setting, functions are defined on a compact subset X of n and take values in Y=. Let ρ be a Borel probability measure on Z=X×Y. A sample z={zt=(xt,yt)}t=1TZT is drawn independently from different Borel probability measures ρ(t) (t=1,,T), ρ(t)(·x)=ρ(·x). Let ρX(t) be the marginal distribution of ρ(t) on X and ρX the marginal distribution of ρ on X. We assume that the sequence {ρX(t)} converges exponentially fast in the dual of the Hölder space Cs(X). Here the Hölder space Cs(X) (0s1) is defined as the space of all continuous functions on X with the following norm finite : (1)fCs(X)=f+|f|Cs(X), where (2)|f|Cs(X):=supxyX|f(x)-f(y)|(d(x,y))s.

Definition 1.

Let 0s1; we say that the sequence {ρX(t)} converges exponentially fast in (Cs(X))* to a probability measure ρX on X or converges exponentially in short if there exist C1>0 and 0<α< 1 such that (3)ρX(t)-ρX(Cs(X))*C1αt,t.

By the definition of the dual space (Cs(X))*, the decay condition (3) can be expressed as (4)|Xf(x)dρX(t)-Xf(x)dρX|C1αtfCs(X),fCs(X),t. The regression function fρ:XY is given by (5)fρ(x)=Yydρ(yx),xX, where ρ(yx) is the conditional distribution of y at xX and ρ is unknown, so fρ cannot be obtained directly. The aim of regression problem is to learn a good approximation of fρ from sample z. This is an ill-posed problem and regularization scheme is needed.

Classical learning algorithm is conducted by a scheme in a reproducing kernel Hilbert space (RKHS)  associated with a Mercer kernel K:X×X, which is defined to be a continuous, symmetric, and positive semi-definite (p.s.d.) function. RKHS K is defined to be the completion of the linear span of {Kx=K(·,x):xX} with the inner product Kx,KyK=K(x,y). Define κ=supx,vX|K(x,v)|<; then the regularized regression problem is given by (6)fz=argminfK{1Tt=1T(f(xt)-yt)2+λfK2}. It has been well understood due to lots of the literature ( and the references therein). Here we consider the indefinite kernel scheme in a hypothesis space K,z depending on the sample z; this space is defined by (7)K,z={i=1TβiKxi:βi}.

And the regularized penalty term is imposed on the coefficients of function f. Indefinite kernel K means Kdoes not need to satisfy symmetry and p.s.d. condition except for continuity and boundedness. Define K~(u,v)=XK(u,x)K(v,x)dρX(x); then K~(u,v) is a Mercer kernel. For more introductions about learning with indefinite kernels, please see . For allxX, if we define LK,ρXf(x)=XK(x,u)f(u)dρX(u) and LK,ρX(t)f(x)=XK(x,u)f(u)dρX(t)(u), since X is compact and K is continous, LK,ρX and its adjoint LK,ρX* are both compact operators. Hence LK~,ρX=LK,ρXLK,ρX*, LK~,ρX(t)=LK,ρX(t)LK,ρX(t)*. The learning algorithm we are interested in this paper takes the following form:(8)fz,λ=argminfK,z{1Ti=1T(f(xi)-yi)2+λΩz(f)},λ>0.

We define the following coefficient-based regularizer: (9)Ωz(f)=fz2:=Ti=1Tβi2        forf=i=1TβiKxi.

Then we have fz,λ=fβz,λ=i=1Tβz,λ,iKxi. By using the integral operator technique from , in , Sun and Wu gave the capacity independent estimate for the convergence rate of fz,λ-fρρ2, where fρ=(X|f(x)|2dρX(x))1/2. Shi investigated the error analysis in a data dependent hypothesis space for general kernels . Sun and Guo conducted error analysis for the Mercer kernels by uniform bounded non-i.i.d. sampling . In this paper, we study learning algorithm (8) by non-identical unbounded sampling processes with indefinite kernels.

If K is a Mercer kernel, from  we know that K is in the range of LK1/2. For an indefinite kernel K, recall LK~,ρX=LK,ρXLK,ρX*. Based on the polar decomposition of compact operators ().

Lemma 2.

Let H be a separable Hilbert space and T a compact operator on H; then T can be factored as (10)T=ΓA, where A=(T*T)1/2 and Γ is a partial isometry on H with Γ*Γ being orthogonal projection onto (A)-.

We immediately have the following proposition .

Proposition 3.

Consider K~ as a subspace of LρX2; then LK*=ULK~1/2 and LK=LK~1/2U*, where U is a partial isometry on LρX2 with U*U being the orthogonal projection onto K~-.

We use the RKHS K~ to approximate fρ, hence define (11)fλ,ρX= argmin fK~{Z(f(x)-fρ(x))2dρ(x,y)+λfK~2}.

In order to estimate fz,λ-fρ, we construct (12)fλ,ρ-X(T)=(λI+LK~,ρ-X(T))-1LK~,ρ-X(T)fρ, where ρ-X(T)=(1/T)t=1TρX(t). Then we can decompose the error term into the following three parts: (13)fz,λ-fρ=fz,λ-fλ,ρ-X(T)+fλ,ρ-X(T)-fλ,ρX+fλ,ρX-fρ.

We will conduct the error analysis in several steps. The major contribution we make is on the sample error estimate; the main difficulty is the non-identical unbounded sampling of the samples; we overcome this difficulty by introducing a suitable intermediate operator.

2. Key Analysis and Main Results

In order to give the error analysis, we assume that the kernel K~satisfies the following kernel condition [1, 11].

Definition 4.

We say that the Mercer kernel K~ satisfies the kernel condition of order s if, for some constant κs>0,K~Cs(X×X), and for all u,vX, (14)K~u-K~vK~κs(d(u,v))s.

Since sample z is drawn from unbounded sampling processes, we will assume the following moment hypothesis condition .

Moment Hypothesis. There exist constants M>0 and C2>0 such that (15)Y|y|ldρ(yx)C2l!Mll,xX.

There is a large literature on error analysis for learning algorithm (6); see, for example, [4, 5, 1417]. But most obtained results are presented under the standard assumption that |y|M almost surely (M is a constant). This excludes the case of Gaussian noise. The moment hypothesis condition is a natural generalization of the condition |y|M. Wang and Zhou considered error analysis for algorithm (6) under condition (15). Our main results are about learning rates of algorithm (8) under conditions (3), (14) and the approximation ability of K~ in terms of fρ.

Now we can state our general results on learning rates for algorithm (8).

Theorem 5.

Assume moment hypothesis condition (15);ρX(t) satisfies condition (3) and K~ satisfies condition (14); fρLK~,ρXr(LρX2), for 1/2<r3/2; take λ=T-θ with 0<θ<1/3; then (16)𝔼fz,λ-fρρCκ′′T-min{1/2-(3/2)θ,(r-1/2)θ}, where Cκ′′ is a constant depending on κ,s,andα, but not on T or δ, and will be given explicitly in Section 3.3.

Remark 6.

If we take λ=T-1/2(r+1), then our rate is O(T-(2r-1)/4(r+1)). The proof of Theorem 5 will be conducted in Section 3, where the error term is decomposed into three parts. In , the authors consider the coefficient-based regression with the Mercer kernels by uniform bounded non-i.i.d sampling; the best rate of order O(T-2r/(1+2r)) was obtained.

When the samples are drawn i.i.d from measure ρ, we have the following result.

Theorem 7.

Assume moment hypothesis condition (15);K~ satisfies condition (14); fρLK~,ρXr(LρX2); then if 0<r1, take λ=T-1/(2r+3); one see that (17)𝔼fz,λ-fρρC~κT-r/(2r+3), where C~κ is a constant depending on κ,s,andα but not on T or δ. And if r>1, take λ=T-1/5; we have (18)𝔼fz,λ-fρρC~κT-1/5.

Here we get the same learning rate as the one in . But our rate is derived under a relaxation condition of the sampling output.

3. Error Analysis

In this section, we will state the error analysis in several steps.

3.1. Regularization Error Estimation

In this subsection, we address a bound for the regularization error fλ,ρX-fρ. The error estimate for regularization error has been investigated in lots of the literature in learning theory ([4, 18] and the references therein); we will omit the proof and quote it directly.

Proposition 8.

Assume fρ=LK~,ρXr(gρ) for some gρLρX2 and r>0; the following bound for approximation error holds: (19)fλ,ρX-fρρCqλq, where Cq=(1+κ2r-2)LK~,ρX-rfρρ, q=min{r,1}, and when 1/2<r3/2, (20)fλ,ρX-fρK~Crλr-1/2, where Cr=LK~,ρX-rfρρ.

3.2. Estimate for the Measure Error

This subsection is devoted to the analysis of the term fλ,ρ-X(T)-fλ,ρXK~ caused by the difference of measures, which we called measure error. The ideas of proof are from . Before giving the result, let us state a crucial lemma first.

Lemma 9.

Assume K~ satisfies condition (14); then (21)LK~,ρ-X(T)-LK~,ρX(κ+κs)2ρX-ρ-X(T)(Cs(X))*.

Proof.

For any hCs(X), we see that (22)(LK~,ρ-X(T)-LK~,ρX)hK~2=Xh(u){Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)}×d(ρX-ρ-X(T))(u)ρX-ρ-X(T)(Cs(X))*gCs(X)=ρX-ρ-X(T)(Cs(X))*(g+|g|Cs(X)), where g(u)=h(u){Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)},uX. Now we need to estimate g and |g|Cs(X), respectively. For the term g, it is easy to see that (23)ghsupuX|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|. The estimation of |g|Cs(X) is more involved: (24)|g|Cs(X)|h|Cs(X)supuX|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|+h|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|Cs(X).supuX|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|ρX-ρ-X(T)(Cs(X))*h(·)K~(u,·)Cs(X)ρX-ρ-X(T)(Cs(X))*×{κ2h+|h|Cs(X)κ2+hκκs},|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|Cs(X)=supu1u2X|Xh(v)K~(u1,v)-K~(u2,v)d(u1,u2)s×d(ρX-ρ-X(T))(v)K~(u1,v)-K~(u2,v)d(u1,u2)s|supu1u2XρX-ρ-X(T)(Cs(X))*×h(v)K~(u1,v)-K~(u2,v)d(u1,u2)sCs(X). Since (25)supu1u2Xh(v)K~(u1,v)-K~(u2,v)d(u1,u2)sCs(X)hsupu1u2XsupvX|K~(u1,v)-K~(u2,v)d(u1,u2)s|+|h|Cs(X)supu1u2XsupvX|K~(u1,v)-K~(u2,v)d(u1,u2)s|+hsupu1u2Xsupv1v2X((d(u1,u2)sd(v1,v2)s)-1|K~(u1,v2)-K~(u2,v2)-K~(u1,v1)+K~(u2,v1)|×(d(u1,u2)sd(v1,v2)s)-1)hCs(X)κκs+hκs2, then (26)|Xh(v)K~(u,v)d(ρX-ρ-X(T))(v)|Cs(X)ρX-ρ-X(T)(Cs(X))*hCs(X)(κκs+κs2). Therefore (27)|g|Cs(X)ρX-ρ-X(T)(Cs(X))*×[|h|Cs(X){κ2hCs(X)+hκκs}+hhCs(X)(κκs+κs2)]. Combining the estimation of g and |g|Cs(X), we get (28)(LK~,ρ-X(T)-LK~,ρX)hK~2hCs(X)2ρX-ρ-X(T)(Cs(X))*2(κ+κs)2. When condition (14) is satisfied, it was proved in  that K~ is included in Cs(X) with the inclusion bounded (29)fCs(X)(κ+κs)fK~,fK~. Then (30)(LK~,ρ-X(T)-LK~,ρX)hK~hK~(κ+κs)2ρX-ρ-X(T)(Cs(X))*. This completes the proof.

Proposition 10.

Assume fρLK~,ρXr(LρX2) for some 1/2<r3/2; K~ satisfies condition (14); the following bound for measure error holds: (31)fλ,ρ-X(T)-fλ,ρXK~C3λr-3/2T, where Cκ=(κ+κs2)2andC3=C1CκCrα/(1-α).

Proof.

From (11), simple calculation shows that fλ,ρX=(λI+LK~,ρX)-1LK~,ρXfρ. Recalling (12), we can see that (32)fλ,ρ-X(T)-fλ,ρXK~=(λI+LK~,ρ-X(T))-1{(LK~,ρ-X(T)-LK~,ρX)fρ+(LK~,ρX-LK~,ρ-X(T))fλ,ρX}(λI+LK~,ρ-X(T))-1{(LK~,ρ-X(T)-LK~,ρX)fρK~=(λI+LK~,ρ-X(T))-1(LK~,ρ-X(T)-LK~,ρX)(fρ-fλ,ρX)K~1λ(LK~,ρ-X(T)-LK~,ρX)(fρ-fλ,ρX)K~. Applying Lemma 9 to the case h=fρ-fλ,ρX, we get (33)fλ,ρ-X(T)-fλ,ρXK~1λ(κ+κs)2ρX-ρ-X(T)(Cs(X))*fλ,ρX-fρK~. By the definition of ρ-X(T) and noticing (3), we can see (34)ρX-ρ-X(T)Cs(X)*=1Tt=1TρX(t)-ρX(Cs(X))*C1αT(1-α).

This in connection with Proposition 8 yields the conclusion.

3.3. Sample Error Estimation

In this subsection we will conduct the estimation of the term fz,λ-fλ,ρ-X(T). At first, we give some notations. Let C(X) be the space of bounded continuous functions on X with supremum norm ·. Define sampling operator Sx:C(X)T by Sx(f)=(f(x1),,f(xT)) (). For β=(β1,,βT), let Ux and U^x be operators from T to C(X) defined as (35)Uxβ=1Ti=1TβiK(x,xi),U^xβ=1Ti=1TβiK(xi,x).

It is easy to see that both Ux and U^x are bounded operators. Recall the definition of fz,λ; then (36)βz,λ=argminβT{1Ti=1T(fβ(xi)-yi)2+λTi=1Tβi2}. Computing the gradient of the above equation, we immediately have  (37)βz,λ=1T(λI+SxU^xSxUx)-1SxU^xy. Hence fz,λ=Ux(λI+SxU^xSxUx)-1SxU^xy. Employing the method as shown in , we can decompose the sample error into two parts: (38)fz,λ-fλ,ρ-X(T)=Ux(λI+SxU^xSxUx)-1SxU^xy-(λI+LK~,ρ-X(T))-1LK~,ρ-X(T)fρ={(λI+LK~,ρ-X(T))-1UxSxU^xyUx(λI+SxU^xSxUx)-1SxU^xyo-(λI+LK~,ρ-X(T))-1UxSxU^xy}+{(λI+LK~,ρ-X(T))-1UxSxU^xy-(λI+LK~,ρ-X(T))-1LK~,ρ-X(T)fρ}=I+II.

Now we state our estimation for the sample error. The estimates are more involved since the sample is drawn by non-identical unbounded sampling processes. We overcome the difficulty by introducing a stepping integral operator LK^,ρ-X(T) which plays an intermediate role in the estimates, and the definition of it will be given later.

Theorem 11.

Let fz,λ be given by (8), assume moment hypothesis condition (15), and the marginal distribution sequence ρX(t),t, satisfies condition (3); then (39)𝔼fz,λ-fλ,ρ-X(T)ρCκλ3/2T1/2, where C4=αC1κM(κ+2|K|Cs(X×X))(κ2C2+C2), Cκ=43C2Mκ2+45C2Mκ3+C4/(1-α).

Proof.

We will estimate I and II, respectively: (40)I=(λI+LK~,ρ-X(T))-1(λI+LK~,ρ-X(T))×Ux(λI+SxU^xSxUx)-1SxU^xy-(λI+LK~,ρ-X(T))-1Ux(λI+SxU^xSxUx)×(λI+SxU^xSxUx)-1SxU^xy=(λI+LK~,ρ-X(T))-1(LK~,ρ-X(T)Ux-UxSxU^xSxUx)×(λI+SxU^xSxUx)-1SxU^xy=(λI+LK~,ρ-X(T))-1(LK~,ρ-X(T)Ux-UxSxU^xSxUx)Tβz,λ. Then (41)Iρλ-1TLK^,ρ-X(T)Ux-UxSxU^xSxUxβz,λ2+λ-1TLK~,ρ-X(T)Ux-LK^,ρ-X(T)Uxβz,λ2, where K^(x,u)=XK(x,v)K(u,v)dρ-X(T)(v) and βz,λ2 is the l2 norm on T, for βz,λ2; noticing (36), we can have (42)λTβz,λ221Ti=1Tyi2. This means (43)𝔼Iρλ-1T(𝔼βz,λ22)1/2{(𝔼LK^,ρ-X(T)Ux-UxSxU^xSxUx2)1/2+(𝔼LK~,ρ-X(T)Ux-LK^,ρ-X(T)Ux2)1/2}λ-3/2T(𝔼(1Ti=1Tyi2))1/2×{(𝔼LK^,ρ-X(T)Ux-UxSxU^xSxUx2)1/2+(𝔼LK~,ρ-X(T)Ux-LK^,ρ-X(T)Ux2)1/2}λ-3/22C2TM{(𝔼LK^,ρ-X(T)Ux-UxSxU^xSxUx2)1/2+(𝔼LK~,ρ-X(T)Ux-LK^,ρ-X(T)Ux2)1/2}. According to the definition of Ux, for any βT, Uxβρ(κ/T)β2; this implies that Uxκ/T. Therefore (44)LK~,ρ-X(T)Ux-LK^,ρ-X(T)Ux=supβ211Ti=1TβiX(K~(x,u)-K^(x,u))i×K(u,xi)dρ-X(T)(u)i=1Tρsupβ211Ti=1T|βi|dρ-X(T)(u)X[K~(x,u)-K^(x,u)]2)12(XK2(u,xi)dρ-X(T)(u))1/2i×(X[K~(x,u)-K^(x,u)]2i×dρ-X(T)(u)X[K~(x,u)-K^(x,u)]2)1/2(XK2(u,xi)dρ-X(T)(u))12ρκTXK(x,v)K(u,v)d(ρ-X(T)-ρX)(v)κTρ-X(T)-ρX(Cs(X))*K(x,·)K(u,·)Cs(X)αC1κ2(κ+2|K|Cs(X×X))  T3/2(1-α). Hence (45)(𝔼LK~,ρ-X(T)Ux-LK^,ρ-X(T)Ux2)1/2αC1κ2(κ+2|K|Cs(X×X))T3/2(1-α). For the term LK^,ρ-X(T)Ux-UxSxU^xSxUx, let (46)ηt,j,l(x)=K(xt,xl)K(xt,xj)K(x,xj)-XK(x,v)K(u,v)K(u,xl)dρX(j)(v)dρX(t)(u), and ξl(x)=(1/T2)t,j=1Tηt,j,l. Then |ηt,j,l|2κ3 and (47)LK^,ρ-X(T)Ux-UxSxU^xSxUx=supβ21(LK^,ρ-X(T)Ux-UxSxU^xSxUx)βρ=supβ211Tl=1Tβlξlρ1T3(Xl=1T(t,j=1Tηt,j,l)2dρX)1/2=1T3(Xt,j,w,τ,l=1Tηt,j,l(x)ηw,τ,l(x)dρX)1/2. Applying the same method as shown in the proof of Lemma 4.1 in , we can see that when all the indices t,j,w,τ,andl are pairwise different, there holds 𝔼x(ηt,j,l(x)ηw,τ,l(x))=0; therefore (48)(𝔼LK^,ρ-X(T)Ux-UxSxU^xSxUx2)1/21T3(𝔼x(t,j,w,τ,l=1Tηt,j,l(x)ηw,τ,l(x)))1/22κ3T3T5-T(T-1)(T-2)(T-3)(T-4)210κ3T.

This together with (45) yields (49)𝔼Iρ2C2M(210κ3λ3/2T1/2+αC1κ2(κ+2|K|Cs(X×X))λ3/2T(1-α)).

The term II is more involved; recall that (50)II=(λI+LK~,ρ-X(T))-1(UxSxU^xy-LK~,ρ-X(T)fρ)=(λI+LK~,ρ-X(T))-1{UxSxU^xy-LK^,ρ-X(T)fρ+LK^,ρ-X(T)fρ-LK~,ρ-X(T)fρ}. Hence (51)𝔼IIρλ-1(𝔼UxSxU^xy-LK^,ρ-X(T)fρρ+𝔼LK^,ρ-X(T)fρ-LK~,ρ-X(T)fρρ)  .

If we define ηtj(x)=ytK(xt,xj)K(x,xj) and (52)ξtj(x)=ηtj(x)-XK(x,u)K(v,u)fρ(v)dρX(j)(u)dρX(t)(v),  (t,j=1,,T), therefore (53)𝔼UxSxU^xy-LK^,ρ-X(T)fρρ2T-4t,j,w,τ=1T𝔼zξtj(x)ξwτ(x). If t,j,w,andτ are pairwise distinct, then 𝔼z(ξtj(x)ξwτ(x))=0. If t=j or w=τ, (54)𝔼zηttXK(x,u)K(v,u)fρ(v)dρX(t)(u)dρX(t)(v),𝔼zηwwXK(x,u)K(v,u)fρ(v)dρX(w)(u)dρX(w)(v). By the Cauchy-Schwartz inequality, for any t,j,w,τ=1,,T, (55)𝔼zξtj(x)ξwτ(x)(𝔼zξtj2(x))1/2(𝔼zξwτ2(x))1/2max{𝔼zξtj2(x),𝔼zξwτ2(x)}. Hence we only need to give a bound for 𝔼zξtj2. Simple calculation shows (56)𝔼zξtj2𝔼zηtj2XK2(x,u)K2(v,u)×Yy2dρ(yv)dρX(j)(u)dρX(t)(v)2C2M2κ4,(tj). By the same method, we know that (57)𝔼zξtt2=𝔼zηtt2-2𝔼zηttXK(x,u)K(v,u)×Yydρ(yv)dρX(t)(u)dρX(t)(v)+(XK(x,u)K(v,u)fρ(v)dρX(t)(u)dρX(t)(v))2XK2(v,v)K2(x,v)×Yy2dρ(yv)dρX(t)(v)+2C2M2κ4-2XK(v,v)K(x,v)Yydρ(yv)dρX(t)(v)×XK(x,u)K(v,u)×Yydρ(yv)dρX(t)(u)dρX(t)(v)8C2M2κ4.

Applying the conclusion as shown in  and together with the above bound, we can see that (58)T-4t,j,w,τ=1T𝔼zξtj(x)ξwτ(x)8T-4(T4-T(T-1)(T-2)(T-3))C2M2κ448C2M2κ4T. Hence (59)𝔼UxSxU^xy-LK^,ρ-X(T)fρρ43C2Mκ2T,𝔼LK^,ρ-X(T)fρ-LK~,ρ-X(T)fρρX(K^(x,v)-K~(x,v))fρ(v)dρ-X(T)(v)ρfρXK(x,u)K(v,u)d(ρ-X(T)  -ρX)(u)C2Mρ-X(T)-ρX(Cs(X))*K(x,·)K(v,·)Cs(X)αC1κ(κ+2|K|Cs(X×X))C2MT(1-α).

This yields (60)𝔼LK^,ρ-X(T)fρ-LK~,ρ-X(T)fρραC1κ(κ+2|K|Cs(X×X))C2MT(1-α); then (61)𝔼IIρ43C2Mκ2λT+αC1κ(κ+2|K|Cs(X×X))C2MλT(1-α). This together with (49) yields the conclusion.

Now we are in a position to give the proofs of Theorems 5 and 7.

Proof of Theorem <xref ref-type="statement" rid="thm1">5</xref>.

Theorem 11 ensures that (62)𝔼fz,λ-fλ,ρ-X(T)ρCκλ3/2T1/2.

For 1/2<r3/2, Proposition 10 tells that (63)fλ,ρ-X(T)-fλ,ρXK~C3λr-3/2T, and Proposition 8 shows that (64)fλ,ρX-fρK~Crλr-1/2, since (65)fρfκfK~,fK~.

Combining all the bounds together and noting that λ=T-θ with 0<θ<1/3, we can get the conclusion of Theorem 5 by taking Cκ′′=(C3+Cr)κ+Cκ.

Proof of Theorem <xref ref-type="statement" rid="thm2">7</xref>.

When the samples are drawn i.i.d. from measure ρ, then fλ,ρ-X(T)=fλ,ρX. Hence (66)𝔼fz,λ-fρρ𝔼fz,λ-fλ,ρXρ+𝔼fλ,ρX-fρρCκλ3/2T1/2+Cqλq. Let λ=T-θ; then (67)𝔼fz,λ-fρρC~κT-min{1/2-(3/2)θ,qθ}.

The conclusion follows by discussing the relationship between r and 1.

Acknowledgments

The author would like to thank Professor Hongwei Sun for useful discussions which have helped to improve the presentation of the paper. The work described in this paper is supported partially by National Natural Science Foundation of China (Grant no. 11001247) and Doctor Grants of Guangdong University of Business Studies (Grant no. 11BS11001).

Smale S. Zhou D.-X. Online learning with Markov sampling Analysis and Applications 2009 7 1 87 113 10.1142/S0219530509001293 MR2488871 ZBL1170.68022 Aronszajn N. Theory of reproducing kernels Transactions of the American Mathematical Society 1950 68 337 404 MR0051437 10.1090/S0002-9947-1950-0051437-7 ZBL0037.20701 Cucker F. Zhou D.-X. Learning Theory: An Approximation Theory Viewpoint 2007 Cambridge, UK Cambridge University Press 10.1017/CBO9780511618796 MR2354721 Smale S. Zhou D.-X. Learning theory estimates via integral operators and their approximations Constructive Approximation 2007 26 2 153 172 10.1007/s00365-006-0659-y MR2327597 ZBL1127.68088 Wu Q. Ying Y. Zhou D.-X. Learning rates of least-square regularized regression Foundations of Computational Mathematics 2006 6 2 171 192 10.1007/s10208-004-0155-9 MR2228738 ZBL1100.68100 Sun H. Wu Q. Indefinite kernel network with dependent sampling Analysis and Applications. Accepted Wu Q. Zhou D.-X. Learning with sample dependent hypothesis spaces Computers and Mathematics with Applications 2008 56 11 2896 2907 10.1016/j.camwa.2008.09.014 MR2467678 ZBL1165.68388 Wu Q. Regularization networks with indefinite kernels Journal of Approximation Theory 2013 166 1 18 10.1016/j.jat.2012.10.001 MR3003945 Sun H. Wu Q. Least square regression with indefinite kernels and coefficient regularization Applied and Computational Harmonic Analysis 2011 30 1 96 109 10.1016/j.acha.2010.04.001 MR2737935 ZBL1225.65015 Shi L. Learning theory estimate for coefficient-based regularized regression Applied and Computational Harmonic Analysis 2013 34 2 252 265 10.1016/j.acha.2012.05.001 Sun H. Guo Q. Coefficient regularized regression with non-iid sampling International Journal of Computer Mathematics 2011 88 15 3113 3124 10.1080/00207160.2011.587511 MR2834509 ZBL1237.68165 Conway J. B. A Course in Operator Theory 2000 American Mathematical Society MR1721402 Wang C. Zhou D.-X. Optimal learning rates for least squares regularized regression with unbounded sampling Journal of Complexity 2011 27 1 55 67 10.1016/j.jco.2010.10.002 MR2745300 ZBL1217.65024 Caponnetto A. De Vito E. Optimal rates for the regularized least-squares algorithm Foundations of Computational Mathematics 2007 7 3 331 368 10.1007/s10208-006-0196-8 MR2335249 ZBL1129.68058 De Vito E. Caponnetto A. Rosasco L. Model selection for regularized least-squares algorithm in learning theory Foundations of Computational Mathematics 2005 5 1 59 85 10.1007/s10208-004-0134-1 MR2125691 Mendelson S. Neeman J. Regularization in kernel learning The Annals of Statistics 2010 38 1 526 565 10.1214/09-AOS728 MR2590050 ZBL1191.68356 Steinwart I. Hush D. Scovel C. Optimal rates for regularized least-squares regression Proceedings of the 22nd Annual Conference on Learning Theory 2009 79 93 Smale S. Zhou D.-X. Shannon sampling. II. Connections to learning theory Applied and Computational Harmonic Analysis 2005 19 3 285 302 10.1016/j.acha.2005.03.001 MR2186447 ZBL1107.94008 Zhou D.-X. Capacity of reproducing kernel spaces in learning theory IEEE Transactions on Information Theory 2003 49 7 1743 1752 10.1109/TIT.2003.813564 MR1985575