1. Introduction
Kernel methods have been extensively utilized in various learning tasks, and its generalization performance has been investigated from the viewpoint of approximation theory [1, 2]. Among these methods, a family of them can be considered as coefficient-based regularized framework in data-dependent hypothesis spaces; see, for example, [3–8]. For given samples {(xi,yi)}i=1n, the solution of these kernel methods has the following expression ∑i=1nαiK(xi,·), where αi∈ℛ and K is a Mecer kernel. The aim of these coefficient-based algorithms is to search a set of coefficients {αi} with good predictive performance.

Inspired by greedy approximation methods in [9–12], we propose a sparse greedy algorithm for regression. The greedy approximation has two advantages over the regularization methods: one is that the sparsity is directly controlled by a greedy approximation algorithm, rather than by the regularization parameter; the other is that greedy approximation does not change the objective optimization function, while the regularized methods usually modify the objective function by including a sparse regularization term [13].

Before introducing the greedy algorithm, we recall some preliminary background for regression. Let the input space 𝒳⊂ℝd be a compact subset and 𝒴=[-M,M] for some constant M>0. In the regression model, the learner gets a sample set z={(xi,yi)}i=1n, where (xi,yi)∈𝒵:=𝒳×𝒴,1≤i≤n, are randomly independently drawn from an unknown distribution ρ on 𝒵. The goal of learning is to pick a function f:𝒳→𝒴 with the expected error
(1.1)ℰ(f)=∫𝒵(f(x)-y)2 dρ
as small as possible. Note that the regression function
(1.2)fρ(x)=∫𝒴ydρ(y∣x), x∈𝒳,
is the minimizer of ℰ(f), where ρ(·∣x) is the conditional probability measure at x induced by ρ.

The empirical error is defined as
(1.3)ℰz(f)=1n∑i=1n(f(xi)-yi)2.

We call a symmetric and positive semidefinite continuous function K:𝒳×𝒳→ℝ a Mercer kernel. The reproducing kernel Hilbert space (RKHS) ℋK is defined to be the closure of the linear span of the set of functions {Kx:=K(x,·):x∈𝒳} with the inner product 〈·,·〉K defined by 〈Kx,Kx′〉K=K(x,x′). For all x∈X and f∈ℋK, the reproducing property is given by 〈Kx,f〉K=f(x). We can see κ:=supx∈𝒳K(x,x)<∞ because of the continuity of K and the compactness of 𝒳.

Different from the coefficient-based regularized method [3–6], we use the idea of sequential greedy approximation to realize sparse learning in this paper. Denote ℋ^={h^i}i=12n, where h^2i-1=Kxi and h^2i=-Kxi. The hypothesis space (depending on z) is defined as
(1.4)CO2n(ℋ^)={f:f(x)=∑i=12nαih^i(x),αi≥0,∑i=12nαi≤1}.
For any hypothesis function space 𝒢, we denote β𝒢={f:f=βg,g∈𝒢}.

The definition of fρ tells us |fρ(x)|≤M, so it is natural to restrict the approximating functions to [-M,M]. The projection operator has been used in error analysis of learning algorithms (see, e.g., [2, 14]).

Definition 1.1.
The projection operator π=πM is defined on the space of measurable functions f:𝒳→ℛ as
(1.5)π(f)(x)={M,if f(x)>M;-M,if f(x)<-M;f(x),otherwise.

The kernel-based greedy algorithm can be summarized as below. Let t be a stopping time and let β be a positive constant. Set f^β0=0. And then for τ=1,2,…,t, define
(1.6)h^τ,α^τ,β^τ=arg minh∈ℋ^,0≤α≤1,0≤β′≤βℰz((1-α)f^βτ-1+αβ′h),f^βτ=(1-α^τ)f^βτ-1+α^τβ^τh^τ.
Different from the regularized algorithms in [6, 12, 14–18], the above learning algorithm tries to realize efficient learning by greedy approximation. The study for its generalization performance can enrich the learning theory of kernel-based regression. In the remainder of this paper, we focus on establishing the convergence rate of π(f^βt) to the regression function fρ under choice of suitable parameters. The theoretical result is dependent on weaker conditions than the previous error analysis for kernel-based regularization framework in [4, 5].

2. Main Result
Define a data-free basis function set
(2.1)ℋ={hi:h2i-1=Kui,h2i=-Kui, ui⊂𝒳,i=1,…,∞},CO(ℋ)={f:f(x)=∑i=1∞αihi(x),αi≥0,∑i=1∞αi≤1}.

To investigate the approximation of π(f^βt) to fρ, we introduce a data-independent function
(2.2)fβ*=arg minf∈βCO(ℋ)ℰ(f).

Observe that
(2.3)ℰ(π(f^βt))-ℰ(fρ) ≤{ℰ(π(f^βt))-ℰz(π(f^βt))+ℰz(fβ*)-ℰ(fβ*)}+{ℰz(π(f^βt))-ℰz(fβ*)} +{ℰ(fβ*)-ℰ(fρ)}.
Here, the three terms on the right-hand side are called as the sample error, the hypothesis error, and the approximation error, respectively.

To estimate the sample error, we usually need the complexity measure of hypothesis function space ℋK. For this reason, we introduce some definitions of covering numbers to measure the complexity.

Definition 2.1.
Let (𝒰,d) be a pseudometric space and denote a subset S⊂𝒰. For every ϵ>0, the covering number 𝒩(S,ϵ,d) of S with respect to ϵ,d is defined as the minimal number of balls of radius ϵ whose union covers S, that is,
(2.4)𝒩(S,ϵ,d)=min{l∈ℕ:S⊂⋃j=1lB(sj,ϵ) for some {sj}j=1l⊂𝒰},
where B(sj,ϵ)={s∈𝒰:d(s,sj)≤ϵ} is a ball in 𝒰.

The empirical covering number with ℓ2 metric is defined as below.

Definition 2.2.
Let ℱ be a set of functions on 𝒳, u=(ui)i=1k and ℱ|u={(f(ui))i=1k:f∈ℱ}⊂ℝk. Set 𝒩2,u(ℱ,ϵ)=𝒩2,u(ℱ|u,ϵ,d2). The ℓ2 empirical covering number of ℱ is defined by
(2.5)𝒩2(ℱ,ϵ)=supk∈ℕsupu∈𝒳k𝒩2,u(ℱ,ϵ), ϵ>0,
where ℓ2 metric
(2.6)d2(a,b)=(1k∑i=1k|ai-bi|2)1/2, ∀a=(ai)i=1k∈ℝk, b=(bi)i=1k∈ℝk.

Denote ℬr as the ball of radius r with r>0, where ℬr={f∈ℋK:∥f∥ℋK≤r}. We need the following capacity assumption on ℋK, which has been used in [5, 6, 18].

Assumption 2.3.
There exist an exponent p, with p∈(0,2) and a constant cp,K>0 such that
(2.7)log𝒩2(ℬ1,ϵ)≤cp,Kϵ-p.

We now formulate the generalization error bounds for π(f^βt). The result follows from Propositions 3.2–3.5 in the next section.

Theorem 2.4.
Under Assumption 2.3, for any 0<δ<1, the following inequality holds with confidence 1-δ(2.8) ℰ(π(f^βt))-ℰ(fρ)≤4(ℰ(fβ*)-ℰ(fρ))+32β2t+4(3M+κ2β)2nlog(2δ)+1280M2(cp,K(4Mκβ)p)2/(2+p) log(2δ)n-2/(2+p).

From the result, we know there exists a constant C independent of n,t,δ such that with confidence 1-δ(2.9)ℰ(π(f^βt))-ℰ(fρ)≤4(ℰ(fβ*)-ℰ(fρ))+Cmax{β2t,β2n,(βpn)2/(2+p)}log(2δ) .
In particular, if fρ∈β~CO(ℋ) for some fixed constant β~ and t≥n, we have ℰ(π(f^β~t))-ℰ(fρ)→0 with decay rate O(n-2/(2+p)). The learning rate is satisfactory as p→0.

Here, the estimate of the hypothesis error is simple and does not need the strict condition on ρ and 𝒳 in [3–5] for learning with data-dependent hypothesis spaces.

If there are some additional conditions on approximation error with the increasing of β, we can obtain the explicit learning rates with suitable parameter selection.

Corollary 2.5.
Assume that the RKHS ℋK satisfies (2.7) and ℰ(fβ*)-ℰ(fρ)≤cγβ-γ for some γ>0. Choose β=np/4(2+p). For any 0<δ<1 and t=n, one has
(2.10)ℰ(π(f^βt))-ℰ(fρ)≤Cn-min{(4+p)/(4+2p),pγ/(8+4p)}log(2δ)
with confidence 1-δ. Here C is a constant independent of n,δ.

Observe that the learning rate depends closely on the approximation condition between fρ and fβ*. This means that only the target function can be well described by the functions from the hypothesis space, the learning algorithm can achieve good generalization performance. In fact, similar approximation assumption is extensively studied for error analysis in learning theory; see, for example, [1, 2, 4, 17].

From Corollary 2.5, when the kernel K∈C∞, p>0 can be arbitrarily small, one can easily see that the learning rate is quite low. Future research direction may be furthered to improve the estimate by introducing some new analysis techniques.

3. Proof of Theorem <xref ref-type="statement" rid="thm1">2.4</xref>
In this section, we provide the proof of Theorem 2.4 based on the upper bound estimates of sample error and hypothesis error. Denote
(3.1)S1={ℰz(fβ*)-ℰz(fρ)}-{ℰ(fβ*)-ℰ(fρ)},S2={ℰ(π(f^βt))-ℰ(fρ)}-{ℰz(π(f^βt))-ℰz(fρ)}.
We can observe that the sample error
(3.2)ℰ(π(f^βt))-ℰz(π(f^βt))+ℰz(fβ*)-ℰ(fβ*)=S1+S2.

Here S1 can be bounded by applying the following one-side Bernstein type probability inequality; see, for example, [1, 2, 14].

Lemma 3.1.
Let ξ be a random variable on a probability space Z with mean Eξ and variance σ2(ξ)=σ2. If |ξ(z)-Eξ|≤B for almost all z∈Z, then for all ε>0,
(3.3)Probz∈Zn{1n∑i=1nξ(zi)-Eξ≥ε}≤exp{-nε22(σ2+Bε/3)}.

Proposition 3.2.
For any 0<δ<1, with confidence 1-δ, one has
(3.4)S1≤ℰ(fβ*)-ℰ(fρ)+2(3M+κ2β)2nlog(1δ).

Proof.
Following the definition of S1, we have S1=(1/n)∑i=1nξ(xi)-Eξ, where random variable ξ(x)=(y-fβ*(x))2-(y-fρ(x))2.

From the definition of fβ*, we know ∥fβ*∥K≤κβ and ∥fβ*∥∞≤κ∥fβ*∥K≤κ2β. Then
(3.5)|ξ(x)|=|(fβ*(x)-fρ(x))((fβ*(x)-y)+(fρ(x)-y))|≤(3M+κ2β)2:=c1
and |ξ-Eξ|≤2c1. Moreover,
(3.6)σ2≤Eξ2=∫𝒵(fβ*(x)-fρ(x))2((fβ*(x)-y)+(fρ(x)-y))2 dρ≤c1{ℰ(fβ*)-ℰ(fρ)}.

Applying Lemma 3.1 with B=2c1 and σ2=c1{ℰ(fβ*)-ℰ(fρ)}, we get
(3.7)1n∑i=1nξ(xi)-Eξ≤t
with confidence at least 1-exp{nt2/(2c1(ℰ(fβ*)-ℰ(fρ)+(2/3)t))}. By setting -nt2/(2c1(ℰ(fβ*)-ℰ(fρ)+(2/3)t))=log(δ), we derive the solution
(3.8)t*=((2c1/3)log(1/δ)+((2c1/3)log(1/δ))2+2c1log(1/δ)(ℰ(fβ*)-ℰ(fρ)))n≤2c1nlog(1δ)+ℰ(fβ*)-ℰ(fρ).
Thus, with confidence 1-δ, we have
(3.9)1n∑i=1nξ(xi)-Eξ≤2c1nlog(1δ)+ℰ(fβ*)-ℰ(fρ).
This completes the proof.

To establish the uniform upper bound of S2, we introduce a concentration inequality established in [18].

Lemma 3.3.
Assume that there are constants B,c>0 and α∈[0,1] such that ∥f∥∞≤B and Ef≤c(Ef)α for every f∈ℱ. If for some a>0 and p∈(0,2),
(3.10)log(𝒩2(ℱ,ϵ))≤aϵ-p, ∀ϵ>0,
then there exists a constant cp′ depending only on p such that for any t>0, with probability at least 1-e-t, there holds
(3.11)Ef-1n∑i=1nf(zi)≤12η1-α(Ef)α+cp′η+2(ctn)1/(2-α)+18Btn, ∀f∈ℱ,
where
(3.12)η:=max{c(2-p)/(4-2α+pα)(an)2/(4-2α+pα),B(2-p)/(2+p)(an)2/(2+p)}.

Proposition 3.4.
Under Assumption 2.3, for any 0<δ<1, one has with confidence at least 1-δ(3.13)S2≤12{ℰ(π(f^βt))-ℰ(fρ)}+640M2(cp,K(4Mκβ)p)2/(2+p)log(1δ)n-2/(2+p).

Proof.
From the definition of f^βt, we have ∥f^βt∥K≤κβ. Denote
(3.14)ℱκβ={g(z)=(y-π(f)(x))2-(y-fρ(x))2:f∈ℬκβ}.
We can see that Eg=ℰ(π(f))-ℰ(fρ) and (1/n)∑i=1ng(zi)=ℰz(π(f))-ℰz(fρ). Since ∥π(f)∥∞≤M and |fρ(x)|≤M, we have
(3.15)|g(z)|=|(π(f)(x)-fρ(x))((π(f)(x)-y)+(fρ(x)-y))|≤8M2,Eg2=∫𝒵(π(f)(x)-fρ(x))2((π(f)(x)-y)+(fρ(x)-y))2 dρ≤16M2Eg.
For g1,g2∈ℱκβ, we have
(3.16)|g1(z)-g2(z)|=|(y-π(f1)(x))2-(y-π(f2)(x))2|≤4M|π(f1)(x)-π(f2)(x)|≤4M|f1(x)-f2(x)|.
Then, from Assumption 2.3,
(3.17)𝒩2,z(ℱκβ,ϵ)≤𝒩2,x(ℬκβ,ϵ4M)≤𝒩2,x(ℬ1,ϵ4Mκβ)≤cp,K(4Mκβ)pϵ-p.

Applying Lemma 3.3 with B=c=16M2 and a=cp,K(4Mκβ)p, for any δ∈(0,1) and for all g∈ℱκβ,
(3.18)Eg-1n∑i=1ng(zi)≤12Eg+cp′(16M2)(2-p)/(2+p)(cp,K(4Mκβ)pn)2/(2+p)+320M2log(1/δ)n,≤12Eg+640M2(cp,K(4Mκβ)p)2/(2+p)log(1δ)n-2/(2+p)
holds with confidence 1-δ. This completes the proof.

Different from the previous studies related with regularized framework [3–5], we introduce the estimate of hypothesis error ℰz(f^βt)-ℰz(fβ*) based on Theorem 4.2 in [11] for sequential greedy approximation.

Proposition 3.5.
For a fixed sample z, one has
(3.19)ℰz(f^βt)-ℰz(fβ*)≤16β2t.

The desired result in Theorem 2.4 can be derived directly by combining Propositions 3.2–3.5.