AAA Abstract and Applied Analysis 1687-0409 1085-3375 Hindawi Publishing Corporation 715683 10.1155/2013/715683 715683 Research Article Density Problem and Approximation Error in Learning Theory Zhou Ding-Xuan Ying Yiming 1 Department of Mathematics City University of Hong Kong Tat Chee Avenue Kowloon Hong Kong China cityu.edu.hk 2013 7 10 2013 2013 03 05 2013 05 08 2013 2013 Copyright © 2013 Ding-Xuan Zhou. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We study the density problem and approximation error of reproducing kernel Hilbert spaces for the purpose of learning theory. For a Mercer kernel K on a compact metric space (X, d), a characterization for the generated reproducing kernel Hilbert space (RKHS) K to be dense in C(X) is given. As a corollary, we show that the density is always true for convolution type kernels. Some estimates for the rate of convergence of interpolation schemes are presented for general Mercer kernels. These are then used to establish for convolution type kernels quantitative analysis for the approximation error in learning theory. Finally, we show by the example of Gaussian kernels with varying variances that the approximation error can be improved when we adaptively change the value of the parameter for the used kernel. This confirms the method of choosing varying parameters which is used often in many applications of learning theory.

1. Introduction

Learning theory investigates how to find function relations or data structures from random samples. For the regression problem, one usually has some experience and would expect that the (underlying) unknown function lies in some set of functions called the hypothesis space. Then one tries to find a good approximation in of the underlying function  f  (under certain metric). The best approximation in is called the target function  f. However,  f  is unknown. What we have in hand is a set of random samples  {(xi,yi)}i=1. These samples are not given by  f  exactly (f(xi)yi). They are controlled by this underlying function  f  with noise or some other uncertainties (f(xi)yi). The most important model studied in learning theory  is to assume that the uncertainty is represented by a Borel probability measure  ρ  on  X×Y, and the underlying function  f:XY  is the regression function of  ρ  given by (1)fρ(x)=Yydρ(yx),xX. Here, ρ(yx)  is the conditional probability measure at  x. Then, the samples  {(xi,yi)}i=1  are independent and identically distributed drawers according to the probability measure  ρ. For the classification problem,  Y={1,-1}  and sign (fρ)  is the optimal classifier.

Based on the samples, one can find a function from the hypothesis space that best fits the data  z:={(xi,yi)}i=1  (with respect to certain loss functional). This function is called the empirical target function fz. When the number of samples is large enough, fz is a good approximation of the target function f with certain confidence. This problem has been extensively investigated and well developed in the literature of statistical learning theory. See, for example, .

What is less understood is the approximation of the underlying desired function  f  by the target function  f. For example, if one takes to be a polynomial space of some fixed degree, then  f  can be approximated by functions from only when  f  is a polynomial in .

In kernel machine learning such as support vector machines, one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces. Here, we take (X,d(·,·)) to be a compact metric space and Y=.

Definition 1.

Let  K:X×X  be continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points  {x1,,x}X, the matrix  (K(xi,xj))i,j=1  is positive semidefinite. Such a kernel is called a Mercer kernel. It is called positive definite if the matrix  (K(xi,xj))i,j=1  is always positive definite.

The reproducing kernel Hilbert space (RKHS) K associated with a Mercer kernel  K  is defined (see ) to be the completion of the linear span of the set of functions  {Kx:=K(x,·):  xX}  with the inner product  ·,·K=·,·K  satisfying (2)i=1ciKxiK2=i=1ciKxi,i=1ciKxiK=i,j=1ciK(xi,xj)cj. The reproducing kernel property is given by (3)Kx,gK=g(x),xX,gK. This space can be embedded into  C(X), the space of continuous functions on  X.

In kernel machine learning, one often takes  K  or its balls as the hypothesis space. Then, one needs to know whether the desired function  f  can be approximated by functions from the RKHS.

The first purpose of this paper is to study the density of the reproducing kernel Hilbert spaces in  C(X)  (or in  L2(X)  when  X  is a subset of the Euclidean space  n). This will be done in Section 2 where some characterizations will be provided. Let us mention a simple example with detailed proof given in Section 6.

Example 2.

Let  X=[0,1]  and let K  be a Mercer kernel given by (4)K(x,y)=j=0+aj(x·y)j, where  aj0  for each  j and  j=0+aj<. Set  J:={j+:aj>0}. Then, K  is dense in  C(X)  if and only if (5)a0>0,jJ{0}1j=+.

When the density holds, we want to study the convergence rate of the approximation by functions from balls of the RKHS as the radius tends to infinity. The quantity (6)I(f,R):=infgKRf-g is called the approximation error in learning theory. Some estimates have been presented by Smale and Zhou  for the  L2  norm and many kernels. The second purpose of this paper is to investigate the convergence rate of the approximation error with the uniform norm as well as the  L2  norm. Estimates will be given in Section 4, based on the analysis in Section 3 for interpolation schemes associated with general Mercer kernels. With this analysis, we can understand the approximation error with respect to marginal probability distribution induced by  ρ. Let us provide an example of Gaussian kernels to illustrate the idea. Notice that when the parameter  σ  of the kernel is allowed to change with  R, the rate of the approximation error may be improved. This confirms the method of adaptively choosing the parameter of the kernel, which is used in many applications (see e.g., ).

Example 3.

Let (7)Kσ(x,y)=exp{-|x-y|2σ2},x,yX=[0,1]n. There exist positive constants  A  and  B  such that, for each  fHs(n)  and  RAfL2, there holds (8)infgKσRf-gL2(X)B(logR)-s/4 when  σ  is fixed; while when  σ  may change with  R, there holds (9)infgKσRRf-gL2(X)B(logR)-s.

2. Density and Positive Definiteness

The density problem of reproducing kernel Hilbert spaces in  C(X)  was raised to the author by Poggio et al. See . It can be stated as follows.

Given a Mercer kernel  K  on a compact metric space  (X,d(·,·)), when is the RKHS  K  dense in  C(X)?

By means of the dual space of  C(X), we can give a general characterization. This is only a simple observation, but it does provide us useful information. For example, we will show that the density is always true for convolution type kernels. Also, for dot product type kernel, we can give a complete nice characterization for the density, which will be given in Section 6.

Recall the Riesz Representation Theorem asserting that the dual space of  C(X)  can be represented by the set of Borel measures on  X. For a Borel measure  μ  on  X, we define the integral operator  LK,μ  associated with the kernel as (10)LK,μ(f)(x):=XK(x,y)f(y)dμ(y),xX. This is a compact operator on Lμ2(X) if μ is a positive measure.

Theorem 4.

Let K be a Mercer kernel on a compact metric space (X,d). Then, the following statements are equivalent.

K  is dense in  C(X).

For any nontrivial positive Borel measure  μ,  K  is dense in  Lμ2(X).

For any nontrivial positive Borel measure  μ,  LK,μ  has no eigenvalue zero in  Lμ2(X).

For any nontrivial Borel measure  μ, as a function in  C(X), (11)XK(·,y)dμ(y)0.

Proof.

(1)    (2). This follows from the fact that  C(X)  is dense in  Lμ2(X). See, for example, .

(2)    (3). Suppose that K  is dense in  Lμ2(X), but  LK,μ  has an eigenvalue zero in  Lμ2(X). Then, there exists a nontrivial function  fLμ2(X)  such that  LK,μ(f)=0; that is, (12)LK,μ(f)(x)=XK(x,y)f(y)dμ(y)LK,μ(f)(x)=XKx(y)f(y)dμ(y)=0. The identity holds as functions in  Lμ2(X). If the support of  μ  is  X, then this identity would imply that  f  is orthogonal to each  Kx  with  xX. When the support of  μ  is not  X, things are more complicated. Here, the support of  μ, denoted as supp  μ, is defined to be the smallest closed subset  F  of  X  satisfying  μ(XF)=0.

The property of the RKHS enables us to prove the general case. As the function  LK,μ(f)  is continuous, we know from (12) that, for each  x  in supp  μ, (13)suppμKx(y)f(y)dμ(y)=XKx(y)f(y)dμ(y)=0. This means for each  x  in suppμ,  fKx  in  Lμ2(suppX), where  μ  has been restricted onto suppμ. When we restrict  K  onto suppμ×suppμ, the new kernel  K~  is again a Mercer kernel. Moreover, by (1),  K~=K|suppμ. It follows that span{Kx|suppμ:xsuppμ} is dense in  K~=K|suppμ. The latter is dense in  Lμ2(suppμ). Therefore,  f  is orthogonal to  Lμ2(suppμ); hence, as a function in  Lμ2(X),  f  is zero. This is a contradiction.

(3)    (4). Every nontrivial Borel measure  μ  can be uniquely decomposed as the difference of two mutually singular positive Borel measures:  μ=μ+-μ-; that is, there exists a Borel set  AX  such that  μ+(A)=μ+(X)  and  μ-(A)=0. With this decomposition, (14)XK(x,y)dμ(y)=XK(x,y){χA(y)-χXA(y)}d|μ|XK(x,y)dμ(y)=LK,|μ|(χA-χXA)(x). Here, χA  is the characteristic function of the set  A, and  |μ|=μ++μ-  is the absolute value of  μ. As  |μ|  is a nontrivial positive Borel measure and  χA-χXA  is a nontrivial function in  L|μ|2(X), statement (3) implies that, as a function in  L|μ|2(X),  XK(x,y)dμ(y)0. Since this function lies in  C(X), it is nonzero as a function in  C(X).

The last implication (4) (1) follows directly from the Riesz Representation Theorem.

The proof of Theorem 4 also yields a characterization for the density of the RKHS in  Lμ2(X).

Corollary 5.

Let  K  be a Mercer kernel on a compact metric space  (X,d) and  μ a positive Borel measure on  X. Then, K  is dense in  Lμ2(X)  if and only if  LK,μ  has no eigenvalue zero in  Lμ2(X).

The necessity has been verified in the proof of Theorem 4, while the sufficiency follows from the observation that an  Lμ2(X)  function  f  lying in the orthogonal complement of span{Kx:xX}  gives an eigenfunction of  LK,μ  with eigenvalue zero: (15)Kx,fLμ2(X)=XKx(y)f(y)dμ(y)=LK,μ(f)(x)=0.

Theorem 4 enables us to conclude that the density always holds for convolution type kernels  K(x,y)=k(x-y)  with  kL2(n). The density for some convolution type kernels has been verified by Steinwart . The author observed the density as a corollary of Theorem 4 when  k^(ξ)  is strictly positive. Charlie Micchelli pointed out to the author that, for a convolution type kernel, the RKHS is always dense in  C(X). So, the density problem is solved for these kernels.

Corollary 6.

Let  K(x,y)=k(x-y)  be a nontrivial convolution type Mercer kernel on  n  with  kL2(n). Then, for any compact subset  X  of  n,  K  on  X  is dense in  C(X).

Proof.

It is well known that  K  is a Mercer kernel if and only if  k  is continuous and  k^(ξ)0  almost everywhere. We apply the equivalent statement (4) of Theorem 4 to prove our statement.

Let  μ  be a Borel measure on  X  such that (16)XK(x,y)dμ(y)=0,xX. Then, the inverse Fourier transform yields (17)Xnk^(ξ)eiξ·(x-y)dξdμ(y)=nk^(ξ)μ^(ξ)eiξ·xdξ=0,xX. Here, μ^(ξ)=e-iξ·ydμ(y)  is the Fourier transform of the Borel measure  μ, which is an entire function.

Taking the integral on  X  with respect to the measure  μ, we have (18)nk^(ξ)μ^(ξ)Xeiξ·xdμ(x)dξ=nk^(ξ)|μ^(ξ)|2dξ=0. For a nontrivial Borel measure  μ  supported on X, μ^(ξ) vanishes only on a set of measure zero. Hence, k^(ξ)=0  almost everywhere, which gives  k=0. Therefore, we must have  μ=0. This proves the density by Theorem 4.

After the first version of the paper was finished, I learned that Micchelli et al.  proved the density for a class of convolution type kernels  k(x-y)  with  k  being the Fourier transform of a finite Borel measure. Note that a large family of convolution type reproducing kernels are given by radial basis functions; see, for example, .

Now we can state a trivial fact that the positive definiteness is a necessary condition for the density.

Corollary 7.

Let  K  be a Mercer kernel on a compact metric space  (X,d). If  K  is dense in  C(X), then  K  is positive definite.

Proof.

Suppose to the contrary that  K  is dense in  C(X), but there exists a finite set of distinct points  {xi}i=1X  such that the matrix  Ax:=(K(xi,xj))i,j=1  is not positive definite. By the Mercer kernel property,  Ax  is positive semidefinite. So it is singular, and we can find a nonzero vector  c:=(c1,,c)T  satisfying  Axc=0. It follows that  cTAxc=0; that is, (19)i=1ciKxiK2=i=1ciKxi,j=1cjKxjKi=1ciKxiK2=i,j=1cicjK(xi,xj)=0. Thus, (20)i=1ciKxi=0.

Now, we define a nontrivial Borel measure  μ  supported on  {x1,,x}  as (21)μ({xi})=ci,i=1,,. Then, for  xX, (22)XK(x,y)dμ(y)=i=1K(x,xi)ci=i=1ciKxi(x)=0. This is a contradiction to the equivalent statement (4) in Theorem 4 of the density.

Because of the necessity given in Corollary 7, one would expect that the positive definiteness is also sufficient for the density. Steve Smale convinced the author that this is not the case in general. This motivates us to present a constructive example of  C  kernel. Denote gWm:=j=0mg(j)L as the norm in the Sobolev space  Wm.

Example 8.

Let  X=[0,1]. For every  m  and every  i{0,1,,m}, choose a real-valued  C  function  ψi,m(x)  on  [0,1]  such that (23)ψi,m(x)=xi,x[0,1](1m+1,1m),01ψi,m(x)dx=0. Define  K  on  [0,1]×[0,1]  by (24)K(x,y)=m=12-mi=0mψi,m(x)ψi,m(y)i=0mψi,mWm2,x,y[0,1]. Then, K  is a  C  Mercer kernel on  [0,1]. It is positive definite, but the constant function  1  is not in the closure of  K  in C(X). Hence, K  is not dense in  C(X).

Proof.

The series in (24) converges in  Wm  for any  m. Hence, K  is  C and is a Mercer kernel on  [0,1].

To prove the positive definiteness, we let  {xi}i=1[0,1]  be a finite set of distinct points and  (ci)i=1 a nonzero vector. Choose  m  such that (25)m-1,1m<min{xj:  xj>0,j{1,,}}. Then, for each  j{1,,}, either  xj=0  or  xj>1/m. Hence, (26)xj[0,1](1m+1,1m),j=1,,. By the construction of  ψi,m, there holds (27)ψi,m(xj)=xji,i=0,1,,m,j=1,,. Then, (28)i,j=1cicjK(xi,xj)2-mi=0mψi,mWm2i=0m[j=1cjψi,m(xj)]22-mi=0mψi,mWm2i=0-1[j=1cjψi,m(xj)]2. Now, the determinant of the matrix  (xji)i=0,1,,-1,j=1,,  is a Vandermonde determinant and is nonzero. Since (cj)j=1 is a nonzero vector, we know that j=1cjxji0 for some  i{0,1,,-1}. It follows that i,j=1cicjK(xi,xj)>0. Thus, K is positive definite.

We now prove that  1, the constant function taking the value  1  everywhere, is not in the closure of  K  in  C(X). In fact, the uniformly convergent series (24) and the vanishing property of  ψi,m  imply that (29)01K(x,y)dy=01Kx(y)dy=0,xX. Since span{Kx:xX} is dense in K and K is embedded in C(X), we know that (30)01f(y)dy=0,fK. If 1 could be uniformly approximated by a sequence {fm} in K, then (31)1=011(y)dy=limm01fm(y)dy=0, which would be a contradiction. Therefore,  K  is not dense in  C(X).

Combining the previous discussion, we know that the positive definiteness is a nice necessary condition for the density of the RKHS in  C(X) but is not sufficient.

3. Interpolation Schemes for Reproducing Kernel Spaces

The study of approximation by reproducing kernel Hilbert spaces has a long history; see, for example, [13, 14]. Here, we want to investigate the rate of approximation as the RKHS norm of the approximant becomes large.

In the following sections, we consider the approximation error for the purpose of learning theory. The basic tool for constructing approximants is a set of nodal functions used in [6, 15, 16].

Definition 9.

We say that  {ui(x):=ui,x(x)}i=1  is the set of nodal functions associated with the nodes  x:={x1,,x}X  if  uispan{Kxj}j=1  and (32)ui(xj)=δij={1,ifj=i,0,otherwise.

The nodal functions have some nice minimization properties; see [6, 16].

In , we show that the nodal functions  {ui}i=1  associated with  x  exist if and only if the Gramian matrix  Ax:=(K(xi,xj))i,j=1  is nonsingular. In this case, the nodal functions are uniquely given by (33)ui(x)=j=1(Ax-1)i,jKxj(x),i=1,,.

Remark 10.

When the RKHS has finite dimension m, then, for any m we can find nodal functions {uj}j=1 associated with some subset x={x1,,x}X, while for >m, no such nodal functions exist. When dimK=, then, for any , we can find a subset x={x1,,x}X which possesses a set of nodal functions.

The nodal functions are used to construct an interpolation scheme: (34)Ix(f)(x)=i=1f(xi)ui,x(x),xX,fC(X). It satisfies Ix(f)(xi)=f(xi) for  i=1,,. Interpolation schemes have been applied to the approximation by radial basis functions in the vast literature; see, for example, .

The error  Ix(f)-f  for  fK  will be estimated by means of a power function.

Definition 11.

Let  K  be a Mercer kernel on a compact metric space  (X,d)  and  x={x1,,x}X. The power function  εK  is defined on  x  as (35)εK(x):=supxX{+i,j=1wiK(xi,xj)wj}1/2infw{i,j=1wiK(xi,xj)wjK(x,x)-2i=1wiK(x,xi)+i,j=1wiK(xi,xj)wj}1/2}.

We know that  εK(x)0  when  dx:=maxxXmini=1,,d(x,xi)0. If  K  is Lipschitz  s  on  X: (36)|K(x,y)-K(x,t)|C(d(y,t))s, then (37)εK(x)2Cdxs. Moreover, higher order regularity of  K  implies faster convergence of  εK(x). For details, see .

The error of the interpolation scheme for functions from RKHS can be estimated as follows.

Theorem 12.

Let  K  be a Mercer kernel and  Ax nonsingular for a finite set  x={x1,,x}X. Define the interpolation scheme associated with  x  as (34). Then, for  fK, there holds (38)Ix(f)-fC(X)fKεK(x),(39)Ix(f)KfK.

Proof.

Let  xX. We apply the reproducing property (3) of the function  f  in (40)Ix(f)(x)-f(x)=i=1f(xi)ui(x)-f(x). Then, (41)Ix(f)(x)-f(x)=i=1ui(x)Kxi,fK-Kx,fKIx(f)(x)-f(x)=i=1ui(x)Kxi-Kx,fK. By the Schwartz inequality in  K, (42)|Ix(f)(x)-f(x)|i=1ui(x)Kxi-KxKfK. As  Ks,KtK=K(s,t), we have (43)i=1ui(x)Kxi-KxK2=K(x,x)-2i=1ui(x)K(x,xi)+i,j=1ui(x)K(xi,xj)uj(x). However, the quadratic function (44)Q((wi)i=1):=K(x,x)-2i=1wiK(x,xi)Q((wi)i=1):=+i,j=1wiK(xi,xj)wj over    takes its minimum value at  (ui(x))i=1. Therefore, (45)i=1ui(x)Kxi-KxKεK(x). It follows that (46)|Ix(f)(x)-f(x)|fKεK(x). This proves (38).

As  Ix(f)K and  Ix(f)(xi)=f(xi)  for  i=1,,, we know that (47)Ix(f)(xi)-f(xi)=Kxi,Ix(f)-fK=0,i=1,,. This means that  Ix(f)-f  is orthogonal to span{Kxi}i=1. Hence, Ix(f)  is the orthogonal projection of  f  onto span{Kxi}i=1. Thus, Ix(f)KfK.

The regularity of the kernel in connection with Theorem 12 yields the rate of convergence of the interpolation scheme. As an example, from the estimate for  εK(x)  given in [16, Proposition 2], we have the following.

Corollary 13.

Let  X=[0,1], sN, and  K(x,y)  be a  Cs  Mercer kernel such that  Ax  is nonsingular for  x={j/N}j=0N-1. Then, for  fK, there holds (48)Ix(f)-fC(X)fK{(4s)s(1+s2s)(s-1)!sysK}N-s.

For convolution type kernels, the power function can be estimated in terms of the Fourier transform of the kernel function. This is of particular interest when the kernel function is analytic. Let us provide the details.

Assume that  k  is a symmetric function in  L2(n)  and  k^(ξ)>0  almost everywhere on  n. Consider the Mercer kernel (49)K(x,y)=k(x-y),x,y[0,1]n. For  N, we define the following function to measure the regularity: (50)λk(N):=n(1+12N)n-1λk(N):=×max1jn{(2π)-nξ[-N/2,N/2]nk^(ξ)(|ξj|N)Ndξ}λk(N):=+(1+(N2N)n)2(2π)-nξ[-N/2,N/2]nk^(ξ)dξ.

Remark 14.

This function involves two parts. The first part is  ξ[-N/2,N/2]n, where  (|ξj|/N)N2-N; hence, it decays exponentially fast as  N  becomes large. The second part is  ξ[-N/2,N/2]n, where  ξ  is large. Then, the decay of  k^  (which is equivalent to the regularity of  k) yields the fast decay of the second part.

The power function  εK(x)  can be bounded by  λk(N)  on the regular points: (51)x:=(αN)α{0,1,,N-1}n.

Proposition 15.

For the convolution type kernel (49) and  x  given by (51), one has(52)εK(x)λk(N). In particular, if (53)k^(ξ)C0e-λ|ξ|,ξn for some constants  C0>0  and  λ>4+2nln4, then there holds (54)εK(x)λk(N)4C0(max{1eλ,4neλ/2})N/2.

Proof.

Choose  {wα:=wα,x(x)}αx  as the Lagrange interpolation polynomials on  x. It is a vector in  Nn  for each  xX. Then, εK(x)supxXQN(x), where (55)QN(x):=k(0)-2αxwα,x(x)k(x-α)QN(x):=+α,βxwα,x(x)  k(α-β)wβ,x(x). In the proof of Theorem  2 in , we showed that  QN(x)λk(N)  for each  x[0,1]n. Therefore,  εK(x)λk(N).

The estimate for  λk(N)  in the second part was verified in the proof of Theorem  3 in .

For the Gaussian kernels (56)K(x,y)=exp{-|x-y|2σ2},x,y[0,1]n, it was proved in [16, Example 4] that, for  N80nlog2/σ2, there holds (57)εK(x)λk(N)2e(116n)N/2+4σπ2-nN.

4. Approximation Error in Learning Theory

Now, we can estimate the approximation error in learning theory by means of the interpolation scheme (34).

Consider the convolution type kernel (49) on X=[0,1]n. As in , we denote (58)Λk(r)={infξ[-rπ,rπ]nk^(ξ)}-1/2,r>0. The approximation error (6) can be realized as follows.

Theorem 16.

Let kL2(n) be a symmetric function with k^(ξ)>0, and let the kernel on X=[0,1]n be K(x,y)=k(x-y). For fL2(n) and MN, we set fML2(n) by (59)f^M(ξ)={f^(ξ),ifξ[-Mπ,Mπ]n,0,otherwise. Then, with x={0,1/N,,(N-1)/N}n, one has

Ix(fM)KfL2Λk(N);

fM-Ix(fM)C(X)fL2Λk(M)εK(x)fL2Λk(M)λk(N);

f-fML2(X)2(2π)-nξ[-Mπ,Mπ]n|f^(ξ)|2dξ0(asM).

Proof.

(i) For i,jXN:={0,1,,N-1}n and xi=i/N, expression (33) gives (60)ui,ujK=s,tXN(Ax-1)is(Ax-1)jtKxs,KxtK=s,tXN(Ax-1)is(Ax-1)jt(Ax)ts=(Ax-1)ij. Then for gC(X) we have (61)Ix(g)K2=iXNg(xi)ui(x)K2=i,jXNg(xi)g(xj)ui,ujK=(g|x)TAx-1g|x, where g|x is the vector (g(xi))iXNNn. It follows that (62)Ix(g)K2=g|x,Ax-1g|x2Ax-12g|x22=Ax-12iXN|g(xi)|2, where Ax-12 denotes the (operator) norm of the matrix Ax-1 in (Nn,2).

We apply the previous analysis to the function fM satisfying (63)jXN|fM(xj)|2=jXN|(2π)-nξ[-Nπ,Nπ]nf^M(ξ)ei·(j/N)·ξdξ|2=jXN|(2π)-nξ[-π,π]nf^M(Nξ)eij·ξNndξ|2(2π)-nξ[-π,π]n|f^M(Nξ)Nn|2dξNnfL22. Then, (64)Ix(fM)K2Ax-12NnfL22.

Now, we need to estimate the norm Ax-12. For convolution type kernels, such an estimate was given in [15, Theorem 2] by means of methods from the radial basis function literature, for example, [17, 2124]. We have (65)Ax-12N-n(Λk(N))2. Therefore, (66)Ix(fM)KfL2Λk(N). This proves the statement in (i).

(ii) Let xX. Then (67)fM(x)-Ix(fM)(x)=(2π)-nξ[-Mπ,Mπ]nf^(ξ)×{eix·ξ-jXNuj,x(x)eixj·ξ}dξ. By the Schwartz inequality, (68)|fM(x)-Ix(fM)(x)|{(2π)-nξ[-Mπ,Mπ]n|f^(ξ)|2k^(ξ)dξ}1/2×{(2π)-nnk^(ξ)|eix·ξ-jXNuj,x(x)eixj·ξ|2dξ}1/2. The first term is bounded by fL2Λk(M). The second term is (69){i,jXNk(0)-2jXNuj,x(x)k(x-xj)+i,jXNui,x(x)k(xi-xj)uj,x(x)}1/2 which can be bounded by εK(x), as shown in the proof of Theorem 12. Therefore, by (52), (70)fM-Ix(fM)C(X)fL2Λk(M)εK(x)fL2Λk(M)λk(N).

(iii) By the Plancherel formula, (71)f-fML2(n)2=(2π)-nξ[-Mπ,Mπ]n|f^(ξ)|2dξ. This proves all the statements in Theorem 16.

Theorem 16 provides quantitative estimates for the approximation error: (72)f-Ix(fM)L2(X){(2π)-nξ[-Mπ,Mπ]n|f^(ξ)|2dξ}1/2+fL2Λk(M)λk(N) with (73)Ix(fM)KfL2Λk(N). Choose N=N(M)M such that Λk(M)λk(N)0 as M+; we have f-Ix(fM)L2(X)0 and the RKHS norm of Ix(fM) is controlled by the asymptotic behavior of Λk(N).

Denote by Λk-1 the inverse function of Λk: (74)Λk-1(R):=max{r>0:Λk(r)R}=max{r>0:k^(ξ)R-2ξ[-rπ,rπ]n}. Then, our estimate for the approximation error can be given as follows.

Corollary 17.

Let X=[0,1]n and fHs(n). Then, for R>fL2, (75)infgKRf-gL2(X)inf0<MNR{fHs(Mπ)-s+fL2Λk(M)λk(NR)}, where NR:=[Λk-1(R/fL2)]. If s>n/2, then (76)infgKRf-gC(X)inf0<MNR{fHss-n/2Mn/2-s+fL2Λk(M)λk(NR){fHss-n/2}}. In particular, when (77)C1exp{-λ1|ξ|d}k^(ξ)C0exp{-λ|ξ|},ξn for some C0,C1,d,λ1>0 and λ>4+2nlog4, one has(78)I(f,R)=infgKRf-gL2(X)(2d(d+1)nd/2πdλ1)s/d(d+1)×(π-sfHs+2s+2C0C1fL2)×{logR+12logC1-logfL2}-s/d(d+1) provided that with the function G(r):=(1/nπ)((2/λ1)log(rC1/fL2))1/d, R satisfies (79)G(R)(16λ1nd/2πd-logmax{1/eλ,4n/eλ/2})d+1,logG(R)G(R)(-logmax{1/eλ,4n/eλ/2})(d+1)2d+4s.

Proof.

The first part is a direct consequence of Theorem 16 when we choose N to be NR, the integer part of Λk-1(R/fL2).

To see the second part, we note that (77) in connection with Proposition 15 implies with Λ:=max{1/eλ,4n/eλ/2}, (80)λk(N)4C0exp{NlogΛ2},Λk(r){C1exp{-λ1(nrπ)d}}-1/2=1C1exp{λ12(nπ)drd}. Then, Λk-1(R/fL2)G(R).

For R(fL2/C1)  exp{(λ1/2)(nπ)d}, we can choose M such that (81)12{G(R)}1/(d+1)M{G(R)}1/(d+1). Choose N such that (82)Md+12NMd+1. Then, MN, and by Theorem 16, (83)Ix(fM)KfL2C1exp{λ12(nπ)dMd(d+1)}R,Λk(M)λk(N)4C0C1exp{N{(nπ)d}logΛ4+N{(nπ)d}logΛ4×(logΛ4+λ1(nπ)dN-1/(d+1))}. When (84)N1/(d+1)4λ1(nπ)d-logΛ,logNN(-logΛ)(d+1)4s, there holds (85)Λk(M)λk(N)4C0C1exp{NlogΛ4}4C0C1N-s/(d+1). Hence, (86)f-Ix(fM)L2(X)fHs(Mπ)-s+fL24C0C1(2M)s(fHsπ-s+2sfL24C0C1)×2s(G(R))-s/(d+1). When R satisfies (79), we know that (87)N1/(d+1)M214{G(R)1/d}1/(d+1)4λ1(nπ)d-logΛ,logNN2logMd+1Md+12d+2logG(R)G(R)-logΛ(d+1)4s. Hence, (84) holds true. This proves our statements.

For the Gaussian kernels, we have the following.

Proposition 18.

Let (88)K(x,y)=exp{-|x-y|2σ2},x,yX=[0,1]n. Denote Cσ,n:=σ2nπ2/4min{log(4n),nlog2} and Cσ,n,s:=(σnπ)s/2(Cσ,ns/2+(σπ)-n/2(2e+4/σπ)). If fHs(n), then one has(89)I(f,R)Cσ,n,s(fHs+fL2)×{logR+n2log(σπ)-logfL2}-s/4 and when s>n/2, (90)infgKRf-gC(X)Cσ,n,s-n/2(fHs+fL2)×{logR+n2log(σπ)-logfL2}n/8-s/4 for R satisfying (91)R>fL2(σπ)-n/2×exp{σ2nπ2(max{Cσ,n,80nlog2/σ2}+1)28},(92)(132Cσ,nlog(σπ)n/2RfL2)1/2s2(log22σnπ+12 log (σπ)n/2RfL2).

Proof.

The Fourier transform of k(x)=exp{-|x|2/σ2} is (93)k^(ξ)=(σπ)nexp{-σ2|ξ|24}. Then, (94)Λk(r)=(σπ)-n/2exp{σ2nr2π28}.

For (95)RfL2Λk(max{Cσ,n,80nlog2σ2}+1), we can take N with Nmax{Cσ,n,80nlog2/σ2} such that (96)12Λk-1(RfL2)NΛk-1(RfL2). Here, Λk-1 is the inverse function of Λk: (97)Λk-1(r)=22σnπ(log{(σπ)n/2r})1/2. Then, fL2Λk(N)R. Let MN. By Theorem 16, Ix(fM)KR.

By Corollary 17 and (57), (98)f-Ix(fM)L2(X)fHs(Mπ)-s+fL2Λk(M)(2e+4σπ)exp{-NC0}, where C0:=min{log(4n),nlog2}. Choose M such that (99)12NCσ,nMNCσ,n. With this choice, σ2nM2π2/8C0N/2. Therefore, (100)f-Ix(fM)L2(X)fHs(4Cσ,nπ2N)s/2+fL2(σπ)-n/2×(2e+4σπ)exp{-C02N}Cσ,n,s(fHs+fL2)×max{(RfL2)-s/2,exp{-C04Λk-1(RfL2)}}, where (101)Cσ,n,s=Cσ,ns/2+(σπ)-n/2(2e+4σπ).

When (102)C04Λk-1(RfL2)s2logΛk-1(RfL2), there holds (103)f-Ix(fM)L2(X)Cσ,n,s(fHs+fL2)×{Λk-1(RfL2)}-s/2. This yields the first estimate.

When s>n/2, the same method gives the error with the uniform norm.

5. Learning with Varying Kernels

Proposition 18 in the last section shows that, for a fixed Gaussian kernel, the approximation error I(f,R) behaves as (104)I(f,R)C(logR)-s/4 for functions f in Hs.

In this section, we consider the learning with varying kernels. Such a method is used in many applications where we have to choose suitable parameters for the reproducing kernel. For example, in  Gaussian kernels with different parameters in different directions are considered. Here, we study the case when the variance parameter keeps the same in all directions. Our analysis shows that the approximation error may be improved when the kernel changes with the RKHS norm R of the empirical target function.

Proposition 19.

Let (105)Kσ(x,y)=exp{-|x-y|2σ2},x,yX=[0,1]n. There exist positive constants An,s and Bn,s, depending only on n and s, such that for each fHs(n) and RAn,sfL2, one can find some σ=σR satisfying (106)infgKσRRf-gL2(X)Bn,s(logR)-s.

Proof.

Take (107)σ=(80nlog2N)1/2,N4nπ(15min{2+logn2log2,n})1/2MN2nπ(15min{2+logn2log2,n})1/2, where N depends on R. Denote Cn:=nπ2/4min{log(4n),nlog2}. As in the proof of Proposition 18, we have (108)f-Ix(fM)L2(X)(Nπ)-sfHs(320nCnlog2)s/2+Nn/2fL2(80nlog2)n/4×(2e+4N80nπlog2)exp{-C02N}.

When N is large enough, with a constant Cn,s depending on n and s, this yields (109)f-Ix(fM)L2(X)Cn,s(fHs+fL2)N-s.

Finally, we determine N by requiring (110)fL2Λk(N)=fL2Nn/4(80nπlog2)-n/4exp{10n2π2Nlog2}RfL2Λk(N+1). There is a constant An,s>0 depending only on n and s such that, for RAn,sfL2, an integer N satisfying all the previous requirements and (111)(N+1)n/4exp{10n2π2(N+1)log2} exists. This makes all the estimates valid. It follows that (112)RfL2Λk(N+1)fL2(80nπlog2)-n/4exp{20n2π2(N+1)log2}. Hence, (113)N+1120n2π2log2log{RfL2(80nπlog2)n/4}. Therefore, there holds Ix(fM)KσR and (114)f-Ix(fM)L2(X)2sCn,s(fHs+fL2)(20n2π2log2)s  ×{logR+n4log(80nπlog2)-logfL2}-s. This verifies our claim for the approximation error in L2(X).

Let us mention the following problem concerning learning with Gaussian kernels with changing variances.

Problem 20.

What is the optimal rate of convergence of (115)supfHs=1inf{f-gL2(X):  gKσRforsomeσ>0} as R tends to infinity?

6. Dot Product Kernels

In this section, we illustrate our results by the family of dot product type kernels. These kernels take the form (116)K(x,y)=j=0+aj(x·y)j,x,yn. When j=0+|aj|R2j< for some R>0, the kernel K is a Mercer kernel on X:={xn:|x|R} if and only if aj0 for each j0. See . Here, we will characterize the density for this family as . Denote xα:=Πi=1nxiαi and (|α|α):=(α1+αn)!/(Πi=1nαj!) for x=(x1,,xn)n and α=(α1,,αn)+n.

Corollary 21.

Let R>0, X:=[0,R]n, and the kernel K be given by (116), where aj0 for each j+ and j=0+ajR2j<. Set J:={α+n:a|α|>0}. Then, K is dense in C(X) if and only if span{xα:αJ} is dense in C(X). Thus, the density depends only on the location of positive coefficients in (116). In particular, when n=1, K is dense in C[0,R] if and only if (117)a0>0,jJ{0}1j=+.

Proof.

Note that (118)Kx(y)=K(x,y)=a0+j=1+aj|α|=j(|α|α)xαyα=αJa|α|(|α|α)xαyα.

Sufficiency. Suppose that span{xα:αJ} is dense in C(X), but K is not dense in C(X). Then, by Theorem 4 there exists a nontrivial Borel measure μ on X such that (119)XK(x,y)dμ(y)=0,xX. Taking the integral with respect to μ and using (118), we have (120)0=XK(x,y)dμ(x)dμ(y)=αJa|ga|(|α|α)[Xxαdμ(x)]2. Since a|α|>0 for each αJ, there holds (121)Xxαdμ(x)=0,αJ. That is, μ annihilates each xα for αJ. But span{xα:αJ} is dense in C(X); μ also annihilates all functions in C(X), which is a contradiction.

Necessity. If span{xα:αJ} is not dense in C(X), then there exists a nontrivial Borel measure μ annihilating each xα; that is, Xxαdμ(x)=0 for each αJ. Then, (118) tells us that, for each xX, (122)XK(x,y)dμ(y)=αJa|ga|(|α|α)xαXyαdμ(y)=0. This in connection with Theorem 4 implies that K is not dense in C(X). This proves the first statement of Corollary 21.

The second statement follows from the classical Muntz Theorem in approximation theory (see ): for a strictly increasing sequence of nonnegative numbers λ0<λ1,, span{xλj:j+} is dense in C[0,R] if and only if λ0=0 and j=1+1/λj=+.

The conclusion in Example 2 follows directly from Corollary 21.

By Corollary 21, we can provide more examples of dot product positive definite kernels whose corresponding RKHS is not dense. The following is such an example. However, compared with Example 8, it is not constructive, in the sense that no function outside the closure of K is explicitly given.

Example 22.

Let X=[0,1] and define (123)K(x,y)=1+k=1+2-km=0k-1(x·y)2k+m. Then, K is a positive definite Mercer kernel on X, but K is not dense in C(X).

Proof.

Observe that the assumption in Corollary 21 holds for K, a0=1>0 and (124)J{0}=k=1+{2k,2k+1,,2k+k-1}. Since jJ{0}(1/j)=k=1+i=0k-1(1/(2k+i))k=1+(k/2k)<+, Corollary 21 tells us that K is not dense in C(X).

What is left is to show that the Mercer kernel K is positive definite. Suppose to the contrary that there exist a finite set of distinct points IX and a nonzero vector c=(cs)sI such that (125)s,tIcsctK(s,t)=0. Denote (126)K~(x,y)=k=1+2-km=0k-1(x·y)2k+m. Then, (127)s,tIcsctK(s,t)=(sIcs)2+s,tI{0}csctK~(s,t)=0. Hence, sIcs=0 which implies that I{0} and (cs)sI{0} is a nonzero vector. Also, (128)0=s,tI{0}csctK~(s,t)  =k=0+2-km=0k-1(sI{0}s2k+mcs)2. It follows that (129)sI{0}s2k+mcs=0,k,m=0,1,,k-1. Choose an integer k which is not less than #(I{0}), the number of elements in the set I{0}. Then, we know that the linear system (130)sI{0}s2k+mxs=0,m=0,1,,#(I{0}) has a nonzero solution (cs)sI{0}. Hence, the matrix (s2k+m)sI{0},m=0,1,,#(I{0}) is singular. So, there exists a nonzero vector (dm)m=0#(I{0}) such that (131)m=0#(I{0})s2k+mdm=0,sI{0}  . As each element s in I{0} is nonzero, we have (132)m=0#(I{0})smdm=0,sI{0}. However, the determinant of the matrix (sm)sI{0},m=0,1,,#(I{0}) is a Vandermonde determinant and is nonzero. This is a contradiction, as the linear system having this matrix as the coefficient matrix has a nonzero solution. Therefore, the Mercer kernel K is positive definite.

An alternative simpler proof for the positive definiteness of the kernel in Example 22 can be given by means of the recent results in [25, 26].

After characterizing the density, we can then apply our analysis in Section 3 and provide some estimates for the convergence rate of the approximation error under the assumption that all the coefficients aj in (116) are strictly positive. We will not provide details here, but only show the application of the interpolation scheme (34) to polynomials.

If f(x)=|α|Mcα(|α|α)xα, then (133)Ix(f)(x)-f(x)=|α|Mcα(|α|α){j=1uj(x)xjα-xα}. It follows from the Schwartz inequality that (134)|Ix(f)(x)-f(x)|2{|α|M|cα|2(|α|α)a|α|}×{|α|Ma|α|(|α|α)|j=1uj(x)xjα-xα|2}. The first term can be bounded by (135){|α|M(|α|α)|cα|2}(minj=0,1,,Maj)-1 while the second is bounded by (136)α+na|α|(|α|α){i,j=1ui(x)uj(x)xiαxjα-2j=1uj(x)xjαxα+xα·xα}=εK(x). Thus, the approximation error can be given in terms of the regularity of the kernel K. The regularity of the approximated function yields the rate of approximation by polynomials f=fM while the asymptotic behavior of the coefficients aj in (116) provides the control of the RKHS norm of Ix(fM).

Acknowledgments

The author would like to thank Charlie Micchelli for proving Corollary 6 in the general form, Allan Pinkus for clarifying Example 2, Tommy Poggio for raising the density problem, Steve Smale for suggestions on positive definiteness and approximation error in learning theory, and Grace Wahba for knowledge on earlier work of approximation by reproducing kernel Hilbert spaces. The work described in this paper is partially supported by a Grant from the Research Grants Council of Hong Kong (Project no. CityU 104710).

Vapnik V. N. Statistical Learning Theory 1998 New York, NY, USA John Wiley & Sons MR1641250 Cucker F. Smale S. On the mathematical foundations of learning American Mathematical Society 2002 39 1 1 49 10.1090/S0273-0979-01-00923-5 MR1864085 ZBL0983.68162 Evgeniou T. Pontil M. Poggio T. Regularization networks and support vector machines Advances in Computational Mathematics 2000 13 1 1 50 10.1023/A:1018946025316 MR1759187 ZBL0939.68098 Wahba G. Spline Models for Observational Data 1990 Society for Industrial and Applied Mathematics (SIAM) 10.1137/1.9781611970128 MR1045442 Aronszajn N. Theory of reproducing kernels Transactions of the American Mathematical Society 1950 68 337 404 MR0051437 10.1090/S0002-9947-1950-0051437-7 ZBL0037.20701 Smale S. Zhou D.-X. Estimating the approximation error in learning theory Analysis and Applications 2003 1 1 17 41 10.1142/S0219530503000089 MR1959283 ZBL1079.68089 Chapelle O. Vapnik V. Bousquet O. Mukherjee S. Choosing multiple parameters for support vector machines Machine Learning 2002 46 131 159 Poggio T. Mukherjee S. Rifkin R. Rakhlin A. Verri A. Winkler J. Niranjan M. Uncertainty Goemetric Computations 2002 Kluwer 131 141 Malliavin P. Integration and Probability 1995 Springer Steinwart I. On the influence of the kernel on the consistency of support vector machines Journal of Machine Learning Research 2002 2 1 67 93 10.1162/153244302760185252 MR1883281 ZBL1009.68143 Micchelli C. A. Xu Y. Ye P. Suykens J. Horvath G. Basu S. Micchelli C. A. Vandewalle J. Cucker Smale learning theory in Besov spaces Advances in Learning Theory: Methods, Models and Applications 2003 Amsterdam, The Netherlands IOS Press 47 68 Micchelli C. A. Interpolation of scattered data: distance matrices and conditionally positive definite functions Constructive Approximation 1986 2 1 11 22 10.1007/BF01893414 MR891767 ZBL0625.41005 Girosi F. Poggio T. Networks and the best approximation property Biological Cybernetics 1990 63 3 169 176 10.1007/BF00195855 MR1065008 ZBL0714.94029 Wahba G. Practical approximate solutions to linear operator equations when the data are noisy SIAM Journal on Numerical Analysis 1977 14 4 651 667 MR0471299 10.1137/0714044 ZBL0402.65032 Zhou D.-X. Capacity of reproducing kernel spaces in learning theory IEEE Transactions on Information Theory 2003 49 7 1743 1752 10.1109/TIT.2003.813564 MR1985575 Zhou D.-X. The covering number in learning theory Journal of Complexity 2002 18 3 739 767 10.1006/jcom.2002.0635 MR1928805 ZBL1016.68044 Schaback R. Reconstruction of multivariate functions from scattered data monograph manusript, 1997 Buhmann M. D. Powell M. J. D. Radial basis function interpolation on an infinite regular grid Algorithms for Approximation, II 1990 London, UK Chapman and Hall 146 169 MR1071976 ZBL0749.41002 Jetter K. Stöckler J. Ward J. D. Error estimates for scattered data interpolation on spheres Mathematics of Computation 1999 68 226 733 747 10.1090/S0025-5718-99-01080-7 MR1642746 ZBL1042.41003 Wu Z. M. Schaback R. Local error estimates for radial basis function interpolation of scattered data IMA Journal of Numerical Analysis 1993 13 1 13 27 10.1093/imanum/13.1.13 MR1199027 ZBL0762.41006 Ball K. Eigenvalues of Euclidean distance matrices Journal of Approximation Theory 1992 68 1 74 82 10.1016/0021-9045(92)90101-S MR1143963 ZBL0746.41002 Narcowich F. J. Ward J. D. Norms of inverses and condition numbers for matrices associated with scattered data Journal of Approximation Theory 1991 64 1 69 94 10.1016/0021-9045(91)90087-Q MR1086096 ZBL0724.41004 Narcowich F. J. Ward J. D. Norm estimates for the inverses of a general class of scattered-data radial-function interpolation matrices Journal of Approximation Theory 1992 69 1 84 109 10.1016/0021-9045(92)90050-X MR1154224 ZBL0756.41004 Schaback R. Lower bounds for norms of inverses of interpolation matrices for radial basis functions Journal of Approximation Theory 1994 79 2 287 306 10.1006/jath.1994.1130 MR1302348 ZBL0829.41020 Lu F. Y. Sun H. W. Positive definite dot product kernels in learning theory Advances in Computational Mathematics 2005 22 2 181 198 10.1007/s10444-004-3140-6 MR2126585 ZBL1082.68097 Pinkus A. Strictly positive definite functions on a real inner product space Advances in Computational Mathematics 2004 20 4 263 271 10.1023/A:1027362918283 MR2032278 ZBL1142.42307 Smola A. J. Schölkopf B. Müller K. R. The connection between regularization operators and support vector kernels Neural Networks 1998 11 637 649 Zhou D. X. Conditionally reproducing kernel spaces in learning theory preprint Dahmen W. Micchelli C. A. Some remarks on ridge functions Approximation: Theory and Applications 1987 3 139 143 Lorentz G. G. Approximation of Functions 1966 Holt, Rinehart and Winston MR0213785