Density Problem and Approximation Error in Learning Theory

and Applied Analysis 3 Proof. (1) ⇒ (2). This follows from the fact that C(X) is dense in L μ (X). See, for example, [9]. (2) ⇒ (3). Suppose that H K is dense in L μ (X), but L K,μ has an eigenvalue zero in L μ (X). Then, there exists a nontrivial function f ∈ L μ (X) such that L K,μ (f) = 0; that is, L K,μ (f) (x) = ∫ X K(x, y) f (y) dμ (y) = ∫ X K x (y) f (y) dμ (y) = 0. (12) The identity holds as functions in L μ (X). If the support of μ is X, then this identity would imply that f is orthogonal to each K x with x ∈ X. When the support of μ is not X, things are more complicated. Here, the support of μ, denoted as supp μ, is defined to be the smallest closed subset F of X satisfying μ(X \ F) = 0. The property of the RKHS enables us to prove the general case. As the function L K,μ (f) is continuous, we know from (12) that, for each x in supp μ, ∫ supp μ K x (y) f (y) dμ (y) = ∫ X K x (y) f (y) dμ (y) = 0. (13) This means for each x in supp μ, f ⊥ K x in L μ (supp X), where μ has been restricted onto supp μ. When we restrict K onto supp μ × supp μ, the new kernel ?̃? is again a Mercer kernel. Moreover, by (1), H ?̃? = H K |supp μ. It follows that span{K x |supp μ : x ∈ supp μ} is dense in H ?̃? = H K |supp μ. The latter is dense in L 2 μ (supp μ). Therefore, f is orthogonal to L μ (supp μ); hence, as a function in L μ (X), f is zero. This is a contradiction. (3) ⇒ (4). Every nontrivial Borel measure μ can be uniquely decomposed as the difference of two mutually singular positive Borelmeasures: μ = μ−μ; that is, there exists a Borel set A ⊂ X such that μ(A) = μ(X) and μ(A) = 0. With this decomposition, ∫ X K(x, y) dμ (y) = ∫ X K(x, y) {χ A (y) − χ X\A (y)} d 󵄨󵄨󵄨󵄨μ 󵄨󵄨󵄨󵄨 = L K,|μ| (χ A − χ X\A ) (x) . (14) Here, χ A is the characteristic function of the set A, and |μ| = μ + + μ − is the absolute value of μ. As |μ| is a nontrivial positive Borel measure and χ A − χ X\A is a nontrivial function in L |μ| (X), statement (3) implies that, as a function in L |μ| (X), ∫ X K(x, y)dμ(y) ̸ = 0. Since this function lies in C(X), it is nonzero as a function in C(X). The last implication (4) ⇒ (1) follows directly from the Riesz RepresentationTheorem. The proof of Theorem 4 also yields a characterization for the density of the RKHS in L μ (X). Corollary 5. Let K be a Mercer kernel on a compact metric space (X, d) and μ a positive Borel measure on X. Then, H K is dense in L μ (X) if and only if L K,μ has no eigenvalue zero in L μ (X). Thenecessity has been verified in the proof ofTheorem 4, while the sufficiency follows from the observation that an L μ (X) function f lying in the orthogonal complement of span{K x : x ∈ X} gives an eigenfunction of L K,μ with eigenvalue zero: ⟨K x , f⟩ L 2 μ (X) = ∫ X K x (y) f (y) dμ (y) = L K,μ (f) (x) = 0. (15) Theorem 4 enables us to conclude that the density always holds for convolution type kernels K(x, y) = k(x − y) with k ∈ L(R). The density for some convolution type kernels has been verified by Steinwart [10]. The author observed the density as a corollary ofTheorem4when ?̂?(ξ) is strictly positive. Charlie Micchelli pointed out to the author that, for a convolution type kernel, the RKHS is always dense in C(X). So, the density problem is solved for these kernels. Corollary 6. Let K(x, y) = k(x − y) be a nontrivial convolution type Mercer kernel on R with k ∈ L(R). Then, for any compact subset X of R, H K on X is dense in C(X). Proof. It is well known that K is a Mercer kernel if and only if k is continuous and ?̂?(ξ) ≥ 0 almost everywhere. We apply the equivalent statement (4) ofTheorem 4 to prove our statement. Let μ be a Borel measure on X such that ∫ X K(x, y) dμ (y) = 0, ∀x ∈ X. (16) Then, the inverse Fourier transform yields ∫ X ∫ R ?̂? (ξ) e iξ⋅(x−y) dξdμ (y) = ∫ R ?̂? (ξ) μ (ξ) e iξ⋅x dξ = 0, ∀x ∈ X. (17) Here, μ(ξ) = ∫ edμ(y) is the Fourier transform of the Borel measure μ, which is an entire function. Taking the integral on X with respect to the measure μ, we have ∫ R ?̂? (ξ) μ (ξ) ∫ X e iξ⋅x dμ (x) dξ = ∫ R ?̂? (ξ) 󵄨󵄨󵄨󵄨μ (ξ) 󵄨󵄨󵄨󵄨 2 dξ = 0. (18) For a nontrivial Borel measure μ supported on X, μ(ξ) vanishes only on a set of measure zero. Hence, ?̂?(ξ) = 0 almost everywhere, which gives k = 0. Therefore, we must have μ = 0. This proves the density byTheorem 4. After the first version of the paper was finished, I learned that Micchelli et al. [11] proved the density for a class of 4 Abstract and Applied Analysis convolution type kernels k(x − y) with k being the Fourier transform of a finite Borel measure. Note that a large family of convolution type reproducing kernels are given by radial basis functions; see, for example, [12]. Now we can state a trivial fact that the positive definiteness is a necessary condition for the density. Corollary 7. Let K be a Mercer kernel on a compact metric space (X, d). If H K is dense in C(X), then K is positive definite. Proof. Suppose to the contrary that H K is dense in C(X), but there exists a finite set of distinct points {x i } l i=1 ⊂ X such that the matrix Ax := (K(xi, xj)) l i,j=1 is not positive definite. By the Mercer kernel property, Ax is positive semidefinite. So it is singular, and we can find a nonzero vector c := (c 1 , . . . , c l ) T ∈ R satisfying Axc = 0. It follows that c T Axc = 0; that is, 󵄩󵄩󵄩󵄩󵄩󵄩󵄩󵄩󵄩 l ∑ i=1 c i K xi 󵄩󵄩󵄩󵄩󵄩󵄩󵄩󵄩󵄩 2 K = ⟨ l


Introduction
Learning theory investigates how to find function relations or data structures from random samples.For the regression problem, one usually has some experience and would expect that the (underlying) unknown function lies in some set of functions H called the hypothesis space.Then one tries to find a good approximation in H of the underlying function  (under certain metric).The best approximation in H is called the target function  H .However,  is unknown.What we have in hand is a set of random samples {(  ,   )} ℓ =1 .These samples are not given by  exactly ((  ) ̸ =   ).They are controlled by this underlying function  with noise or some other uncertainties ((  ) ≈   ).The most important model studied in learning theory [1] is to assume that the uncertainty is represented by a Borel probability measure  on  × , and the underlying function  :  →  is the regression function of  given by   () = ∫    ( | ) ,  ∈ . ( Here, ( | ) is the conditional probability measure at .Then, the samples {(  ,   )} ℓ =1 are independent and identically distributed drawers according to the probability measure .For the classification problem,  = {1, −1} and sign (  ) is the optimal classifier.
Based on the samples, one can find a function from the hypothesis space H that best fits the data z := {(  ,   )} ℓ =1 (with respect to certain loss functional).This function is called the empirical target function  z .When the number ℓ of samples is large enough,  z is a good approximation of the target function  H with certain confidence.This problem has been extensively investigated and well developed in the literature of statistical learning theory.See, for example, [1][2][3][4].
What is less understood is the approximation of the underlying desired function  by the target function  H .For example, if one takes H to be a polynomial space of some fixed degree, then  can be approximated by functions from H only when  is a polynomial in H.
In kernel machine learning such as support vector machines, one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces.Here, we take (, (⋅, ⋅)) to be a compact metric space and  = R. Definition 1.Let  : × → R be continuous, symmetric, and positive semidefinite; that is, for any finite set of distinct points { 1 , . . .,  ℓ } ⊂ , the matrix ((  ,   )) ℓ ,=1 is positive semidefinite.Such a kernel is called a Mercer kernel.It is called positive definite if the matrix ((  ,   )) ℓ ,=1 is always positive definite.
In kernel machine learning, one often takes H  or its balls as the hypothesis space.Then, one needs to know whether the desired function  can be approximated by functions from the RKHS.
The first purpose of this paper is to study the density of the reproducing kernel Hilbert spaces in () (or in  2 () when  is a subset of the Euclidean space R  ).This will be done in Section 2 where some characterizations will be provided.Let us mention a simple example with detailed proof given in Section 6.
Example 2. Let  = [0, 1] and let  be a Mercer kernel given by where   ≥ 0 for each  and ∑ +∞ =0   < ∞.Set  := { ∈ Z + :   > 0}.Then, H  is dense in () if and only if When the density holds, we want to study the convergence rate of the approximation by functions from balls of the RKHS as the radius tends to infinity.The quantity is called the approximation error in learning theory.Some estimates have been presented by Smale and Zhou [6] for the  2 norm and many kernels.The second purpose of this paper is to investigate the convergence rate of the approximation error with the uniform norm as well as the  2 norm.
Estimates will be given in Section 4, based on the analysis in Section 3 for interpolation schemes associated with general Mercer kernels.With this analysis, we can understand the approximation error with respect to marginal probability distribution induced by .Let us provide an example of Gaussian kernels to illustrate the idea.Notice that when the parameter  of the kernel is allowed to change with , the rate of the approximation error may be improved.This confirms the method of adaptively choosing the parameter of the kernel, which is used in many applications (see e.g., [7]).
There exist positive constants  and  such that, for each  ∈   (R  ) and  ≥ ‖‖

Density and Positive Definiteness
The density problem of reproducing kernel Hilbert spaces in () was raised to the author by Poggio et al.See [8].It can be stated as follows.
Given a Mercer kernel  on a compact metric space (, (⋅, ⋅)), when is the RKHS H  dense in ()?
By means of the dual space of (), we can give a general characterization.This is only a simple observation, but it does provide us useful information.For example, we will show that the density is always true for convolution type kernels.Also, for dot product type kernel, we can give a complete nice characterization for the density, which will be given in Section 6.
Recall the Riesz Representation Theorem asserting that the dual space of () can be represented by the set of Borel measures on .For a Borel measure  on , we define the integral operator  , associated with the kernel as  , () () := ∫   (, )  ()  () ,  ∈ .(10) This is a compact operator on  2  () if  is a positive measure.

Theorem 4.
Let  be a Mercer kernel on a compact metric space (, ).Then, the following statements are equivalent.
( The necessity has been verified in the proof of Theorem 4, while the sufficiency follows from the observation that an  2  () function  lying in the orthogonal complement of span{  :  ∈ } gives an eigenfunction of  , with eigenvalue zero: Theorem 4 enables us to conclude that the density always holds for convolution type kernels (, ) = ( − ) with  ∈  2 (R  ).The density for some convolution type kernels has been verified by Steinwart [10].The author observed the density as a corollary of Theorem 4 when k() is strictly positive.Charlie Micchelli pointed out to the author that, for a convolution type kernel, the RKHS is always dense in ().So, the density problem is solved for these kernels.
Proof.It is well known that  is a Mercer kernel if and only if  is continuous and k() ≥ 0 almost everywhere.We apply the equivalent statement (4) of Theorem 4 to prove our statement.
Let  be a Borel measure on  such that Then, the inverse Fourier transform yields Here, μ() = ∫  −⋅ () is the Fourier transform of the Borel measure , which is an entire function.
Taking the integral on  with respect to the measure , we have For a nontrivial Borel measure  supported on , μ() vanishes only on a set of measure zero.Hence, k() = 0 almost everywhere, which gives  = 0. Therefore, we must have  = 0.This proves the density by Theorem 4.
After the first version of the paper was finished, I learned that Micchelli et al. [11] proved the density for a class of convolution type kernels ( − ) with  being the Fourier transform of a finite Borel measure.Note that a large family of convolution type reproducing kernels are given by radial basis functions; see, for example, [12].Now we can state a trivial fact that the positive definiteness is a necessary condition for the density.

Corollary 7.
Let  be a Mercer kernel on a compact metric space (, ).If H  is dense in (), then  is positive definite.
Proof.Suppose to the contrary that H  is dense in (), but there exists a finite set of distinct points {  } ℓ =1 ⊂  such that the matrix  x := ((  ,   )) ℓ ,=1 is not positive definite.By the Mercer kernel property,  x is positive semidefinite.So it is singular, and we can find a nonzero vector Thus, Now, we define a nontrivial Borel measure  supported on { 1 , . . .,  ℓ } as Then, for  ∈ , This is a contradiction to the equivalent statement (4) in Theorem 4 of the density.
Because of the necessity given in Corollary 7, one would expect that the positive definiteness is also sufficient for the density.Steve Smale convinced the author that this is not the case in general.This motivates us to present a constructive example of Then,  is a  ∞ Mercer kernel on [0, 1].It is positive definite, but the constant function 1 is not in the closure of H  in ().Hence, H  is not dense in ().
Proof.The series in (24) converges in   ∞ for any  ∈ N. Hence,  is  ∞ and is a Mercer kernel on [0, 1].
We now prove that 1, the constant function taking the value 1 everywhere, is not in the closure of H  in ().In fact, the uniformly convergent series (24) and the vanishing property of  , imply that Since span{  :  ∈ } is dense in H  and H  is embedded in (), we know that If 1 could be uniformly approximated by a sequence {  } in H  , then which would be a contradiction.Therefore, H  is not dense in ().
Combining the previous discussion, we know that the positive definiteness is a nice necessary condition for the density of the RKHS in () but is not sufficient.

Interpolation Schemes for Reproducing Kernel Spaces
The study of approximation by reproducing kernel Hilbert spaces has a long history; see, for example, [13,14].Here, we want to investigate the rate of approximation as the RKHS norm of the approximant becomes large.
In the following sections, we consider the approximation error for the purpose of learning theory.The basic tool for constructing approximants is a set of nodal functions used in [6,15,16].Definition 9. We say that {  () :=  ,x ()} ℓ =1 is the set of nodal functions associated with the nodes The nodal functions have some nice minimization properties; see [6,16].
The error  x () −  for  ∈ H  will be estimated by means of a power function.
Definition 11.Let  be a Mercer kernel on a compact metric space (, ) and x = { 1 , . . .,  ℓ } ⊂ .The power function   is defined on x as We know that   (x) → 0 when then Moreover, higher order regularity of  implies faster convergence of   (x).For details, see [16].
The error of the interpolation scheme for functions from RKHS can be estimated as follows.
Proof.Let  ∈ .We apply the reproducing property (3) of the function  in Then, It follows that This proves (38).
As  x () ∈ H  and  x ()(  ) = (  ) for  = 1, . . ., ℓ, we know that  x () (  ) −  (  ) = ⟨   ,  x () − ⟩  = 0,  = 1, . . ., ℓ. ( This means that  x () −  is orthogonal to span{   } ℓ =1 .Hence,  x () is the orthogonal projection of  onto span{   } ℓ =1 .Thus, ‖ x ()‖  ≤ ‖‖  .The regularity of the kernel in connection with Theorem 12 yields the rate of convergence of the interpolation scheme.As an example, from the estimate for   (x) given in [ For convolution type kernels, the power function can be estimated in terms of the Fourier transform of the kernel function.This is of particular interest when the kernel function is analytic.Let us provide the details.
The estimate for   () in the second part was verified in the proof of Theorem 3 in [16].
For the Gaussian kernels it was proved in [16,Example 4] that, for  ≥ 80 log 2/ 2 , there holds )

Approximation Error in Learning Theory
Now, we can estimate the approximation error in learning theory by means of the interpolation scheme (34).
(ii) Let  ∈ .Then By the Schwartz inequality, The first term is bounded by ‖‖  2 Λ  ().The second term is which can be bounded by   (x), as shown in the proof of Theorem 12. Therefore, by (52), (iii) By the Plancherel formula, This proves all the statements in Theorem 16.
Denote by Λ −1  the inverse function of Λ  : Then, our estimate for the approximation error can be given as follows.
For the Gaussian kernels, we have the following. where When This yields the first estimate.
When  > /2, the same method gives the error with the uniform norm.

Learning with Varying Kernels
Proposition 18 in the last section shows that, for a fixed Gaussian kernel, the approximation error (, ) behaves as
In this section, we consider the learning with varying kernels.Such a method is used in many applications where we have to choose suitable parameters for the reproducing kernel.For example, in [7] Gaussian kernels with different parameters in different directions are considered.Here, we study the case when the variance parameter keeps the same in all directions.Our analysis shows that the approximation error may be improved when the kernel changes with the RKHS norm  of the empirical target function.
as  tends to infinity?

Dot Product Kernels
In this section, we illustrate our results by the family of dot product type kernels.These kernels take the form When ∑ +∞ =0 |  | 2 < ∞ for some  > 0, the kernel  is a Mercer kernel on  := { ∈ R  : || ≤ } if and only if   ≥ 0 for each  ≥ 0. See [25][26][27][28].Here, we will characterize the density for this family as [29]  (122) This in connection with Theorem 4 implies that H  is not dense in ().This proves the first statement of Corollary 21.
Thus, the approximation error can be given in terms of the regularity of the kernel .The regularity of the approximated function yields the rate of approximation by polynomials  =   while the asymptotic behavior of the coefficients   in (116) provides the control of the RKHS norm of  x (  ).

. Every nontrivial Borel measure 𝜇 can be uniquely decomposed as the difference of two mutually sin- gular positive Borel measures: 𝜇 = 𝜇 + −𝜇 − ; that is, there exists a Borel set
(12)he support of  is , then this identity would imply that  is orthogonal to each   with  ∈ .When the support of  is not , things are more complicated.Here, the support of , denoted as supp , is defined to be the smallest closed subset  of  satisfying ( \ ) = 0.The property of the RKHS enables us to prove the general case.As the function  , () is continuous, we know from(12)that, for each  in supp , This means for each  in supp ,  ⊥   in  2  (supp ), where  has been restricted onto supp .When we restrict  onto supp  × supp , the new kernel K is again a Mercer kernel.Moreover, by (1), H K = H  | supp  .It follows that span{  | supp  :  ∈ supp } is dense in H K = H  | supp  .The latter is dense in  2  (supp ). ⊂  such that  + () =  + () and  − () = 0.