^{1}

We study the density problem and approximation error of reproducing kernel Hilbert spaces for the purpose of learning theory. For a Mercer kernel

Learning theory investigates how to find function relations or data structures from random samples. For the regression problem, one usually has some experience and would expect that the (underlying) unknown function lies in some set of functions

Based on the samples, one can find a function from the hypothesis space

What is less understood is the approximation of the underlying desired function

In kernel machine learning such as support vector machines, one often uses reproducing kernel Hilbert spaces or their balls as hypothesis spaces. Here, we take

Let

The

In kernel machine learning, one often takes

The first purpose of this paper is to study the density of the reproducing kernel Hilbert spaces in

Let

When the density holds, we want to study the convergence rate of the approximation by functions from balls of the RKHS as the radius tends to infinity. The quantity

Let

The density problem of reproducing kernel Hilbert spaces in

Given a Mercer kernel

By means of the dual space of

Recall the Riesz Representation Theorem asserting that the dual space of

Let

For any nontrivial positive Borel measure

For any nontrivial positive Borel measure

For any nontrivial Borel measure

(1)

(2)

The property of the RKHS enables us to prove the general case. As the function

(3)

The last implication (4)

The proof of Theorem

Let

The necessity has been verified in the proof of Theorem

Theorem

Let

It is well known that

Let

Taking the integral on

After the first version of the paper was finished, I learned that Micchelli et al. [

Now we can state a trivial fact that the positive definiteness is a necessary condition for the density.

Let

Suppose to the contrary that

Now, we define a nontrivial Borel measure

Because of the necessity given in Corollary

Let

The series in (

To prove the positive definiteness, we let

We now prove that

Combining the previous discussion, we know that the positive definiteness is a nice necessary condition for the density of the RKHS in

The study of approximation by reproducing kernel Hilbert spaces has a long history; see, for example, [

In the following sections, we consider the approximation error for the purpose of learning theory. The basic tool for constructing approximants is a set of nodal functions used in [

We say that

The nodal functions have some nice minimization properties; see [

In [

When the RKHS has finite dimension

The nodal functions are used to construct an interpolation scheme:

The error

Let

We know that

The error of the interpolation scheme for functions from RKHS can be estimated as follows.

Let

Let

As

The regularity of the kernel in connection with Theorem

Let

For convolution type kernels, the power function can be estimated in terms of the Fourier transform of the kernel function. This is of particular interest when the kernel function is analytic. Let us provide the details.

Assume that

This function involves two parts. The first part is

The power function

For the convolution type kernel (

Choose

The estimate for

For the Gaussian kernels

Now, we can estimate the approximation error in learning theory by means of the interpolation scheme (

Consider the convolution type kernel (

Let

(i) For

We apply the previous analysis to the function

Now, we need to estimate the norm

(ii) Let

(iii) By the Plancherel formula,

Theorem

Denote by

Let

The first part is a direct consequence of Theorem

To see the second part, we note that (

For

For the Gaussian kernels, we have the following.

Let

The Fourier transform of

For

By Corollary

When

When

Proposition

In this section, we consider the learning with varying kernels. Such a method is used in many applications where we have to choose suitable parameters for the reproducing kernel. For example, in [

Let

Take

When

Finally, we determine

Let us mention the following problem concerning learning with Gaussian kernels with changing variances.

What is the optimal rate of convergence of

In this section, we illustrate our results by the family of dot product type kernels. These kernels take the form

Let

Note that

The second statement follows from the classical Muntz Theorem in approximation theory (see [

The conclusion in Example

By Corollary

Let

Observe that the assumption in Corollary

What is left is to show that the Mercer kernel

An alternative simpler proof for the positive definiteness of the kernel in Example

After characterizing the density, we can then apply our analysis in Section

If

The author would like to thank Charlie Micchelli for proving Corollary