Kernel selection is a central issue in kernel methods of machine learning. In this paper, we investigate the regularized learning schemes based on kernel design methods. Our ideal kernel is derived from a simple iterative procedure using large scale unlabeled data in a semisupervised framework. Compared with most of existing approaches, our algorithm avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms. Moreover, large amounts of information associated with input space can be exploited, and thus generalization ability is improved accordingly. We provide some theoretical support for the least square cases in our settings; also these advantages are shown by a simulation experiment and a real data analysis.
1. Introduction
Kernel-based methods have been proved to be powerful for a wide range of different data analysis problems. Since support vector machine (SVM) was initially proposed by Vapnik [1], many other kernel-based methods have been proposed such as kernel PCA, kernel Fisher discriminant, and kernel CCA. In many cases, the performance of kernel methods greatly depends on the choice of kernel function (for the importance of specifying an appropriate kernel, see Chapter 13 of [2]). To choose an appropriate kernel, many kernel learning algorithms have been proposed in recent years such as [3–5]. Among them, there are two kinds of candidate kernel sets: the first one involves parameter selection from the candidate kernel collection including Gaussian kernels [6]; that is, Gσ(x,y)=exp(-σx-y2). The others mainly refer to the linear combination of certain prespecified kernels. It is known that the latter one is also called “multiple kernel learning” [4]. Recall that Lanckriet et al. [5] proposed a positive semidefinite programming to search the best linear combination automatically for SVM; however, this approach is time-consuming and only feasible for small sample cases. Sonnenburg et al. [7] relaxed this mentioned optimization problem to a semi-infinite linear program, which is capable of coping with a large spectrum of kernels and samples. However, these multiple kernel learning algorithms do not have better performance than traditional nonweighted kernel K=∑kKk in SVM sometimes, and Cortes [8] questioned that “can learning kernels help performance?”. Recently Kloft and Blanchard [9] introduced a multikernel learning with lq-norm (q≥1) approach, which has been shown effective in both theory and practice [10, 11]. Essentially, lq-norm is a kind of minimizing empirical risk algorithm with kernel candidate set {K=∑k=1MθkKk∣θlq≤1,θ≥0}. Kloft and Blanchard [9] provided an excess generation error utilizing local Rademacher complexity of lq-norm multiple kernel learning. Although these mentioned learning kernel algorithms provide more flexibility than these one-kernel approaches, more complex computational problems induced by multiple kernel learning arise additionally. In addition, these above kernel learning algorithms are considered only under fully supervised learning settings. In practice, labeled instances however are often difficult, expensive, or time-consuming to obtain, as they require the efforts of experienced human annotators. Meanwhile unlabeled data may be relatively east to collect. In the machine learning literature, semisupervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better learners.
In this paper, we pursue the goal of kernel learning algorithms under semisupervised learning framework. For this sake, we resort to find a sequence of candidate kernels using an iterative procedure; then our regularized learning algorithms perform on the corresponding RKHS, which leads to a classical convex optimization program on training data. Finally, we apply the test data to select the optimal kernel function and regularization parameter. It is worth noting that the proposed method consists of two-step estimation stated as above. At the first step, we use large amount of information of unlabeled data to explore underlying data structure. Our optimization problem involved at the second step is as efficient as those classical single kernel approaches. More importantly, we provide sufficient theoretical support for our approach and we demonstrate the effectiveness of the proposed method by experiments.
The rest of the paper is organized as follows. In Section 2 we introduce some basic notations and our two-step estimation for kernel learning. We present main theoretical results for the proposed approach, which is achieved mainly by using advanced concentration inequalities stated in Section 3. Section 4 contains some proof details such as error decomposition and approximation error. We implement a simulation and a real data experiment in Section 5. Some proofs are relegated to the Appendix.
2. The Proposed Algorithm
We first describe the notations used in this paper. Suppose that our algorithm produces one learner f:X→Y from a compact metric space X to the output space Y⊆R. Such a learner f yields for each point x the value f(x)∈Y, which is a prediction made for x. The goodness of estimation is usually assessed by some specified loss function denoted by ν:R2→R+. The most commonly used loss function is the least square one; that is, ν(f(x),y)=(f(x)-y)2. Let (x,y) be the random variable on X×Y with the probability distribution ρ. Within statistical learning framework, the target function can be formulated as a minimizer of the following functional optimization: (1)f∗=argminf∈F∫X×Yνfx,ydρx×y.In particular, in the case of the least square loss, we derive an explicit solution expressed as(2)fρx=∫Yydρy∣x,x∈X,where ρ(y∣x) is conditional probability measure at x induced by ρ. Under fully supervised learning setting, based on available samples S={(xi,yi)}i=1m, the main goal of learning is to design an efficient learning algorithm to obtain one learner fS, which is capable of well approximating regression function f∗ on the whole space. These popular regularized learning algorithms within a RKHS can be stated as(3)inff∈HK1m∑i=1mνfxi,yi+λfK2,where HK is a specified RKHS and 0<λ≤1 is the regularization parameter, controlling empirical error and functional complexity of HK. Note that λ may rely on sample size; it satisfies limm→∞λ(m)=0.
In the semisupervised learning framework, the first m samples are labeled as above and followed by n unlabeled samples x~={xm+1,…,xm+n}. Denote by a weak kernel K0 as the original kernel. Compared to those standard kernels, a weak kernel here means that its complexity is very large or it is less smooth. A learner with a weak kernel usually leads to overfitting, while it can approximate well more complicated functions and hence reduce estimation bias of the learner. Hence, selecting an appropriate kernel needs to trade off the functional complexity of various HK. Motivated by this observation, we propose an iterative procedure as our first step for constructing candidate kernels. For the kth step, the next candidate kernel is derived as follows:(4)Kk+1x,u=1m+n∑i=1m+nKkx,xiKku,xi,x,u∈X.The labeled samples are divided into training data denoted by D={(xi,yi)}i=1l and test data T=S∖D, respectively; then we establish our regularized learning algorithm based on the associated HKk:(5)fz,λk≔argminf∈HKk1l∑i=1lνf^xi,yi+λfKk2.Given the total number N of iteration steps, we minimize f^(k,λ) on the test data:(6)fz,λ∗k∗=mink∈1,2,…,N,λ∈0,11m-l∑xi,yi∈Tfz,λkxi-yi2.Thus, we take fz,λ∗k∗ as our final leaner in semisupervised learning setting. Note that we use the least square loss instead of ν in the final step, since we compute or approximate its solution easily following nice mathematical property of the least square one.
Our motivation of designing kernel as (4) is based on the following fact. By the Mercer Theorem [12], any given kernel Kk defined on a compact set can be expressed as Kk(x,u)=∑j=1∞λjej(x)ej(u), where (λj,ej) is the corresponding eigenpairs of the integral operator LKk, which will be defined in (19) below. In general, the problem of selecting kernel corresponds to a suitable choice of the parameters λj, since the eigenvalue λj has a close relationship with the functional complexity of HKk [13]. In our case, we use an iterative procedure to select an appropriate kernel. To be precise, we define a new candidate kernel by Kk+1(x,u)≔∫XKk(x,t)Kk(u,t)dρX(t). Based on the observation Kk+1(x,u)=∑j=1∞λj2ej(x)ej(u), it suffices to find an appropriate iteration step. Since ρX is often unknown, alternatively, we use an empirical estimator of Kk+1 defined as (4) as our candidate kernel. Furthermore, in view of the slow rate with order 1/m, by which the empirical kernel in (4) converges to its corresponding population kernel, the large amounts of unlabeled data guarantee a smaller error generated by sampling randomly.
It is seen from the proposed program as above that this method avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms up to some constant. Moreover, large amounts of information associated with input space have been made fully use of, so that some intrinsic data structure may be exploited.
3. Main Results
To highlight our idea by presenting more refined theoretical results, in what follows, we are primarily concerned with the least square setting, since the regularized least square algorithm has a closed-form solution. First of all, by the law of large number, with a high probability, we can replace the first step (4) with the following iterative procedure:(7)Kk+1x,u=∫XKkx,tKku,tdρXt,x,u∈X,where ρX is the marginal distribution induced by ρ. We denote by fz,λN the derived learner of (5) at the Nth iteration. For notational simplicity, we write fz=fz,λN. In this paper, we focus on generalization error of the proposed algorithm; that is, (8)fz-fρLρX2.A small quantity of fz-fρLρX2 implies a good prediction ability of fz. Different from those classical literatures under fixed kernel settings such as [13, 14], the main goal of this paper is to indicate some specific advantages theoretically compared with those fixed kernel approaches.
To simplify theoretical analysis, we assume that the conditional distribution ρ(·∣x) has a support on [-M,M], and it follows that fρx≤M almost everywhere. To this end, we introduce the projection operator as follows.
Definition 1.
Define the projection operator πM on measurable function f:X→R as (9)πMfx=M,if fx>M,fx,if -M≤fx≤M,-M,if fx<-M.
Note that the error bound between the projections of fz and fρ can be expressed as(10)πMfz-fρLρX22=EπMfz-Efρ,where E(f)=∫Z(f(x)-y)2dρ denotes the population error of the function f.
Since the regression function fρ may not be found within HK, the approximation error between HK and fρ is needed. Considering the error induced by sampling and the approximation error, we introduce the following empirical error as (11)Ezf=1m∑i=1mfxi-yi2,and we also introduce an approximation error associated with the joint distribution ρ:(12)Dλ=fλ-fρLρX22+λfλKN2,where fλ is called the regularization function, given as(13)fλ=argminf∈HKNf-fρLρX22+λfKN2.
Remark 2.
In the literature of learning theory, one usually assumes that there exist cβ>0 and 0<β≤1, such that D(λ)≤cβλβ. In fact β=1 means fρ∈HKN and vice versa [15–17]. Strictly speaking D(λ) is formally discussed in approximation theory.
To obtain convergence rates of (10), we decompose the term E(πM(fz))-E(fρ) into two parts: the approximation error and the sample error; see [14, 15].
Proposition 3.
Let fz be defined by (5); the following inequality holds(14)EπMfz-Efρ≤Sz,λ+Dλ,where (15)Sz,λ=EπMfz-EzπMfz+Ezfλ-Efλ.
Proposition 3 shows that E(πM(fz))-E(fρ) is bounded by S(z,λ)+D(λ). We usually call S(z,λ) the sample error, since this quantity mainly involves random sampling and the complexity of HK.
Bounding the sample error S(z) is a standard technique in learning theory [13, 15, 18]. To this end, we need to introduce the notion of covering number to measure the complexity of HK.
Definition 4.
Assume (M,d) is a pseudometric space with some metrics d and S⊂M. Then, for any ϵ>0, we define the covering number N(S,ϵ,d) referring to ϵ and d as covering number of a ball with radius ϵ: (16)NS,ϵ,d=minl∈N:S⊂⋃j=1lBsj,ϵ for any sequence sjj=1l⊂M.
Recall that a kernel function is called the Mercer kernel, if it is symmetric, positive definite, and continuous. Several properties concerning the Mercer kernel have been established well and can be found in [12, 15]. Suppose that κ≔supx∈XK0(x,x)<∞.
Assumption 5.
Suppose that the Mercer kernel KN has a polynomial complexity with s>0(17)logNB1,η≤CN1ηs,∀η>0,where B1 is the unit ball of HK and CN is some constant. For the Sobolev space Hh on Rp with the order h, it is known in [12] that s=2p/h.
On the other hand, to quantify the approximation error D(λ) and characterize the regularity of fρ, we need to introduce the notion of fractional integral operator associated with K. Recall that a standard inner product on LρX2 is defined as (18)f,gρX=∫XfxgxdρX.Then we can define an integral operator LK on LρX2(X):(19)LKfx=∫XKx,tftdρXt.It has been verified in [16] that the integral operator LK is a compact, self-adjoint, and positive definite operator from LρX2(X) to LρX2(X). So the fractional operator of LK is well defined. Moreover, it is easy to check that LKk+1=LKk2 and LKN=LK0(2N). Lemma 12 below will show that if a weak kernel with respect to the true function is used for learning, the approximation ability cannot be improved even if the true function is sufficiently smooth. This is why we propose the iterative procedure (4) for updating kernels.
With these preparations, we can state the main results depending on the capacity of HKN and the smoothness of target function as follows.
Theorem 6.
Let fz be defined by (5) and Assumption 5 holds. If LK0-r(fρ)∈LρX2(X)(r>0), then, for any 0<δ<1 and m>1360log(1/δ)/c~(1+s)/s, the following holds:(20)Sz,λ≤7κNλminr-N/2N,0LK0-rfρLρX2+3M2log2/δ3m+12λminr/N,2LK0-rfρLρX22+12EπMfz-Efρ+c~Ms/1+s1λs/21+s1m1/1+s,with probability of at least 1-δ, where the constant c~ is given by Proposition 11.
From Proposition 3, we can deduce that (1/2)E(πM(fz))-E(fρ) in S(z,λ) can be ignored by studying the new equivalent sampling error given by S~(z,λ)=S(z,λ)-(1/2)E(πM(fz))-E(fρ). The following corollary provides an asymptotically optimal convergence rate of fz. The proof can be found in the Appendix.
Corollary 7.
Let fz be defined by (5), and Assumption 5 holds true. If LK0-rfρ∈LρX2, then when 1<N<r<2N, for any 0<δ<1 and m>1360log(1/δ)/c~(1+s)/s, the following holds: (21)S~z,λ=log2δO1m2r/2r1+s+sN,with probability of at least 1-δ, where λ=1/m2N/(2r(1+s)+sN). Particularly, for KN∈C∞(X), we have (22)S~z,λ=log2δO1m1-ϵ,where ϵ is an arbitrary positive number.
It is seen from Corollary 7 that the ideal choice of the regularization parameter λ depends on the two quantities r and s, which are often unknown in advance. Alternatively, cross-validation technique is one of the commonly used tools in practice. It is worth noting that our approach selects the ideal kernel and the regularization parameter simultaneously, which is of significant difference from those classical fixed kernel methods.
Next, we compare our rate with existing references. Recently, sharp learning rates have been established by advanced empirical process technique in [14]. Note that we use K0 to replace the kernel K appearing in algorithm (3), where its covering number has polynomial decay index of p. To be precise, an upper bound of sampling error in [14] was given as(23)Sz,λ≤2Dλ+28κDλ3mλlog2δ+724M2mlog2δ+2C11m2/2+p1λp/2+p.
From formula (A.2) in the Appendix, we can achieve that HKN=LKN1/2(LρX2)=LK0N(LρX2)=LK0(N-1/2)(HK0). Following the equivalent relationship between covering number and the spectrum of LK (see Theorem 10 [13]), we see that sufficiently large N ensures that s≪p. Additionally, when 1<N<r<2N, λmin{(r-N)/(2N),0} in Theorem 6 and D(λ)/λ above are both constants and hence it follows that λmin{r/N,2}=λr/N<λ=cD(λ) with some constant c, since 0<λ<1. In summary, our derived sample error is sharper than that in [14]. Thus, if the regression function is smooth sufficiently, that is, which can also be approximated well by HKN, we conclude that the corresponding learning rate of Theorem 6 outperforms that in [14]. This shows that if we know the smoothness (r) of the target function, we can choose a proper kernel KN (which requires N>r/2) to improve the sampling error effectively. This provides us an excellent theoretical basis in choosing kernel function for real problems. Of course, considering noises with the samples in real problems, the chosen kernel also has to be smoother than the target function.
4. Error Analysis
The sampling error S(z,λ) is analyzed according to empirical process technique. The initial study on sampling error mainly applies the McDiarmid inequality without considering the conception of “space complexity.” However, the McDiarmid inequality is not able to show the variance message of random variables. Afterwards, VC dimension was originally introduced into the literature and some other conceptions such as covering numbers formed in the Bernstein-type probability inequality significantly reduce the sampling error; see [19] for an explicit overview. To bound sample error, we split it into two parts again: (24)Sz,λ=Ezfλ-Ezfρ-Efλ-Efρ+EπMfz-Efρ-EzπMfz-Ezfρ≔S1z,λ+S2z,λ.Note that S1(z,λ) does not involve any functional complexity, which in turn can be estimated easily by the following one-side Bernstein probability inequality.
Lemma 8.
Define ξ as a random variable on the probability space Z, and there exists a constant Mξ and it holds ξ-Eξ≤Mξ. Define the variance of ξ by σ2; then, for any 0<δ<1, the following holds: (25)1m∑i=1mξzi-Eξ≤2Mξlog1/δ3m+2σ2log1/δm,with probability of at least 1-δ.
Proposition 9.
Define random variable ξ1(z)=(fλ(x)-y)2-(fρ(x)-y)2. For any 0<δ<1 the following holds: (26)S1z,λ=1m∑i=1mξ1zi-Eξ1≤7fλ∞+3M2log2δ3m+12fλ-fρρ2,with probability of at least 1-δ/2.
Proof.
Notice that (27)ξ1z=fλx-fρxfλx+fρx-2y.Since fρx≤M is true almost everywhere, thus (28)ξ1≤c≔fλ∞+Mfλ∞+3M.This yields that ξ2-Eξ2≤Mξ2≔2c.
Additionally, notice that E(ξ1)2 satisfies (29)Efλx-fρx2fλx+fρx-2y2≤fλ∞+3M2fλ-fρρ2,which implies that σ2(ξ1)≤E(ξ12)≤cfλ-fρρ2.
By Lemma 8 and the basic inequality ab≤(a+b)/2 (a,b≥0), we obtain (30)1m∑i=1mξ1zi-Eξ1≤7clog2/δ3m+12fλ-fρρ2with probability of at least 1-δ/2.
Bounding the sampling error S2(z,λ) is more involved, since the estimator fz will vary with the random sample. To handle it, some advanced uniform concentration inequality is required [14].
Lemma 10.
Define G as a function set on Z. If there exists a constant cρ, then g-Eg≤B almost everywhere and E(g2)≤cρE(g). Then for any positive number ɛ and 0<α≤1, (31)Probz∈Zmsupg∈GEg-1/m∑i=1mgziEg+ɛ≥4αɛ≤NG,αɛexp-α2mɛ2cρ+2/3B.
Proposition 11.
Given fz defined by (5), suppose that Assumption 5 holds. When m>1360log(1/δ)/c~(1+s)/s, then, for any 0<δ<1, (32)S2z,λ≤12EπMfz-Efρ+c~Ms/1+s1λs/21+s1m1/1+swith probability of at least 1-δ, where c~=(1360CN(16)sM2+s)1/1+s.
Proof.
Introduce a function set FR defined by Proposition 9: (33)FR=πMfx-y2-fρx-y2:f∈BR.Each function in FR can be denoted as g(z)=(πM(f)(x)-y)2-(fρ(x)-y)2, where f∈BR. It follows that E(g)=E(πM(f))-E(fρ)≥0, (1/m)∑i=1mg(zi)=Ez(πM(f))-Ez(fρ), and (34)gz=πMfx-fρxπMfx+fρx-2y.
Also note that πM(f)∞≤M and fρ≤M almost everywhere; we have (35)gz≤8M2.This implies that (36)g-Eg≤B≔16M2.
Additionally notice that (37)Eg2=∫XπMfx-fρx2πMfx+fρx-2y2≤16M2πMf-fρρ2=16M2Eg.By Lemma 10 with B=cρ=16M2 and α=1/4, we obtain(38)EπMf-Efρ-EzπMf-EzfρEπMf-Efρ+ɛ≤ɛ,with probability at least (39)1-NFR,14ɛexp-mɛ162cρ+2/3B≥1-NFR,14ɛexp-mɛ680M2.Now we need to estimate covering number N(FR,(1/4)ɛ).
For arbitrary g1,g2∈FR, we obtain (40)g1z-g2z≤f1x-f2xπMf1x+πMf2x-2y.
Since πMfx≤M is true for any x∈X and πM is contraction mapping, we have (41)g1z-g2z≤4Mf1x-f2x,∀f1,f2∈BR.
Hence (42)NFR,14ɛ≤NBR,ɛ16M≤NB1,ɛ16MB.
According to Assumption 5, suppose that (43)CN16MBs1ɛs-mɛ680M2=logδ,where b=CN(16MB)s and a=m/680M2. Then it can be written as (44)ɛ1+s-log1/δaɛs-ba=0.
According to Lemma 7 given in [16] (45)ɛ≤max2log1/δa,2ba1/1+s.Substituting it into (38) and noticing that EπMf-Efρ+ɛ(ɛ)≤(1/2)E(πM(f))-E(fρ)+ɛ, we further see that E(πM(f))-E(fρ)-(Ez(πM(f))-Ez(fρ)) can be bounded by (46)12EπMf-Efρ+max1360log1/δm,c~Bs/1+s1m1/1+s.
Based on Lemma 4.1 given in [14], for each z∈Zm we have (47)fzKN≤Mλ.
Hence, if (48)m>1360log1/δc~1+s/s,we have (49)1360log1/δm≤c~Bs/1+s1m1/1+s.Proposition 11 is completed by taking B=M/λ.
According to the conclusion of Proposition 9, two important quantities fλ∞ and fλ-fρρ2 involved in S1(z,λ) need to be bounded.
Lemma 12.
Let fλ be defined as (13). If LK0-rfρ∈LρX2, the following holds: (50)fλKN≤λminr-N/2N,0LK0-rfρLρX2,fλ-fρLρX≤λminr/2N,1LK0-rfρLρX2.
It makes sense that the estimation of fλ is extended from Lemma 4.3 in [18], where K0 takes place of KN. Now we discuss the second quantity. In the classical algorithm (3), when r>1, the increase of smoothness of fρ is not able to improve the error fλ-fρLρX, which is called the “saturation” phenomenon in the literature of inverse problems. While, for the algorithm we study, only when r>2N, the saturation will happen. This shows specific advantages of using KN instead of the original K0 from the perspective of approximation theory.
Proof of Lemma <xref ref-type="statement" rid="lem3">12</xref>.
According to [17], fλ=(λI+LKN)-1LKNfρ. Notice that (51)LKN=LK02N,and this yields (52)fλ=λI+LK02N-1LK02Nfρ=λI+LK02N-1LK02NLK0rLK0-rfρ=∑k=1∞λk2N+rλ+λk2NLK0-rfρ,ekLρX2ek,where {λk,ek}k is the corresponding spectrum of the integral operator LK0. Thus (53)fλKN2=∑k=1∞λk2N+2rλ+λk2N2LK0-rfρ,ekLρX22≤λminr-N/N,0LK0-rfρLρX2.
On the other hand, noting the fact that fλ-fρ=λ(λI+LKN)-1fρ and the assumption LK0-rfρ∈LρX2, we have (54)fλ-fρLρX2=λλI+LK02N-1LK0rLK0-rfρLρX2=λ∑k=1∞αkλkrλk2N+λekLρX2≤λminr/2N,1LK0-rfρLρX2where we used the fact that αl2=LK0-rfρLρX2.This is the end of proving Lemma 12.
Together with Propositions 9 and 11, Lemma 12, and the fact that fλ∞≤κNfλKN, Theorem 6 is proved easily.
5. Numerical Experiments5.1. Simulated Example
Although this paper mainly focuses on theoretical analysis, we can take some experiments to show its efficiency in practice. A simulated example is considered, where the true regression is an additive model. That is,(55)f∗xi=5f1xi1+3f2xi2+4f3xi3+6f4xi4with f1(u)=2exp(-0.1u), f2(u)=(2u-1)2, f3(u)=sin(πu)/(2-sinπu), and f4(u)=0.2sin(πu)+0.1cos(πu)+0.3sin2(πu)+0.5cos3(πu)+0.5sin3(πu). Firstly, generate xij which are independent from U(-0.5,0.5); then generate yi by yi=f∗(xi)+ϵi with ϵi~N(0,0.1).
For the above example, different scenarios are taken into account, with (m=200,n=30), (m=400,n=100), and (m=500,n=120), and each scenario is repeated 50 times. We here use the widely used Gaussian kernel, Kσ(x,u)=exp(-x-u2/σ2), where parameter σ will be specified by conducted 10-fold cross-validation on each data set. Besides, as we mentioned before, we start with a weak kernel K and search a better one somehow iteratively. A standard weak kernel is defined as follows: Kweak(x,y)=e-λx-y, where λ is an adjustable parameter, where we specify it as λ=0.1.
The performance of various methods is measured by the MSE, where MSE represents the relative mean squared error for each kernel-based regression. The averaged performance measures are summarized in Table 1. Note that SKM represents the single kernel method with the Gaussian kernel, UKM represents the proposed method without using any unlabeled data, and SEKM represents the proposed method with using all the unlabeled data.
Performance obtained using various kernel-based methods.
Method/sample Size
(m = 200, n = 30)
(m = 400, n = 100)
(m = 500, n = 120)
SKM
0.090 ± 0.032
0.093 ± 0.022
0.095 ± 0.018
UKM
0.085 ± 0.052
0.082 ± 0.037
0.080 ± 0.041
SEKM
0.080±0.062
0.077 ± 0.042
0.075 ± 0.036
From Table 1 we find that using the proposed kernel learning method on the data sets generates better prediction accuracy than using a single kernel. Probably, the true function is more complicated, and in this case the Gaussian kernel has a limited learning ability. Thus, learning to start with a weak kernel implies that the hypothesis space is much larger than that induced by the Gaussian kernel, so that the true function can be learnt well by our algorithm. Moreover, from the last row of Table 1, we see that using the unlabeled data by our algorithm can further reduce the prediction error, as we expect in theory.
5.2. Real Example
The proposed method is also applied to a real example, the Boston housing data, which is publicly available. The Boston housing data concerns the median value of owner-occupied homes in each of the 506 census tracts in the Boston Standard Metropolitan Statistical Area in 1970. It consists of 13 variables, including per capita crime rate by town (CRIM), proportion of residential land zoned for lots over 25,000 square feet (ZN), proportion of nonretail business acres per town (INDUS), Charles River dummy variable (CHAS), nitric oxides concentration (NOX), average number of rooms per dwelling (RM), proportion of owner-occupied units built prior to 1940 (AGE), weighted distances to five Boston employment centers (DIS), index of accessibility to radial highways (RAD), full-value property-tax rate per $10000 (TAX), pupil-teacher ratio by town (PTRATIO), the proportion of blacks by town (B), and lower status of the population (LSTAT), which may affect the housing price.
In our analysis, all the variables are standardized. To compute the averaged prediction error, each dataset is randomly split into two parts: training data and testing data with the number 30. To show the performance our method compared with a single kernel method, we split the training data with three different scenarios with (m=300,n=30), (m=350,n=40), and (m=426,n=50), and each scenario is repeated 50 times. Besides, parameter σ will be specified by conducted 10-fold cross-validation on each data set. The prediction performance of the single kernel method via the proposed method is summarized in Table 2.
On test set obtaining best accuracy for Boston housing.
Method/sample Size
(m = 300, n = 30)
(m = 350, n = 40)
(m = 426, n = 50)
SKM
1.774 ± 0.0931
1.712 ± 0.0835
1.675 ± 0.0733
UKM
1.785 ± 0.851
1.684 ± 0.737
1.633 ± 0.841
SEKM
1.764 ± 0.851
1.705 ± 0.752
1.625 ± 0.804
As shown in the table, the proposed method is significant in terms of prediction accuracy, except that one of six results in Table 2 is of poor performance compared to the single kernel method. This practical result may be acceptable, since we do not know the underlying rule for this real data, and it is hard to ensure a perfect performance in various settings. Totally, to some extent, the proposed method is a simple but efficient kernel learning method in the family of kernel methods.
6. Conclusions and Discussions
This paper mainly discussed the kernel learning problems within semisupervised learning setting. Our candidate kernel sequence is generated by a simple iterative procedure using large amounts of unlabeled data. Under mild assumptions on target function, it is shown that we can match a kernel theoretically to outperform efficiently the sample error induced by one-kernel learning. This also shows that the learning kernel function outperforms traditional kernel-based learning algorithms in our case. Moreover, a simulation example and a real data experiment are implemented, respectively, to show the effectiveness of our proposed method.
We note that the space complexity of function space in the paper is described by covering number, which was a straightaway conception. Yet it is not a perfect choice theoretically. Combining with the way in which the kernel function was formed in the text, we can replace Assumption 5 with the eigenvalue asymptotic behavior assumption on the integration operator LK0. Based on its relationship among entropy and the Rademacher complexity, a better theoretical results may be achieved. This will be our subsequent work in the future. We attempt to explore some intrinsic structure of input space by selecting an appropriate kernel. Perhaps, there are other more effective ways to explore these underlying structures.
Appendix
Integral operator LK has the following properties which were proved in [15].
(1) LK is a positive, self-adjoint, and compact operator from LρX2(X) to LρX2(X). Consequently, according to classical spectral theorem, its standard eigenfunctions e1, e2… consist of a family of orthogonal bases on LρX2. The discrete operator spectrum and the corresponding eigenvalues λ1, λ2… are finite or monotonically decreasing; limi→∞λi=0.
(2) For each η>0, LKη:LρX2(X)→LρX2(X) is defined as (A.1)LKηfx=∑k=1∞λkηf,ekek,∀f∈LρX2X.Denote Γ={i:λi>0}; {λiei:i∈Γ} forms a set of orthogonal bases of HK. Furthermore LK1/2 is isomorphic mapping between HK and Lμ2. Particularly, the following holds:(A.2)fLρX2=LK1/2fK.
Proof of Corollary <xref ref-type="statement" rid="coro1">7</xref>.
Notice that when 1<N<r<2N, λmin{(r-N)/2N,0} in the conclusion of Proposition 3 is a constant, and λmin{r/N,2}=λr/N. By taking (A.3)λr/N=1λs/21+s1m1/1+s,we obtain λ=1/m2N/(2r(1+s)+sN). Thus (A.4)S~z,λ=log2δO1m2r/2r1+s+sN.In addition, if KN∈C∞(X), the corresponding s infinitely approaches 0 by the classic conclusion described in [14]. Thus, we complete the proof of Corollary 7.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The second author’s research is supported partially by National Natural Science Foundation of China (no. 11301421) and Fundamental Research Funds for the Central Universities of China (Grant nos. JBK141111, 14TD0046, and JBK151134).
VapnikV. N.ScholkopfB.SmolaA.ChapelleO.WestonJ.SchölkopfB.Cluster kernels for semi-supervised learningProceedings of the 16th Annual Neural Information Processing Systems Conference (NIPS '02)December 2002New York, NY, USA2-s2.0-50649084677KloftM.BrefeldU.SonnenburgS.ZienA.lp-norm multiple kernel learningLanckrietG. R. G.CristianiniN.BartlettP.El GhaouiL.JordanM. I.Learning the kernel matrix with semi-definite programmingWuQ.YingY. M.ZhouD. X.Multi-kernel regularized classifiersSonnenburgS.RätschG.SchäferC.SchölkopfB.Large scale multiple kernel learningCortesC.Invited talk: can learning kernels help performanceProceedings of the 26th Annual ICML2009New York, NY, USAKloftM.BlanchardG.The local Rademacher complexity of lp-norm multiple kernel learningLvS.-G.ZhuJ.-D.Error bounds for lp-norm multiple kernel learning with least square lossLvS.ZhouF. Y.Optimal learning rates of lp-type multiple kernel learning under general conditionsAronszajnN.Theory of reproducing kernelsSteinwartI.HushD.ScovelC.Optimal rates for regularized least squares regressionProceedings of the 22nd Annual Conference on Learning Theory (COLT '09)June 2009Montreal, Canada7993WuQ.YingY. M.ZhouD.-X.Learning rates of least-square regularized regressionCuckerF.ZhouD.-X.CuckerF.SmaleS.On the mathematical foundations of learningSmaleS.ZhouD.-X.Estimating the approximation error in learning theorySunH. W.WuQ.Regularized least square regression with dependent samplesLuxburgU. V.SchölkopfB.Statistical learning theory: models, concepts, and results