An Efficient Kernel Learning Algorithm for Semisupervised Regression Problems

Kernel selection is a central issue in kernel methods of machine learning. In this paper, we investigate the regularized learning schemes based on kernel design methods. Our ideal kernel is derived from a simple iterative procedure using large scale unlabeled data in a semisupervised framework. Compared with most of existing approaches, our algorithm avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms. Moreover, large amounts of information associated with input space can be exploited, and thus generalization ability is improved accordingly. We provide some theoretical support for the least square cases in our settings; also these advantages are shown by a simulation experiment and a real data analysis.


Introduction
Kernel-based methods have been proved to be powerful for a wide range of different data analysis problems.Since support vector machine (SVM) was initially proposed by Vapnik [1], many other kernel-based methods have been proposed such as kernel PCA, kernel Fisher discriminant, and kernel CCA.In many cases, the performance of kernel methods greatly depends on the choice of kernel function (for the importance of specifying an appropriate kernel, see Chapter 13 of [2]).To choose an appropriate kernel, many kernel learning algorithms have been proposed in recent years such as [3][4][5].Among them, there are two kinds of candidate kernel sets: the first one involves parameter selection from the candidate kernel collection including Gaussian kernels [6]; that is,   (, ) = exp(−‖ − ‖ 2 ).The others mainly refer to the linear combination of certain prespecified kernels.It is known that the latter one is also called "multiple kernel learning" [4].Recall that Lanckriet et al. [5] proposed a positive semidefinite programming to search the best linear combination automatically for SVM; however, this approach is time-consuming and only feasible for small sample cases.Sonnenburg et al. [7] relaxed this mentioned optimization problem to a semi-infinite linear program, which is capable of coping with a large spectrum of kernels and samples.However, these multiple kernel learning algorithms do not have better performance than traditional nonweighted kernel  = ∑    in SVM sometimes, and Cortes [8] questioned that "can learning kernels help performance?".Recently Kloft and Blanchard [9] introduced a multikernel learning with  norm ( ≥ 1) approach, which has been shown effective in both theory and practice [10,11].Essentially,   -norm is a kind of minimizing empirical risk algorithm with kernel candidate set { = ∑  =1     | ‖‖   ≤ 1,  ≥ 0}.Kloft and Blanchard [9] provided an excess generation error utilizing local Rademacher complexity of   -norm multiple kernel learning.Although these mentioned learning kernel algorithms provide more flexibility than these one-kernel approaches, more complex computational problems induced by multiple kernel learning arise additionally.In addition, these above kernel learning algorithms are considered only under fully supervised learning settings.In practice, labeled instances however are often difficult, expensive, or time-consuming to obtain, as they require the efforts of experienced human annotators.Meanwhile unlabeled data may be relatively east to collect.In the machine learning literature, semisupervised learning addresses this problem by using large amount of unlabeled data, together with the labeled data, to build better learners.
In this paper, we pursue the goal of kernel learning algorithms under semisupervised learning framework.For this sake, we resort to find a sequence of candidate kernels using an iterative procedure; then our regularized learning algorithms perform on the corresponding RKHS, which leads to a classical convex optimization program on training data.Finally, we apply the test data to select the optimal kernel function and regularization parameter.It is worth noting that the proposed method consists of two-step estimation stated as above.At the first step, we use large amount of information of unlabeled data to explore underlying data structure.Our optimization problem involved at the second step is as efficient as those classical single kernel approaches.More importantly, we provide sufficient theoretical support for our approach and we demonstrate the effectiveness of the proposed method by experiments.
The rest of the paper is organized as follows.In Section 2 we introduce some basic notations and our two-step estimation for kernel learning.We present main theoretical results for the proposed approach, which is achieved mainly by using advanced concentration inequalities stated in Section 3. Section 4 contains some proof details such as error decomposition and approximation error.We implement a simulation and a real data experiment in Section 5. Some proofs are relegated to the Appendix.

The Proposed Algorithm
We first describe the notations used in this paper.Suppose that our algorithm produces one learner  :  →  from a compact metric space  to the output space  ⊆ R. Such a learner  yields for each point  the value () ∈ , which is a prediction made for .The goodness of estimation is usually assessed by some specified loss function denoted by ] : R 2 → R + .The most commonly used loss function is the least square one; that is, ]((), ) = (()−) 2 .Let (, ) be the random variable on × with the probability distribution .Within statistical learning framework, the target function can be formulated as a minimizer of the following functional optimization: In particular, in the case of the least square loss, we derive an explicit solution expressed as where ( | ) is conditional probability measure at  induced by .Under fully supervised learning setting, based on available samples  = {(  ,   )}  =1 , the main goal of learning is to design an efficient learning algorithm to obtain one learner   , which is capable of well approximating regression function  * on the whole space.These popular regularized learning algorithms within a RKHS can be stated as inf where H  is a specified RKHS and 0 <  ≤ 1 is the regularization parameter, controlling empirical error and functional complexity of H  .Note that  may rely on sample size; it satisfies lim  → ∞ () = 0.
In the semisupervised learning framework, the first  samples are labeled as above and followed by  unlabeled samples x = { +1 , . . .,  + }.Denote by a weak kernel  0 as the original kernel.Compared to those standard kernels, a weak kernel here means that its complexity is very large or it is less smooth.A learner with a weak kernel usually leads to overfitting, while it can approximate well more complicated functions and hence reduce estimation bias of the learner.Hence, selecting an appropriate kernel needs to trade off the functional complexity of various H  .Motivated by this observation, we propose an iterative procedure as our first step for constructing candidate kernels.For the th step, the next candidate kernel is derived as follows: The labeled samples are divided into training data denoted by  = {(  ,   )}  =1 and test data  =  \ , respectively; then we establish our regularized learning algorithm based on the associated H   : Given the total number  of iteration steps, we minimize f(, ) on the test data: Thus, we take   * z, * as our final leaner in semisupervised learning setting.Note that we use the least square loss instead of ] in the final step, since we compute or approximate its solution easily following nice mathematical property of the least square one.
Our motivation of designing kernel as (4) is based on the following fact.By the Mercer Theorem [12], any given kernel   defined on a compact set can be expressed as   (, ) = ∑ ∞ =1     ()  (), where (  ,   ) is the corresponding eigenpairs of the integral operator    , which will be defined in (19) below.In general, the problem of selecting kernel corresponds to a suitable choice of the parameters   , since the eigenvalue   has a close relationship with the functional complexity of H   [13].In our case, we use an iterative procedure to select an appropriate kernel.To be precise, we define a new candidate kernel by  +1 (, ) := ∫    (, )  (, )  ().Based on the observation  +1 (, ) = ∑ ∞ =1  2    ()  (), it suffices to find an appropriate iteration step.Since   is often unknown, alternatively, we use an empirical estimator of  +1 defined as (4) as our candidate kernel.Furthermore, in view of the slow rate with order 1/√, by which the empirical kernel in (4) converges to its corresponding population kernel, the large amounts of unlabeled data guarantee a smaller error generated by sampling randomly.
It is seen from the proposed program as above that this method avoids multioptimization in the process of learning kernels and its computation is as efficient as the standard single kernel-based algorithms up to some constant.Moreover, large amounts of information associated with input space have been made fully use of, so that some intrinsic data structure may be exploited.

Main Results
To highlight our idea by presenting more refined theoretical results, in what follows, we are primarily concerned with the least square setting, since the regularized least square algorithm has a closed-form solution.First of all, by the law of large number, with a high probability, we can replace the first step (4) with the following iterative procedure: where   is the marginal distribution induced by .We denote by   z, the derived learner of (5) at the th iteration.For notational simplicity, we write  z =   z, .In this paper, we focus on generalization error of the proposed algorithm; that is, A small quantity of implies a good prediction ability of  z .Different from those classical literatures under fixed kernel settings such as [13,14], the main goal of this paper is to indicate some specific advantages theoretically compared with those fixed kernel approaches.
To simplify theoretical analysis, we assume that the conditional distribution (⋅ | ) has a support on [−, ], and it follows that |  ()| ≤  almost everywhere.To this end, we introduce the projection operator as follows.
Definition 1. Define the projection operator   on measurable function  :  → R as Note that the error bound between the projections of  z and   can be expressed as where E() = ∫  (() − ) 2  denotes the population error of the function .
Since the regression function   may not be found within H  , the approximation error between H  and   is needed.Considering the error induced by sampling and the approximation error, we introduce the following empirical error as and we also introduce an approximation error associated with the joint distribution : where   is called the regularization function, given as Remark 2. In the literature of learning theory, one usually assumes that there exist   > 0 and 0 <  ≤ 1, such that D() ≤     .In fact  = 1 means   ∈ H   and vice versa [15][16][17].Strictly speaking D() is formally discussed in approximation theory.
To obtain convergence rates of (10), we decompose the term E(  ( z )) − E(  ) into two parts: the approximation error and the sample error; see [14,15].Proposition 3. Let  z be defined by (5); the following inequality holds where Proposition 3 shows that E(  ( z )) − E(  ) is bounded by S(z, ) + D().We usually call S(z, ) the sample error, since this quantity mainly involves random sampling and the complexity of H  .
Bounding the sample error S(z) is a standard technique in learning theory [13,15,18].To this end, we need to introduce the notion of covering number to measure the complexity of H  .Definition 4. Assume (M, ) is a pseudometric space with some metrics  and  ⊂ M.Then, for any  > 0, we define the covering number N(, , ) referring to  and  as covering number of a ball with radius : Recall that a kernel function is called the Mercer kernel, if it is symmetric, positive definite, and continuous.Several properties concerning the Mercer kernel have been established well and can be found in [12,15].Suppose that  := sup ∈ √ 0 (, ) < ∞.
Assumption 5. Suppose that the Mercer kernel   has a polynomial complexity with  > 0 where  1 is the unit ball of H  and   is some constant.For the Sobolev space  ℎ on R  with the order ℎ, it is known in [12] that  = 2/ℎ.
On the other hand, to quantify the approximation error D() and characterize the regularity of   , we need to introduce the notion of fractional integral operator associated with .Recall that a standard inner product on  2   is defined as Then we can define an integral operator   on  2   (): It has been verified in [16] that the integral operator   is a compact, self-adjoint, and positive definite operator from  2   () to  2   ().So the fractional operator of   is well defined.Moreover, it is easy to check that   +1 =  2   and    =  (2)  0 .Lemma 12 below will show that if a weak kernel with respect to the true function is used for learning, the approximation ability cannot be improved even if the true function is sufficiently smooth.This is why we propose the iterative procedure (4) for updating kernels.
With these preparations, we can state the main results depending on the capacity of H   and the smoothness of target function as follows.Theorem 6.Let  z be defined by (5) and Assumption 5 holds.If  −  0 (  ) ∈  2   () ( > 0), then, for any 0 <  < 1 and  > (1360 log(1/)/c) (1+)/ , the following holds: with probability of at least 1 − , where the constant c is given by Proposition 11.
From Proposition 3, we can deduce that (1/2)E(  ( z )) − E(  ) in S(z, ) can be ignored by studying the new equivalent sampling error given by S(z, ) = S(z, ) − (1/2)E(  ( z )) − E(  ).The following corollary provides an asymptotically optimal convergence rate of  z .The proof can be found in the Appendix.
It is seen from Corollary 7 that the ideal choice of the regularization parameter  depends on the two quantities  and , which are often unknown in advance.Alternatively, cross-validation technique is one of the commonly used tools in practice.It is worth noting that our approach selects the ideal kernel and the regularization parameter simultaneously, which is of significant difference from those classical fixed kernel methods.
Next, we compare our rate with existing references.Recently, sharp learning rates have been established by advanced empirical process technique in [14].Note that we use  0 to replace the kernel  appearing in algorithm (3), where its covering number has polynomial decay index of .
To be precise, an upper bound of sampling error in [14] was given as From formula (A.2) in the Appendix, we can achieve that 0 (H  0 ).Following the equivalent relationship between covering number and the spectrum of   (see Theorem 10 [13]), we see that sufficiently large  ensures that  ≪ .Additionally, when 1 <  <  < 2,  min{(−)/(2),0} in Theorem 6 and D()/ above are both constants and hence it follows that  min{/,2} =  / <  = D() with some constant , since 0 <  < 1.In summary, our derived sample error is sharper than that in [14].Thus, if the regression function is smooth sufficiently, that is, which can also be approximated well by H   , we conclude that the corresponding learning rate of Theorem 6 outperforms that in [14].This shows that if we know the smoothness () of the target function, we can choose a proper kernel   (which requires  > /2) to improve the sampling error effectively.This provides us an excellent theoretical basis in choosing kernel function for real problems.Of course, considering noises with the samples in real problems, the chosen kernel also has to be smoother than the target function.

Error Analysis
The sampling error S(z, ) is analyzed according to empirical process technique.The initial study on sampling error mainly applies the McDiarmid inequality without considering the conception of "space complexity."However, the McDiarmid inequality is not able to show the variance message of random variables.Afterwards, VC dimension was originally introduced into the literature and some other conceptions such as covering numbers formed in the Bernstein-type probability inequality significantly reduce the sampling error; see [19] for an explicit overview.To bound sample error, we split it into two parts again: Note that S 1 (z, ) does not involve any functional complexity, which in turn can be estimated easily by the following oneside Bernstein probability inequality.Lemma 8. Define  as a random variable on the probability space , and there exists a constant   and it holds |−E()| ≤   .Define the variance of  by  2 ; then, for any 0 <  < 1, the following holds: with probability of at least 1 − .
Bounding the sampling error S 2 (z, ) is more involved, since the estimator  z will vary with the random sample.To handle it, some advanced uniform concentration inequality is required [14].
According to the conclusion of Proposition 9, two important quantities ‖  ‖ ∞ and ‖  −  ‖ 2  involved in S 1 (z, ) need to be bounded.Lemma 12. Let   be defined as (13).If  −  0   ∈  2   , the following holds: It makes sense that the estimation of   is extended from Lemma 4.3 in [18], where  0 takes place of   .Now we discuss the second quantity.In the classical algorithm (3), when  > 1, the increase of smoothness of   is not able to improve the error ‖  −  ‖    , which is called the "saturation" phenomenon in the literature of inverse problems.While, for the algorithm we study, only when  > 2, the saturation will happen.This shows specific advantages of using   instead of the original  0 from the perspective of approximation theory.
Proof of Lemma 12.According to [17],   = ( +    ) −1      .Notice that and this yields This is the end of proving Lemma 12.
For the above example, different scenarios are taken into account, with ( = 200,  = 30), ( = 400,  = 100), and ( = 500,  = 120), and each scenario is repeated 50 times.We here use the widely used Gaussian kernel,   (, ) = exp(−‖ − ‖ 2 / 2 ), where parameter  will be specified by conducted 10-fold cross-validation on each data set.Besides, as we mentioned before, we start with a weak kernel  and search a better one somehow iteratively.A standard weak kernel is defined as follows:  weak (, ) =  −‖−‖ , where  is an adjustable parameter, where we specify it as  = 0.1.
The performance of various methods is measured by the MSE, where MSE represents the relative mean squared error for each kernel-based regression.The averaged performance measures are summarized in Table 1.Note that SKM represents the single kernel method with the Gaussian kernel, UKM represents the proposed method without using any unlabeled data, and SEKM represents the proposed method with using all the unlabeled data.
From Table 1 we find that using the proposed kernel learning method on the data sets generates better prediction accuracy than using a single kernel.Probably, the true function is more complicated, and in this case the Gaussian kernel has a limited learning ability.Thus, learning to start with a weak kernel implies that the hypothesis space is much larger than that induced by the Gaussian kernel, so that In our analysis, all the variables are standardized.To compute the averaged prediction error, each dataset is randomly split into two parts: training data and testing data with the number 30.To show the performance our method compared with a single kernel method, we split the training data with three different scenarios with ( = 300,  = 30), ( = 350,  = 40), and ( = 426,  = 50), and each scenario is repeated 50 times.Besides, parameter  will be specified by conducted 10-fold cross-validation on each data set.The prediction performance of the single kernel method via the proposed method is summarized in Table 2.
As shown in the table, the proposed method is significant in terms of prediction accuracy, except that one of six results in Table 2 is of poor performance compared to the single kernel method.This practical result may be acceptable, since we do not know the underlying rule for this real data, and it is hard to ensure a perfect performance in various settings.Totally, to some extent, the proposed method is a simple but efficient kernel learning method in the family of kernel methods.

Conclusions and Discussions
This paper mainly discussed the kernel learning problems within semisupervised learning setting.Our candidate kernel sequence is generated by a simple iterative procedure using large amounts of unlabeled data.Under mild assumptions on target function, it is shown that we can match a kernel theoretically to outperform efficiently the sample error induced by one-kernel learning.This also shows that the learning kernel function outperforms traditional kernel-based learning algorithms in our case.Moreover, a simulation example and a real data experiment are implemented, respectively, to show the effectiveness of our proposed method.We note that the space complexity of function space in the paper is described by covering number, which was a straightaway conception.Yet it is not a perfect choice theoretically.Combining with the way in which the kernel function was formed in the text, we can replace Assumption 5 with the eigenvalue asymptotic behavior assumption on the integration operator   0 .Based on its relationship among entropy and the Rademacher complexity, a better theoretical results may be achieved.This will be our subsequent work in the future.We attempt to explore some intrinsic structure of input space by selecting an appropriate kernel.Perhaps, there are other more effective ways to explore these underlying structures.

Lemma 10 .
Define  as a function set on .If there exists a constant   , then |−E()| ≤  almost everywhere and E( 2 ) ≤   E().Then for any positive number  and 0 <  ≤ 1,
where {  ,   }  is the corresponding spectrum of the integral operator   0 .Thus On the other hand, noting the fact that   −   = ( +    ) −1   and the assumption  −  0   ∈  2   , we have        −        2

Table 1 :
Performance obtained using various kernel-based methods.

Table 2 :
On test set obtaining best accuracy for Boston housing.