A Multikernel-Like Learning Algorithm Based on Data Probability Distribution

,


Introduction
(a) Data Spaces and Data Distributions.Let  represent the data and Ω the data space.In mathematics, the data  can be regarded as a random variable/vector/matrix defined on the data space Ω.There will be different kinds of data on a data space.For example, the data space Ω can be the one consisting of all images of 512 × 512 pixels, while the data   represents all face images of 512 × 512 pixels and the data   represents all landscape images of 512 × 512 pixels; both of them are defined on the data space Ω but subject to different probability distributions.
If the data is regarded as a random variable, the samples of data can be then regarded as the concrete realization of the random variable.In machine learning, the samples of data can be exploited to estimate the probability distribution of data (the probabilistic modeling of data).There are a lots of researches on the probabilistic modeling of data [1][2][3].
(b) Classification/Labels and Classifiers/Label Functions.The classification of the data may be different when the data are used in different applications.For example, let the data be the face images.In the application of identification recognition, the face images of the same person are all grouped into the same class, even though the expressions and postures of these face images may be different.However, in the application of expression recognition, the face images of the same expression are all grouped into the same class, even though these face images belong to different persons.
The classifier of data is a machine indicating the class to which a data point belongs.The classifiers of data are trained with the data samples [4,5].The classes of data are also called the labels of data and the classifiers of data are also called the label functions of data.In this paper, we adopt the terminology of labels and label functions of data.
(c) Kernel Tricks in Machine Learning.The applications of kernel tricks in machine learning can be roughly divided into two categories: the transformation of data spaces and the construction of label functions.In the transformation of data spaces, the kernel functions are used to transform data spaces into other spaces where the data can be linearly separated.

Mathematical Problems in Engineering
The famous Kernel PCA [6] and kernel Fisher discriminant (KFD) [7] belong to this category.In the construction of label functions, the kernel functions are used to serve as the basic functions of label functions.The famous manifold regularization learning [8,9] belongs to this category.In this paper we address the problems involved in the construction of label functions.
In the construction of label functions, the label functions are expressed as () = ∑ + =1   (,   ), where (, V) represents a kernel function, { 1 ⋅ ⋅ ⋅   } the labeled samples, and { +1 ⋅ ⋅ ⋅  + } the unlabeled samples.The coefficients ⃗  = [ 1 ⋅ ⋅ ⋅  + ] can be derived from solving the following learning problem: ( In the above equation,   = span{(,   ) |  = 1, . ..,  + } represents the solution space of the learning problem; { 1 ⋅ ⋅ ⋅   } represents the labels of the labeled samples represents the kernel matrix; and (  , (  )) represents the cost function.We want to find a label function () which will make the cost as small as possible.
There are two kinds of properties about the data.The first kind of properties is the natural properties of data.The probability distributions of data are the examples of natural properties of data.The second kind of properties is the semantic properties of data.The data labels are the examples of semantic properties of data.It is clear that, in the framework of learning problem shown in (1), the semantic properties hidden in the data labels have been fully utilized, while the natural properties hidden in data probability distributions seem not to be deeply exploited.At present, the usual way to learn more information other than the data labels is to add various regularization terms to the cost function.For example, in the manifold regularization learning [8][9][10][11], a manifold regularization term is added to the cost function: where ‖‖ 2  = ⃗      ⃗   is the so-called manifold regularization term, in which ⃗   = [( 1 ) ⋅ ⋅ ⋅ ( + )] and   is the Laplacian matrix reflecting the adjacency relations of data samples [12].
However, the addition of too many regularization terms to the cost function will make the learning problem complicated and difficult to solve.In this paper, rather than adding regularization terms, an alternative way to exploit the data probability distribution in the learning problem is proposed.For the convenience of description, let us denote the kernel function as (,  | ), where  represents the parameter of the kernel function.In the proposed algorithm, the parameters of the basic functions (,   |   ) are adjusted based on the data distribution sample-by-sample; that is,   = ((  )), where () is the probability distribution of data,  = 1, . . ., +.These basic functions are then used to span the solution space of the learning problem:   = span{(,   |   ) |  = 1, . . ., +}.It is clear that the probability distribution of data is integrated into the basic functions.
According to the theory of kernel functions, if   ̸ =   , then (,  |   ) and (,  |   ) are two different kernel functions and will generate two different RKHS.Now let   denote the RKHS generated by the kernel function (,  |   ),  = 1, . . .,  + ; then, theoretically speaking, the solution space   can be regarded as a subspace of the direct sum space  1 ⊕ ⋅ ⋅ ⋅ ⊕  + .Therefore, the proposed algorithm can be regarded as a kind of multikernel learning algorithm, but quite different from the commonly used multikernel learning algorithm [13][14][15][16].Therefore, we call the proposed algorithm the multikernel-like learning algorithm based on data probability distribution, referred to as MKDPD algorithm.
How to label the new coming data (the out-of-samples)  is the key topic in machine learning [12,17,18].There are two extreme algorithms.One algorithm uses the original label function  old to label the new coming data ; that is, the labels of the new coming data  are given by  old ().This algorithm is best in efficiency, but worst in accuracy.Another algorithm regards the new coming data  as the unlabeled samples and mixes the new coming data  with the original samples to retrain a new label function  new .The labels of the new coming data  are then given by  new ().This latter algorithm is best in accuracy, but worst in efficiency.Various algorithms for labeling new coming data are the trade-off between these two extreme algorithms.In the proposed MKDPD algorithm, there are two learning processes.In the first learning process, the basic functions of label function are trained.In the second learning process, the weights of basic functions in the label function are solved.Accordingly, a new algorithm for labeling the new coming data is proposed.In the proposed algorithm, the new coming data will be exploited in the first learning process to retrain the basic functions, but the weights of basic functions remained unchanged and combined with the retrained basic functions to label new coming data.The proposed labeling algorithm achieves a better trade-off between the computational efficiency and accuracy.
The rest of the paper is arranged as follows: in Section 2, the literatures related to our work are reviewed briefly.In Section 3, the main theories of kernel functions and RKHS are introduced.In Section 4, the MKDPD algorithm is proposed.In Section 5, an MKDPD-based algorithm for labeling new coming data is proposed.In Section 6, the experimental results on toy and real-world data are presented to show the performance of the MKDPD algorithms.In Section 7, some conclusions are presented for reference.

Related Works
Learning from the given data is the main process to machine learning.So how to fully make use of the given samples is the key to the successful learning.In general, supervised learning has sufficient labeled samples and these kinds of algorithms are suitable for classification problems, such as the representative linear discriminant analysis (LDA) [19] and KFD [7].In practice, a large number of samples are unlabeled, only a small number of samples are labeled.In this case, supervised learning algorithms do not effectively make use of the information hidden in unlabeled samples.To tackle the issue, semisupervised learning [18] is proposed and a wide range of semisupervised learning algorithms have been proposed and widely applied in many areas of machine learning.
In recent years, the study of semisupervised learning is not limited to the simple introduction of unlabeled data.Many researchers pay attention to exploit the intrinsic geometry of data and introduce the kernel learning to semisupervised methods.For example, manifold regularization learning proposed by Belkin et al. [8] exploits the underlying data structure by adding a manifold regular term to a generalpurpose learner.And following it, a serious of algorithms are proposed.Sindhwani et al. [20] proposed a linear MR (LMR) algorithm, in which a global linear mapping between the samples and their labels is constructed for labeling novel samples.Inspired by Gaussian fields and harmonic functions (GFHF) [21], local and global consistency (LGC) [22], and LMR [20], Nie et al. [23] extended LMR algorithm to flexible manifold embedding algorithm (FMA).FMA relaxes the hard linear constraint in LMR by adding a flexible regression residue.Geng et al. [10] proposed the ensemble manifold regularization (EMR) to deal with the aforementioned problems by learning an optimal graph Laplacian based on a set of given candidate graph Laplacian.With the spare assumption, Fan et al. [24] replaced the manifold regularizer with a sparse regularizer under the MR framework.Luo et al. [11] applied MR framework to solve the problem of multilabel image classification by learning a discriminative subspace.
Introducing kernel trick to semisupervised learning methods is an important progress in machine learning.The kernel function [17] is either used to map input sample into a high dimensional kernel space for learning problem nonlinearization, or used to span a RKHS for the learning of label function.Take the MR learning as example, the label function (classifier function) in MR is a linear combination of single kernel function on labeled and unlabeled samples and the performance of MR algorithms strongly depends on this label function.
The theory of RKHS plays an important role in kernel methods and RKHS has found a wide range of applications such as minimum variance unbiased estimation of regression coefficients, least squares estimation of random variables, detection of signals in Gaussian noise, problems in optimal approximation [25].Some RKHS-based learning algorithms appearing recently find applications to online learning with the problem of classification or regression [26][27][28], while others find applications to the classification of hyperspectral images [29,30].Gurram and Kwon [29] achieved the weights for SVM separating hyperplanes by combining both local spectral and spatial information.Gu et al. [30] introduced the conception of Multiple-Kernel Hilbert Space (MKHS) to analyze spectral unmixing problems, and the resultant algorithm performs well in solving nonlinear problems.
In theory, RKHS can be generated with some specific functions called kernel functions such as Gaussian, Laplacian, and polynomial [31].Modifying kernel functions is a way of improving the performance of kernel methods; for example, Wu and Amari [32] extended conformal transformation of kernel functions to improve the performance of Support Vector Machine classifiers, Gurram and Kwon [33] defined a new inner product to warp the RKHS structure to reflect the intrinsic geometry of the given samples, and the literatures [33,34] obtained the best kernel parameters by calculating the derivatives of objective functions.The application of multiple kernels is hot topic in kernel methods and multikernel learning (MKL) [35] is a successful method, which enhances the interpretability of the classifier with a combination of base kernels and improves the performances of kernel methods.MKL algorithms have been widely investigated [13][14][15][16][35][36][37] and the reviews of MKL algorithms can be found in [13,14].MKL offers a feasible scheme to ensemble multiple kernels, but high computational cost raised by optimization procedure is a bad limitation when it is used to process largescale data and a number of kernels.Therefore, one main research direction in MKL is how to effectively solve the MKL problem.Lots of MKL algorithms have been proposed; for example, SimpleMKL [36] proposed by Rakotomamonjy et al. is one of the state-of-the-art algorithms used to solve MKL problem which is addressed by a simple subgradient descent method.However, the MKL task is still challenging because it must on one hand learn an optimal combination of multiple kernels and determine the optimal classifier in each iteration and on the other hand make sure that the two optimization procedures are feasible.

RKHS and Its Application to
Machine Learning Note that in  = ((Ω), ⟨⋅, ⋅⟩), Ω is a data space, (Ω) is a linear space of functions defined on the data space Ω, and ⟨⋅, ⋅⟩ is an inner product defined on (Ω).According to the theory of RKHS, RKHS can be generated from a kernel function.The kernel function is defined as follows.
A kernel function (, V) can be used to generate a RKHS such that the kernel function is the reproducing kernel of RKHS.The generating procedure is as follows.
First, a linear space can be generated from the kernel function (, V), where  is the set of all positive integers.
It is worth noting that for all  ∈   (Ω), since That is to say, the functions in the inner space   can be reproduced with the kernel function (, V).
Third, the inner space   can be completed if it is not completed.The completion of   , denoted by   , is then a RKHS and the kernel function (, V) is the reproducing kernel of   .By the way, it can be seen from the completion that the inner space   is dense in the RKHS   .

Solution Spaces of Machine Learning Problems.
In practice, it is impossible to take the space   as the solution space of learning problem because it is infinite-dimensional.It is reasonable to require that the solution space be both finitedimensional and sample-dependent.Thus, for the given samples { 1 , . . .,   ,  +1 , . . .,  + }, a linear space can be generated as follows: It is clear that   (Ω) is both finite-dimensional and sampledependent.Furthermore,   (Ω) is exactly a subspace of   (Ω) and therefore   = (  (Ω), ⟨⋅, ⋅⟩  ) is a subspace of   .Since   is finite-dimensional, then it is complete; that is,   is a Hilbert space.However,   is no longer a RKHS.
For all functions ,  ∈   (Ω), since () = ∑ + =1   (,   ) and () = ∑ + =1   (,   ), then, according to (6), we have where and Since the matrix   is symmetric and positive definite, the inner product on   (Ω) can be defined by itself, not necessarily inherited from ⟨⋅, ⋅⟩  .In fact, for all ,  ∈   (Ω), the inner product can be defined as It can be easily proven that the definition of ⟨⋅, ⋅⟩  meets the requirements of inner product and therefore   = (  (Ω), ⟨⋅, ⋅⟩  ) is an inner space.Again, since   (Ω) is finitedimensional,   is complete.In machine learning, it is the space   that is taken as the solution space of learning problem.

A Multikernel-Like Learning Algorithm Based on Data Probability Distribution MKDPD
4.1.Motivation.As shown in (5), the space of label functions is as follows: This means that the functions {(,   ) |  = 1, . . .,  + } play the role of basic functions of   (Ω).Obviously, these basic functions are only dependent on the locations of given samples and seem too simple to adapt to various probability distributions of data.Take Gaussian kernel function (, V) = exp −‖−V‖ 2 /2 2 as an example, the basic functions {(,   ) |  = 1, . . .,  + } generated from Gaussian kernel function are identical with each other, only different in the locations of data space.A basic function can be derived from another basic function only by translation in the data space Ω.In fact, if  ̸ = , then Furthermore, since () = ∑ + =1   (,   ), then for all  ∈ Ω with () ̸ = 0, () should give the label of .This means that sup () ⊆ ∪ + =1 sup ((⋅,   )), where sup () is the support of () and sup ((⋅,   )) is the support of (,   ); that is, If the relation sup () ⊆ ∪ + =1 sup ((⋅,   )) is not true, there would be  ∈ Ω such that () ̸ = 0, but () = ∑ + =1   (,   ) = 0; that is, () cannot give the label of .However, the union ∪ + =1 sup ((⋅,   )) is dependent on the locations of the given data samples { 1 ⋅ ⋅ ⋅  + }, not dependent on the data probability distribution ().In practice, kernel functions are often compactly supported and the data are not evenly distributed over the data space.In these cases, label function () will be overfitted in the areas where too many data samples are collected, or underfitted in the areas where there are too few data samples collected, or not fitted at all in the area where the union ∪ + =1 sup ((⋅,   )) fails to cover.
Based on the above considerations, a learning algorithm based on the data probability distribution is proposed in this paper.In the proposed algorithm, the union is not only dependent on the locations of the given samples, but also dependent on the data probability distribution.With these basic functions, we can span a linear space   (Ω) as follows:

Construction of Solution
It is clear that   (Ω) is a finite-dimensional linear space.Further, in order to define an inner product on   (Ω), we need to define a symmetric and positive definite matrix first: where ,  ∈  (+)×(+) is a unit matrix and Note that, since (  ,   |   ) ̸ = (  ,   |   ),  ̸ = ,   is not symmetric and positive definite.However,   is symmetric and definite positive and can be used to define an inner product ⟨⋅, ⋅⟩  on   (Ω): for all ,  ∈   (Ω), since where It can be easily proven that ⟨⋅, ⋅⟩  meets the requirements of inner product and therefore   = (  (Ω), ⟨⋅, ⋅⟩  ) is an inner product space.Furthermore, since   (Ω) is finitedimensional,   is then complete; that is,   is a Hilbert space.However, it is worth noting that   is neither a RKHS, nor a subspace of   .Recall that although   is not RKHS,   is a subspace of   .
In the proposed algorithm,   is taken as the solution space of learning problem: Below we explain the rationality of the definition ⟨⋅, ⋅⟩  : (1) If ⟨⋅, ⋅⟩  is an inner product of   (Ω), according to the linearity of inner product, for all ,  ∈   (Ω), we have Usually, the inner product of functional space is often defined as the integral of product of functions; therefore we have Substituting ( 21) into (20) gives However the matrix      is only positive semidefinite and cannot be used to define an inner product.This problem can be easily solved by adding a regularization term , where  ∈  (+)×(+) is the unit matrix and  is the regularization parameter:   =      + .The matrix  D is now symmetric and positive definite and can be used to define an inner product on   (Ω): The regularization parameter  can also alleviate the ill-condition of the matrix      and reduce the error stemming from the substitution of integral with summation in (29).
Combining ( 23) and (25) will give the following result: This means that if the parameter  of the kernel function (,   | ) does not adjuste sample by sample, then   =   .

Analytic Solutions to Two-Class Learning Problems.
In the proposed algorithm, the Hilbert space   is taken as the solution space of learning problems.Then for all  ∈   , Based on the above results, the problem shown in (19) can be simplified as follows: Furthermore, if the cost function (, ()) is set to be the square error function, that is, (, ()) = ( − ()) 2 , we have where  is the selection matrix, ⃗   =      ⃗ , and   =        +   +        .Note that the matrix   is a symmetric and positive definite matrix.

Analytic Solutions to Multiclass Learning Problems.
In principle, the deduction shown above is also suitable for the multiclass learning problems.In fact, for the data sample   , its label   can take different values to indicate the different classes to which the data sample   belongs.However, in practice, the different values of   may be too close to facilitate the optimization calculation.Therefore, for the multiclass problems, we adopt another way to indicate the data labels.
For the data sample   , let its label ⃗   be a -dimensional vector, where  is the number of classes.If the data sample   belongs to the th class, then the th component of ⃗   is set to be 1 while the other components of ⃗   are set to zero, where  = 1, . . ., .Furthermore, a label function   () is set to describe the probability that the data  belongs to the th class.Based on these notations, the multiclass problem can be expressed as follows: where Again, the matrix  is a selection matrix such that  =     .At last, substituting (32), (33), and ( 34) into (30) will give the following result:

The Framework of Multikernel-Like Learning Algorithms.
In the algorithm proposed in this paper, the label function () is set to be () = ∑ + =1   (,   |   ), where the parameter   is adjusted according to the data probability distribution () on the data sample   .In general, the data probability distribution is not uniform and therefore the parameter   will be different sample by sample.As a result, the functions (, V |  1 ), . . ., (, V |  + ) are different kernel functions and will produce different RKHS.In this sense, the algorithm proposed in this paper can be regarded as a kind of multikernel learning algorithms, but quite different from the commonly used multikernel learning algorithm.
In the commonly used multikernel learning algorithms, the multikernel function is a linear combination of multiple basic kernel functions:  MK (, V) = ∑  =1     (, V), where the functions  1 , . . .,   are called basic kernel functions, while the function  MK (, V) is called the multikernel function.Since the basic kernel functions are symmetric and positive definite, it can be easily proven that the multikernel function is also symmetric and positive definite.Therefore the label function  MK () based on the multikernel function can be expressed as where the coefficients  1 , . . .,  + and  1 , . . .,   are determined through machine learning.
If we follow the ideas of the commonly used multikernel learning algorithms and regard the functions (, V |  1 ), . . ., (, V |  + ) as the basic kernel functions, then the multikernel function becomes  MK (, V) = ∑  =1   (, V |   ).Thus, according to (36), the label function  MK () based on the multikernel function becomes It can be seen from ( 37) that, no matter how to adjust the coefficients  1 , . . .,   , it is impossible to make  MK () ̸ = ().From the perspective of solution spaces, in the solution space of  MK (), there are  +  functions (,   |  1 ), . . ., (,   |  + ) around the data sample   , while in the solution space of (), there is only one function (,   |   ) around the data sample   .Therefore the algorithm proposed in this paper is quite different from the commonly used multikernel learning algorithm.
Nevertheless, the algorithm proposed in this paper still belongs to the realm of multikernel learning.As stated above, the functions (,   |  1 ), . . ., (,   |  + ) are different kernel functions and can produce different RKHS:  1 , . . .,  + .Now let   = {(,   |   ) |  ∈ }, as stated in Section 3.1;   is then a 1-dimensional subspace of   .Furthermore, the direct sum of these subspaces turns out to be Obviously the direct sum of these subspaces is the solution space   .
Due to the fact that our algorithm is different from the commonly used multikernel learning algorithm, but still involved in multiple kernel functions, our algorithm is called multikernel-like algorithm.

An MKDPD-Based Algorithm for Labeling
New Coming Data Obviously, in terms of accuracy, the relearning methods perform best, whiles the unlearning methods perform worst.In terms of efficiency, the unlearning methods perform best, while the relearning methods perform worst.For years the researchers have been hovering between these two extreme methods and try to find the trade-offs between accuracy and efficiency.

A MKDPD-Based Algorithm for Labeling New Coming
Data.As stated in Section 4, there are two times of learning in the MKDPD algorithm.In the first time of learning, the MKDPD algorithm has to adjust the parameters of kernel functions according to data probability distribution.In the second time of learning, the MKDPD algorithm has to determine the coefficients of label functions.Therefore, in the framework of the MKDPD algorithm, there are at least three ways to label new coming data: (1) The MKDPD-based relearning method:

𝑖
),  = 1, . . ., .Obviously, the MKDPD-based relearning method can achieve the best accuracy but perform worst in efficiency because the MKDPD-based relearning method has to calculate both the parameters  new and coefficients  new .
(2) The MKDPD-based unlearning method: Obviously, the MKDPD-based unlearning method can achieve the best efficiency but perform worst in accuracy because the unlearning method does not calculate the parameters  new and coefficients  new either.

𝑖
).The MKDPD-based semilearning method regards the new coming data as unlabeled data samples and mixes them with the original data samples to retrain the new parameters  new of kernel functions.However, the coefficients  old remained unchanged and combined with the retrained kernel functions to label the new coming data.The MKDPD-based semilearning method takes full advantage of two times of learning in the MKDPD algorithm and achieves a better tradeoff between computational accuracy and efficiency.

Experimental Data and Experimental
Settings.We test our algorithm in the framework of manifold regularization and therefore the experimental data are downloaded from the website of manifold regularization (http://manifold.cs.uchicago.edu/manifoldregularization/manifold.html).There are a total of 400 sets of data collected from two half-moons, 200 sets of data from one half-moon, and another 200 sets of data from another half-moon.We randomly take 1 set of datum as labeled sample and 99 sets of data as unlabeled samples from each half-moon.The remaining 200 sets of data are taken as new coming data for labeling.
In order to alleviate the effect of random sampling on the objectivity of the experimental results, the random sampling has been done for  times and each random sampling will produce an experimental result.The average of  experimental results is taken as the end result. is set to be 30, 50, 70, and 90, respectively (Table 1).

Error Analysis. Table 1 lists the error rates of various
algorithms for labeling new coming data.Not surprisingly, the order of error rates is  re MKDPD ≤  semi MKDPD ≤  un MKDPD .This order coincides with the amount of information exploited by these algorithms from the new coming data.
In addition, the error rate of  MKDPD exploit not only the locations of samples, but also the probabilities of data on the samples, while the basic functions (,   ) of  un MR exploit only the locations of samples.Figure 1(a) shows the error rates of various algorithms change along with the number of new coming data.Again, the error rate of  semi MKDPD is between those of  un MKDPD and  re MKDPD .Since both  un MKDPD and  un MR make use of the original parameters  old and  old to label the new coming data, the runtime of  un MKDPD and  un MR will not change no matter how many new coming data are coming.However, in the proposed algorithm  semi MKDPD , although the parameters  are retrained from  old to  new according to the new coming data, the runtime of  semi MKDPD almost remains unchanged along with the number of new coming data.The means that  semi MKDPD achieves a certain amount of accuracy without increasing its runtime.

Adjustment of Parameters of Kernel Functions.
In the proposed MKDPD algorithm, the basic functions of label function () are set to be (,   |   ), where   = ((  )),  = 1, . . .,  + .The schemes of how to adjust the parameters   are open.People can adopt various schemes according to their specific applications.No matter how to adjust the parameters   , the structures of analytic solutions shown in (19) will not change in the framework of the proposed MKDPD algorithm.In the experiments presented in this paper, the scheme of adjusting the parameters is based on the following considerations; (1) The parameters   should be adjusted so as to make sup () ⊆ ∪ + =1 sup ((⋅,   |   )).In this way, for all  ∈ Ω with () ̸ = 0, the label function () can give the label of .
(2) If the value of (  ) is large, the number of samples gathering in the neighborhood of   will be large too because they are more likely to be collected.In order to prevent data overfitting in the area, it is reasonable to adjust the parameter   to reduce the scope of the support sup ((⋅,   |   )).Conversely, if the value of (  ) is small, the number of samples will be small too because they are more unlikely to be collected.In order to prevent data underfitting in the area, it is reasonable to adjust the parameter to expand the scope of the support sup ((⋅,   |   )).This means that the probability (  ) is inversely proportional to the scope of the support.
(3) In the following experiments, we adopt Gaussian kernel functions (,   |   ) = exp −‖−  ‖ 2 /2 2  .The Gaussian kernel function can be regarded as a compactly support function, the sample   is its center, and 3  is often regarded as its effective radius.Therefore, the parameter   is proportional to the scope of the support sup ((⋅,   |   )), or inversely proportional to the probability (  ); that is,   =   /(  ), where   is an adjustable parameter.In our experiments, the probability (  ) of sample   is set to (  ) =   (  )/ ∑    (  ), where   (  ) = (  )/( + ) and (  ) is the number of neighbors of   .

Experiments on Synthetic Dataset (Two Moons Dataset).
The synthetic dataset is the Two Moons Dataset, which has already been used in the experiments in Section 5.
The Two Moons dataset contains 400 sets of data nonevenly collected from two half-moons, where 200 sets of data are collected from one half-moon and the other 200 sets of data are collected from the other half-moon.We randomly take 100 sets of data from each half-moon as samples, where 1 sample is labeled and the other samples are unlabelled.The remaining 200 sets of data in the Two Moons dataset are taken as the test data (i.e., new coming data in Section 5).
Figure 2 shows the experimental results of the proposed MKDPD algorithm and MR algorithm.In Figures 2(a each digit as the samples and the remaining images as the test data (new coming data).
The MNIST (http://yann.lecun.com/exdb/mnist/) is another popular handwritten digits dataset, which contains a training set of 60000 images and a test set of 10000 images.Each image in MNIST is of size 28 × 28 and can be converted into a 784-dimensional vector.We select 400 sets of data from the training set as the samples for each digit and all data in the test set as the test data (new coming data).

The Two-Class Experiments.
We randomly select two different digits from the ten digits to construct a binary classification problem, and then there is a total of 45 classification problems.For each binary classification problem, a sample is taken as the labeled sample for each digit and the remaining samples are taken as the unlabeled samples.To avoid the randomness, the labeled samples are randomly selected for ten times and each time will produce an experimental result.The average of ten experimental results is presented as the final experiment result.
The experimental results of the proposed MKDPD algorithm and MR algorithm are presented in Figure 4.In Figures 4(a As can be seen, the error rates of the proposed MKDPD algorithm are lower than those of the MR algorithm.Furthermore, the averages of the results of 45 binary classifications are listed in Table 3.It can be seen from Table 3 that the proposed MKDPD algorithm outperforms the MR algorithm. In Figures 4(e), 4(f), 4(g), and 4(h), the -axis represents the error rates of labeling the unlabeled samples and the axis represents the error rates of labeling the test data.For a good learning algorithm, its error rates on the unlabeled samples and on the test data should be close to each other; that is, the scatter points in Figures 4(e), 4(f), 4(g), and 4(h) should be close to the diagonal line.Again, in this respect, the proposed MKDPD algorithm performs the MR algorithm better.

The Multiclass Experiments.
In the multiclass experiments, there are 10 classes and each class corresponds to a digit.The labeled samples of each digit are randomly selected from its samples for 10 times and each time will produce an experimental result.The average of 10 experimental results is taken as the final result and listed in the first and second column of Table 5.The number of labeled samples is set to be 1, 3, and 5, respectively.It can be seen from Table 5 that the proposed MKDPD algorithm outperforms the MR algorithm.

Recognition of Spoken Letters
6.4.1.ISOLET Dataset.ISOLET is a dataset of spoken letters and can be downloaded from UCI machine learning repository.ISOLET contains the utterances of 150 speakers who spoke 26 English letters twice, and then each speaker has 52 utterances.In the experiment, two subsets of ISOLET, whose names are ISOLET1 and ISOLET5 respectively, are directly download from (http://manifold.cs.uchicago.edu/manifold regularization/manifold.html).Each subset contains the utterances of 30 speakers.We take the data in ISOLET1 as the samples and the data in ISOLET5 as the test data (new coming data).

The Two-Class Experiments.
In the two-class experiments, the utterances of the first 13 English letters are classified as one class and the utterances of the last 13 English letters are classified as another class.We take the 52 utterances of one speaker from ISOLET1 as the labels samples and the utterances of the other speakers in ISOLET1 as the unlabeled samples.Since there are 30 speakers in ISOLET1, we can construct 30 two-class experiments this way and the 30 experimental results are presented in Figure 5.As can be seen from Figure 5, the proposed MKDPD algorithm outperforms the MR algorithm.The averages of 30 experimental results experiments are listed in Table 4, which also shows that the proposed MKDPD performs better than the MR algorithm.

The Multiclass Experiments.
In the multiclass experiments, the utterances of each English letter are classified as a class, and then there are 26 classes.We randomly select  speakers from ISOLET1 and take their utterances as the labeled samples.The utterances of the other speakers in ISOLET1 are then taken as the unlabeled samples.In order to alleviate the effect of randomness, the selection of  speakers is performed for 10 times and each time will produce an experimental result.The averages of 10 experimental results are taken as the final results and listed in the third column of Table 5.It can be observed from   his face images are taken as the samples, while his other face images are taken as the test data (new coming data).

The Two-Class Experiments.
We select the face images of two persons to construct a binary classification problem and then there are a total of 28 binary classification problems.
For each binary classification problem,  samples of each person are taken as the labeled samples and the remaining samples are taken as the unlabeled samples.To avoid the randomness, the  samples are randomly taken for 10 times and each time will produce an experimental result.The average of 10 experimental results is presented as the final results (see Figures 6 and 7), where the number of labeled samples is set to be  = 3, 5, and 7, respectively.The average of 28 binary classification results is listed in Table 4.It can be seen that, compared with the MR algorithm, the proposed MKDPD algorithm achieves about 1%∼5% improvements.

The Multiclass Experiments.
In the multiclassification experiments, we take the face images of a person as one class and then there are 8 classes. samples of each person are taken as the labeled samples and the remaining samples as the unlabeled samples.Again, the  labeled samples are randomly taken for 10 times and each time will produce an experimental result.The average of 10 experimental results is presented as the final result and listed in the last four columns of Table 5, where the number of labeled samples is set to be  = 3, 5, and 7, respectively.As can be seen, the proposed MKDPD algorithm achieves about 1%∼12% improvements to the MR algorithm.

Conclusion
In machine learning, one variable of a kernel function is often anchored on each given sample and thus derives a number of basic functions of the label function.The weights of basic functions in the label function are then trained by exploiting the labels of labeled samples.The basic functions derived this way are the same in shape, only different in the positions of data space.Obviously, these basic functions seem too simple to adapt to the changes of data distribution.For example, if the given samples are distributed unevenly, then in the area where there are too many samples are given, there will be too many basic functions located in this area and maybe overlapped too much, while in the area where there are too few samples given, there will be too few basic functions located in this region and maybe overlapped too little or not overlapped at all.In the MKDPD algorithm proposed in this paper, we adjust the basic functions according to the probabilities of data on the given samples.If the probability of data on a sample is large, then the number of samples in the vicinity of the sample will be large too and we can reduce the support of the basic function located on the sample accordingly to avoid overlapping too much with other basic functions.Likewise, if the probability of data on a sample is small, the number of samples in the vicinity of the sample will be small too and we can expand the support of the basic function located on the sample accordingly to avoid overlapping too little with other basic functions.The experimental results justify the proposed MKDPD algorithm.
From the perspective of the applications of data classification, the aim of machine learning is to label the new coming data.Usually, there are three methods: unlearning, relearning, and semilearning.In the MKDPD algorithm proposed in this paper, there are two learning processes: learning the basic functions and learning the weights of basic functions.In this paper we propose a semilearning method based on the MKDPD algorithm.The proposed semilearning method regards the new coming data as unlabeled samples and mixes them with the original samples to relearn the basic functions, but still the original weights to combine the new basic functions to label the new data.
Spaces.For the convenience of description, let (, V | ) denote the kernel function, where  represents the parameter of the kernel function.Thus, for the given data samples { 1 ⋅ ⋅ ⋅  + } and data probability distribution (), the basic functions of solution space generated from the kernel function (, V | ) are expressed as (,   |   ), where   =  () ,  = 1 ⋅ ⋅ ⋅  + .

Figure 1 (
b) shows the runtime of various algorithms in labeling the new coming data.It can be seen from Figure 1(b) that the runtime of  re MKDPD increases exponentially along with the number of new coming data, while the runtime of  un MKDPD ,  semi MKDPD , and  un MR almost remains unchanged.

Figure 1 :
Figure 1: (a) shows the error rates of the four algorithms ( un MR ,  un MKDPD ,  semi MKDPD , and  re MKDPD ) on the testing set, and the number of new coming samples is 50, 100, 150, and 200, respectively.(b) reports the computation time of the four algorithms with the increasing of the new coming sample (from 50 to 200).

Figure 3 :
Figure 3: Error rates (%) of MKDPD and MR on unlabeled and testing samples of two moons set with the regular parameter  varied from 0 to 0.2.
), 4(b), 4(c), and 4(d), the -axis represents the 45 binary classification problems, and the -axis represents the error rates.The error rates on the unlabeled samples are shown in Figures4(a) and 4(c), and the error rates on the test data are shown in Figures 4(b) and 4(d).

Figure 5 :
Figure 5: Two-class problem experiments: (a) is the error rates (%) of MKDPD and MR on unlabeled samples of ISOLET set, where the -axis is the 30 binary classification problems; (b) is the error rates of MKDPD and MR on test samples of ISOLET set.

Figure 6 :
Figure 6: Two-class problem experiments: (a) is the error rates (%) of MKDPD and MR on unlabeled samples of the Yale-B set; (b) is the error rates (%) of MKDPD and MR on test samples of the Yale-B set.And the -axis of each figure is the 28 binary classification problems and the value in bracket is the number of labeled samples in each class.

Figure 7 :
Figure 7: Two-class problem experiments: (a) is the error rates (%) of MKDPD and MR on unlabeled samples of the CMU-PIE set; (b) is the error rates (%) of MKDPD and MR on test samples of the CMU-PIE set.And the -axis of each figure is the 28 binary classification problems and the value in bracket is the number of labeled sample in each class.
arg min
,  = 1, . . ., .Based on these new data samples, the methods relearn the new coefficients { new 1 , . . .,  new ++ } of the label function .The label of the new coming data  new  ( new  ,   ).

Table 1 :
Average error rates (%) of four algorithms on testing samples with different number of tests.

Table 3 :
Average error rates (%) of the two-class experiments on USPS, MNIST, and ISOLET datasets.

Table 5
persons; each person is photographed under 4 expressions, 24 Illuminations, and 13 poses.In the experiment, we select the face images of 8 persons of 3 poses and 24 illuminations as the experimental dataset.Each image is cropped and resized to an image of 32 × 32 pixels.Again, for each person, 50% of

Table 4 :
Average error rates (%) of the two-class experiments on Yale-B and CMU-PIE datasets with different number of labeled samples.

Table 5 :
Error rates (%) of the multiclass problem experiments on five datasets: USPS, MNIST, ISOLET, Yale-B, and CMU-PIE, where the value in bracket is the number of labeled sample in each class.