Improving Localized Multiple Kernel Learning via Radius-Margin Bound

Xiaoming Wang


Introduction
Over the past decade, kernel methods [1] have drawn a lot of attention of researchers in the machine learning community and have been widely applied.A kernel characterizes the similarity between two samples [2].Actually, the performance of a kernel-based algorithm often strongly depends on the selection of the kernel.Generally, an unsuitable kernel would lead to a poor performance.Therefore, it is very critical to choose a suitable kernel for a kernel-based algorithm.
Recent researches on kernel methods have highlighted the requirement to learn a suitable kernel matrix or function from the training data.A generic technique is known as multiple kernel learning (MKL) [3].Given a set of predefined basic kernel functions, MKL tries to find their combination by employing a criterion which maximizes a generalization performance measure or minimizes an error bound.Actually, the practical problems frequently involve multiple heterogeneous data sources [4].Thus, MKL is in accord with this fact.Many studies [5][6][7][8][9][10][11][12][13] have shown that MKL can generally find the suitable combination of basic kernel functions and so can usually achieve better performance in contrast with single kernel.The idea of MKL has been applied in all sorts of kernel-based algorithms, for example, support vector machine (SVM), which is a powerful and excellent machine learning method based on Vapnik's statistical learning theory [14].In the paper, we will only focus on the SVM-based MKL.
Localized multiple kernel learning (LMKL) [15][16][17], as a method of MKL, is an attractive method which combines multiple heterogeneous attributes according to their discriminative ability for each individual instance.Generally, other MKL methods try to learn a global combination in the whole input space [2,5,6,18,19], whereas LMKL believes that a sample-specific local combination should most likely better reflect the distinctive characteristics of each instance and so embodies the idea.This is the key difference between other MKL methods and LMKL.Overall, LMKL consists of an SVM learning problem and a parametric gating model.The gating model is used to assign local weights to predefined basic kernels.In LMKL, a two-step alternate optimization method is employed to train the two components.In contrast 2 Mathematical Problems in Engineering with other MKL methods, LMKL generally provides fewer support vectors but can achieve statistically similar accuracy results.The idea of LMKL has been extended to other kernelbased methods and successfully used in some practical applications [20][21][22][23].
However, LMKL learns the kernel function (essentially the parameters of the gating model) only by maximizing the margin which is embodied in single-kernel-based SVM.
A key fact is that the generalization performance of SVM depends not only on the separating margin, but also on the radius of the smallest ball that encloses the data [24][25][26][27][28].In fact, it is not necessary that standard SVM (or single-kernel SVM) exploits the radius.The reason is that the radius of the minimum enclosing ball (MEB) is fixed once the kernel including its parameters is selected.However, in the context of LMKL, the radius is not fixed but is a function of the parameters of the gating model.
Actually, several attempts have recently been directed at incorporating the radius into SVM-based MKL [29].However, most of these works direct SVM with ℓ 2 soft margin (ℓ 2 -SVM) because the problem of ℓ 2 -SVM can be transformed into a form of SVM with hard margin, in which the radiusmargin bound holds and can be used to conduct model selection [25].Unfortunately, for SVM with ℓ 1 soft margin (ℓ 1 -SVM), the radius-margin bound does not hold, as we have no way of reducing the formulation of ℓ 1 -SVM to a form of SVM with hard margin.So, one cannot directly utilize the radius-margin bound in LMKL since its formulation is rooted in ℓ 1 -SVM.However, in [27], Chung et al. investigated several heuristic bounds for SVM and developed a modified radiusmargin bound to conduct model selection for ℓ 1 -SVM.The experimental results have indicated its effectiveness.
Inspired by the work of Chung et al. in [27] and aiming at the drawback of LMKL, in this paper, we propose an improved version of LMKL, which is named ILMKL.A noticeable characteristic of the proposed method is that it takes account of both the separating margin and the radius of MEB; that is, it integrates the information of the margin and the radius to measure the goodness of the kernel and learn the parameters of the gating model.Actually, a key insight of our work is that learning the parameters of the gating model in LMKL is similar to conducting model selection.Analogous to LMKL, the problem of the proposed method can be efficiently resolved in a coupled manner through employing the twostep alternate optimization method.Moreover, the proposed method treats the regularization parameter as an extra parameter that can be automatically learned.Consequently, we can jointly tune it with the parameters of the gating during the learning kernel function process.This improves the computational efficiency of the proposed method to some extent since it avoids the time-consuming cross-validation process.Comprehensive experiments are conducted and the experimental results well demonstrate the efficiency and effectiveness of the proposed method.
The rest of the paper is organized as follows.Section 2 reviews the related work.In Section 3, we first present the formulation of the proposed method and then detail how to solve the optimization problem.After that, we conduct some preliminary discussion on the proposed method and outline the algorithm step.Section 4 reports the experimental results, and the conclusions are drawn in Section 5.

Preliminaries
In the paper, we suppose a training dataset which consists of  samples and is represented by D = {(x 1 ,  1 ), . . ., (x  ,   )}, where the samples x  ∈ R  and its corresponding labels   ∈ {1, −1}.Here,  = 1, . . .,  and  is the dimension of the sample space.Denote, for convenience, by I the set of all indices; that is, I = {1, . . ., }.

Radius-Margin
Bound for SVM.SVM embodies the structural risk minimization principle, which is related to the probability of incorrectly classifying an unknown sample.Geometrically, the key idea of SVM is to construct a separating hyperplane in the data space through employing the maximal margin principle among two different classes of samples [14].In the nonlinear case, ℓ 1 -SVM defines the following optimization problem: min w,, where (x) : R  → H, ⟨⋅, ⋅⟩ is the inner product of two vectors,   represents the training error, and  is the regularization parameter that adjusts the training error and the regularization term ‖w‖ 2 .Problem (1) can be efficiently solved by transforming it to its corresponding dual problem [30], which is formulated as Here, (x  , x  ) = ⟨(x  ), (x  )⟩ is called kernel function.Suppose that  * = [ * 1 , . . .,  *  ]  solves the above optimization problem and  * is the optimal threshold which can be computed by using the KKT condition of (1); the decision function of SVM is formulated as In order to obtain a better performance in the practical applications, it is very important to choose suitable hyperparameters which include the regularization parameter  and the kernel parameter of the kernel function (⋅, ⋅) for SVM.This is the so-called model selection [24].Generally, one can empirically set these hyperparameters.But this is very hard work because one cannot know in advance the suitable hyperparameters when facing all kinds of practical applications.Many works have tried to find a good criterion to automatically learn the related hyperparameters [24][25][26].In [27], Chung et al. proposed the following bound for ℓ 1 -SVM: where Δ is a parameter and can generally be set as Δ = 1 and  refers to the radius of MEB.This radius-margin bound is differentiable and is successfully used to conduct model selection for ℓ 1 -SVM in [27].

Localized Multiple Kernel
Learning.In the context of MKL, we assume that there exist  different mappings   (x) : R  → H  ( = 1, . . ., ), the th mapping of which is endowed with the base kernel   (⋅, ⋅) of associated reproducing kernel Hilbert space (RKHS) H  .
As a method of MKL, LMKL is based on ℓ 1 -SVM and defines its optimization problem as follows: where is a regularization parameter that adjusts the training error and the regularization term ‖ w‖ 2 , and   represents the training error.Here, w  ∈ H  ,   (x) ∈ H  , and   (x) is a gating function defined up to a set of parameters which need to be learned from the training data.Further, by using duality, we have the dual formulation of the primal problem in (5) as follows: where the locally combined kernel function is defined as If the used gating model   (x) is constant (not a function of x), LMKL finds a fixed combination over the whole input space and is similar to the original MKL formulation.The main advantage of LMKL is that it can achieve statistically similar accuracy results by storing fewer support vectors compared with the original MKL.

The Proposed LMKL Framework
In this section, we first present the primal optimization problem of the proposed method ILMKL and then detail how to solve it, and finally some preliminary discussion on the proposed method is given and the algorithm is outlined.

Primal Optimization Model of the Proposed Method.
In the context of LMKL, it is easy to find that the radius of MEB is not fixed but is a function of the parameters of the gating model.Nevertheless, LMKL learns the parameters of the gating model only through using the separating margin.Therefore, LMKL ignores the fact that the generalization performance of SVM depends not only on the separating margin but also on the radius.Actually, the purpose of learning the parameters of the gating model is in essence to yield an appropriate kernel matrix for good performance.In our opinion, this process is similar to model selection by which SVM chooses the appropriate parameters to achieve good performance.Therefore, following the basic idea of the work in [27], we define the primal optimization problem of ILMKL as follows: As in Section 2.2, here,  w = [w 1 ; . . .; Here,   (x) is a gating function.As in [15], we employ the softmax gating model determining the parameters k  ( = 1, . . ., ) and it can be expressed as where   is the th feature of the th sample, x = [1, x  ], and . ., ) are the parameters of the gating model associated with the mth kernel and the softmax guarantees nonnegativity.As pointed out in [17], one can use more complex gating models.Obviously, in contrast with LMKL ( 5), ILMKL has two noticeable characteristics.One is that it takes into consideration the information of the radius and margin.Another characteristic of the proposed method is that it treats the regularization parameter  as a variable that can be automatically learned during the procedure of learning the parameters of the gating model.To sum up, the key insight of our method is that, in the context of LMKL, learning the parameters of the gating model is similar to conducting model selection for SVM.

Training with Alternating
Optimization.Generally, it is very difficult to directly solve problem (9).In LMKL, a two-step alternate optimization method is employed to find the parameters of the gating model and the discriminant function.In our method ILMKL, we use the same strategy.
The first step is to fix {k  , } and solve (9) with respect to {w  , , , a, }, and the second step is to optimize the parameters of {k  , } by using a gradient-descent method.The objective value obtained for a fixed {k  , } is an upper bound for (9) and the parameters of {k  , } are optimized according to the current solution.The objective value obtained at the next iteration cannot be greater than the current one due to the use of gradient-descent procedure.And as iterations progress with a proper step size selection procedure, the objective value of (9) never increases.Note that this way does not guarantee convergence to the global optimum and so the initial parameters of {k  , } may affect the solution quality.
In this subsection, we will discuss how to solve problem (9) when fixing {k  , }.In the following two subsections, we will, respectively, discuss how to optimize k  and .
For a fixed {k  , }, we have min w  ,,,a,,k  , × ( min Here, we set Therefore, for a fixed {k  , }, problem (9) of the proposed method can be expressed as min w  ,,,a, where Note that if we fix the gating model parameters and the regularization parameter, the optimization problem (13) becomes convex.In order to find its solution, we can switch it into the dual optimization problem.By using duality, the dual problem of the primal problem in (13) can be formulated as min w  ,,,a, where where the locally combined kernel function  (x  , x  ) is defined as (7).Obviously, this formulation corresponds to, respectively, solving a canonical SVM dual problem and a canonical support vector domain description (SVDD) [31] dual problem with the kernel matrix  () = (  (x  , x  )) × , which should be positive semidefinite.
Finally, once the final gating model   (x) has been learned and problem ( 14) is solved, the resulting discriminant function of the proposed method ILMKL can be expressed as follows:

Optimizing the Parameters of the Gating Model.
In order to optimize the parameters k  of the gating model   (x) by using a gradient-descent method, one needs to calculate the derivatives of the primal objective with respect to the parameters k  .Next, we will discuss how to calculate the derivatives of the parameters.
First, note that So, we have Thus, we can calculate the derivatives of (x  , x  ) with respect to the gating model parameters k  as follows: Further, according to the above formula, the following can be obtained: Finally, the derivatives of I(k  , ) with respect to the parameters k  of the gating model   (x) can be formulated as

Optimizing the Regularization Parameter.
In our method, the regularization parameter  is treated as a variable that can be learned when learning the gating model.Similar to the process of optimizing the parameter of the gating model, we employ a gradient-descent method to optimize the regularization parameter  and so the derivative of ( 14) with respect to  is needed.In the following, we will discuss how to compute the derivative.Actually, the derivative of ( 14) I(k  , ) with respect to  can be expressed as (1) Initialize ln( init ) and initialize k  to small random numbers for  = 1, . . ., ; (2) while stopping criterion not met do (3) Co m p u t e with  = exp(ln()); (4) Calcula t e  (x  , x  ) with the gating model according to (7) when fixing k  ; (5) Co m p u t eI(k  , ) by using an canonical SVM solver and an canonical SVDD solver with  (x  , x  ) according to ( 14); (6) Co m p u t e(I(k  , ))/k  for  = 1, . . .,  with (20); (7) Co m p u t eI(k  , )/(ln()) with ( 23); (8) U pda t ek  and ln() by the gradient-descent method; (9) end while Algorithm 1: ILMKL.
Obviously, one obtains in advance ( 1 (k  , ))/ and ( 2 (k  , ))/, which are, respectively, computed as 3.5.Discussion.It should be noted that  > 0 must hold in the whole procedure.However, actually in the iterations, this condition may be broken.In order to deal with this problem, following [27], we can use ln() instead of  in the solving procedure.The reason is that ln() can be any real number when  > 0. Thus, the positivity of  is dodged.Here, we need to rewrite the above partial derivatives.According to the chain rules, we can modify (21) as the following: where Finally, according to the above discussion, we outline the complete algorithm of ILMKL in Algorithm 1.

Experiments
In this section, the experimental results will be reported.In the first experiment, we investigate the influence of parameters on ILMKL performance.In the second experiment, we further explore the possibility of learning the regularization parameter  under different initial value on a synthetic dataset.In the third experiment, we conduct the experiments on several UCI datasets and compare the proposed method with traditional MKL methods.

Parameter Influence on Performance of the Proposed
Method.In the training procedure of LMKL, the regularization parameter  must be predefined.In our method ILMKL, the parameter  can be automatically tuned during the learning process of the parameters of the gating model.However, we need to first set the parameter Δ and the initial value (denoted by  init ) of .In [27], the authors suggested Δ = 1 and ln( init ) = 0.In this subsection, we will investigate the influence of these parameters on the final learned regularization parameter  and the classification performance of the proposed method.
We used the Sonar dataset, which was selected from the UCI repository [32].In the experiment, 50% of the datasets were randomly selected for training and the rest for testing.The data were preprocessed in the following way: first, the mean and the standard deviation of each feature were computed according to training data; then, training examples were normalized to have mean 0 by subtracting the mean and unit variance; finally, testing examples were correspondingly preprocessed using the mean and the standard deviation.The base kernels include one linear kernel and one polynomial kernel with degree of two.All kernel matrices are calculated and normalized to unit trace before training.
Figure 1 shows the experimental results under different Δ.From Figure 1(a), we can find that the value of Δ actually influences the final value of learned .The reason, in our opinion, is that the algorithm may fall into a local minimum since we adopt the gradient-descent method which cannot guarantee finding the global minimum.Actually, the final values of learned  under different Δ are close to each other on the whole.Moreover, the classification accuracies under different Δ have almost no difference according to Figure 1(b).Figure 2 shows the experimental results under different initial value of .From Figure 2(a), similar to the case which is shown in Figure 1(a), we can find that the initial value  init of  also impacts the final learned .However, according to Figure 2(b), the initial value scarcely influences the classification accuracy.
Therefore, it can be concluded that the proposed method is effective to learn a suitable  in the SVM-based LMKL scenario under different Δ and initial value  init of .

Experimental Results on Synthetic Dataset.
In the above experiments, the final regularization parameter learned  is always much larger than the initial value  init of  and so  almost always increases in the learning progress.In these experiments, we will show that the final learned  actually can be larger or smaller than the initial value  init of .
Following [15], we create a synthetic dataset, which consists of two classes, and each class contains 200 samples.The samples come from four Gaussian components (two for each class), and each component, respectively, has the following mean vector and covariance matrix: where the samples in class 1 are from the first two components (denoted by red ×) and others belong to class 2 (denoted by blue +).Here, we adopt the same base kernels as in Section 4.1, that is, linear kernels and one polynomial kernel with degree of two.Before training, all kernel matrices are computed and preprocessed to unit trace in advance.impact severely the result of LMKL.Obviously, the experimental result illustrated in Figure 3(c) is better.Here, the regularization parameter  is set as  = 10.The experimental results of the proposed method are illustrated in Figure 4.It can be found that we almost obtain the same result under different initial value  init of .Moreover, the final learned  is sometimes smaller and sometimes larger than the initial value.Note that the final values of learned  are different under the different initial value  init of .The reason, as pointed out in Section 4.1, is that the gradient-descent method is employed to optimize the regularization parameter .However, the final learned regularization parameters are close to each other.The experiments further validate the fact that our method can effectively automatically learn the regularization parameter .This is an advantage over traditional LMKL.

Experimental Results on UCI Datasets.
In this subsection, we report the performance comparison about SimpleMKL [2], LMKL [15], and the proposed method on several UCI datasets [32].
In the experiment, we use 50% of each dataset as a training set and the rest as the test set.As in Section 4.1, the data were normalized (i.e., 0 mean and 1 standard deviation).The base kernels include seven Gaussian kernels with the widths of [3 −3 , 3 −2 , 3 −1 , 3 0 , 3 1 , 3 2 , 3 3 ] and four polynomial kernels with degrees of one to four.Before training, all kernel matrices are calculated in advance and preprocessed to unit trace.Each experiment is repeated 50 times, and the mean accuracy and standard deviation were computed.In the experiments, SimpleMKL and LMKL employ the crossvalidation technique to choose the regularization parameter  from the set {10 −4 , 10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 , 10 3 , 10 4 }.For our method ILMKL, it is not necessary to use crossvalidation to select the parameter  because it can learn an appropriate value.
Table 1 reports the classification accuracies of several SVM-based MKL methods on the selected datasets.From Table 1, it can be found that LMKL has comparable performance to SimpleMKL.However, on the whole, it can be found that ILMKL has a clear improvement in the classification performance in contrast with SimpleMKL and LMKL.These experimental results indicate that the generalized performance in SVM-based MKL can be improved when the information of radius of MEB is considered.The proposed method ILMKL embodies the idea.
For a rigorous comparison, simultaneously, we further conducted the paired two-tailed -tests [33] on these methods.In -test, the  value depicts the probability that two sets generate from distributions with equal means.If the  value is smaller, then the difference of the two mean values is more significant.Generally, 0.05 is viewed as a typical Finally, we investigated the support vector percentages of several methods on the selected datasets, which are reported in Table 3.Generally, fewer support vectors mean less test time.From Table 3, LMKL tends to have more support vectors in contrast with SimpleMKL.The proposed method ILMKL has on the whole similar support vector percentages to LMKL.So, our method inherits the advantage of LMKL that it stores fewer support vectors but can achieve statistically similar accuracy results compared with other MKL methods.

Conclusions
In this paper, by following the work in [27], we presented a novel LMKL method.Different from traditional LMKL, our method takes into consideration the information of both the radius and the margin when learning the parameters of the gating model.As a result, our method can achieve better accuracy.Simultaneously, our method can automatically tune the regularization parameter  during the process of learning the parameters of the gating model.Therefore, this can improve the computational efficiency of our method by avoiding using the time-consuming cross-validation technique to find a suitable regularization parameter.Comprehensive experiments are conducted on several toy and benchmark datasets and the results well demonstrate the efficiency and effectiveness of the proposed method.

Figure 1 :Figure 2 :
Figure 1: The experimental results of ILMKL under different Δ on the Sonar dataset.

Figure 3 − 5 2 − 5
illustrates the experimental results of LMKL under different regularization parameter .It can be easily found that the values of the regularization parameter  Mathematical Problems in Engineering The obtained separate hyperplane under  svm = 10 −1 The obtained separate hyperplane under  svm = 10 0 The obtained separate hyperplane under  svm = 10 1 The obtained separate hyperplane under  svm = 10 The obtained separate hyperplane under  svm = 10 3

Figure 3 :
Figure 3: The experimental results of LMKL on the synthetic dataset.

Figure 4 :
Figure 4: The experimental results of ILMKL on the synthetic dataset.

Table 1 :
Classification accuracy (mean ± standard derivation) on the selected datasets.

Table 2 :
value of -test on the selected datasets.

Table 3 :
Rate of support vectors comparison on the selected datasets.value;thatis,it is considered statistically significant when the  value is smaller than 0.05.Table2reports the experimental results of the -tests.For example, the  value of the -test when comparing LMKL and Sim-pleMKL on the Ionosphere dataset is 0.0862 (>0.05), meaning that SimpleMKL does not perform significantly better than LMKL on this dataset at the 0.05 significant level though SimpleMKL has better classification accuracy according to Table1.However, ILMKL performs significantly better than SimpleMKL since the  value of the -test is 0.0296 (<0.05) on this dataset at the 0.05 significant level.From Table2, ILMKL has on the whole significant improvement in the generalized performance in contrast with SimpleMKL and LMKL.