Distance Based Multiple Kernel ELM : A Fast Multiple Kernel Learning Approach

We propose a distance basedmultiple kernel extreme learningmachine (DBMK-ELM), which provides a two-stage multiple kernel learning approach with high efficiency. Specifically, DBMK-ELM first projects multiple kernels into a new space, in which new instances are reconstructed based on the distance of different sample labels. Subsequently, an l 2 -norm regularization least square, in which the normal vector corresponds to the kernel weights of a new kernel, is trained based on these new instances. After that, the new kernel is utilized to train and test extreme learning machine (ELM). Extensive experimental results demonstrate the superior performance of the proposed DBMK-ELM in terms of the accuracy and the computational cost.


Introduction
Currently, classification and regression are two major problems targeted by most of machine learning, pattern recognition, and data mining methods.For example, one may use a classifier in fingerprint identification system [1] or introduce regression to predict stock price [2], and so forth.As a kind of structure that can process classification and regression problems, single hidden layer feedforward networks (SLFNs) have been extensively studied.In previous work, many methods have been proposed to train SLFNs, such as back propagation algorithm [3] and SVM [4].However, the above training methods may have two main drawbacks, that is, local extremum and long training time.
Recently, Huang et al. have proposed extreme learning machine (ELM) to train SLFNs in an extremely fast fashion [5,6].It has been proved that ELM can overcome the main drawbacks of previous SLFNs training methods.Specifically, ELM can learn a global optimal solution of the SLFNs' parameters, which can contribute to a high performance in both classification and regression problems, in an extremely short time because it only needs to train the output weights between hidden layer and output layer via the least square method [7,8].The reason is that ELM only needs to train the output weights between the hidden layer and the output layer via the least square method.Another attractive feature of ELM is that it establishes a unified model for solving both classification and regression problems [8].Considering the outstanding advantages of ELM, numerous valuable applications based on the ELM have been proposed, such as [9,10].Meanwhile, many researchers have promoted the evolution of ELM recently, including proposed online sequential ELM [11], voting based ELM [12], weighted ELM [13], sparse ELM [14], and kernel based ELM [8].As it can get rid of the impact of the number of hidden nodes, one of their work, the kernel based ELM, is applicable to a wide range and has perfect performance.So far, most studies that towards kernel based ELM focus on using a single kernel.However, since description ability of single kernel is weaker than multiple kernels in most cases, it may get better results to use multiple kernels in kernel based ELM.Moreover, using multiple kernels can handle multiple source information fusing problem, which can further improve the performance of classification and regression in some cases.
Multiple kernel learning (MKL) is a kind of machine learning method that can enable classifier and regressor to utilize multiple kernel information.Given a set of base kernels, the goal of MKL is to construct a new kernel, which can be more suitable to address the problem at hand, through learning an appropriate combination of base kernels.Typically, MKL can be sorted into two categories, one-stage approach and two-stage approach, by its approach.The onestage approach learns the combination coefficients of the base kernels and the parameters of the classifier jointly by solving a joint optimization objective function.After [15] pioneered this kind of approach, which got great attention, a lot of work following it has been proposed, including [16][17][18][19], to name just a few.On the contrary, the two-stage approach, such as [20,21], constructs a new kernel firstly by finding a suitable combination of base kernels and then it uses this combination in classifier or regressor.
In previous work, some researchers tried to introduce multiple kernels into ELM, such as [22,23], and got satisfactory results.Although these efforts made pioneering achievements, all of them fail to achieve an extreme learning speed or cannot make sense in both classification and regression cases.In this paper, we propose a novel multiple kernel based ELM, named distance based multiple kernel extreme learning machine (DBMK-ELM), which is a fast two-stage multiple kernel learning approach and can be adapted to both classification and regression.In The rest of paper is organized as follows.Section 2 briefly introduces ELM.Then, Section 3 presents the proposed DBMK-ELM.Meanwhile, Section 4 evaluates the performance of DBMK-ELM via extensive experiments.Finally, Section 5 concludes the paper.

Related Work
Our proposed method is extended from ELM, specifically, kernel based ELM.In this section, we briefly introduce ELM and kernel based ELM.

Extreme Learning Machine.
Extreme learning machine is a perfect training method of single hidden layer feedforward networks (SLFNs).Since it was proposed by Huang et al. [6], ELM has been widely used in numerous areas.The main advantages include but are not limited to  performance.These advantages are attributed to the fact that ELM randomly generates the weights between input layer and hidden layer and uses a least squares method to learn the other weights.For instance, if we consider a SLFNs as Figure 1 illustrating, which has  hidden layer nodes and one output layer node, the output function of it is as follows: where  is the weights between hidden layer and output layers and h(x) = [(a 1 , where {x  ,   } is training samples.According to KKT theorem,  can be calculated from or where H = [h(x Therefore, ELM output function can be written as follows: In this case, users do not have to know the h(x) or set the number of hidden nodes , that is, the dimension of ELM feature space.

Distance Based Multiple Kernel ELM
Since the goal of multiple kernel learning is to construct a new kernel that more suitable for problem processing, the nature of "good" kernel must be considered.Generally, kernel can be seen as a measure of similarity.Each entry in a kernel matrix represents a similarity of two corresponding samples.From this point of view, a "good" kernel can display the true similarity of sample pairs.In other words, if two sample pairs have similar similarity, their corresponding value in the "good" kernel will also be similar.To this end, we propose distance based multiple kernel ELM.It measures similarities of sample pairs by their "label distance" that will be defined in the following part and uses this information to construct a "good" kernel.DBMK-ELM is a kind of two-stage multiple kernel learning method.In the first stage, it learns a new kernel.In the second stage, it uses the new kernel in kernel based ELM.

Label Distance.
We first define the "label distance".As a significant part of training samples, the label contains class information in the classification case and dependent variable information in the regression case, which can be used to measure true similarity between samples.Considering different label meaning between classification and regression, we discuss "label distance" in these two cases separately.
In the classification case, the label means the class that a sample belongs to and samples can be seen as similar when they are in the same class.In other words, if two samples have the same label, they can be seen as similar.On the contrary, if two samples have different labels, they can be seen as different.However, in this case, it is difficult to discriminate how different two samples are, because the difference between classes is not clear.Therefore, label distance is defined to 0 if two samples have the same label and defined to 1 if two samples have different labels.Formally, we define label distance (,   ) as follows: In the regression case, the label means the value of the dependent variable.Typically, the similarity of samples is directly represented in the difference of their values of dependent variable in regression cases, for example, pollution prediction, housing number prediction, and stock price prediction, to name just a few.Thus, label distance can be defined as distance between two values of the dependent variable.Admittedly, various measurements can be used to measure this distance, but Euclidean distance is used in this paper.We, in the regression case, formally define the label distance (,   ) as follows:

Multiple Kernel Learning
Based on Distance.Since label distance has been defined above, the distance information can be used to guide the new kernel learning.In this subsection, we show how does DBMK-ELM perform multiple kernel learning based on distance.As we discussed, the new optimal kernel can be seen as a linear combination of base kernels.Therefore, the goal of distance based multiple kernel learning (DBMK) is to learn the combination coefficients of base kernels from which a new kernel that each entry value is coincident to the label distance of the corresponding sample pair can be generated.Considering values of the same entry in each base kernels, corresponding to the same sample pair, if they are seen as input features and the label distance of the same sample pair is seen as output value, the DBMK can be transformed to the regression problem, in which parameters of the regressor are the combination coefficients that need to be learned.To this end, DBMK learns the combination coefficients following the next two steps.Firstly, it reconstructs a new sample space, named K-space, in which each sample corresponds to a sample pair in the original sample space, based on multiple base kernels and label distance of sample pairs.Secondly, it solves a regression problem in this new space to find combination coefficients of kernels.After this two steps, DBMK-ELM can obtain the new optimal kernel through combining the base kernels using learned combination coefficients.
For machine learning problems, including classification and regression, if training samples (x, ) are drawn from a distribution  over X × Y ⊂ R  × R and  base kernels, which must satisfy the positive semidefinite condition, are generated by a set of kernel functions { 1 (⋅, ⋅), . . .,   (⋅, ⋅)} and denoted as K 1 , . . ., In the K-space, DBMK-ELM learns the combination coefficients through solving a regression problem.In this problem, the training samples are (z, ), the whole samples in K-space, in which input feature vector is z and the output label is .Though numbers of methods can be used to solve the regression problem, DBMK-ELM applies an ℓ 2 -norm regularization least squares linear regression, which is similar to ELM output weight learning, to solve it, considering a fast training speed.This method aims to minimize the training error and the norm of combination coefficients at the same time.If K-space has  samples, the optimization objective function of it can be formalized as follows: where  is the combination coefficients that DBMK-ELM needs to learn, (z  ,   ) corresponds to the th sample in K-space,   represents the training error generated by th sample, and  is a trade-off parameter.
Using KKT theorem, solving (10) is equivalent to solving its dual optimization problem: where   is the Lagrange multiplier corresponding to the th sample.In this case, KKT optimality conditions can be written as follows: where Z = [z 1 , . . ., z  ] ⊤ and  = [ 1 , . . .,   ] ⊤ .According to (12), we have and equally where T = [ 1 , . . .,   ] ⊤ in both equations above.We can use (13) to calculate  when the number of training samples in K-space is not huge and use (14) in the opposite case to speed up the computation.Finally, DBMK-ELM obtains the new optimal kernel by combining base kernels according to  as follows: Similarily, for each two sample pair (  ,   ), their new optimal kernel function can be written as follows:

Multiple Kernel Extreme Learning Machine.
In the second stage, DBMK-ELM uses the learned new kernel, in which multiple kernel information is included, in the kernel based ELM in both the training and testing case.Therefore, the output function of DBMK-ELM can be written as follows: where Y = [ 1 , . . .,   ] ⊤ .At this point, DBMKL-ELM can successfully deal with the classification and regression problem benefitting from multiple kernel at a fast speed.For a classification or regression problem, DBMK-ELM first learns the combination coefficients of pregenerated base kernels using (13) or (14).Then, it calculates a new optimal kernel by (15).Finally, the result of the problem can be obtained by (17).The algorithm of DBMK-ELM can be illustrated as Algorithm 1.

Connected to TS-MKL.
A previous successful multiple kernel learning approach TS-MKL (two-stage multiple kernel learning) [21] also follows the idea that uses the label information to learn kernel.It denotes +1 for the same label and −1 for different labels to construct its target labels.This is very similar to DBMK-ELM label distance in the classification case.The experimental results show that this method can find a really good kernel that achieves the stateof-the-art classification performance.However, TS-MKL can only solve the classification problem.DBMK-ELM does not just consider the difference between classes but uses label distance to measure the similarity of a sample pair.In this way, DBMK-ELM not only adapts to the classification problem but also adapts to the regression problem.

Experiments
In this section, we first compare DBMK-ELM to several methods in both classification and regression benchmarks using pregenerated kernels.In order to verify DBMK-ELM performance on multiple kernel learning, SimpleMKL [16] and unweighted sum of kernel methods (UW) have been compared.Meanwhile, we also compare DBMK-ELM with basic kernel based ELM [8] in which the best kernel in base kernels has been used.Then, we compare the classification accuracy between DBMK-ELM and ELM in multiple kernel classification benchmarks, in which different kernels are generated from different channels, in order to demonstrate the ability for multisource data fusion of DBMK-ELM.In addition, we compare DBMK-ELM with the state-of-theart multiple kernel extreme learning machine, namely, the ℓ 1 -MK-ELM and the radius-incorporated MK-ELM (R-MK-ELM) [23], in classification benchmark mentioned above, since these two methods can only suit classification cases.Finally, we conduct parameter sensitivity test for DBMK-ELM.
Three multiple kernel classification benchmarks from bioinformatics data sets are selected in our experiment.The first of them is the original plant data set of TargetP [26].The others are PsortPos and PsortNeg [27] that both for bacterial protein locations problem.We show the number of training samples, testing samples, kernels, and classes in these data sets on Table 3.
For each data set, we randomly select two-thirds of the data samples as training data and the rest as testing data.We repeat this procedure 20 times for each data set and obtain 20 partitions of original data.All algorithms in the experiment are evaluated on each partition and the averaged results are reported for each benchmark.

Parameters Setting and Evaluation Criteria.
For both classification and regression benchmark data sets, we generate 23 kernels on full feature vector, including 20 Gaussian kernels ( −‖x  −x  ‖ 2 ) with  = {2 −10 , 2 −9 , . . ., 2 9 }, 3 polynomial kernels of degrees 1, 2, and 3.For the kernel based ELM [8], we test all the 23 kernels generated above and display the best result of them in our experiments according to the testing accuracy.For all algorithms, the regulation parameter  is selected from {10 −1 , 10 0 , . . ., 10 3 } via 3-fold cross validation on training data.We select accuracy and computational efficiency as the performance evaluation criteria.The accuracy means the classification accuracy rate in testing data for classification problems or the mean square error (MSE) in testing data for regression problems.In addition, for the regression problem, sample labels have been normalized to [−1, 1].For all cases, the computational efficiency is evaluated by the training time.
The reported results for each benchmark include the mean value and the standard deviation of criteria in 20 partitions.In order to measure the statistical significance for the accuracy improvement, we further use the paired student's t-test, in which  value means the probability that two compared sets come from distributions with an equal mean.Typically, if the  value less than 0.05, the compared sets are considered having statistically significant difference.

Classification Performance.
The classification accuracy of different methods is shown in Table 4.The content in Table 4 has following meanings, the first part is the mean ± standard deviation and the second part is the  value calculated by the paired Student's -test.The bold value in each cell of Table 4 represents the highest accuracy and those having no significant difference compared with the highest one.We also show the classification training time in Table 5, which presents as the mean ± standard deviation.
As we can see from Table 4, DBMK-ELM achieves the highest correct classification rate or has no significant different compared with the best one.Meanwhile, the results in Table 5 prove that the time cost of this approach is significantly lower than SimpleMKL, ℓ 1 -MK-ELM, and R-MK-ELM.7.

Regression Performance. The regression accuracy of different methods is shown in
In this case, we can see DBMK-ELM has the significant highest regression accuracy compared with other methods.From the time cost point of view, this situation is similar to the classification problem; that is, DBMK-ELM dramatically improved training time compared to SimpleMKL.

Multiple Kernel Classification Benchmark Performance.
The classification accuracy for multiple kernel classification benchmarks of DBMK-ELM and other methods is shown in Table 8.And the multiple kernel classification training time is shown in Table 9.From the results, we can see DBMK-ELM significantly better than ELM.That means DBMK-ELM has the ability to perform multisource data fusion, thereby improving the performance of ELM.The DBMK-ELM is better than the state-of-the-art multiple kernel extreme learning machine in this case regarding the classification accuracy and the training time.
4.6.Parameter Sensitivity Test.In our proposed DBMK-ELM, there are two regularization parameters need to be set.In order to describe more clearly, we use 1 and 2 represents the regularization parameter in ELM training and multiple kernel learning, respectively.We choose classification data set ionosphere and regression data set yacht to test parameter sensitivity.For each data set, we set a wide range of 1 and  2 and 3, respectively.As can be seen from the results, the performance of DBMK-ELM is not sensitivity while 1 and 2 vary within a wide range.

4.7.
Discussion.The experimental results have illustrated that DBMK-ELM can achieve a high accuracy with a fast learning speed.However, two issues need to discussed.
(1) Why is the learning speed of DBMK-ELM much slower than basic ELM in some cases?The main reason is that there are substantial samples to learn in the K-space, which is constructed in multiple kernel learning step.Specifically, if there are  original samples, there will be  2 corresponding new samples.Therefore, the training time difference between DBMK-ELM and basic ELM will be magnified with the training samples increasing.It may be possible to reduce the training time gap between DBMK-ELM and basic ELM if we use sampling techiques in the K-space.
(2) In which cases should we use DBMK-ELM?DBMK-ELM can obtain more accurate results and a faster learning speed compared with traditional multiple kernel learning method, SimpleMKL.Despite the fact that it is much better than other multiple kernel learning methods, DBMK-ELM has more time cost compared with basic ELM method.In this way, a trade-off between accuracy and time cost is needed.The experimental results show that DBMK-ELM significantly improves testing accuracy compared with basic ELM in regression and multisource data fusion problems in most cases.But in the classification case, where kernels are generated from one data source, DBMK-ELM has no significant difference compared with basic ELM in testing accuracy.Therefore, a preferable choice is to apply DBMK-ELM in regression and multisource data fusion problems and use basic ELM in single data source generated kernel classification problems.

Conclusion
In this paper, we have proposed DBMK-ELM, a new multiple kernel based ELM, to extend the basic kernel based ELM.The proposed multiple kernel learning method can unify classification and regression problems.Moreover, DBMK-ELM is able to learn from multiple kernels at an extremely fast speed.Experimental results show that DBMK-ELM achieves a significant performance enhancement, in terms of the accuracy and the time cost in both classification and regression problems.In future, we will consider how to define a better distance among different classes and how to extend DBMK-ELM to the semisupervised learning problem.
the first stage, DBMK-ELM finds the combination coefficients of pregenerated base kernels based on training samples.It first projects original base kernels into a new space and reconstructs new instances based on the distance of training samples.Then, it transfers the multiple kernel learning problem to a binary classification problem or a regression problem and solves it using the least square method.Finally, it constructs the new kernel from base kernels based on the learned combination coefficients.In the second stage, DBMK-ELM adopts the new kernel in kernel based ELM.Experimental results demonstrate the following advantages of our proposed DBMK-ELM: (1) the training time of DBMK-ELM is extremely short compared with traditional MKL methods; (2) DBMK-ELM can fully use multiple source information and outperform previous MKL methods in terms of the classification and regression accuracy; (3) DBMK-ELM can improve the robustness and the accuracy of basic kernel based ELM in both classification and regression cases.

Figure 2 :
Figure 2: Classification case: performances of DBMK-ELM with different parameters on the ionosphere data set.

Figure 3 :
Figure 3: Regression case: performances of DBMK-ELM with different parameters on the yacht data set.

Table 1 :
(17))Calculate the output o using(17)for each sample (x  ,   ) in testing set (X tst , Y tst ); Summary of the classification problems data sets.

Table 2 :
Summary of the regression problems data sets.

Table 3 :
Summary of the multiple kernel classification benchmark data sets.
Table 6 with the same representation of Table 4.And the regression training time is shown in Table

Table 8 :
Multiple kernel classification case: classification accuracy (%).Boldface means no statistical difference from the best one ( val ≥ 0.05).Specifically, we have used 10 different values of 1 and 10 different values of 2 from {10 −1 , 10 0 , . . ., 10 7 , 10 8 }.For each (1, 2) pair, we repeat 20 times on each data set to get the average accuracy.The result of classification case and regression case is shown in Figures

Table 9 :
Multiple kernel classification case: classification training time (s).