An Extreme Learning Machine Based on the Mixed Kernel Function of Triangular Kernel and Generalized Hermite Dirichlet Kernel

According to the characteristics that the kernel function of extreme learning machine (ELM) and its performance have a strong correlation, a novel extreme learning machine based on a generalized triangle Hermitian kernel function was proposed in this paper. First, the generalized triangle Hermitian kernel function was constructed by using the product of triangular kernel and generalized Hermite Dirichlet kernel, and the proposed kernel function was proved as a valid kernel function of extreme learning machine. Then, the learning methodology of the extreme learning machine based on the proposed kernel function was presented. The biggest advantage of the proposed kernel is its kernel parameter values only chosen in the natural numbers, which thus can greatly shorten the computational time of parameter optimization and retain more of its sample data structure information. Experiments were performed on a number of binary classification, multiclassification, and regression datasets from the UCI benchmark repository. The experiment results demonstrated that the robustness and generalization performance of the proposed method are outperformed compared to other extreme learning machines with different kernels. Furthermore, the learning speed of proposed method is faster than support vector machine (SVM) methods.


Introduction
The kernel extreme learning machine (KELM) is proposed by Huang et al. in 2010 by applying the kernel functions to ELM algorithm [1,2] and where the random hidden layer feature mapping in ELM is substituted by the kernel mapping [3].It effectively improves the undesirable generalization performance and stability caused by the stochastic nature of hidden layer output matrix and greatly reduces computational complexity.In KELM, optimization of the number of hidden layer nodes is avoided and the least square optimal solution can be obtained.Compared with SVM [4] and basic ELM [5,6], it can provide more stable and better generalization performance.Hence, KELM has been widely applied in classification and regression problems [7,8].
It is well known that the learning ability and generalization performance of ELM mainly depend on the kernel function, and different kernel functions or same kernel function with different parameters has different influence on the generalization performance.Besides, the time required to search for the optimal kernel parameters is different using various kernel functions.Normally, the selection and optimization of kernel parameters are much tedious and timeconsuming [9,10].Liu et al. [11] pointed out that the common Gauss kernel function and polynomial kernel function are very sensitive to the changes of kernel parameters, so the selection range of kernel parameters is large with small stepsize leading to high computation complex issue.In order to solve this issue, a series of SVM kernel functions based on generalized orthogonal polynomials have been constructed in recent years [12][13][14][15][16]. Ozer et al. [12] introduced a set of Chebyshev kernel functions derived from the generalized Chebyshev polynomials.The test results show that the generalized Chebyshev kernel approaches the minimum support vector number for classification in general.Furthermore, Zhang et al. constructed a series of novel SVM kernel functions, such as Laguerre kernel function from generalized Laguerre polynomial and Hermite kernel function from generalized Hermite polynomial [13][14][15].Those algorithms can shorten the time of parameter optimization; however, the processing to the parameters of weight function is so simple that the influence of structure information of sample data on the generalization performance is neglected.Tian and Wang [17] verified that Gaussian Hermite kernel function can achieve the highest classification accuracy in the binary classification problem compared to the rest of above orthogonal polynomial kernel functions; however, the efficiency of its training and the robustness are relatively lower.
According to above analysis, based on Hermite orthogonal polynomials, a mixed kernel function called Generalized Triangular Hermite kernel function is constructed by using the product of triangular kernel and generalized Hermite Dirichlet kernel.This kernel function has only one parameter chosen from a small range of integer numbers, thus the parameter optimization is facilitated greatly, and more structure information of sample data is retained.It is proved that generalized triangle Hermite kernel can be used as an allowed kernel function of extreme learning machine in theory.The effectiveness of the proposed method for regression and binary, multiclass classification problems is demonstrated by performing numerical experiments on a number of real-world datasets from the UCI benchmark repository and comparing their results with SVM and other extreme learning machines with different kernels.

Introduction to Kernel Extreme Learning Machine.
A training set ℵ = {(x  , t  ) | x  ∈   , t  ∈   ,  = 1, . . ., } is given, the output function of hidden node is chosen as (a  ,   , x) with  hidden nodes, and ELM algorithm can be written as follows.
Step 2. Calculate the hidden layer output matrix H (ensure H to be full rank): Step 3. Calculate the output weight vector : where  is the regularization coefficient and T The output function of ELM is If the hidden layer feature mapping h() is unknown, a kernel matrix can be defined to replace HH  by using Mercer's condition.Thus, KELM algorithm is generated as follows: Finally, the output function of KELM is defined as
In addition to the above kernel functions, a new kernel function also can be constructed according to the property of the kernel function.
Property 1. Assume that  1 (x, z) and  2 (x, z) are valid kernel functions on X×X; then kernel functions (x, z) =  1 (x, z)+  2 (x, z) and (x, z) =  1 (x, z) ×  2 (x, z) also are valid on X × X. Theorem 1.Let  : X → R be an integrable bounded continuous function.Then the necessary and sufficient condition for the translation invariant function (x, z) = (x − z) to be a kernel function is (0) > 0 and Fourier transform is Like SVM, a function is an allowed KELM kernel function as long as it satisfies Mercer's condition.Mercer Theorem [18].Assume that, for X ⊂   , (x, z) is a continuous symmetric real value function on X × X such that the following integration should always be nonnegative for every  ⊂  2 (X): ∬ x×x (x, z)(x)(z)x z ≥ 0. Then (x, z) must be a valid kernel function.

Triangular Hermite Kernel Extreme
Learning Machine

Construction of Triangular Kernel Function.
Laplace kernel function (x, z) = exp (−‖x − z‖/) also is a radial basis function.Its classification performance is nearly equivalent to the Gaussian kernel  RBF (x, z) = exp (−‖x − z‖ 2 /2 2 ), but it is less sensitive for the changes of parameter .The Laplace kernel can be used as an alternative when using the Gaussian becomes too expensive.While  → 0, using the Taylor formula, Laplace kernel can be simplified as Figure 1 shows the function curve of ( 7) and  RBF (x, z), when  = 1, 2, where  ∈ [−7, 7] and  = 0.
Seen from Figure 1, two types of kernel functions are quite different, and  mac (x, z) is relatively less sensitive to changes of the parameter .A typical triangle is presented on its function curve; thus, it is called triangular kernel function [19].
Given a sample set X = {x  }  =1 , where x is the sample mean and  is the number of training samples, in order to further simplify (7), let  0 [16] be twice as long as the maximum distance of all sample points to sample mean such that formula holds with probability one.Thus, the simplified triangular kernel function is obtained: where 8) is a translation invariant function; according to Theorem 1,  Tri (x, z) is strictly proved to be a valid KELM kernel function in the following proof.
Proof.(x) = (x) = 1 − ‖x‖/ 0 ; it is well known that (x) is an integrable bounded continuous function on   by function analysis and (0) = 1 > 0 such that the Fourier transform of (x) is The proof is completed.
References [12][13][14][15][16] constructed the weighting function as  Gau (x, z) = exp (−‖x − z‖ 2 /), where  denotes the dimension of vector x, z, in which since the parameter  of Gaussian kernel  RBF is directly set with √/2, the data structure information is lost although the parameter optimization is simplified.However, the setting of parameter  in triangular kernel function proposed in this paper just makes up for the shortcoming of it.
In order to reflect the difference more intuitively between  Gau (x, z) and  Tri (x, z), Figures 2 and 3  As is shown in Figures 2 and 3, both the vectors x and z are one-dimensional ( = 1).When z takes a fixed value, the graph of kernel function is invariably constant, but the graph of kernel function changes as the value interval of x changes.It is well known that the choice of different kernel functions is to select different criteria to measure the similarity and the degree of similarity [18].
Consequently, for the same point (x, z), the similarity in different intervals should be different, and the value of  Tri (x, z) increases as the expansion of the interval, while the value of  Gau (x, z) in the four intervals always remains changeless.Therefore, it can be said that the parameter  0  of kernel function  Tri (x, z) is set to retain more distance similarity information of sample data; in addition, its computational cost is very low.

Construction of Generalized Hermite Dirichlet Kernel.
Hermite polynomial [16] is a kind of orthogonal polynomials with respect to the weighting function  − 2 between the intervals (−∞, +∞), which is defined as It satisfies the orthogonal relationship: It has a recursive relationship: Owing to the orthogonality, variability, and universal approximation function capability of Hermite polynomial, a general Hermite kernel function can be constructed as a good alternative to other common kernel functions (Gaussian kernel, polynomial kernel, etc.).For this purpose, let the scalar variable  be instead of row vector x and  +1 be substituted as follows correspondingly: x,  = 2,  = 0, 1, 2, . . ., where x  is the transpose of  +1 .Therefore, for vector input, it can define the generalized Hermite polynomial as By using generalized Hermite polynomial, this paper defines generalized th order Hermite Dirichlet kernel [12] as It can evaluate and verify the Mercer Theorem for  Hem (x, z) as follows by assuming that each element is independent from others: Therefore,  Hem (x, z) is a valid KELM kernel, and its kernel parameters can only be natural numbers, which greatly simplifies the selection and optimization of kernel parameters.

Generalized Triangular Hermite Kernel Extreme Learning
Machine.According to Property 1, a new KELM kernel called Generalized Triangular Hermite kernel is constructed, which is the multiplication of triangular kernel and generalized Hermite Dirichlet kernel, which is defined as Generalized Triangular Hermite kernel combines the advantages of triangular kernel and generalized Hermite Dirichlet kernel, which not only retains more distance similarity information of sample data, but also just chooses natural numbers to its parameter.Accordingly, it can greatly shorten the time of parameter optimization.Although it has two kernel parameters, that can be determined quickly and easily, which greatly reduces parameter optimization cost.The Generalized Triangular Hermite kernel function up to the 3rd order is listed as in Table 1.
Figures 4 and 5 show the Generalized Triangular Hermite kernel output up to the 3rd order at two different coordinate scales of vertical axis, where changes are within the range of [−2, 2] and are fixed at a constant value.Figure 4 shows the kernel function, while Figure 5 shows the value, where the 0th and 1st orders correspond to the left vertical axis, and the 2nd and 3rd orders correspond to the right vertical axis.
Finally, the kernel  Tri-H (x, z) to KELM algorithm is introduced; as a result, the Generalized Triangular Hermite kernel extreme learning machine algorithm is obtained as follows.

Experiments and Analysis
In order to test the performance of Generalized Triangular Hermite kernel extreme learning machine (Tri-H KELM) algorithms, this section compares it concerning testing accuracy, training time, and regression determination coefficient with other various algorithms in bispiral dataset and realworld benchmark regression, binary, multiclass classification datasets.Table 2 lists the various algorithms used in the experiments and the range of corresponding kernel parameter value, which also includes Gaussian kernel (Gauss), polynomial kernel (Poly), Gaussian Hermite kernel (Gau-H) [15] extreme learning machine algorithms, and triangular Hermite kernel support vector machine algorithm (Tri-H SVM).In multiclass simulations, we use the LIBSVM toolbox available and train SVM for each class separately as one versus all.The SVM cost parameter value is 100 in each corresponding experiment.Better test results are given in boldface in Tables 2-9.
Table 1: List of the Generalized Triangular Hermite kernel function up to the 3rd order.

Classification Performance Comparison on Banana
Dataset.The Banana dataset is a well-known binary class dataset from the UCI benchmark repository used in many pattern recognition tests.Each vector of the dataset has 2 features.The training set consists of 800 randomly selected samples (400 for (+1) class and 400 for (−1) class); meanwhile, 400 samples are randomly chosen as a testing set, which contains 200 samples of each class.At each trial of simulation, regularization coefficient is assumed.The simulation results, the maximum testing accuracy, and the  corresponding kernel parameter optimization time are given in Table 3.
Figures 6-9 show the boundaries of different KELM algorithms on benchmark Banana dataset.
Seen from Table 3 and Figures 6-9, Tri-H KELM achieves the highest testing accuracy in Banana dataset compared with Gauss, Poly, and Gau-H KELM.Furthermore, for the process of parameter optimization, the kernel parameter of two types of Hermite KELM algorithms is only taken from 0 to 3, which greatly shortens the time.

Classification Performance Comparison on UCI Benchmark Datasets.
In order to extensively verify the performance of different algorithms, wide types of datasets have been tested in simulations, which include 5 binary classification cases and 6 multiclass classification cases.Most of the datasets are taken from UCI Machine Learning Repository.The specifications of binary classification and multiclass classification cases are listed in Tables 4 and 5. Comparing Tri-H KELM with Gauss, Poly, and Gau-H KELM and Tri-H SVM, experiment results, including the maximum testing accuracy and corresponding kernel parameter value, and the training time are given in Tables 6 and 7.
(  short, because the best classification performance is achieved with given kernel parameters.A further study reveals that all KELM algorithms have better scalability and run at much faster learning speed than traditional SVM.

Regression Performance Comparison on UCI Benchmark
Datasets.This subsection selects 5 regression cases from UCI Machine Learning Repository; data is described in Table 8.The simulations of different algorithms on all the regression datasets were carried out.The regression performance of ELM was evaluated by the coefficient of determination  2 which is defined within the interval of [0, 1] and is closer to 1.The simulations results ( 2 and the training time) are given in Table 9.
As Table 9 shows, in comparison to other several KELM algorithms, Tri-H KELM obtains the maximum coefficient of determination in most regression datasets, and it has all achieved the maximum with respect to kernel parameter.It has better generalization performance for regression problem.Note that Tri-H KELM achieves similar regression performance as SVM at much faster learning speeds.

Conclusion
In this work, a novel extreme learning machine based on the mixed kernel function of triangular kernel and generalized Hermite Dirichlet kernel (Tri-H KELM) has been put forward, which introduces the triangular Hermite kernel function to kernel extreme learning machine algorithm.Because the presented kernel has only one parameter chosen from a small set of integers, the parameter optimization is facilitated greatly.Besides, more structure information of sample data is retained in the proposed kernel.Numerical experiments have been performed with different algorithms (Tri-H SVM, Gauss, Poly, Gau-H, and Tri-H KELM) on bispiral benchmark dataset and a number of real-world benchmark datasets and their results have been compared with Tri-H SVM and Gauss, Poly, and Gau-KELM for regression and binary, multiclass classification.Comparable generalization and robustness performance of the proposed approach with the rest of the methods considered at a much faster learning speed than Tri-H SVM indicate its usefulness and effectiveness.Future work will be on the study of Tri-H ELM in its practical applications.

Figure 6 :
Figure 6: The boundaries of Gauss KELM algorithms in Banana dataset.

Figure 7 :
Figure 7: The boundaries of Poly KELM algorithms in Banana dataset.

Figure 8 :
Figure 8: The boundaries of Gau-H KELM algorithms in Banana dataset.

Figure 9 :
Figure 9: The boundaries of Tri-H KELM algorithms in Banana dataset.

Table 2 :
The various algorithms and the range of their kernel parameters.

Table 3 :
The maximum testing rate (%) and the corresponding kernel parameter value.

Table 4 :
Specifications of binary classification cases.

Table 5 :
Specifications of multiclass classification cases.

Table 8 :
Specifications of binary classification cases.
The experiment results infer that Tri-H KELM performs better than other KELM algorithms and Tri-H SVM on average for multiclass classification cases.Its biggest advantage is that the parameter selection time is