Mexican Hat Wavelet Kernel ELM for Multiclass Classification

Kernel extreme learning machine (KELM) is a novel feedforward neural network, which is widely used in classification problems. To some extent, it solves the existing problems of the invalid nodes and the large computational complexity in ELM. However, the traditional KELM classifier usually has a low test accuracy when it faces multiclass classification problems. In order to solve the above problem, a new classifier, Mexican Hat wavelet KELM classifier, is proposed in this paper. The proposed classifier successfully improves the training accuracy and reduces the training time in the multiclass classification problems. Moreover, the validity of the Mexican Hat wavelet as a kernel function of ELM is rigorously proved. Experimental results on different data sets show that the performance of the proposed classifier is significantly superior to the compared classifiers.


Introduction
Extreme learning machine, which was proposed by Huang et al. [1] in 2004, is a model of single-hidden layer feedforward neural network. In this model, input weights and hidden layer biases are initialized randomly, and output weights are obtained by using the Moore-Penrose generalized inverse of the hidden layer output matrix. Compared with the conventional BP neural networks, ELM has faster learning speed, higher testing accuracy, and lower computational complexity. Therefore, ELM is widely used in sales forecasting [2], image quality assessment [3], power loss analysis [4], and so on. In 2006, Huang et al. [5] proposed incremental extreme learning machine (I-ELM), which continuously increased the number of hidden layer nodes to improve the training accuracy. Subsequently, Li [6] combined I-ELM with the convex optimization learning method and proposed ECI-ELM in 2014, which reduced the training time of I-ELM. This improvement overcame the weakness of randomly selecting weights in I-ELM and eventually improved the training accuracy. At the same time, Wang and Zhang [7] introduced the Gram-Schmidt orthogonalization method into I-ELM and saved the training time of I-ELM to a large degree. But, in general, I-ELM and its varieties only improve the training accuracy. Their numbers of hidden layer nodes are very likely to exceed the number of samples. Thus, I-ELM greatly improves the training time. In another perspective, in order to achieve a higher training accuracy, Rong et al. [8] used statistical methods to measure the relevance of hidden nodes of ELM and proposed P-ELM in 2008. Then, in 2010, Miche et al. [9] proposed OP-ELM, which is an improvement of P-ELM. In addition, Akusok et al. [10] proposed a high-performance ELM model in 2015, which provides a solid ground for tackling numerous Big Data challenges. However, none of these methods has changed the characteristic of the random selection of input weights. In addition, the linear weighted mapping method in original ELM is not replaced at all. Therefore, both ELM and its varieties have some inevitable problems. A Because of the random selection of input weights, some hidden nodes may be given an input weight that is very close to 0, which are commonly called dead nodes. This phenomenon leads to the minimal effect of these nodes and eventually affects the output accuracy. B With the increment of the number of samples, the hidden nodes number also becomes large. Thus, some high dimensional dot product operations will appear in the training process. Eventually, that will cause the increase of computational complexity and training time. This problem is commonly called dimension explosion. C For nonlinear samples, the linear weighted mapping method often has inevitable error, which leads to the reduction of the training accuracy.
In order to solve the above problems, Huang et al. [11] proposed the kernel extreme learning machine (KELM) in 2012, which utilized the kernel function to replace the linear weighted mapping method. Initially, the kernel function they selected is a Gauss function. Although [11] solves the problem of dead nodes and dimension explosion in a sense, the performance of the traditional kernel function for multiclass classification problems is still not very good. From [12,13], we know that wavelet functions can be used in SVM and ELM, which have a strong fitting capability. Therefore, in this paper, we propose a Mexican Hat wavelet kernel ELM (MHW-KELM) classifier, which effectively solves the problems in the conventional classifier. Compared with the traditional KELM, the MHW-KELM classifier achieves better results on dealing with the multiclass classification problems. Because of that, the new kernel function improves the training accuracy.
The basic principle of ELM and some theorems are shown in Section 2 of this paper. In Section 3, the Mexican Hat wavelet kernel ELM is proposed, and its validity is also proved. Performance evaluation is presented in Section 4. Conclusion is given in Section 5. . . .
If the output weights are , according to the proof given by Huang et al. [1], the norm of is smaller, and the generalization performance of ELM is better. Therefore, the output weights can be obtained by finding the least square solution of the problem Minimize: Subject to: ℎ ( ) = − , = 1, 2, . . . , , where ℎ( ) is the th output vector of hidden layer, is the th label vector, and is the error between the th network output vector and the label vector. According to KKT theory, the above problem can be transformed into a Lagrange function where each of the Lagrange multipliers corresponds to a sample . By calculating the partial derivative of (3), we can get the following set of equations: where = [ 1 , . . . , ] . And the least square solution of can be obtained by calculating the three equations in (4a), (4b), and (4c). The solution is and the output function of ELM is

Translation-Invariant Kernel Theorem.
Kernel function method is often used in SVM as a method of replacing dot product. According to the Mercer theorem (see [14]), by introducing the kernel function ( , ), we can replace the calculation of dot product in ELM. In order to reduce the computational complexity of high dimensional dot product, it is necessary to ensure that ( , ) is only a mapping method of the relative position of two input samples (see (7)).
The kernel functions which satisfy (7) are called the translation-invariant kernel function. In fact, it is difficult to prove that a translation-invariant kernel function satisfies the Mercer theorem. Fortunately, for the translation-invariant kernel function, the following theorem provides a necessary and sufficient condition to make it become an admissible support vector kernel.

). A translation-invariant kernel ( , ) = ( − ) is an admissible support vector kernel, if and only if the Fourier transform
is nonnegative.
Computational Intelligence and Neuroscience 3 The kernel function selection method of ELM is the same as SVM. Therefore, the above theorem can also be used to determine whether a function is an admissible ELM kernel. The commonly used translation-invariant kernel functions are Gauss kernel function and polynomial kernel function. In these two functions, Gauss kernel function is a kind of translation-invariant kernel function. And the expression of the two kernel functions can be given as In (9), is a Gauss core width and is an adjustable polynomial power exponent.

Kernel ELM.
In original ELM model, the linear weighted hidden output function ℎ( ) is usually not satisfied with the mapping method of the nonlinear samples. In order to solve this problem, we can replace ℎ( ) and in (6) with a kernel function ( , V). And the result is where Ω ELM is the kernel function matrix of (see (11)).

Mexican Hat Wavelet Kernel Function.
In this part, the Mexican Hat wavelet kernel function is proposed. It is also proved that Mexican Hat wavelet function is an admissible ELM kernel.
Theorem 2 (see [12]). Let ( ) be a mother wavelet. Let and denote the dilation and translation, respectively, and , , ∈ . If , ∈ , then the dot product wavelet kernel is If it satisfies the translation-invariant kernel theorem, the following translation-invariant kernel function can be obtained: The proof of Theorem 2 is given in [12]; we will not repeat it in this paper. We use Mexican Hat wavelet as the mother wavelet (see (14)). Then, the Mexican Hat wavelet kernel function is derived (see (15)). In this paper, it is also proved that Mexican Hat wavelet satisfies the translation-invariant kernel theorem. In other words, it is also an admissible ELM kernel.

Lemma 3. As a kind of translation-invariant kernel function, Mexican Hat wavelet is an admissible ELM kernel.
Proof. Firstly, it should be proved that the Fourier transform of Mexican Hat wavelet is nonnegative (see (16)).
Equation (17) can be decomposed into a set of integral inequalities (see (19)). And the derivation process is The integral term in (17) can be written as where is the integral term in (17), 4 Computational Intelligence and Neuroscience According to the translation invariance of the integral, it is easy to get (21) by using the partial integration method. The answer is ( ) Then, substituting (22) into (17), we can obtain the Fourier transform From (23), it is known that if ≥ 0, ( ) ≥ 0. Therefore, according to the translation-invariant kernel theorem, Mexican Hat wavelet is an admissible ELM kernel.

MHW-KELM Classifier.
We have already proved that Mexican Hat wavelet is an admissible ELM kernel. So, we can substitute (15) into (10) and construct MHW-KELM classifier. For a binary classification problem, the output function of the new classifier is . . .
Besides, this classifier can also be used for the multiclass classification problems. And the output function is . . .
Equation (25) means the classification result is expressed by the index value of the maximum value in output vector. In addition, we can combine the nonnegative constant parameter of Mexican Hat wavelet and the penalty factor into an individual and use some evolutionary algorithms such as PSO [17,18] to find the best values of these parameters. Next, we will analyze the performance of the proposed classifier.

Performance Evaluation
This section will analyze the performance of MHW-KELM and compare it with the traditional Gauss-KELM, Poly-KELM, original ELM, and BP classifier. All these algorithms run on the R2014a MATLAB software. The operating environment is Core-i7, 2.6 GHz CPU, 8 G RAM. We choose scaled conjugate gradient algorithm to optimize BP neural network, which is faster than normal BP neural network. In order to get excellent performance, the number of hidden nodes of original ELM and BP is selected as 100% and 30% of training samples, respectively. The data sets used in the experiment are from the UCI database [19]. They are Abalone, Auto MPG, Bank, Evaluation, Wine, Wine Quality, Iris, Glass, Image, Yeast, Zoo, and Letter, respectively. The basic features of these 12 data sets are shown in Table 1.
Then, we use the 12 data sets given in Table 1 to test the running time and training accuracy of 5 algorithms. Each data set will be tested by each algorithm 100 times. For each time, the training sample will be randomly selected from the total sample. In order to conduct a rigorous comparison, paired Student's test is performed, which gives the probability Computational Intelligence and Neuroscience 5   For each data set, the data with bold type means this is the best accuracy or the best running time ( value = 1.00), while the data with italic type means there is no statistical difference between this one and the best accuracy or it is very close to the best time ( value ≥ 0.05). By drawing the running time in all tables to a line graph, we can get Figure 1. In Figure 1, the horizontal coordinate corresponds to the number of training samples, 50, 100, 200, 1000, and 2000, respectively. Without loss of generality, we can select five data sets, Zoo, Image, Auto MPG, Car Evaluation, and Abalone, as the representations of different numbers of samples. The vertical coordinate shows the mean running time of each data set. Moreover, the running times of MHW-KELM and Gauss-KELM are very close. So, we only draw the running time of MHW-KELM. Four lines are drawn with different styles.
From all tables and Figure 1, it is clear to see that when the training number is larger than 1000, compared to other algorithms, MHW-KELM shows an obvious advantage in running time. For the data sets whose training number is more than 1000, such as Abalone, Bank, Car Evaluation, Wine Quality, Yeast, and Letter, we can obtain that the running time of MHW-KELM and Gauss-KELM is less than that of other algorithms. That means translation-invariant kernel is superior to other kernels. Therefore, it can be concluded that the choice of translation-invariant kernel function can effectively shorten the running time when the training size is large enough. From Tables 2-13, it can be obviously seen that the classification performance of MHW-KELM is better than other algorithms when the number of categories is more than 4. The results of paired Student's test show that the performance of MHW-KELM is significantly different ( value ≤ 0.05) from that of the original ELM and SCG-BP on all data sets, and it is also different from Gauss-KELM and Poly-KELM on Auto MPG, Car Evaluation, Wine Quality, and Image. These four data sets have one thing in common, which is the fact that the category numbers of these data sets are all more than 4.

6
Computational Intelligence and Neuroscience        Besides, when the category number is less than 4, such as Abalone, Bank, Wine, Iris, Yeast, and Letter, MHW-KELM still has a similar performance to Gauss-KELM or Poly-KELM. Therefore, MHW-KELM is an excellent classifier in multiclass classification problems, which is better than traditional kernel ELM. That means the Mexican Hat wavelet function is a better ELM kernel than the Gaussian function.

Conclusion
In this paper, we propose a classifier, the Mexican Hat wavelet kernel ELM classifier, which can be applied to the multiclass classification problem. Besides, its validity as an admissible ELM kernel is also proved. This classifier solves the inevitable problems in original ELM by replacing the linear weighted mapping method with Mexican Hat wavelet. The experimental results show that the training time of MHW-KELM classifier is much less than that of original ELM, which solves the problem of the dimension explosion in original ELM. Meanwhile, the training accuracy of this classifier is superior to the traditional Gauss-KELM and original ELM in dealing with the multiclass classification problems.
In future work, in order to reduce the impact of inequality of the training data on the performance, we plan to utilize the boosting weighted ELM proposed by Li et al. [20] to optimize the proposed classifier. In addition, from the experimental results of this paper, it can be seen that a single kernel function cannot meet the requirements of all data sets. So, we are prepared to combine multiple kernel functions to construct mixed kernel ELM, in order to suit different situations.