Robust Matching Pursuit Extreme Learning Machines

Extreme learning machine (ELM) is a popular learning algorithm for single hidden layer feedforward networks (SLFNs). It was originally proposed with the inspiration from biological learning and has attracted massive attentions due to its adaptability to various tasks with a fast learning ability and efficient computation cost. As an effective sparse representation method, orthogonal matching pursuit (OMP)method can be embedded into ELM to overcome the singularity problemand improve the stability. Usually OMP recovers a sparse vector byminimizing a least squares (LS) loss, which is efficient for Gaussian distributed data, butmay suffer performance deterioration in presence of non-Gaussian data. To address this problem, a robust matching pursuit method based on a novel kernel risk-sensitive loss (in short KRSLMP) is first proposed in this paper.TheKRSLMP is then applied to ELM to solve the sparse output weight vector, and the new method named the KRSLMP-ELM is developed for SLFN learning. Experimental results on synthetic and real-world data sets confirm the effectiveness and superiority of the proposed method.


Introduction
Extreme learning machine [1] is a kind of single hidden layer feedforward network (SLFN) [2].In the past decade, ELM became popular and attractive in the machine learning and pattern recognition communities for its fast adaptability and good generalization performance [3].In general, ELM has the following advantages: (i) It not only has the ability of estimating the unknown mathematical model embedded in a mass of training samples but also possesses parallel schemes to be efficiently implemented in parallel for training and testing; (ii) it uses randomly generated input weights and hidden biases without tuning during the training phase, and therefore, the output weights can be analytically obtained by solving the standard least squares (LS) problem.Thus, extremely fast learning ability and efficient computation cost can be achieved, especially for big data applications.In view of these remarkable superiorities, ELM has been widely applied in many applications, such as face recognition [4], series compensated transmission line protection [5], time series analysis [6], and nonlinear model identification [7].
However, ELM still has several drawbacks.First, ELM encounters the problem of irrelevant variables when handling real-world data sets [8].Second, choosing a proper hidden nodes number is an open problem for all ELM algorithms.An ELM network with too few hidden nodes may not be accurate for modeling the input data, whereas a network with too many hidden nodes tends to generate an overfitting model [9].Moreover, when the number of hidden nodes is more than the input data, ELM might have the singularity problem [4].Third, the original ELM learns the model with an  2norm based loss function, which is very vulnerable to noise.It is well known that the  2 -norm can magnify the bad effects of outliers associated with large deviations [10].The presence of non-Gaussian noises or outliers in the training data may thus lead to an unreliable model with degraded performance.
To overcome the first and second limitations, several methods have been proposed in the regularization framework [9,[11][12][13].Furthermore orthogonal matching pursuit (OMP) is a plain and efficient iterative algorithm which chooses an atom in the dictionary with the best correlation to the remaining elements at each iteration [14].As such, OMP has been embedded to ELM (OMP-ELM) to overcome the singularity problem and led to more stable solution than the original ELM [15].Most of the existing methods learn the model with an  2 -norm based loss function, which may perform poorly in the presence of non-Gaussian noises (which exist in many real-world situations) or outliers [16][17][18].To combat non-Gaussian noises or outliers and improve the generalization ability, the regularized correntropy criterion is used to replace the  2 -norm based loss function in original ELM model to develop the ELM-RCC [16].In [19], ELM with  1 -norm based loss function (ORELM) was proposed to achieve robust performance.
The kernel risk-sensitive loss (KRSL) is a nonlinear similarity measure firstly proposed in [20], which can reach a more satisfying robust performance.The KRSL is based on the original structure of risk-sensitive loss and is defined in the reproducing kernel Hilbert space (RKHS) [21,22]: where E[⋅] denotes the mathematical expectation,  휎 (⋅) is the Gaussian kernel with bandwidth , and  is the risk-sensitive parameter.In this paper, we propose a KRSL based matching pursuit (KRSLMP) method.The KRSLMP is then embedded to ELM to construct a robust and sparse ELM model.The rest of the paper is structured as follows.In Section 2, we sketch the related work, including similarity measures in kernel space, kernel risk-sensitive loss, ELM model, and orthogonal matching pursuit algorithm.In Section 3, we develop the KRSLMP-ELM.In Section 4, experiments on regression problem with synthetic and real-world data sets are conducted to verify the effectiveness of the proposed algorithm.The sensitivity of the KRSLMP-ELM to free parameters is also analyzed.Finally, conclusion is given in Section 5.

Preliminaries and Related Works
For convenience of presentation, the following notations used in this paper are introduced.Vectors and matrices are represented with boldface lowercase letters and boldface capital letters, respectively.For any vector x, we use () to denote its th entry.The notation x| 퐼 denotes the subvector of x ∈ R 푛 with entries indexed by the set  ⊂ Ω = {1, 2, . . ., }.The complementary set of  is denoted as  푐 = Ω − .

Similarity Measures in Kernel Space.
Let  and  be two random variables; the correntropy between  and  is defined by [17,23]  (, ) = E [ 휎 ( − )] = ∫  휎 ( − )  푋푌 (, ) , (2) where  푋푌 (, ) is the joint distribution function of (, ).The Gaussian kernel with bandwidth  is given by Correntropy (, ) is a local correlation measure in the kernel space H.According to Mercers theorem [24], it can be expressed in terms of the inner product as It applies a kernel trick that nonlinearly maps the original space to a higher dimensional feature space.It can be shown that correntropy is directly related to the probability of how similar two random variables are in a neighborhood of the joint space controlled by the kernel bandwidth  [17,25,26].

Kernel Risk-Sensitive
Loss.Similarity measures in kernel space have the ability to extract higher-order statistics of data, which can significantly improve the learning performance in non-Gaussian environments [21].The optimization problem can be determined by maximizing the correntropy criterion (MCC) or equivalently minimizing the correntropic loss (C-Loss) [27,28] between the output estimation and the target response.However, highly nonconvex problem may happen in C-Loss performance surface which has steep slopes around the optimal solution but is extremely flat far from the solution.This may lead to slow convergence and poor performance.Choosing a large kernel bandwidth may overcome the above problem.But the robustness will decrease significantly when outliers occur with kernel bandwidth increasing [29].To achieve a satisfying performance surface, the KRSL was proposed in [20].
The KRSL is defined by which can also be expressed in a traditional risk-sensitive loss form as [30] where  is the risk-sensitive parameter that controls the shape of performance surface.
In practice, the joint distribution function of  and  is usually unknown and only a finite number of samples {( 푗 ,  푗 )} 푀 푗=1 are available.The KRSL can thus be estimated by As one can see, (6) where  is hidden nodes number,  푖 is the weight connecting the th hidden node and output nodes,  is the activation function (in this work,  is a sigmoid function without explicit mention), a 푖 denotes the weight that connects the th hidden node and input nodes, and  푖 represents the randomly chosen bias of the th hidden node.Equation ( 7) can be compactly written as a matrix notation where and  is the minimal norm least squares solution of ( 8).The parameter  can be obtained by where H † is the Moore Penrose generalized inverse of the hidden layer output matrix H.

Orthogonal Matching Pursuit.
Matching pursuit method is one of the effective methods for sparse representation [14,32,33].In general, a sparse representation problem can be formulated as where A ∈ R 푚×푛 ( < ) denotes the measurement matrix, x is the sparse vector, and  ∈ R 푚 represents the noise vector.The main purpose is to recover the sparse vector x from the observation y and the measurement matrix A. The OMP uses the  0 -norm constrained least squares model where ‖x‖ 0 counts the number of nonzero coordinates of x.
In the following, we briefly describe the OMP method.First, we initialize the residual r 0 = y, the index set Λ 0 = 0, and the iteration  = 1.At each iteration, OMP algorithm selects a column of the measurement matrix A which is most correlated to the residual as where r 푡−1 denotes the residual in  − 1th iteration and  푖 is the th column of A. Then collect We can solve an LS problem to obtain a new estimation x 푡 supported in Λ 푡 : where supp(x) denotes the support set of x.If the stopping criterion is satisfied, we output x 푡 as the estimate of x.
Then one can update the residual From ( 8) and ( 11), we can find that ELM has a similar network model for sparse representation problem.Thus, one can take advantage of the OMP algorithm for selecting the best hidden nodes of the ELM network.The OMP estimates the sparse vector by using the  2 -norm based criterion, which performs well with the Gaussian error distribution.However, the presence of non-Gaussian noise may give rise to performance degradation.

Kernel Risk-Sensitive Loss Based Matching Pursuit Extreme Learning Machine
To address the aforementioned issue, we propose a robust kernel risk-sensitive loss based orthogonal matching pursuit extreme learning machine algorithm (KRSLMP-ELM) in this section.In the KRSLMP-ELM, we initialize the residual r 0 as y and the initial index set as Λ 0 = 0.Then, similar to OMP, a column of  most correlated with the residual is selected and the index set is augmented at each iteration.Then we obtain a new estimation  푡 by solving the following KRSL minimization problem: We utilize the half-quadratic (HQ) theory [34] to construct the optimization algorithm.Considering that the measurements may include both large and small noise, we can use HQ optimization to estimate the importance of different samples.The samples severely corrupted will be assigned small weight values in learning procedure to decrease the impact of large noise.Thus, the performance of KRSLMP-ELM can be significantly further improved.
Inspired by the HQ theory, ( 21) can be solved by the following alternate technique: , where  denotes the iteration number.In the proposed algorithm, the bandwidth is adaptively chosen during the iteration.In order to make the scheme robust to outliers, we calculate the value of  as follows.
Denote the training error as () = ‖() − (H)()‖ 2 2 ,  = 1, 2, . . ., .We can then reorder the error in an ascending order, and we get the reordered as  휎 .Let  = ⌊⌋, where scalar  ∈ (0, 1] and ⌊⌋ outputs the largest integer smaller than .We can select  휎 () as the bandwidth in accordance with the proportion of outlier.Discussions on the detailed experimental results by choosing different bandwidths are given in the experiment section.A solution for the optimization problem in (21) can be derived as follows: where  (푡+1) | Λ   = 0 and I denotes the identity matrix.Since the importance degree of the measurements is employed to adaptively update the output weight vector in the KRSLMP-ELM, we update the residual It is noted that the sparsity level  has to be assigned in advance in the KRSLMP-ELM.The sparsity  directly determines the number of the active hidden nodes used in ELM due to the fact that more hidden nodes than necessary are generated.To obtain the best sparsity level , namely, the best number of hidden nodes used in ELM, we utilize the root mean square error (RMSE) as the criterion where  푖 denotes the target response and ŷ푖 the corresponding output estimated by the KRSLMP-ELM.
For different sparsity level , the corresponding RMSE is first calculated.Then the best  coefficients associated with the minimum RMSE value are selected.
The iteration is repeated until achieving the stopping criterion.The KRSLMP-ELM is summarized in Algorithm 1.

Experimental Results
To validate the effectiveness of the proposed KRSLMP-ELM algorithm, experiments on two synthetic data sets and seven benchmark data sets are conducted in this section.The performance of the new method is compared to five state-of-theart algorithms, namely, ELM, RELM, ELM-RCC, OMP-ELM, and ORELM.Sigmoid function () = 1/(1 +  −푥 ) is used as the activation function for all methods.Func.This synthetic data set is generated by where  is a zero-mean Gaussian distributed noise vector with standard deviation 0. Find a column of H most correlated with the residual Solve the KRSLMP minimization problem by the following iterations ) The solution is denoted as (w (푡) ,  (푡) ) ( 6) Update residual  푡 = √diag(w (푡) )(y − H (푡) ) ( 7) end for Algorithm 1: KRSLMP-ELM.Parameters used in the six methods for experiments of the two synthetic data sets are summarized in Table 1, where , , , and  represent the number of hidden layer nodes, regularization parameter, sparsity level, and risk-sensitive parameter in KRSLMP-ELM.We set  = 0.9 in Sinc synthetic data set experiment and  = 1 in Func synthetic data set experiment.For the convenient distinguishment of the proposed method with other methods in Sinc function approximation problem, only the estimation results of the original ELM, ORELM, ELM-RCC, and KRSLMP-ELM are illustrated in Figure 1.In Figure 2, we plot the squared training errors obtained by the KRSLMP-ELM, ELM-RCC, ORELM, and the original ELM, respectively.As shown in these figures, the KRSLMP-ELM wins the best approximation performance.The testing RMSEs of six algorithms are presented in Table 2.It is indicated that the KRSLMP-ELM is more robust than the other five methods.
Further, we perform another experiment to compare the performance of KRSLMP-ELM to that of the original ELM with different outliers.We consider the Sinc function approximation problem and set the inner noise as a zeromean Gaussian distributed noise with standard deviation 0.1, and the outliers noise is zero-mean Gaussian with standard deviation ranging between 0.1 and 10.We run 100 trials for different outliers noises and show the RMSE results in Figure 3.One can see that the original ELM's performance degrades severely when the outliers get enhanced while the KRSLMP-ELM's performance is much less influenced by outliers.

Benchmark Data Sets.
In this subsection, seven benchmark regression data sets from UCI machine learning repository [36] are tested to support the superiority of the proposed method.Specifications of the data sets are shown detailedly in Table 3.It should be pointed out that the training and testing data samples are randomly chosen in each data set and all the features are normalized into [0, 1].The parameters of each method are all chosen by the fivefold cross-validation and are given in Table 4.For all algorithms, 100 independent trials are conducted and the average results are reported.The training and testing RMSEs and their standard deviation of all algorithms are listed in Table 5.As highlighted in boldface, the ELM-KRSLMP achieves the best performance in most regression data sets.

Sensitivity of Parameters.
We analyze the sensitivity of the parameters , , , and  of KRSLMP-ELM in this subsection.For illustration, we use the regression results obtained by the Servo data set as an example.For each parameter, its sensitivity is tested by fixing the remaining parameters as the ones used in Table 4.Then, the testing RMSEs are recorded as criteria for performance comparison.
The results of the regression performance are demonstrated in Figure 4.

Conclusion
In this paper, a robust matching pursuit based ELM algorithm, called the kernel risk-sensitive loss based matching pursuit extreme learning machine (KRSLMP-ELM), has been developed.Kernel risk-sensitive loss (KRSL) is a nonlinear similarity measure defined in kernel space, and it can achieve better performance than the conventional MSE criterion when dealing with non-Gaussian and nonlinear problems.
Incorporating the KRSL into the existing orthogonal matching pursuit algorithm, we developed an improved KRSLMP-ELM algorithm, which is more robust than the OMP-ELM method.Comparisons with several existing state-of-the-art algorithms have also been provided to validate the superiority of the proposed KRSLMP-ELM algorithm.

4. 1 .
Synthetic Data Sets.In this subsection, experiments on two synthetic regression data sets for nonlinear function approximation problem are carried out.Descriptions of the two data sets are as follows.Sinc.The synthetic data set is generated by  푖 = ⋅Sinc( 푖 )+ 푖 , where   푖 contains two mutually independent noises that are inner noise  푖 and outliers noise  푖 .Specifically,  푖 is defined as  푖 = (1 −  푖 ) 푖 +  푖  푖 , where  푖 is binary distributed with the probability masses Pr{ 푖 = 1} =  and Pr{ 푖 = 0} = 1 − (0 ≤  ≤ 1). 푖 and  푖 are independent of  푖 .In this experiment,  is set at 0.1.The outlier  푖 is generated by using a zero-mean Gaussian distributed noise with standard deviation 4.0.For the inner noise  푖 , two different noises are tested, which are (a) uniform distribution over [−1.0, 1.0] and (b) Sine wave noise sin(), with  uniformly distributed over [0, 2].We uniformly generate the input data  푖 from [−10.0, 10.0], where 200 data points are used for training and another 200 clean data points which are not contaminated by any noise are used for testing.
Standard deviation of outliers noise
푗 ,  푗 )} 푀 푗=1 be given by  training samples, where input x 푗 ∈ R 푛 and corresponding desired output  푗 ∈ R; the relationship between x 푗 and  푗 can be represented under the assumption of the model.The network model of ELM with  hidden neurons can be modeled and expressed as defines a distance between the vectors X = [ 1 ,  2 , . . .,  푀 ] 푇 and Y = [ 1 ,  2 , . . .,  푀 ] 푇 .2.3.Extreme Learning Machine.Extreme learning machine (ELM) was proposed by Huang et al. for training single hidden layer feedforward neural networks (SLFNs) [2, 31].The input weights and biases are initialized randomly in ELM and remain unchanged during training.The network learning thus becomes optimizing the output weights, which can be formulated as solving a linear equation.Let {(x arg min

Table 1 :
Parameter settings of four algorithms in function fitting.

Table 2 :
Testing RMSEs of six methods.

Table 3 :
Specification of the data sets.

Table 4 :
Parameter settings of six methods.

Table 5 :
Training and testing RMSEs for different data sets.