Smoothing L0 Regularization for Extreme Learning Machine

Extreme learning machine (ELM) has been put forward for single hidden layer feedforward networks. Because of its powerful modeling ability and it needs less human intervention, the ELM algorithm has been used widely in both regression and classification experiments. However, in order to achieve required accuracy, it needs many more hidden nodes than is typically needed by the conventional neural networks. /is paper considers a new efficient learning algorithm for ELM with smoothing L0 regularization. A novel algorithm updates weights in the direction along which the overall square error is reduced the most and then this new algorithm can sparse network structure very efficiently. /e numerical experiments show that the ELM algorithm with smoothing L0 regularization has less hidden nodes but better generalization performance than original ELM and ELM with L1 regularization algorithms.


Introduction
Recently, the studies and applications of artificial intelligence have raised a new high tide. Artificial neural networks, also known as neural networks, are an usual artificial intelligence learning algorithm, which can autonomously learn the characteristics of data, hence avoiding the course of artificial selection.
Among them, the most common network type is feedforward neural networks (SLFNs), such as perceptron, back-propagation (BP), and radial basis function (RBF) networks. Generally, it includes an input layer, several hidden layers, and an output layer. Each neuron is arranged in layers, only connected to the antecedent layer. It takes the output of the previous layer and delivers it to the next layer.
ere is no feedback between the layers. As a result of its good characteristics, it has been widely used in various fields [1][2][3][4]. However, the drawback of the feedforward neural network is inefficiency, especially when it deals with complex data.
Lately, ELM was proposed by Huang et al. [5][6][7]. Other than the conventional neural network, ELM is a new type of SLFNs, which input weights and the thresholds of the hidden layer can be discretionarily assigned if the activation function of the hidden layer is immortally differentiable. e output weights can be decided by Moore-Penrose generalized inverse. In some simulations, the learning speed of the ELM algorithm can be completed in seconds [8]. It is extremely fast and has better generalization performance than other primitive gradient-based learning algorithms.
However, in order to guarantee the veracity of regression and classification experiments, the ELM method needs a huge number of hidden nodes. is leads to a particularly complex network structure. It inevitably increases the model size and testing time. erefore, in order to ameliorate the predictive power and generalization performance of ELM, choosing the appropriate quantity of hidden layer nodes is a hot topic in the research of ELM.
Many scholars have done a lot of research on the hidden layer structure optimization of ELM and achieved many remarkable results. Rong et al. [9] proposed a fast pruned ELM (P-ELM) and successfully applied it to the classification problem.
is learning mechanism mainly achieves the effect of "pruning" the network structure by deleting hidden layer nodes irrelevant to class labels. en, Miche et al. [10] proposed an optimal extreme learning machine (OP-ELM) for optimal pruning and extended the pruning method. Incremental extreme learning machine (I-ELM) was put forward by Huang et al. [11], and it can increase the quantity of hidden layer nodes adaptively. e efficiency of learning is improved, and the network performance is optimized by increasing nodes in the training process. In [12], Yu and Deng proposed a series of new efficient algorithms for training ELM, which exploit both the structure of SLFNs and the gradient information over all training epochs.
In conclusion, the number of neurons is an important factor that determines the structure of the network. ere are two methods to adjust the size of the network, one is the growing method, and the other is the pruning method. For the growing method, a common strategy is to begin with a smaller network and then add new hidden nodes during the process of training the network [13,14]. e pruning method starts with a large network structure and then prunes unimportant weights or nodes [15][16][17][18][19].
In this paper, based on the regularization method, we propose a new efficient algorithm to train ELM. is novel algorithm updates the weights in the direction along which the overall square error is reduced the most and then to the direction which enforces sparsity of the weights effectively. Our strategy is to combine the L 0 regularization term with the standard error function. On the basis of the regularization theory, when p ⟶ 0, the L p regularization method tends to produce more sparse results. However, it is not differentiable at the origin of coordinates and leads to NPhard problem [24]. So, L 0 regularization cannot make use of the optimization algorithm directly [25]. Also, there are some other sparse strategies for optimized structure of neural networks [26,27]. Inspired by the work on smoothing L 1/2 regularization for feedforward neural networks [28,29] and other optimizing strategies, this paper draws on some smoothing techniques to overcome the shortcomings of the origin is not differentiable and the phenomenon of oscillation, and we will show the related details in the next section. e main contributions are as follows: (1) It is shown how the smoothing L 0 regularization is proposed to train ELM, which can discriminate important weights from unnecessary weights and drives the unnecessary weights to zeros, thus effectively simplified the structure of the network. (2) e ideal approximate solution is obtained by using the smoothing approximation techniques. e drawbacks of nonsmooth and nondifferentiable of the normal L 0 regularization term can be addressed. And this effectively prevents the oscillatory phenomenon of the training.
(3) Numerical simulations shown that the sparsity effect and the generalization performance of the ELMSL0 are better than those of the original ELM and ELML1. For example, from Tables 1-3, the results clearly show that the novel algorithm ELMSL0 uses fewer neurons but has higher testing accuracy for most of the data sets. e remainder of this paper is organized as follows. Section 2 describes the network structure of the traditional ELM algorithm. Section 3 shows the ELM algorithm with smoothing L 0 regularization (ELMSL0). Section 4 shows that how the smoothing L 0 regularization helps the gradientbased method to bring forth sparse results. e performance of the ELMSL0 algorithm is compared with ELM and ELM with L 1 regularization (ELML1) algorithms by regression and classification applications in Section 5. Finally, conclusion is given in Section 6.

Extreme Learning Machine (ELM)
Next, we will do a description of the traditional ELM. e neuron quantities of the input, hidden, and output layers are n, L, and m, respectively (see Figure 1). e ELM algorithm randomly generates the connection weights of the input layer and hidden layer, and the thresholds of the hidden layer neurons need not be adjusted in the process of training the network. As long as the number of hidden layer neurons is set, the unique optimal solution can be acquired. e input matrix X and output matrix Y of the training set with Q samples are, respectively, (1) On the basis of the structure of ELM, the actual output T of the network is as follows: (2) e normative ELM with L hidden nodes and activation function g(x) can be expressed as where t j � [t 1j , t 2j , . . . , t mj ] T , and the specific expression of t j is as follows: where is the weight vector connecting the ith hidden node and input nodes, β i � [β i1 , β i2 , . . . , β im ] T is the weight vector connecting the ith hidden node and output nodes, b i is the threshold of the ith hidden node, and w i · x j denotes the inner product of w i and x j .
Equation (4) can be written as Hβ � T T , and H is called the hidden layer output matrix of the ELM, where By using the Moore-Penrose generalized inverse of H, the least-squares solutions of Hβ � T T can be written as

Extreme Learning Machine with L 0 Regularization (ELMSL0)
In the general case, the weights with large absolute value play a more important role in training. In order to prune the network availably, we need to distinguish the unimportant weights in the first place and then remove them. erefore, the aim of the network training is to find the appropriate weights β that can minimize ‖Hβ − Y T ‖ as well as ‖β‖ p : where λ is the regularization coefficient to balance the training accuracy and the complexity of the network. ‖β‖ p p is called L p regularization, and it shows different properties for different values of p. Moreover, L 2 regularization can prevent the weight from increasing too large effectively, but it is not sparse. L 1 regularization can generate sparse solutions. L 0 regularization produces more sparse solutions than L 1 regularization. So, we focus on the L 0 regularization in this paper. Here, ‖β‖ 0 0 is the L 0 regularization with the L 0 norm defined by for β � (β 1 , β 2 , . . . , β n ) T . We know that the quantity of nonzero elements of the vector β equals to ‖β‖ 0 0 . However, according to combinatorial optimization theory, we know that minimizing ‖Hβ − Y T ‖ 2 2 + λ‖β‖ 0 0 is a NP-hard problem. In order to overcome this drawback, the following continuous function H ρ (β) is employed to approximate the L 0 regularization at the origin: Here, h ρ (·) is continuously differentiable on R and where ρ is a positive number, and it is used to control the degree of how H ρ (β) approximates the L 0 regularization. A representative choice for h ρ (β) is To sum up, the error function with L 0 regularization has the following form: where (Hβ − Y T ) k * represents the kth row of the matrix (Hβ − Y T ). We use the gradient descent method to minimize the error function, and the gradient of the error function is given by where i � 1, 2, . . . , L. For any initial value β 0 , the batch gradient method with the smoothing L 0 regularization term updates the weights by where η > 0 is the learning rate.

Description of Sparsity
Regularized sparse model plays an increasingly important role in machine learning and image processing. It removes a large number of redundant variables and only retains the explanatory variables that are most relevant to the response variables, which simplifies the model while retaining the most important information in the data sets and effectively solves many practical problems. Next, we show that the ELMSL0 algorithm differentiates between important weights and unimportant weights. Among them, the weights with large absolute value are more important. e curves of h ρ (β) and h ρ ′ (β) for different ρ are shown in Figures 2 and 3. Notice that the only minimum point of h ρ (β) is β � 0. On the grounds of (10), for adequately small ρ, there exists a positive number β 0 , such that h ρ ′ (β) ≈ 0 when β > β 0 and h ρ ′ (β) will be quite large when β < β 0 . erefore, when using the ELMSL0 algorithm to train the network, the absolute value is greater than the weight of β 0 , which is not easily affected by the regularization term. e unimportant weights are absolutely less than β 0 , they will be driven to zero during the training. is makes clear why the smoothing L 0 regularization term can help the gradient-based method achieve sparse results.
In the light of the above discussion, we look forward to more weights fall into the interval (−β 0 , β 0 ) to achieve the more sparse results. To make this true, one choice is to set the very small initial weights, but it leads to slow convergence at the beginning of the training procedure on account of the gradient of the square error function will be very small [30]. e weight decay method is another choice, and the size of the network weight is reduced compulsively during the training process. As shown in Figure 3, if the parameter ρ is set too small, it will be unstable and affect the performance of the algorithm. erefore, the parameter ρ should be set to a decreasing sequence with lower bounds.

Simulation Results
In order to substantiate the reliability of the introduced ELMSL0 algorithm, we conduct some experiments which are both regression and classification applications. In Section 5.1, ELM, ELML1, and ELMSL0 algorithms are used to approximate the SinC and multidimensional Gabor function. ere are a few real-word classification data to test the performance of three algorithms in Section 5.2.

Function Regression Problem.
e SinC function is defined by and it has a training set (x i , y i ) and testing set (x i , y i ) with 5000 data, where x i is uniformly distributed on the interval (−10, 10). To make the regression problem more real, adding the uniform noise distributed in [−0.2, 0.2] to all the training samples, the testing samples remain noise-free. ere are 50 hidden nodes, and the activation function is RBF for ELM, ELML1, and ELMSL0 algorithms. We choose the learning rate η � 0.2, the regularization coefficient λ � 0.05, and h ρ (β) � 1 − exp(−β 2 /2ρ 2 ), where ρ is set to be a decreasing sequence. As demonstrated by Figure 3, h ρ ′ (β) may be absolutely too large when parameter ρ is too little, which may cause instability during the training procedure. us, we set a lower bound for ρ � 0.06. Figures 4-6 exhibit the prediction results. Figure 4 shows the prediction values of the ELM algorithm for SinC function. Figure 5 shows the prediction values of the ELML1 algorithm. Figure 6 shows the prediction values of the ELMSL0 algorithm. It is obvious that the ELMSL0 algorithm has better prediction performance than other two algorithms. e root mean square error (RMSE) is usually used as error function for evaluating the performance of the algorithm. e smaller value of the RMSE shows that the algorithm is more accurate to the described experimental data. By the following equation, we can calculate the RMSE: where f(x i ) denotes the predicted data and y i denotes the actual data. We run 50 experiments on each of the three algorithms and then give the average training and testing RMSE values in Table 1. By comparing the number of hidden nodes of three algorithms, the ELMSL0 algorithm needs less hidden nodes but the testing sets have the highest accuracy. Next, we consider using three algorithms to approximate a multidimensional Gabor function individually: Z(x, y) � 1 2π(0.5) 2 exp − x 2 + y 2 2(0.5) 2 cos(2π(x + y)). (17) We select 2601 samples from an equably spaced 51 × 51 grid on −0.5 ≤ x ≤ 0.5 and −0.5 ≤ y ≤ 0.5 for training samples, and 2601 testing samples are selected similarly. In order to prevent overfitting in the training process, the noise which is evenly distributed in [−0.4, 0.4] has been added to training samples, and there is no noise in the testing samples. e original ELM algorithm has 100 hidden nodes, and RBF is selected for the activation function. en, we use ELML1 and ELMSL0 algorithms to prune the network, respectively, choosing the learning rate η � 0.2, the regularization coefficient λ � 0.05, and ρ � 0.06 to approximate Gabor function (Figure 7). We perform 50 experiments on each of the three algorithms. As shown in Figures 8-10, it is so clear that the ELMSL0 algorithm has better prediction performance than conventional ELM and ELML1 algorithms. Table 2 gives the average training and testing RMSE values and the number of hidden nodes required by three algorithms. e accuracy of the ELMSL0 algorithm in testing sets is higher than other algorithms.

Real-Word Classification Problems.
In this section, we compare the generalization performance of ELM, ELML1, and ELMSL0 algorithms in some real-word classification problems, which include seven binary classification problems and seven multiclass classification problems. Tables 4  Mathematical Problems in Engineering and 5 describe these classified data clearly, including the number of training and testing data and the attributes of each classification data set. To explain Tables 4 and 5 clearly, we take the Diabetes and Iris data sets for example. Diabetes data belongs to either positive or negative class. e data from "Pima Indians Diabetes Database" are created by Applied Physics Laboratory. It consists of 768 women over the age of 21 who come from Phoenix, Arizona. Iris data include four features: calyx length, calyx width, petal length, and petal width. It contains three classes: Setosa, Versicolor, and Virginica.
ere are different hidden nodes of each data, and the activation function is sigmoid function for three algorithms. We choose the learning rate η � 0.02, the regularization coefficient λ � 0.03, and h ρ (β) � 1 − exp(−β 2 /2ρ 2 ), where ρ is a decreasing sequence with a lower bound of 0.08. We run 50 experiments with each data set, and Table 3 shows the average training and testing accuracy. It can be found that the ELMSL0 algorithm requires fewer hidden nodes without affecting the testing accuracy of the      Iris  100  50  4  3  Wine  120  58  13  3  Glass  150  64  9  6  Olitos  84  36  25  4  E.coli  236  100  8  8  Seeds  147  63  7  3  Wholesale  308  132  7  3 algorithm. So, the ELMSL0 algorithm not only produces sparse results to prune the network but also has better generalization performance than other two algorithms in most classified data sets. It is well known that regularization coefficient and the number of hidden nodes affect the accuracy of algorithms. erefore, we need several experiments to select the appropriate regularization coefficients. A binary Sonar data set and a multiclassification Iris data set are selected here. We take the Sonar signal classification data set from the UCI machine learning repository for example. It is a typical benchmark problem in the field of neural networks. All samples are divided into two categories, one is sonar signals bounced off a metal cylinder, the other is those bounced off a roughly cylindrical rock. Here, we consider the number of hidden nodes is 500 in ELM, pruned by ELML1 and ELMSL0 algorithms. Figure 11 shows the testing accuracy of the two algorithms with different regularization coefficients. Figure 12 shows the number of hidden nodes required by the two algorithms as the regularization coefficient increases. From Figures 11 and 12, the ELMSL0 algorithm requires fewer hidden nodes but better generalization performance than the ELML1 algorithm with different regularization coefficients.   Figure 12: e number of hidden nodes required by ELML1 and ELMSL0 algorithms with different regularization coefficients.

Conclusions
In this paper, we use a smoothing function to approximate L 0 regularization, proposing a pruning method with smoothing L 0 regularization (ELMSL0) for training and pruning extreme learning machine. And also it is shown that the ELMSL0 algorithm can produce sparse results to prune networks availably. Both regression and classification problems show that the ELMSL0 algorithm has better generalization performance and simpler network structure. In future, we consider applying the intelligent algorithm to the ELM algorithm to find the most appropriate weights and thresholds.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.