Pruning Multilayered ELM Using Cholesky Factorization and Givens Rotation Transformation

Extreme learning machine is originally proposed for the learning of the single hidden layer feedforward neural network to overcome the challenges faced by the backpropagation (BP) learning algorithm and its variants. Recent studies show that ELM can be extended to the multilayered feedforward neural network in which the hidden node could be a subnetwork of nodes or a combination of other hidden nodes. Although the ELM algorithm with multiple hidden layers shows stronger nonlinear expression ability and stability in both theoretical and experimental results than the ELM algorithm with the single hidden layer, with the deepening of the network structure, the problem of parameter optimization is also highlighted, which usually requires more time for model selection and increases the computational complexity. +is paper uses Cholesky factorization strategy and Givens rotation transformation to choose the hidden nodes of MELM and obtains the number of nodes more suitable for the network. First, the initial network has a large number of hidden nodes and then uses the idea of ridge regression to prune the nodes. Finally, a complete neural network can be obtained. +erefore, the ELM algorithm eliminates the need to manually set nodes and achieves complete automation. By using information from the previous generation’s connection weight matrix, it can be evitable to re-calculate the weight matrix in the network simplification process. As in the matrix factorization methods, the Cholesky factorization factor is calculated by Givens rotation transform to achieve the fast decreasing update of the current connection weight matrix, thus ensuring the numerical stability and high efficiency of the pruning process. Empirical studies on several commonly used classification benchmark problems and the real datasets collected from coal industry show that compared with the traditional ELM algorithm, the pruning multilayered ELM algorithm proposed in this paper can find the optimal number of hidden nodes automatically and has better generalization performance.


Introduction
Extreme learning machine (ELM) was first proposed by Huang and has attracted extensive attentions for its extremely fast learning speed, least human intervention, and easy implementation [1][2][3]. In the ELM algorithm, the input weights and hidden biases are randomly generated from any continuous probability distribution, and then, the output weights can be solved using the generalized Moore-Penrose inverse. Compared with the BP neural network, this algorithm has a good performance network in regression [4][5][6], classification [7][8][9], feature learning [10][11][12], and cluster tasks [13][14][15]. Different from conventional gradient-based neural network learning algorithms, which are sensitive to the combination of parameters and easy to trap in local optimum, ELM not only has a faster training speed but also has a smaller training error. e learning theories of ELM show that it can maintain the universal approximation and classification capability of SLFNs even if it works with randomly generated hidden nodes. However, because the essence of the extreme learning machine is the empirical risk minimization model and the randomness of the input weights and hidden biases, the extreme learning machine requires more hidden nodes than the BP neural network. Inspired by the structural risk minimization (SRM) principle of statistical learning theory and the weighted least square method, Deng et al. [16] proposed a regularized extreme learning machine (RELM). On the basis of regularization, Zhou et al. [17] used Cholesky factorization to solve the regularized extreme learning machine and recursive least squares method for online learning. Chou's method implements online training by adding new training samples one by one and introduces the forgetting factor to reduce the influence of old training samples, so as to highlight the role of new training samples.
Although RELM has better robustness, it does not solve how to obtain the most suitable network structure like ELM. When there are too few hidden nodes in the neural network, the phenomenon of under fitting will appear, and when there are too many hidden nodes, there will be over fitting.
erefore, suitable hidden nodes are particularly important for neural networks. Rong et al. [18] proposed a pruning extreme learning machine. is method first defines a large amount of hidden nodes for the network, then prunes the label according to the correlation degree of each node, and finally obtains a suitable network. Martinez-Martinez et al. [19] used regularization to prune the hidden nodes. e initial stage of this method is the same as the pruning extreme learning machine, which is to first define a large number of hidden nodes for the network. e difference is that the method proposed by Martinez uses ridge regression to prune the network structure, thereby pruning redundant nodes.
When the input of the extreme learning machine contains a lot of noise, the fitting effect is often not ideal. erefore, Miche et al. [20] proposed an optimal pruning extreme learning machine (OP-ELM) based on the correlation of input variables. is method uses the leave-one-out criterion to determine whether the node is left or not, which can effectively reduce the sensitivity to noise and obtain a more concise network model. Miche et al. [21] proposed a new and improved ELM (TROP-ELM) on the basis of the optimal pruning extreme learning machine. TROP-ELM adds L1 and L2 penalty terms to solve the OP-ELM instability in the process of solving the residual sum of squares. Indeed, the first step in the improvement process performs a L1 penalty on the output layer to rank the hidden nodes by implementing a LARS and then followed sequentially by a L2 penalty applied on the L1 penalized production from the LARS, thus avoiding numerical instabilities and efficient pruning of the network accordingly.
In recent years, due to the successful application of deep learning in many practical problems, academic research on ELM has shifted from single-layer networks to multilayer networks. e multilayered extreme learning machine (MELM) [22] is a new framework that attempts to integrate the characteristics of deep neural networks with a multilayer structure into ELM, but it also brings huge hidden layers of MELM that need to be predefined manually. erefore, the exploration of the neural network structure is still an important issue. Moreover, the Moore-Penrose generalized inverse matrix is involved in the process of using the least squares method to solve the output matrix. If the matrix is not full rank, it will often cause numerical calculation errors.
Aiming at the above open problem of MELM, this paper proposes a pruning multilayer ELM algorithm framework (P-MELM). e algorithm adds the Tikhonov regularization factor in the process of calculating the output weights to achieve efficient pruning of nodes and then uses fast matrix decomposition strategy to automatically learn the optimal number of nodes. Since a lot of operations are involved in the pruning process of hidden nodes, this paper proposes a more stable numerical operation method, which uses Cholesky factorization of the covariance matrix and Givens rotation transformation to calculate pseudo inverse. is method can avoid large-scale matrix inversion, so compared with the direct calculation of the least square solution, this method needs less computing time and memory. e chapters of this article are arranged as follows. Section 1 reviews the improvements made by previous generations to ELM. Section 2 mainly introduces MELM. Section 3 proposes and introduces our algorithm P-MELM. Section 4 establishes a model and compares it with other algorithms. Section 5 summarizes the full text.

Multilayered ELM (MELM)
is the selected activation function. First, using the principle of random sampling, a value is randomly selected within the range of [0-1] to become the input weight W 1,L � [w 1,1 , w 1,2 , . . . , w 1,L ] T ∈ R L×n of the first hidden layer and the threshold B 1,L � [b 11 , b 12 , . . . , b 1L ] T ∈ R L×N of the first hidden layer. e standard mathematical formula for MELM is as follows: (1) By using the ELM algorithm to write formula (1) in the matrix form, we can get where As described in RMELM [23], the output matrix of the first hidden layer in MELM is defined as H 1,L ∈ R N×L , whose elements (h 1,j ) i � G(w 1,j x i + b 1j ) (i � 1, 2, . . . , N and j � 1, 2, . . . , L) may be explained as the output of the jth node in the first hidden layer in reference to x i , where h 1,j � g(w 1,j x 1 + b 1j ) · · · g(w 1,j Unlike the conventional learning algorithm for SLFNs, the MELM approach is thus to generate randomly the weights w 1,j and the bias b 1j in the first hidden layer and determine analytically the connection weights β 1,L ∈ R L×m between the first hidden layer and the output layer using Moore-Penrose generalized inverse, which is identical to finding the minimum norm least square solution of the linear system, and therefore, representation (2) can be written as where H + where W 2,L represents the weights' matrix that links the first hidden layer to the second hidden layer. e matrices B 2,L and H 2,L , respectively, denote the bias and the expected output of the second hidden layer. Moreover, the output matrix H 2,L is made on the basis of the multilayer theory and can be rewritten as where (β 1,L ) + is the MP pseudoinverse of the matrix β 1,L , obtained using the approach discussed before, namely, (β 1,L ) + � (β T 1,L β 1,L ) − 1 β 1,L if β T 1,L β 1,L is nonsingular; otherwise, (β 1,L ) + � β T 1,L (β 1,L β T 1,L ) − 1 if β 1,L β T 1,L is nonsingular. e augmented matrix is then defined as W 2HE,L � B 2,L W 2,L and computed as follows: where H + 2E,L represents the pseudoinverse of the matrix H 2E,L � 1 H 1,L T , 1 � [1, 1, . . . , 1] T ∈ R N , and g − 1 (x) represents the inverse function of g(x). e calculation of H + 2E,L is the same as the previous description of H + 1,L . Subsequently, we can calculate the actual output matrix of the second hidden layer as erefore, β 2,L can be updated as follows: e calculating method of H 2,L proceeds in the form discussed before. If MELM has only two hidden layers, the final output of MELM can be expressed as When the number of hidden layers of MELM is greater than 2, as indicated in (5)-(10), an iterative solution scheme is adopted to implement the calculation process. en, all the related parameters of these hidden layers can be obtained after the (M-2)-iteration operation. To make the final actual hidden layer output close to the expected hidden layer output, in the training stage, the MELM designs a novel parameter-setting step for the newly added hidden layer, as described in (8), which is the key step to guarantee the stability and feasibility of the MELM model.

Pruning Multilayered ELM (P-MELM)
is newly modified approach is in the form of a L2 regularization penalty applied within the MELM. First, the regularization method can constrain the norm of the output weights to avoid the possible numerical instabilities and prevent the model from overfitting. Second, the regularization term makes the coefficient matrix of linear equations positive definite and thus uses a fast matrix factorization strategy to recursively update the output weights in the calculation of the pruning criterion, which provides more accurate and reliable computation results. e pruning multilayered ELM algorithm based on the Cholesky factorization and Givens rotation transformation are deduced in this section, which avoids the complexities of matrix inversion.

e Solutions of P-MELM by Cholesky Factorization and Givens Rotation Transformation. Previous research has
shown that due to the randomly generated parameters of the hidden nodes, MELM typically requires more hidden nodes than conventional gradient descent methods to provide desired performance levels. is essence of MELM clearly brings faster learning speed, least human intervention, and easy implementation. However, it suffers from performance problems connected with too many hidden nodes chosen at random, that is, the lack of sparsity and adjustment ability in the hidden layers. Furthermore, another common risk is that the network will produce some redundant nodes that add little or nothing to the existing efforts to respond to the model training. As a result of all these trends, the initial output matrix may not be a full column rank or even illconditioned; then, the exact analytical solution of the output weights can hardly be found, which directly affects the generalization capability and stability of the network. Aiming at this problem, it is proposed here to modify the pruning strategy for the selection of the optimal number of nodes by taking heuristic techniques to tentatively prune out the redundant nodes of the model that will decrease the computational complexity of providing an accurate ranking of the hidden layer nodes and raise the efficiency of the network learning and the adaptability to real problem resolving. In the following, the most well-known algorithms used to perform regularization are reviewed, using a L 2 penalty in the MELM, to regularize the network.
It is assumed that the initial value of hidden nodes is set as L � L max , the output matrix of the first hidden layer is H 1,L , and the connection weights' matrix between the first hidden layer and the output layer is β 1,L . rough studying the training dataset, the optimal number of hidden nodes and hidden layers are determined, which use the P-MELM prediction model for a framework and employ the Cholesky factorization method and Givens rotation transformation to recursively calculate the factorization factor U L . As the number of hidden nodes decreases, the new connection weights' matrix β 1,L+1 can be obtained rapidly according to the parameter information of the existing network. More specifically, if we reduce the total number of hidden nodes from L to L − 1, the first hidden layer output matrix of P-MELM will change from H 1,L ∈ R N×L to H 1,L−1 ∈ R N×(L− 1) , which can be intuitively represented as the following: where . According to (12), we can get where I L ∈ R L×L and I L−1 ∈ R (L− 1)×(L− 1) denote the identity matrices of order L and (L-1), respectively, Due to the tight coupling that exists between A L−1 and A L , as shown in (13) above, the Cholesky factorization U L−1 of matrix A L−1 can be analytically and easily determined based on the Cholesky factorization theorem [24]. e recursive calculation of matrix factorizations is implemented as follows. First, the matrices A L and U T L are partitioned into four submatrices, and their corresponding block matrices have the same size. For convenience, the matrix U T L can thus be rewritten as where V T is an (L − 1)-dimensional row vector, W T is an (L − 1) × (L − 1) square matrix, and v is a scalar. Substituting formula (14) to (13), A L−1 can be solved as Next, the rank one update algorithm [25] is used to calculate the Cholesky factorization factor, so U T L−1 can be obtained from the equation of the form: Here, J(k) is the Givens rotation transformation matrix, which can be expressed in the matrix form as where and k � 1, 2, . . . , L − 1. It is worth noting that, in (16), the matrix V and W change with the rank one update algorithm proceeding, so the calculation process of (17) does not need to explicitly record the matrix J(k). Meanwhile, as shown in (12), B L can be represented as If the number of hidden nodes decreases, the connection weights' matrix β 1,L−1 between the first hidden layer and the output layer can be directly and analytically determined by solving the following equations: en, substitute formula (15) into (19) and multiply U −1 L−1 of both sides, and the problem seeking for β 1,L−1 can thus be formulated as the following: where and it can be shown that F L−1 is associated with U L−1 and B L−1 , which are indeed obtained by solving the linear equation system. Accordingly, the optimal output weight matrix β 1,L−1 can be calculated simultaneously using the corresponding entries of matrices U L−1 and F L−1 . erefore, if the number of hidden nodes decreases successively, β 1,L−1 can be rapidly calculated by simple matrix arithmetic operation on the basis of computing β 1,L . In this case, if β 1,L−1 is calculated by using the method shown in (5), it must be recalculated in the way of finding the inverse of the higher-order matrix, and it cannot be directly solved on the basis of computing β 1,L . erefore, the P-MELM algorithm can guarantee the learning accuracy and further improve the training speed. We have summarized the steps of the P-MELM algorithm as follows:

Decremental Learning
Step 1: first, we limit the number of nodes. e maximum number of nodes is L max , the minimum number of nodes is L min , and the iteration stopping criterion is ξ L . en, we set the initial number of hidden nodes as L � L max and calculate A L and B L .
Step 2: calculate the Cholesky factorization U L of A L according to (13), and calculate F L using the matrices U L and B L according to (21).
Step 3: calculate β 1,L with the matrices U L and F L according to (20). en, build a P-MELM model according to β 1,L . e network structure of the model has M hidden layers, and each hidden layer has L hidden layer nodes.
Step 4: the empirical risk and structural risk of the P-MELM algorithm are solved, and the sum of these two parts is carried out: Step 5: let L � L − 1, calculate U L and B L on the basis of U L+1 and B L+1 according to the principle shown in (14)- (18), and then, go to Step 3. Start from L � L max − 1 to determine whether (23) is satisfied: where R 1, max is the resulting maximum value of R 1,L , · · · , R 1,L max . If the conditions of (23) are met, then modeling iteration is completed, the optimal number value of hidden nodes is confirmed as L, and the corresponding prediction model of P-MELM is built. Otherwise, if L > L min , continue to decrease L until the condition L � L min is satisfied.
In the P-MELM algorithm flow, first, the neural network has a larger number of hidden nodes L max , and then, the number of nodes is gradually reduced until R 1,L changes significantly. If the number of hidden nodes continues to decrease at this time, the generalization ability and fitting effect of P-P-MELM ELM will decrease along with the progress, and a large number of effective nodes will also be lost in the network. erefore, when R 1,L changes significantly, P-P-MELM ELM has the optimal number of hidden nodes.

Proposed P-MELM. P-MELM shows an obvious superiority in faster computations and numerical stability compared with the conventional MELM by introducing Cholesky factorization and Givens rotation transformation.
is technique uses the basic arithmetic operation on matrix elements to construct weights matrix and avoids the complicated matrix inverse. In addition, when the hidden node is reduced one by one, the dynamically changing network structure needs to retrain the model by the existing training data, and this will lead to significantly longer model training times. Faced with this situation, a better alternative is to fully utilize information gained by the process of P-MELM training and implement the decremental updating of network parameters recursively.
First, train P-MELM with the dataset {X, T} and then, use trained P-MELM to provide the prediction of the verifying data Z. e chosen neural network architecture consisted of M hidden layers, one input layer, and output layer; meanwhile, each of the hidden layers contains L hidden nodes.
en, the output of the standard MELM classifier with respect to the input x can be modeled as If the number of training samples is N >> L, we set the initial value of hidden nodes as L max , the minimum number of hidden nodes as L min , the initial number of hidden layers as 1, and the maximum number of hidden layers as M max . ξ L is expressed as the best accuracy of neural network learning.  1,1 , . . . , w 1,L , b 11 , . . . , b 1L , x 1 , . . . , x N ).
(3) Calculate the matrix A L and B L according to (13) and (18). (4) Cholesky factorization of A L according to (14) to get U L . (5) According to (21), calculate F L by using the matrices U L and B L . (6) According to (20), calculate β 1,L by using the matrices U L and F L . en, build a P-MELM model according to β 1,L . e network structure of the model has M hidden layers, and each hidden layer has L hidden layer nodes.   (24). erefore, the predicted output of test instance Z can be represented as t M,L (Z).
If we set M max � 1 in the P-MELM approach, P-MELM becomes the pruning extreme learning machine (P-ELM). e flow chart of the P-MELM algorithm is shown in Figure 1. It can be seen from Figure 1 that when the number of hidden layer nodes in the P-MELM algorithm gradually decreases, P-MELM does not need to retrain the network and recalculate the weights, but updates the weights on the basis of the original network. MELM needs to recalculate the output weight every time once the number of hidden nodes is reduced. erefore, when the number of hidden layer nodes decreases gradually, P-MELM is simpler and takes less time than MELM.

Algorithm Verification
In order to verify the stability of the P-MELM algorithm and P-ELM algorithm proposed in this paper, we have carried out experiments in MATLAB 2016a environment and carried out 100 experiments on each algorithm. Our experiment is divided into two parts. e first part is a simple benchmark classification [26], and the second part is to classify coal by using the spectral information of coal [27]. In addition, in order to comprehensively evaluate the performance of the P-ELM algorithm and P-MELM algorithm, the initial value of the number of nodes in a single hidden layer is set as 100 in the experiment, and the step size is set as 1 to delete redundant nodes. e number of layers of the multilayer extreme learning machine is set from 2 to 10.

Data Acquisition and Processing.
We download two commonly used datasets of Diabetic Retinopathy Debrecen and image segmentation from the UCI Machine Learning Library to verify the reliability of the algorithm proposed in simple benchmark classification. In addition, when faced with extremely time-consuming and complex classification problems, researchers now tend to use machine learning methods to solve complex problems. erefore, this paper also uses the spectral information of coal to classify coal to verify the reliability of P-ELM and P-MELM. e collected coal mine information is shown in Table 1.
We use the SVC-HR-1024 spectrometer produced by American Spectra Vista Company as the experimental instrument in the spectrum experiment. e spectral bandwidth of this instrument is 1024, and the spectral range is 350-2500 nm. In the experiment, firstly, the sample is cleaned. e spectrometer probe is 380 mm away from the surface of the ore sample, and as far as possible, it is set perpendicular to the sample surface. Secondly, the experimenters should reduce the walking process and wear dark clothes during the experiment to reduce the interference of environmental factors on the spectrum experiment. en, the spectrometer was used to conduct 6 spectroscopic experiments on each sample of coal. Finally, summarize all the experimental data, and calculate the average of the six spectral experimental data of the same coal mine as the final spectral data. Figure 2 shows the process of data collection and processing, during which whiteboard measurement and calibration are carried out every 10 minutes. Because the original spectral data included 1024 bands, some of which are overlapped bands, it is necessary to remove the overlapped bands from the data, and finally, 973 useful bands are obtained.

Evaluation of Testing Accuracy.
We use P-ELM and P-MELM to model the classification problem. Table 2 shows the parameters of P-ELM and P-MELM. Figures 3 and 4, respectively, show the accuracy of data classification by P-ELM and P-MELM under different hidden nodes. From Figures 3  and 4, we can see that P-ELM and P-MELM can obtain good learning accuracy and have a relatively accurate classification effect. If the redundant hidden nodes in the neural network are pruned, the testing accuracy of P-ELM and P-MELM enhance or remain unchanged, and this phenomenon is especially evident in Diabetic Retinopathy Debrecen, image segmentation, and coal spectral datasets. More importantly, the classification accuracy surfaces of P-MELM keep relatively smooth, but the curves of P-ELM have some oscillations in Diabetic Retinopathy Debrecen and image segmentation cases when the redundant hidden nodes are pruned. By comparing the algorithm process of P-ELM and P-MELM, it can be seen that P-MELM changes from a single hidden layer to multiple hidden layers on the basis of P-ELM, and the calculation method of the parameters in the P-MELM algorithm is different. So, P-MELM can eliminate the fluctuations caused by P-ELM to some extent. As a consequence, the P-MELM algorithm is more accurate in classification than the P-ELM algorithm, and the fitting effect is better.

Evaluation of Average Training Time.
In Section 4.3, we compare the training time of MELM and P-MELM under the same parameters. e time required for MELM and P-MELM algorithms is shown in Figure 5. From Figure 5, we can see that whether it is coal mine classification, diabetic retinopathy classification, or image segmentation classification, and the training time required for P-MELM is much less than the training time required for MELM. is is because when the number of hidden nodes changes, MELM needs to retrain the network, and P-MELM does not need to recalculate the network parameters, but only updates the weights in the model. erefore, under the same network structure, P-MELM has a faster convergence speed than MELM. Furthermore, the training time curves for several cases shown in Figure 4 also indicate that the spent learning time is still monotonically decreasing with the learning steps, which is consistent with MELM's conclusions. Compared with P-MELM, MELM needs more time to retrain the network and recalculate the output weights. However, with the increase of hidden layer nodes, the time difference of training time will gradually decrease.

Comparison of Testing Accuracy by Different Methods.
e classification accuracy of the testing data is selected as the performance evaluation criterion for the algorithms investigated, namely, ELM, MELM, P-ELM, P-MELM, ELM-AE, and SELM.
e best learning accuracy that six algorithms can achieve can be observed from Table 3. On the basis of these     Mathematical Problems in Engineering experimental results, we can draw the following conclusions. If the initial network structure is fixed, P-ELM and P-MELM always get higher testing accuracy than ELM and MELM, respectively. is means that the P-ELM and P-MELM techniques can get more stable networks in most cases. What is more, a higher classification accuracy indicates a better classification algorithm performance. As observed from Table 3, it is obvious that the proposed P-MELM technique achieves the highest classification accuracy among all classification datasets, relative to the original ELM, MELM, and P-ELM algorithms. Finally, these results further prove that the P-ELM and P-MELM methods have a better classification effect than the ELM method and the training speed can meet the requirements.

Conclusions and Discussion
Based on MELM, this paper introduces the L2 penalty factor between the hidden layer and the output layer, uses the regularization parameter C to measure the structural risk and empirical risk of the MELM model, and proposes the P-MELM algorithm. P-MELM uses the Cholesky factorization method to effectively reduce the amount of calculation for a single solution of the output weights' matrix, and when pruning hidden nodes, P-MELM can also update the original connection weights without the need for network recalculation, so P-MELM greatly reduces the amount of calculation and improves the calculation efficiency. P-MELM introduces the L2 regularization term into MELM to reduce the sensitivity of nodes. Starting from the initial large number of hidden nodes, by minimizing its prediction error and regularization control item, redundant nodes are rationally pruned to obtain a more concise network. Finally, data simulation indicates that P-MELM features are fast, precise, and reliable.
Data Availability e datasets are obtained from references [26,27]. All data is available. e datasets are obtained from the University of California, Irvine, Machine Learning Repository, Coal Classification Method Based on Visible-Infrared Spectroscopy, and An Improved Multilayer Extreme Learning Machine. All data is available.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments
is work was supported in part by the National Key Research and Development Program of China under Grant