A Multiple Hidden Layers Extreme Learning Machine Method and Its Application

,


Introduction
At present, artificial neural network has been widely applied in many research fields, such as pattern recognition, signal processing, and short-term prediction.Among them, the single-hidden-layer feedforward neural network (SLFN) is the most widely used type of the artificial neural network [1,2].Because the parameters of traditional feedforward neural network are usually determined by gradient-based error backpropagation algorithms, the network bears the timeexpensive training and testing process and easily falls into the local optimum.Now, many algorithms have been proposed to improve the SLFN operation rate and precision such as the backpropagation algorithm (BP) and its improved algorithms [3,4].With limitations of BP algorithms, generalization ability of networks is unsatisfactory and the over learning easily occurs.In 1989, Lowe proposed the RBF neural network [5] which indicated that the parameters of the SLFNs can also be randomly selected in his articles.In 1992, Pao Y. H. et al. proposed the theory of the random vector functional link network (RVFL) [6,7], and they presented that only one parameter of the output weights should be calculated during the training process.
In 2004, Huang G. B. proposed the extreme learning machine (ELM) reducing the training time of network and improving the generalization performance [8][9][10].Traditional neural network learning algorithms (such as BP) need to randomly set all the training parameters and use iterative algorithm to update the parameters.Also it is easy to generate local optimal solution.But ELM only needs to randomly set the weights and bias of the hidden neurons, and the output weights are determined by using the Moore-Penrose pseudoinverse under the criterion of least-squares method.In recent years, various ELM variants have been proposed aiming to achieve better achievements, such as the deep ELM with kernel based on Multilayer Extreme Learning Machine (DELM) algorithm [11]; two-hidden-layer extreme learning machine (TELM) [12]; a Four-Layered Feedforward Neural Network [13]; online sequential extreme learning machine [14,15]; multiple kernel extreme learning machine (MK-ELM) [16]; two-stage extreme learning machine [17], using noise detection and improving the classifier accuracy [18,19].
First, consider the DELM with kernel based on ELM-AE algorithm (DELM) presented in [11], which quotes the ELM autoencoder (ELM-AE) [20][21][22] as the learning algorithm in each layer.The DELM also has multilayer network structure divided into two parts: the first part uses the ELM-AE to deep learn the original data aiming at obtaining the most representative new data; the second part calculates the network parameters by using the Kernel ELM algorithm with a three-layer structure (the output of the first part, hidden layer, and output layer).But for the MELM we do not need the data processing (such as extract the representative data from the original data) but make the actual output of the hidden layers more closer to the expected output of the hidden layers by calculating step by step in the multiple hidden layers.
Next, a two-hidden-layer feedforward network (TLFN) was proposed by Huang in 2003 [23].This article demonstrates that the TLFNs could learn arbitrary  training samples with a very small training error by employing 2√( + 3) hidden neurons [24].But the changing process of the TLFNs structure is very complicated.First the TLFN has a three-layer network with  output neurons, then adds 2 neurons (two parts) to the hidden layer aiming at making the original output layer transform into the second hidden layer with  hidden neurons, and finally adds a output layer to the structure.Eventually the final network structure has one input layer, two hidden layers, and one output layer.But the MELM has a relatively simple stable network structure and the simple calculation process, and the MELM is a timesaving algorithm compared with TLFNs.
In the paper, we propose a multiple hidden layers extreme learning machine algorithm (MELM) that the MELM adds some hidden layers to the original ELM network structure, randomly initializes the weights between the input layer and the first hidden layer as well as the bias of the first hidden layer, utilizes the method (make the actual each hidden layer output approach the expected each hidden layer output) to calculate the parameters of the hidden layers (except the first hidden layer), and finally uses the least square method to calculate the output weights of the network.In the following chapters, we have carried out many experiments with the ideas proposed.The MELM experimental results on regression problems and some popular classification problems have shown satisfactory advantages in terms of average accuracy compared to other ELM variants.Our experiments also study the effect of different numbers of the hidden layer neurons, the compatible activation function, and the different numbers of the hidden layers on the same problems.
The rest of this paper is organized as follows.Section 2 reviews the original ELM; Section 3 presents the method and framework structure of two-hidden-layer ELM; Section 4 presents the proposed the MELM technique: multihiddenlayer ELM; Section 5 reports and analyzes experimental results; Section 6 presents the conclusions.

Extreme Learning Machine
The extreme learning machine (ELM) proposed by Huang G.B. aims at avoiding time-costing iterative training process  and improving the generalization performance [8][9][10]25].As a single-hidden-layer feedforward neural networks (SLFNs), the ELM structure includes input layer, hidden layer, and output layer.Different from the traditional neural network learning algorithms (such as BP algorithm) randomly setting all the network training parameters and easily generating local optimal solution, the ELM only sets the number of hidden neurons of the network, randomizes the weights between the input layer and the hidden layer as well as the bias of the hidden neurons in the algorithm execution process, calculates the hidden layer output matrix, and finally obtains the weight between the hidden layer and the output layer by using the Moore-Penrose pseudoinverse under the criterion of least-squares method.Because the ELM has the simple network structure and the concise parameters computation processes, so the ELM has the advantages of fast learning speed.The original structure of ELM is expressed in Figure 1.
Figure 1 is the extreme learning machine network structure which includes  input layer neurons,  hidden layer neurons, and  output layer neurons.First, consider the training sample {, Y} = {  ,   } ( = 1, 2, . . ., Q), and there is an input feature  = [ 1  2 ⋅ ⋅ ⋅   ] and a desired matrix  = [ 1  2 ⋅ ⋅ ⋅   ] comprised of the training samples, where the matrix  and the matrix  can be expressed as follows: where the parameters  and  are the dimension of input matrix and output matrix.
Then the ELM randomly sets the weights between the input layer and the hidden layer: where   represents the weights between the th input layer neuron and th hidden layer neuron.Third, the ELM assumes the weights between the hidden layer and the output layer that can be expressed as follows: where   represents the weights between the th hidden layer neuron and th output layer neuron.Fourth, the ELM randomly sets the bias of the hidden layer neurons: Fifth, the ELM chooses the network activation function ().
According to Figure 1, the output matrix  can be expressed as follows: Each column vector of the output matrix  is as follows: Sixth, consider formulae ( 5) and ( 6), and we can get where   is the transpose of  and  is the output of the hidden layer.In order to obtain the unique solution with minimum-error, we use least square method to calculate the weight matrix values of  [8,9].To improve the generalization ability of network and make the results more stable, we add a regularization term to the  [26].When the number of hidden layer neurons is less than the number of training samples,  can be expressed as When the number of hidden layer nodes is more than the number of training samples,  can be expressed as

Two-Hidden-Layer ELM
In 2016, B. Y. Qu and B. F. Lang proposed the two-hiddenlayer extreme learning machine (TELM) in [11].The TELM tries to make the actual hidden layers output approach expected hidden layer outputs by adding parameter setting step for the second hidden layer.Finally, the TELM finds a better way to map the relationship between input and output signals, which is the two-hidden-layer ELM.The network structure of TELM includes one input layer, two hidden layers, one output layer, and each hidden layer with  hidden neurons.The activation function of the network is selected for ().
The focus of the TELM algorithm is the process of calculating and updating the weights between the first hidden layer and the second hidden layer as well as the bias of the hidden layer and the output weights between the second hidden layer and the output layer.The workflow of the TELM architecture is depicted in Figure 2.
The TELM first puts the two hidden layers as one hidden layer, so the output of the hidden layer can be expressed as  = ( + ) with the parameters of the weight  and bias  of the first hidden layer randomly initialized.Next, the output weight matrix  between the second hidden layer and the output layer can be obtained by using Now the TELM separates the two hidden layers merged previously, so the network has two hidden layers.According to the workflow of Figure 2, the output of the second hidden layer can be obtained as follows: where  1 is the weight matrixes between the first hidden layer and the second hidden layer,  is the output matrix of the first hidden layer,  1 is the bias of the second hidden layer, and  1 is the expected output of the second hidden layer.However the expected output of the second hidden layer can be obtained by calculating where  + is the generalized inverse of the matrix .Now the TELM defines the matrix   = [ 1  1 ], so the parameters of the second hidden layer can be easily obtained by using formula (12) and the inverse function of the activation function. where denotes a one-column vector of size , and its elements are the scalar unit 1.The notation  −1 () indicates the inverse of the activation function ().
With selecting the appropriate activation function (), the TELM calculates (14), so the actual output of the second hidden layer is updated as follows: So the weights matrix  between the second hidden layer and the output layer is updated as follows: where  + 2 is the generalized inverse of  2 , so the actual output of the TELM network can be expressed as To sum up, the TELM algorithm process can be expressed as follows.
(2) Randomly generate the weights  between the input layer and the first hidden layer and bias  of the first hidden neurons 4) Obtain the weights between the second hidden layer and output layer  =  + .
(5) Calculate the expected output of the second hidden layer  1 =  + .
(6) According to formulae ( 12)-( 14) and the algorithm steps (4, 5), calculate the weights  1 between the first hidden layer and the second hidden layer and the bias  1 of the second hidden neurons   =  −1 (

Multihidden-Layer ELM
At present, some scholars have proposed many improvements on ELM algorithm and structure and obtained some achievements and advantages.Such advantages of neural network motivate us to explore the better ideas behind the ELM.Based on the above articles, we adjust the structure of ELM neural network.Thus we propose an algorithm named multiple hidden layers extreme learning machine (MELM).The structure of the MELM (select the three-hidden-layer ELM for example) is illustrated in Figure 3.The workflow of the three-hidden-layer ELM is illustrated in Figure 4.
Here we take a three-hidden-layer ELM, for example, and analyze the MELM algorithm below.First give the training samples {, T} = {  ,   } ( = 1, 2, 3, . . ., ) and the threehidden-layer network structure (each of the three-hiddenlayer has  hidden neurons) with the activation function ().The structure of the three-hidden-layer ELM has input layer, three hidden layers, and output layer.According to the theory of the TELM algorithm, now we put the three hidden layers as two hidden layers (the first hidden layer is still the first hidden layer, but put the second hidden layer and the third hidden layer together as one hidden layer), so that the structure of the network is the same as the TELM network mentioned above.So we can obtain the weights matrix  new between the second hidden layer and the output layer.According to the number of the actual samples, we can use formula (8) or formula (9) to calculate the weights , which can improve the generalization ability of the network.
Then the MELM separates the three hidden layers merged previously, so the MELM structure has three hidden layers.So the expected output of the third hidden layer can be expressed as follows: + new is generalized inverse of the weights matrix  new .Third the MELM defines the matrix  1 = [ 2  2 ], so the parameters of the third hidden layer can be easily obtained by calculating formula (18) and the formula where  2 is the actual output of the second hidden layer,  2 is the weights between the second hidden layer and the third hidden layer,  2 is the bias of the third hidden neurons,  + 1 is the generational inverse of , 1 denotes a onecolumn vector of size , and its elements are the scalar unit 1.The notation  −1 () indicates the inverse of the activation function ().
In order to test the performance of the proposed MELM algorithm, we adopt different activation functions for regression and classification problems to experiment.Generally, we adopt the logistic sigmoid function () = 1/(1 +  − ) .The actual output of the third hidden layer is calculated as follows: Finally, the output weights matrix  new between the third hidden layer and the output layer is calculated as follows: when the number of hidden layer neurons is less than the number of training samples,  can be expressed as follows: When the number of hidden layer neurons is more than the number of training samples,  can be expressed as follows: The actual output of the three-hidden-layer ELM network can be expressed as follows: To ensure that the actual final hidden output more approaches the expected hidden output during the training process, the operation process is the optimization of the network structure parameters starting from the second hidden layer.
The above is the parameter calculation process of threehidden-layer ELM network, but the purpose of this paper is to calculate the parameter of the multiple hidden layers ELM network and the final output of the MELM network structure.We can use cycle calculation theory to illustrate the calculating process of the MELM.When the four-hiddenlayer ELM network occurs, we can recalculate formula (18) to formula (22) in the calculation process of the network, obtain and record the parameters of each hidden layer, and finally calculate the final output of the MELM network.If the number of hidden layers increased, the calculation process can be recycled and executed in the same way.The calculation process of MELM network can be described as follows.
(2) Randomly initialize the weights  between the input layer and the first hidden layer as well as the bias  of the first hidden neurons .(3) Calculate the equation  = (    ).( 4) Calculate the weights between the hidden layers and the output layer = (/ +   ) −1    or  =   (/ +   ) −1 .
(5) Calculate the expected output of the second hidden layer  1 =  + .
(6) According to formulae ( 12)-( 14) and the algorithm steps (4,5), calculate the weights  1 between the first hidden layer and the second hidden layer and the bias  1 of the second hidden neurons   =  −1 ( 1 ) +  .(7) Obtain and update the actual output of the second hidden layer  2 = (    ).
(8) Update the weights matrix  between the hidden layer and the output layer If the number of the hidden layer is three, we can calculate the parameters by recycle executing the above operation from step (5) to step (9).Now  new is expressed as follows: .(10) Calculate the output, () =  2  new .If the number  of the hidden layer is more than three, recycle is executing step (5) to step (9) for ( − 1) times.All the  matrix ( 1 ,  2 ) must be normalized between the range of −0.9 and 0.9, when the max of the matrix is more than 1 and the min of the matrix is less than −1.

Application
In order to verify the actual effect of the algorithm proposed in this paper, we have done the following experiments which are divided into three parts: regression problems, classification problems, and the application of mineral selection industry.All the experiments are conducted in the MATLAB 2010b computational environment running on a computer with a 2.302 GHZ in i3 CPU.

Regression Problems.
To test the performance of the regression problems, several widely used functions are listed below [27].We use these functions to generate a dataset which includes random selection of sufficient training samples and the remaining is used as a testing samples, and the activation function is selected as the hyperbolic tangent function () = (1 −  − )/(1 +  − ). ( ( ( The symbol  that is set as a positive integer represents the dimensions of the function we are using.The function  2 () and  3 () are the complex multimodal function.Each function-experiment has 900 training samples and 100 testing samples.The evaluation criterion of the experiments is the average accuracy, namely, the root mean square error (RMSE) in the regression problems.We first need to conduct some experiments to verify the average accuracy of the network structure in different number of hidden neurons and hidden Then we can observe the optimal neural network structure with specific hidden layers number and hidden neurons number compared with the results of the TELM on regression problems.
In Table 1, we give the average RMSE of the above three functions in the case of the two-hidden-layer structure, the MLELM, and the three-hidden-layer structure (we do a series of experiments, including the four-hidden-layer structure, five-hidden-layer structure, and eight-hidden-layer structure.But the experiments proved that the three-hiddenlayer structure for the three functions  1 (),  2 (),  3 () can achieve the best results, so we only give the value of the RMSE of the three-hidden-layer structure) in regression problems.The table includes the testing RMSE and the training RMSE of the three algorithm structure.We can see that the MELM (three-hidden-layer) has the better results.

Classification Problems.
To test the performance of the MELM algorithm proposed on classification datasets, so we quote some datasets (such as Mice Protein, svmguide4, vowel, AutoUnix, and Iris) that are collected from the University of California [28] and the LIBSVM website [29].The training data and testing data of each experiment are randomly selected from the original datasets.This information of the datasets introduced is given in Table 2.
The classification performance criteria of the problems are the average classification accuracy of the testing data.In the Figure 5, the average testing classification correct percentage for the ELM, TELM, MLELM, and MELM algorithm is shown clearly by using (a) Mice Protein, (b) svmguide4, (c) Vowel, (d) AutoUnix, and (e) Iris, and different datasets (a, b, c, d, and e) have an important effect on the classification accuracy for the three algorithms (ELM, TELM, MLELM, and MELM).But from the changing trend of the average testing correct percentage for the three algorithms with the same datasets, we can see that the MELM algorithm can select the optimal number of hidden layers for the network structure to adapt to the specific datasets, aiming at obtaining the better results.So the datasets of the Mice Protein and the Forest type mapping use the three-hidden-layer ELM algorithm to experiment and the datasets of the svmguide4 use the five-hidden-layer ELM algorithm to experiment.

The Application of Ores Selection Industry.
In this part, we use the actual datasets to analyze and experiment with the model we have established and test the performance of the MELM algorithm.First, our datasets come from AnQian Mining Group, namely, the mining area of Ya-ba-ling and Xi-da-bei.And the samples we selected are the hematite (50 samples), magnetite (90 samples), granite (57 samples), phyllite (15 samples), and chlorite (32 samples).On the precise classification of the ores and the prediction of total iron content of iron ores, people usually use the chemical analysis method.But the analysis process of total iron content is a complex process, a long process, and the color of the solution sometimes without the significantly change in the end.Due to the extensive using of near infrared spectroscopy for chemical composition analysis, we can infer the structure of the unknown substance according to the position and shape of peaks in the absorption spectra.So we can use the theory to obtain the correct information of the ores with the low influence of environment on data acquisition.HR1024 SVC spectrometer is used to test the spectral data of each sample, and the total iron content of hematite and magnetite is obtained from the chemical analysis center of Northeastern University of China.Our experiments' datasets are the near infrared spectrum information of iron ores which are the absorption rate of different iron ores under the condition of the near infrared spectrum.The wavelength of near infrared spectrum is 300 nm-2500 nm, so the datasets of the iron ores we obtained have the very high dimensions (973 dimensions).We use the theory of the principal component analysis to reduce the dimensions of the datasets.
(A) Here we use the algorithm proposed in this paper to classify the five kinds of ores and compare with the results of the ELM algorithm and TELM algorithm.The activation function of the algorithm in classification problems is selected as the sigmoid function () = 1/(1 +  − ).
Figure 6 expresses the average classification accuracy of the ELM, TELM, MLELM, and MELM (three-hidden-layer ELM) algorithm in different hidden layers.After we conduct a series of experiments with different hidden layers of the MELM structure, finally we select the three-hidden-layer ELM structure which can obtain the optimal classification results in the MELM algorithm.Figure 6(a) is the average training correct percentage and the MELM has the high result compared with others in the same hidden neurons.Figure 6(b) is the average testing correct percentage and the MELM has the high result in the hidden neurons between 20 and 60.In the actual situation, we can reasonably choose the number of each hidden neurons to model.
(B) Here we use the method proposed in this paper to test the total iron content of the hematite and magnetite and compare with the results of the ELM algorithm and TELM algorithm.Here is the prediction of total iron content of hematite and magnetite with the optimal hidden layers and optimal hidden layer neurons.The activation function of the algorithm in regression problems is selected as the hyperbolic tangent function () = (1 −  − )/(1 +  − ).
Figures 7 and 8, respectively, expresses the total iron content of the hematite and magnetite by using the ELM, TELM, MLELM, and MELM algorithm (the MELM includes the eight-hidden-layer structure which is the optimal structure in Figure 7; the six-hidden-layer structure which is the optimal structure in Figure 8).The optimal results of each algorithm are obtained by constantly experimenting with the different number of hidden layer neurons.In the prediction process of the hematite, we can see that the MELM has the better results and each hidden layer of this structure has 100 hidden neurons.In the prediction process of the magnetite, we can see that the MELM has the better results and each hidden layer of this structure has 20 hidden neurons.So the MELM algorithm has strong performance to the ores selection problems, which can achieve the best adaptability to the problems by adjusting the number of hidden layers and the number of hidden layer neurons to improve the ability of system regression analysis.

Conclusion
The MELM we proposed in this paper solves the two problems which exist in the training process of the original ELM.The first is the stability of single network that the disadvantage will also influence the generalization performance.The second is that the output weights are calculated by the product of the generalized inverse of hidden layer output and the system actual output.And the parameters randomly selected of hidden layer neurons which can lead to the singular matrix or morbid matrix.At the same time, the MELM    network structure also improves the average accuracy of training and testing performance compared to the ELM and TELM network structure.The MELM algorithm inherits the characteristics of traditional ELM that randomly initializes the weights and bias (between the input layer and the first hidden layer), also adopts a part of the TELM algorithm, and uses the inverse activation function to calculate the weights and bias of hidden layers (except the first hidden layer).
Then we make the actual hidden layer output approximate to the expected hidden layer output and use the parameters obtained above to calculate the actual output.In the function regression problems, this algorithm reduces the least mean square error.In the datasets classification problems, the average accuracy of the multiple classifications is significantly higher than that of the ELM and TELM network structure.In such cases, the MELM is able to improve the performance of the network structure.

Figure 1 :
Figure 1: The structure of the ELM.

Figure 2 :
Figure 2: The workflow of the TELM.

Figure 3 :Figure 4 :
Figure 3: The structure of the three-hidden-layer ELM.

Figure 7 :
Figure 7: The total iron content of the hematite.The magnetite

Figure 8 :
Figure 8: The total iron content of the magnetite.

Table 1 :
The RMSE of the three function examples.

Table 2 :
The three datasets for the classification.