Particle Swarm Optimization Based Selective Ensemble of Online Sequential Extreme Learning Machine

1School of Information Science and Engineering, Ocean University of China, 238 Songling Road, Qingdao 266100, China 2School of Mechanical and Electrical Engineering, China Jiliang University, 258 Xueyuan Street, Xiasha High-Edu Park, Hangzhou 310018, China 3Department of Mechanical and Industrial Engineering and the Iowa Informatics Initiative, 3131 Seamans Center, The University of Iowa, Iowa City, IA 52242-1527, USA 4Arcada University of Applied Sciences, 00550 Helsinki, Finland


Introduction
Feedforward neural network is one of the most prevailing neural networks for data processing in the past decades [1,2].However, the slow learning speed limits its applications.Recently, an original algorithm designed for single hidden layer feedforward neural networks (SLFNs) named extreme learning machine (ELM) was proposed by Huang et al. [3].ELM is a tuning free algorithm for it randomly selects the input weights and biases of the hidden nodes instead of learning these parameters.And, also, the output weights of the network are then analytically determined.ELM proves to be a few orders faster than traditional learning algorithms and obtains better generalization performance as well.It lets the fast and accurate data analytics become possible and has been applied to many fields [4][5][6].
However, the algorithms mentioned above need all the training data available to build the model, which is referred to as batch learning.In many industrial applications, it is very common that the training data can only be obtained one by one or chunk by chunk.If batch learning algorithms are performed each time new training data is available, the learning process will be very time consuming.Hence online learning is necessary for many real world applications.
An online sequential extreme learning machine is then proposed by Liang et al. [7].OS-ELM can learn the sequential training observations online at arbitrary length (one by one or chunk by chunk).New arrived training observations are learned to update the model of the SLFNs.As soon as the learning procedure for the arrived observations is completed, the data is discarded.Moreover, it has no prior knowledge about the amount of the observations which will be presented.Therefore, OS-ELM is an elegant online learning algorithm which can handle both the RBF and additive nodes in the same framework and can be used to both the classification and function regression problems.OS-ELM proves to be a very fast and accurate online sequential learning algorithm [8][9][10], which can provide better generalization performance in faster speed compared with other online learning algorithms such as GAP-RBF, GGAP-RBF, SGBP, RAN, RANEKF, and MRAN.
However, due to the random generation of the parameters for the hidden nodes, the generalization performance of OS-ELM sometimes cannot be guaranteed, similar to ELM.Some ensemble based methods have been applied to ELM to improve its accuracy [11][12][13].Ensemble learning is a learning scheme where a collection of a finite number of learners are trained for the same task [14,15].It has been demonstrated that the generalization ability of a learner can be significantly improved by ensembling a set of learners.In [16] a simple ensemble OS-ELM, that is, EOS-ELM, has been investigated.However, Zhou et al. [17] proved that selective ensemble is a better choice.We apply this idea to OS-ELM.At first, a novel selective ensemble algorithm, termed as PSOSEN, is proposed.PSOSEN adopts particle swarm optimization [18] to select the individual OS-ELMs to form the ensemble.Benefiting from the fast speed of PSO, PSOSEN is designed to be a new accurate and fast selective ensemble algorithm.It should be noted that PSOSEN is a general selective ensemble algorithm suitable for any learning algorithms.
Different from batch learning, online learning algorithms need to perform learning continually.Therefore the complexity of the learning algorithm should be taken into account.Obviously, performing selective ensemble learning each step is not a good choice for online learning.Thus we designed an adaptive selective ensemble framework for OS-ELM.A set of OS-ELMs are trained online, and the root mean square error (RMSE) will always be calculated.The error will be compared with a preset threshold .If RMSE is bigger than the threshold, it means the model is not accurate.Then PSOSEN will be performed and a selective ensemble  is obtained.Otherwise, it means the model is relatively accurate and the ensemble will not be selected.Then the output of the system is calculated as the average of the individuals in the ensemble set.And each individual OS-ELM will be updated recursively.
UCI data sets [19], which contain both regression and classification data, are used to verify the feasibility of the proposed algorithm.Comparisons of three aspects including RMSE, standard deviation and running time between OS-ELM, and EOS-ELM, selective ensemble of OS-ELM (SEOS-ELM) with both GASEN and PSOSEN are presented.The results convincingly show that PSOSEN achieves better generalization accuracy and fast learning speed.
The rest of the paper is organized as follows.In Section 2, previous work including ELM and OS-ELM is reviewed.The novel selective ensemble based on particle swarm optimization is presented in Section 3.An adaptive selective ensemble framework is designed for OS-ELM in Section 4. Experiments are carried out in Section 5 and the comparison results are also presented.In Section 6, further discussion about PSOSEN is provided.We draw the conclusion of the paper in Section 7.

Review of Related Work
In this section, both the basic ELM algorithm and the online version OS-ELM are reviewed in brief as the background knowledge for our work.

Extreme Learning Machine (ELM)
. ELM algorithm is derived from single hidden layer feedforward neural networks (SLFNs).Unlike traditional SLFNs, ELM assigns the parameters of the hidden nodes randomly without any iterative tuning.Besides, all the parameters of the hidden nodes in ELM are independent of each other.Hence ELM can be seen as generalized SLFNs.
Given  training samples (  ,   ) ∈   ×   , where   is an input vector of  dimensions and   is a target vector of  dimensions.Then SLFNs with Ñ hidden nodes each with output function (  ,   , ) are mathematically modeled as where (  ,   ) are parameters of hidden nodes, and   is the weight vector connecting the th hidden node and the output node.To simplify, (1) can be written equivalently as where is called the hidden layer output matrix of the neural network, and the th column of  is the output of the th hidden node with respect to inputs  1 ,  2 , . . .,   .
In ELM,  can be easily obtained as long as the training set is available and the parameters (  ,   ) are randomly assigned.Then ELM evolves into a linear system and the output weights  are calculated as where  † is the Moore-Penrose generalized inverse of matrix .
The ELM algorithm can be summarized in three steps as shown in Algorithm 1.
(2) Calculate the hidden layer output matrix .
(3) Calculate the output weight  : ∧  =  † , where  † is the Moore-Penrose generalized inverse of hidden layer output matrix .Algorithm 1 2.2.OS-ELM.In many industrial applications, it is impossible to have all the training data available before the learning process.It is common that the training observations are sequentially inputted to the learning algorithm; that is, the observations arrive one-by-one or chunk-by-chunk.In this case, the batch ELM algorithm is no longer applicable.Hence, a fast and accurate online sequential extreme learning machine was proposed to deal with online learning.
The output weight  obtained from ( 4) is actually a leastsquares solution of (2).Given rank() = Ñ, the number of hidden nodes,  † can be presented as This can also be called the left pseudoinverse of  for it satisfies the equation  †  =  Ñ.If    tends to be singular, smaller network size Ñ and larger data number  0 should be chosen in the initialization step of OS-ELM.Substituting ( 5) to (4), we can get which is the least-squares solution to (2).Then the OS-ELM algorithm can be deduced by recursive implementation of the least-squares solution of (6).
There are two main steps in OS-ELM, initialization step and update step.In the initialization step, the number of training data  0 needed in this step should be equal to or larger than network size Ñ.In the update step, the learning model is updated with the method of recursive least square (RLS).And only the newly arrived single or chunk training observations are learned, which will be discarded as soon as the learning step is completed.
The two steps for OS-ELM algorithm in general are as follows.
(a) Initialization step: batch ELM is used to initialize the learning system with a small chunk of initial training data (1) Assign random input weights   and bias   (for additive hidden nodes) or center   and impact factor   (for RBF hidden nodes),  = 1, . . ., Ñ.
(2) Calculate the initial hidden layer output matrix: (3) Calculate the initial output weight (b) Sequential learning step is as follows.
The ( + 1)th chunk of new observations can be expressed as where N +1 represents the number of observations in the ( + 1)th chunk newly arrived.

Particle Swarm Optimization Selective Ensemble
In this section, a novel selective ensemble method referred to as particle swarm optimization selective ensemble (PSOSEN) is proposed.PSOSEN adopts particle swarm optimization to select the good learners and combine their predictions.Detailed procedures of the PSOSEN algorithm will be introduced in this section.
A remarkable superiority of PSOSEN is its speed over other selective ensemble algorithms.Another popular selective ensemble learning method is based on genetic algorithm.Compared with GASEN, PSOSEN achieves faster convergence to optimal solution due to the omission of crossover and mutation operations used in GASEN.GASEN is actually quite complicated for the requirement of encode, decode, and other genetic operations.For instance, GASEN only works with binary encoding, while PSOSEN is available for any forms of values based on their current positions and velocity vectors in the corresponding hyperspace.For PSOSEN, there is no need for overmuch parameter adjustment, thus easy to implement.Although using simple method, PSOSEN is still capable of obtaining high accuracy of prediction and reaching the optima earlier than GASEN.Furthermore, PSO is less influenced by changes in problem dimensionality or modality of problems compared with GA, which also proves to be robust in most situations [20].
As selective ensemble is usually more time-consuming than original algorithm, a faster optimization method might be preferable.For this purpose, PSOSEN might be more appropriate to be adopted to search for the optimal ensemble of ELM models efficiently.
Zhou et al. [17] have demonstrated that ensembling many of the available learners may be better than ensembling all of those learners in both regression and classification.The detailed proof of this conclusion will not be presented in this paper.However, one important problem for selective ensemble is how to select the good learners in a set of available learners.
The novel approach selective ensemble algorithm is proposed to select good learners in the ensemble.PSOSEN is based on the idea of heuristics.It assumes each learner can be assigned a weight, which could characterize the fitness of including this learner in the ensemble.Then the learner with the weight bigger than a preset threshold  could be selected to join the ensemble.
We will explain the principle of PSOSEN from the context of regression.We use   to denote the weight of the th component learner.The weight should satisfy the following equations: Then the weight vector is Suppose input variables  ∈   according to the distribution (), the true output of  is (), and the actual output of the th learner is   ().Then the output of the simple weighted ensemble on  is Then the generalization error   () of the th learner and the generalization error Ê() of the ensemble are calculated on , respectively: The generalization error   of the th learner and that of the ensemble Ê is calculated on (), respectively: We then define the correlation between the th and the th component learner as follows: Obviously   satisfies the following equations: Considering the equations defined above, we can get To minimize the generalization error of the ensemble, according to (22), the optimum weight vector can be obtained as The th variable of  opt , that is,  opt⋅ , can be solved by Lagrange multiplier: The equation can be simplified to Taking ( 2) into account, we can get Equation ( 26) gives the direct solution for  opt .But the solution seldom works well in real world applications.Due to the fact that some learners are quite similar in performance, when a number of learners are available, the correlation matrix   may be irreversible or ill-conditioned.
Although we cannot obtain the optimum weights of the learner directly, we can approximate them in some way.Equation ( 23) can be viewed as an optimization problem.As particle swarm optimization has been proved to be a powerful optimization tool, PSOSEN is then proposed.The basic PSO algorithm is shown in Figure 1.
PSOSEN randomly assigns a weight to each of the available learners at first.Then it employs particle swarm optimization algorithm to evolve those weights so that the weights can characterize the fitness of the learners in joining the ensemble.Finally, learners whose weight is bigger than a preset threshold  are selected to form the ensemble.Note that if all the evolved weights are bigger than the threshold , then all the learners will be selected to join the ensemble.
PSOSEN can be applied to both regression and classification problems for the purpose of the weights evolving process which is only to select the component learners.In particular, the outputs of the ensemble for regression are combined via simple averaging instead of weighted averaging.The reason is that previous work [17] showed that using the weights both in selection of the component learners and in combination of the outputs tends to suffer the overfitting problem.
In the process of generating population, the goodness of the individuals is evaluated via validation data bootstrap sampled from the training data set.We

Particle Swarm Optimization Based Selective Ensemble of Online Sequential Extreme Learning Machine
In this section, PSOSEN is applied to the original OS-ELM to improve the generalization performance.In order to reduce the complexity and employ PSOSEN flexibly, an adaptive framework is then designed.The flowchart of the framework is shown as in Figure 2.
Online learning is necessary in many industrial applications.In these situations, training data can only be obtained sequentially.Although OS-ELM is proposed as useful online learning algorithm, the generalization performance may not be quite good results from the random generation of the input OS-ELM(2) OS-ELM(N − 1) OS-ELM(N) parameters.Ensemble methods have been investigated in OS-ELM, that is, the EOS-ELM algorithm [16].However, it is only very simple ensemble method, which just calculates the average of all the  individual OS-ELMs.In this section, selective ensemble, which is superior to simple ensemble, is applied to OS-ELM.The novel selective ensemble method proposed in Section 3 is adopted.Apparently, performing PSOSEN each step is a time consuming process.We design an adaptive framework to determine whether to perform PSOSEN or simple ensemble.Thus the accuracy and the complexity can be balanced well.The framework for the new algorithm can be explained as follows.
First,  individual OS-ELMs are initialized.The number of nodes is same for each OS-ELM, while the input weights and biases for each OS-ELM are randomly generated.
Second, the RMSE error is calculated: where  is the expected output, while  , is the actual output of the th individual OS-ELM.The RMSE will be compared with a preset threshold .If  is bigger than , which means simple ensemble is not accurate, PSOSEN is performed and a selective ensemble  is obtained.And if  is smaller than , which indicates that simple ensemble is relatively accurate, the ensemble will not be selected.
Third, the output of the system is calculated as the average output of the individual in the ensemble set: where   is the output matrix of the th OS-ELM, and  , is the output weight calculated by the th OS-ELM at step .At last, each OS-ELM will update recursively according to the update equations presented in Section 2.

Performance Evaluation of PSOSEN Based OS-ELM
In this section, a series of experiments were conducted to evaluate the performance of the proposed algorithm.OS-ELM, EOS-ELM, and GASEN based OS-ELM are also compared with the new algorithm in this section.All the experiments were carried out in the MATLAB R2012b environment on a desktop of CPU 3.40 GHz and 8 GB RAM.

Model Selection.
For OS-ELM, the number of hidden nodes is the only parameter that needs to be determined.Cross-validation method is usually used to choose this parameter.Fifty trials of simulations are performed, respectively, for regression and classification problems.The number of hidden nodes is then determined by the validation error.For EOS-ELM, SEOS-ELM (GASEN), and SEOS-ELM (PSOSEN), there is another parameter that needs to be determined, that is, the number of networks in the ensemble.The parameter is set from 5 to 30 with the interval 5. Finally, the optimal parameter is selected according to the RMSE for regression, testing accuracy for classification, and standard deviation value.Under the same problem, the number of OS-ELMs is selected based on the lowest standard deviation and the comparable RMSE or accuracy compared with OS-ELM.Table 1 is an example of selecting the optimal number of networks for SEOS-ELM (PSOSEN) with RBF hidden nodes on New-thyroid dataset.As illustrated by Table 1, the lowest standard deviation occurs when the number of OS-ELMs is 20.Meanwhile, the prediction accuracy of SEOS-ELM is better than OS-ELM.Hence we set the number of networks to be 20 for the New-thyroid dataset.The numbers of OS-ELMs for other datasets are determined in the same way.
In the experiments, OS-ELM, EOS-ELM, and SEOS-ELM (GASEN) were compared with SEOS-ELM (PSOSEN).Some general information of the benchmark datasets used in our evaluations is listed in Table 2.Both regression and classification problems are included.
For OS-ELM, the input weights and biases with additive activation function or the centers with RBF activation function were all generated from the range [−1, 1].For regression problems, all the inputs and outputs were normalized into the range [0, 1], while the inputs and outputs were normalized into the range [−1, 1] for classification problems.
The benchmark datasets studied in the experiments are from UCI Machine Learning Repository except California Housing dataset from the StatLib Repository.Besides, a timeseries problem, Mackey-Glass, from UCI was also adopted to test our algorithms.

Algorithm Evaluation.
To verify the superiority of the proposed algorithm, RMSE for regression problems and testing accuracy for classification problems are, respectively, computed.The initial size of the dataset is very small, which equals to the number of the hidden nodes to guarantee the model to work.All the data then is sent to the model in a oneby-one learning mode.The evaluation results are presented in Tables 3, 4, 5, and 6, which are, respectively, corresponding to the models with sigmoid hidden nodes and RBF hidden nodes for both regression and classification problems.Each result is an average of 50 trials.And in every trial of one problem, the training and testing samples were randomly adopted from the dataset that was addressed currently.
From the comparison results of four tables, we can easily find that EOS-ELM, SEOS-ELM (GASEN), and SEOS-ELM (PSOSEN) are more time consuming than OS-ELM, but they still keep relatively fast speed at most of the time.It should be noted that the complexity of SEOS-ELM is adjustable, which depends on the threshold .
What is important, EOS-ELM, SEOS-ELM (GASEN), and SEOS-ELM (PSOSEN) all attain lower testing deviation and more accurate regression or classification results than OS-ELM, which shows the advantage of ensemble learning.In addition, both SEOS-ELM (GASEN) and SEOS-ELM (PSOSEN) are more accurate than EOS-ELM.This verifies that selective ensemble is better than simple ensemble method.
In terms of the comparison between SEOS-ELM (GASEN) and SEOS-ELM (PSOSEN), it can be observed that both of the two selective ensemble algorithms achieve comparable accuracy.However, the advantage of the new algorithm is that it is more computational efficient.This verifies that PSOSEN is a fast and accurate selective ensemble algorithm.
As an online learning algorithm, the online learning ability is another important evaluation criterion.To illustrate the online learning ability of the proposed algorithm, a simulated regression dataset is adopted.The dataset was generated from the function  =  2 + 3 + 2, comprising 4500 training data and 1000 testing data.Noting that this function is chosen arbitrarily just to simulate the regression problem, Figures 3 and 4 explicitly depict the variability of training accuracy of SEOS-ELM (PSOSEN), EOS-ELM, and OS-ELM with respect to the number of training data in the process of learning.It can be observed that with the increasing number of training samples, RMSE values of the three methods significantly decline.As the online learning progressed, the training models are continuously updated and corrected.We can then conclude that the more training data the system learns, the more precise the model is.Whether sigmoid or RBF the hidden nodes is, SEOS-ELM always obtains smaller RMSE than EOS-ELM and OS-ELM, which indicates that the performance of SEOS-ELM is considerably accurate compared with the other methods.Moreover, the smaller testing deviation of SEOS-ELM in

Discussion
In the experiments, PSOSEN showed its higher accuracy than the original OS-ELM and simple ensemble of OS-ELM, which verified the feasibility of the selective ensemble method.In addition, compared with GASEN, PSOSEN showed comparable accuracy while much faster learning speed.Taking the complexity and accuracy into consideration, PSOSEN is a good choice for selective ensemble.Experiments on online version ELM have demonstrated the advantages.However, it should be noted that, as a general selective ensemble method, PSOSEN is applicable to any learning algorithms, both batch learning and online learning.So, applying PSOSEN to other learning algorithms are of interest in the future.
The experiments also showed that although ensemble learning, both simple ensemble and selective ensemble, attains higher accuracy, it is more time consuming than the original learning algorithm.In addition, selective ensemble is slower than simple ensemble.As a selective ensemble method, PSOSEN is also slower than the original learning algorithm and the simple ensemble.So, selective ensemble is a trade-off between complexity and accuracy.In the future, new selective ensemble method should be designed to further improve the speed of the algorithm.

Conclusion
In this paper, PSOSEN is proposed as a novel selective ensemble algorithm.Benefiting from the fast speed of PSO,

Figure 2 :
Figure 2: Flowchart for the proposed framework.

Figure 3 :
Figure 3: RMSE with respect to the number of training samples for sigmoid hidden nodes.

Figure 4 :
Figure 4: RMSE with respect to the number of training samples for RBF hidden nodes.
( 1 , . . .,   ,  1 , . . .,  Ñ,  1 , . . .,   ) use Ê  to denote the generalization error of the ensemble, which corresponds to individual  on the validation data .Obviously Ê  can describe the goodness of .The smaller Ê  is, the better  is.So, PSOSEN adopts () = 1/ Ê  as the fitness function.The PSOSEN algorithm is summarized as follows. 1 ,  2 , . . .,   are bootstrap samples generated from original training data set.A component learner   is trained from each   .And a selective ensemble  * is built from  1 ,  2 , . . .,   .The output is the average output of the ensemble for regression or the class label who receives the most number in voting process for classification (see Algorithm 2).

Table 1 :
Network selection for New-thyroid dataset.

Table 2 :
Specification of benchmark datasets.

Table 3 :
Comparison of algorithms for regression problems with sigmoid hidden nodes.

Table 4 :
Comparison of algorithms for classification problems with sigmoid hidden nodes.

Table 5 :
Comparison of algorithms for regression problems with RBF hidden nodes.

Table 6 :
Comparison of algorithms for classification problems with RBF hidden nodes.Datasets Algorithm Number of nodes Number of networks Training time (s) RMSE or Accuracy Testing dev.

Table 3 to
Table 6 also confirms the stability performance of SEOS-ELM.