Wavelet Network: Online Sequential Extreme Learning Machine for Nonlinear Dynamic Systems Identification

A single hidden layer feedforward neural network (SLFN) with online sequential extreme learning machine (OSELM) algorithm has been introduced and applied in many regression problems successfully. However, using SLFN with OSELM as black-box for nonlinear system identification may lead to building models for the identified plant with inconsistency responses from control perspective. The reason can refer to the random initialization procedure of the SLFN hidden node parameters with OSELM algorithm. In this paper, a single hidden layer feedforward wavelet network (WN) is introduced with OSELM for nonlinear system identification aimed at getting better generalization performances by reducing the effect of a random initialization procedure.


Introduction
In literature, Huang et al. presented an extreme learning machine (ELM) algorithm with a single hidden layer feedforward neural network (SLFN) in [1] and has taken many researchers' interest because it offers significant advantages such as providing better generalization performance, fast learning speed, ease of implementation, and minimal human intervention [2].The main principle of using ELM algorithm is the random initialization procedure of the SLFN input weights, biases, and activation function parameters which are either additive or radial basis functions (RBF) and the analytical determining of the output weights in a single step using least square solution.
However, the ELM algorithm is based on fixed network structure, which means if the hidden nodes activation function (e.g., RBF's centers and impact factors) parameters are initialized, then they will not be tuned during the learning phase.Therefore, random initialization of SLFN hidden node parameters may have effect on the modeling performances [3], and to improve the SLFN it requires high complexity performance and this may lead to ill condition, which means that an ELM may not be robust enough to capture variations in data [4].
Wavelet network (WN) has been applied successfully in many applications related to system identification and blackbox modelling for nonlinear dynamic systems with different batch training algorithms [5][6][7].The main advantages of WN are the capability of analytical initialization procedure for the hidden nodes (wavelet activation function) parameters [8].Moreover, wavelets decomposition properties for localization in both time and frequency domains allow better generalization in nonlinear system identification problems [7].
The similarity between the wavelet decomposing theory and single hidden layer feedforward neural networks (SLFNs) inspired Cao et al. in [9,10] to propose a structure of the composite function wavelet neural networks (CFWNN) to be used with ELM for different applications.The initialization of the wavelet function parameters, namely, translation and dilation, is done using the input data that takes into account 2 Advances in Artificial Intelligence the domain of input space.The results on several benchmark real-world data sets showed that the proposed CFWNN method can achieve better performances.
Latterly, a new WN structure of dual wavelet activation functions in the hidden layer nodes has been introduced by Javed et al. in [3] with ELM algorithm and called summation wavelet extreme learning machine (SW-ELM).The proposed SW-ELM showed good accuracy and generalization performances by reducing the impact of a random initialization procedure where wavelets and other parameters of hidden nodes are adjusted a priori to learning.
For many industrial applications where online sequential learning algorithms are preferred over batch learning algorithms, an online sequential extreme learning machine (OSELM) algorithm has been introduced for SLFN (NN-OSELM) with additive and RBF hidden nodes functions and showed better generalization performance and fast training capability over the other well-known sequential learning algorithms [11,12].However, the NN-OSELM has not been yet verified in nonlinear systems identification problems for control applications.The authors in [13] stated that if the RBF network is trained from random initial weights for each subset, it could converge to a different minimum that corresponds to weights   different from the one corresponding to  0 .This random initialization procedure may cause unacceptable learning behaviour in system identification applications and may lead to a different open loop response of the identified system regardless of the modeling accuracy.
To overcome these differences, the RBF can be replaced by wavelet function based on a fact that wavelets are capable of controlling the order of approximation and the regularity by some of their key mathematical properties and explicit analytical initialization form [13,14].In this regard, the authors in [15] introduced a feedforward WN with OSELM (WN-OSELM) algorithm for nonlinear system identification where the wavelet activation function parameters initialization played a big role to ensure fast and better learning performance over NN-OSELM.However, the proposed WN-OSELM method was based on fixed numbers of the input features and the hidden nodes which is not optimal in any case.
In this paper, feedforward WN framework is introduced with OSELM (WN-OSELM) to limit the impact of random initialization of the SLFN hidden nodes parameters by using density function with recursive algorithm [16] and the input weights and biases using Nguyen Widrow approach [17].Moreover, the optimal input features and hidden nodes are selected using sequential forward search approach (SQFS) [18] and final prediction criterion (FPEC) [19], respectively.The simulations will be carried out on three nonlinear systems and it will take in account the optimum number of hidden nodes and the input features for both WN and NN to ensure the generalization property.

Preliminaries
In this section, a brief review of the traditional wavelet neural networks and the NN-OSELM is presented.( The wavelet function parameters   and   are called translation and dilation (scale) parameters, respectively.The translation parameter determines the central position of the wavelet, whereas the dilation parameter controls the waves spread.These parameters can be defined from the data set available [14].

NN-OSELM.
For single-layer feedforward neural networks with hidden nodes governed by additive sigmoid or RBF activation functions and based on ELM algorithm theories [20], an online sequential extreme learning machine (OSELM) is developed by [12] in a unified way to deal with industrial applications that the training data comes one by one or chunk by chunk.To explain the principles of the OSELM, suppose a SLFN of  hidden nodes and RBF activation function can be expressed as below, where Φ(⋅) is a Gaussian type radial basis function and   is the center vector for neuron  and   is the output weights.Now, for a number of  input/output training samples that equals the number of SLFN hidden nodes (i.e.,  = ), if the input weights, biases, and hidden node parameters are randomly assigned and independent from the training data sets, an analytic and simple way to calculate the estimated output weights ω is applying least square solutions on the cost function as follows: where  is the SLFN hidden layer output and  the real output.Here, the estimated output weights can be determined by inverting  with a single step where a zero training error can be realized.However, when the number of training samples is greater than the number of hidden nodes ( > ), the estimated weights ω can be considered by using a pseudoinverse of  to give a small nonzero training error || > 0, where  † is pseudoinverse of  and the solution given by ( 5) can be rewritten as below [21], Based on the above, the learning procedure of the OSELM algorithm consists of two phases, namely, the initial phase and the sequential learning phase.In initial phase, suppose a chunk of initial training samples (  0 ,   0 ) ∈ R  0 :  0 = 1, 2, . . .,  0 , where  0 ⊆  is reached such that  ≤  0 ; then the estimated weights can be found by where . . .
while   is the applied input vector to the  input nodes   = (  1 ,   2 , . . .,    ).Now, for the sequential phase, suppose another chunk of data is reached . .,  0 +  1 , and  ≤  1 ; then the minimizing problem becomes where and then where To express ω1 in terms of ω0 , the detailed derivation formula can be found in [12], and it is described as For the subsequenced chunks of data samples, the recursive least square solutions are applicable, and the previous arguments can be generalized for  +1 .Suppose a data set (  +1 ,   +1 ),  +1 =   + 1,   + 2, . . .,   +  +1 , where  ≥ 0; then a recursive algorithm for updating  +1 can be written as follows: and ( 14) will be Finally, the OSELM algorithm for SLFN can be summarized in the following steps: (1) Initialize the network parameters (input weights, biases, and hidden nodes parameters) randomly.
(6) Go to Step (4) until all training data finish.

Nonlinear System Identification Using WN-OSELM
Many of nonlinear systems can be represented using the nonlinear autoregressive with exogenous inputs (NARX) model.Taking single-input-single-output systems as an example, this can be expressed by the following nonlinear difference equation: =  ( ( − 1) , . . .,  ( −   ) ,  () , . . .,  ( −   )) where  is unknown nonlinear mapping functions,   is the number of lagged feedback predictor sequences () (output),   is the number of lagged feedforward predictors () (input), and () is the fitting residual assumed to be bounded and uncorrelated with the inputs.Several methods can be applied to realize the NARX model including polynomials, neural networks, and wavelet networks [22].In this work, the proposed wavelet network with NARX called series-parallel structure is shown in Figure 1, where the real plant outputs are assumed to be available and can be used for prediction so that the stability of the network is guaranteed [23].
On the other hand, since the type of activation functions has an excessive impact on the network efficacy [24], the activation function of the WN hidden nodes is selected to be Morlet wavelet function as follows: Morlet wavelet function has been introduced by the authors in [3, 9, 10] and showed better generalizing capability which is proved statistically in different applications, noting that the type of the activation function that has been used with WN-OSELM in [15] was first derivative of Gaussian function.
It is important to state that the OSELM algorithm for WN is similar to SLFN introduced in Section 2.2.The only difference is the initialization procedure of the input weights and the wavelet activation function parameters which is determined from the available data set.It is preferred to yield attention of the WN parameters to deliver a fine initial point to the OSELM algorithm for the training phase.Assigning wavelet activation function parameters, namely dilation and translation, is a critical issue; wavelet functions with a small dilation value are low frequency filters and may make some wavelets too local, whereas increasing dilation factors the wavelet behaves as high frequency filter and in both cases the components of the gradient of the cost function could be having an effect on speed of convergence and the accuracy of learning algorithm [25].In this paper, the initializing of wavelet activation function parameters is done using density function and recursive algorithm [16] which is based on the initial input data that is available.Consider WN in the form of (1) with  = 1, . . .,  input nodes, and  = 1, . . .,  hidden nodes, and suppose chunks of input/output data set (  ,   ) ∈ R  are available for training; then the input domain space can be defined by [  ,   ] which holds the input item   in all observed samples.Now, consider the dilation parameter  1 defined as the center of the parallelepiped; then  1 = 0.2(  −   ).To initialize  1 , a point   is selected between   and   so it divides interval [  ,   ] into two parts.To find the point   , the estimated density functions from the available input output observations [, ] are considered such that the point   could be the center of gravity of domain [  ,   ].In this method, to find the point   , the "density" functions estimated from noisy input/output observations [, ()] are used such that the point   is the center of gravity of [  ,   ], where Here, the assumptions that   = (  +   )/2 and   1 =   are followed, the same as in [3].Recursively, for each subinterval the same procedure applied for all  input nodes to get a form of ( 12 ,  12 ), . . ., (  ,   ) matrix.However, the number of data samples required to fill up appropriate matrix should be greater than the number of hidden nodes [12].On the other hand, the WN input layer weights and biases are initialized using Nguyen Widrow approach that is explained in detail in [3].This algorithm creates initial weight and bias values so that the active region of each neuron in the input layer is distributed evenly across the input space.Therefore, the training phase works faster because each area of the input space has neurons that covers the active region of the data set, and this is the main advantage of Nguyen Widrow approach over purely randomizing weights and biases [17].
The initialization procedure of Nguyen Widrow approach [3] is as follows: (1) Initialize small (random) input weights  old in between [0.5, −0.5]: {the weights from input nodes  to hidden nodes }.

Simulation and Examples
The simulations are based on comparisons between WN-OSELM and NN-OSELM that applied to three nonlinear dynamic systems, namely, the magnetic levitation (MagLev), continuous stirred tank reactor (CSTR), and robot arm system.The comparison of this method to some conventional nonlinear dynamic systems identification methods will not be conducted here because a comparison between NN-OSELM and some conventional methods was covered in [12].All the input output data sets and the simulations are carried out using MATLAB 7.11 environment running in an ordinary PC with 2.27 GHZ CPU.Moreover, the collected data sets of the plants are detailed in Table 1, where the first two rows of the table show the ranges of the amplitudes and widths of the input pulses.
The third and fourth rows indicate the outputs value each with corresponding units and the sampling time for each one.However, the halves of the total of each plant sample are used for training and the other half is used for testing and validation.
In addition, Table 2 defines the architecture of the WN and NN models with each plant, where the activation function of WN is Morlet wavelet function and for NN selected to be a sigmoid function [12].For the NRAX structure, the input features and the number of the hidden nodes are different for each model because we seek here the best performance of both models by selecting the finest delayed input/output and the optimal hidden nodes numbers that give better performance to the models.The first two rows in Table 2 show the number of delayed inputs and delayed outputs (regressors) and both were determined using a heuristic approach called sequential forward search (SQFS) [18].
Moreover, the third row states the optimal number of hidden layer nodes based on final prediction criterion (FPEC) [19].The last row of Table 2 yields the initial chunk of the training data samples which should be greater than the number of hidden nodes and has been determined by adding number 50 over the hidden nodes number.Note that any integer value equal to or larger than 0 can be used, just to make sure that the number of hidden nodes is equal to or less than the presented data samples in the initial stage only.
In order to compare the model output with the plant actual output, the relative error index [7] can be used to measure the performance of the identified model which is defined as where () and ŷ() are the output measurements over the data set and associated for one-step-ahead predictions, respectively. test refers to the length of the validation data set and () refers to the mean of the measured output over the validation data set and can be determined by Note that  is called also normalized error because the value of ∑  test =1 |() − ()| 2 is considered as the standard deviation of the measured output and it is always constant for a given test data set; therefore the best model has minimum  with validation data test.However, to determine  in percentage form the formula below is applicable for performance test, Furthermore, by comparing the PI for the simulated models output over the validation data set, the performance of the model can be evaluated where the best model has maximum PI with validation data.

Example 1: Magnetic Levitation
System.The equation of motion for this system stated in [26] is where () is the displacement of the magnet ball, () is the current that controls the electromagnet flux,  is the mass of the magnet ball, and  is the gravitational constant.The factor  is a tacky friction parameter (damping constant), and  is a field power constant.These parameters can be found in [26].The magnetic levitation system data was collected at a sampling interval of 0.01 seconds to form the input and output time series data set shown in the Figure 2.
The simulations of WN and NN on the system are carried out and the results of the models are shown in Figure 3.The differences is notable and the PI performances of WN reached 97.50% which means the capability of capturing the plant dynamics is much more than NN that its PI reached 87.20%.
On the other hand, the root mean square error (RMSE) curves of the training phase for both models are shown in Figure 4.It is clear that the convergence capability of the WN over NN is much better in convergence than NN.Note that this system has fast dynamic response, yet WN was capable of extracting plant dynamics correctly due to the capability of the wavelet localization property by means of both translation and dilation parameters.

Example 2: Continuous Stirred Tank Reactor (CSTR).
The dynamics of the CSTR system [27] can be described by where () is the fluid level, () is the concentration result at the output,  1 () is the flow rate of first liquid feed source with concentration  1 = 24.9, and  2 () is the flow rate of the second feed with concentrations  2 = 0.1.The CSRT data sets are obtained preparatory for both training and testing as shown in Figure 5, where each sample time  is considered as 0.2 seconds.
However, the simulation result displayed WN much better than NN as in Figure 6, where the PI was 91.66% for  WN while NN showed 80.13%.Note that NN suffers from sharp discontinuity ends, while WNs with decompositions properties showed were able to overcome the sharp discontinuities with better responses.
The learning evaluation in terms of RMSE for both NN and WN models of the CSTR can be shown in Figure 7.It is clear that the capability of WN is superior to that of NN in terms of fast convergence.
where () is the angle of the arm and  is the DC motor torque.To obtain input and output data sets, torque  levels are applied to get different arm angles of total 8000 samples and Figure 8 shows 4000 samples for the training data set for learning process.
However, the simulation results for robot arm system in Figure 9 showed that PI for WN still better than NN, where PI with WN reached 89.79% while NN was 79.37%.
On the other hand, the training RMSE curves are shown in Figure 10.For robot arm system, it is obvious that the NN curve declines gradually with nonsmooth convergence, while the WN curve was stable from the beginning and much better with RMSE value.
From all the three examples, it is clear that OSELM with WN has a jumping effect of convergence and much smoothness over NN and this can be explained in this way; since the sequential implementation of the least squares solution is applied here, then all the convergence results of recursive least squares (RLS) solutions can be applied also [12], and this means that the distribution of eigenvalues of  has a variable impact on the speed of convergence.In NN case, eliminating eigenvalue spread problem cannot be implemented due to lack of knowledge of  due to random initialization of  parameters which is not in the case of WN that the way of choosing its initial parameters is known.Therefore, a priori  chosen initial parameters depending on the available data set will be very useful to enhance the OSELM convergence performance.Note that, in case of NN, the best of 50 trails (this is stated in [13]) are selected for the comparison process.Table 3 shows the elapsed time in seconds of the learning process for both WN and NN.In all simulation tests the WN showed faster learning process over NN, and this can be explained by the proper initializing of the WN parameters which enabled to converge faster even when WN both input and hidden node numbers was larger than NN as in case of robot arm system.However, it important to mention that the initialization process time of the WN parameters is higher than the NN initialization process time due to the recursive method of WN and random process of NN.

Conclusions
Wavelet network is proposed with OSELM to be used as nonlinear system identification scheme for control purposes in many industrial applications where sequential learning algorithms do not require retraining whenever new data is received.The initialization of wavelet network parameters is done using density function and recursive algorithm.The numbers of input and hidden nodes are selected using SQFS and FPEC, respectively; this is different from the approach introduced in [15] in which the numbers of inputs and hidden nodes were fixed.The simulations are carried out on three nonlinear dynamic systems and compared with SLFN-OSELM.From the results, it was found that the initialization procedure of the networks parameters plays a big role in the OSELM performance and consequently in the generated model performance.In other words, there was instability in the training phase of NN because of random initialization of the network parameters, while the training curve of WN was smooth and stable due to analytical initialization method of the wavelet mother function parameters.Moreover, the identified models by WN showed a better modeling accuracy over NN.The results overall showed the superiority of WN over NN in modelling performance and fast learning ability which realize that this work provides a very short learning time and better accuracy in many tests.In other words, the proposed can provide solutions for real-time varying system identification in some critical applications.

1 Figure 2 :Figure 3 :
Figure 2: The input and output data set of magnetic levitation system.

Figure 4 :
Figure 4: The RMSE curve of the NN and WN for MagLev.

Figure 6 :
Figure 6: The WN and NN outputs for CSTR.

Figure 7 :
Figure 7: The RMSE of the NN and WN for CSTR.

4 Figure 8 :
Figure 8: The input output data set of robot arm.

Figure 9 :
Figure 9: WN and NN outputs for robot arm.

Figure 10 :
Figure 10: The RMSE of the NN and WN for robot arm.

Table 1 :
The data sets for the nonlinear plants.

Table 2 :
Parameters of the models structures for the plants.

Table 3 :
The WN and NN learning times and PI for the three systems.