Sequential Extreme Learning Machine with Generalized Regularization and Adaptive Forgetting Factor for Time-Varying System Prediction

Many real world applications are of time-varying nature and an online learning algorithm is preferred in tracking the realtime changes of the time-varying system. Online sequential extreme learning machine (OSELM) is an excellent online learning algorithm, and some improved OSELM algorithms incorporating forgettingmechanism have been developed tomodel and predict the time-varying system. But the existing algorithms suffer from a potential risk of instability due to the intrinsic ill-posed problem; besides, the adaptive tracking ability of these algorithms for complex time-varying system is still very weak. In order to overcome the above two problems, this paper proposes a novel OSELM algorithm with generalized regularization and adaptive forgetting factor (AFGR-OSELM). In the AFGR-OSELM, a new generalized regularization approach is employed to replace the traditional exponential forgetting regularization to make the algorithm have a constant regularization effect; consequently the potential illposed problem of the algorithm can be completely avoided and a persistent stability can be guaranteed. Moreover, the AFGROSELM adopts an adaptive scheme to adjust the forgetting factor dynamically and automatically in the online learning process so as to better track the dynamic changes of the time-varying system and reduce the adverse effects of the outdated data in time; thus it tends to provide desirable prediction results in time-varying environment. Detailed performance comparisons of AFGR-OSELM with other representative algorithms are carried out using artificial and real world data sets. The experimental results show that the proposed AFGR-OSELM has higher prediction accuracy with better stability than its counterparts for predicting time-varying system.


Introduction
In many real world applications such as financial data analysis, industrial process monitoring, weather forecast, and customer behavior prediction, the samples arrive successively in the form of a data stream [1].Moreover, these systems are usually nonstationary and exhibit time-varying behaviors, that is, the underlying distribution or the trend of the data changes over time [2].In such situations, an online learning algorithm would be a more appropriate selection over batch learning algorithm since the online learner can be incrementally updated to acquire new knowledge or to suit changing patterns in evolving data stream and it does not require retraining whenever a new sample is received [3][4][5].
Among the existing online learning algorithms, the online sequential extreme learning machine (OSELM) [6] is a very famous one for its efficient and powerful learning abilities.OSELM originates from the batch extreme learning machine (ELM) [7] algorithm which has been demonstrated to be extremely fast with good generalization performance.As a sequential implementation of ELM, OSELM is capable of learning the training data one-by-one or chunk-by-chunk with fixed or varying length, the output weights of which are constantly updated by adopting a recursive least squares (RLS) algorithm.Compared with other popular online learning algorithms, OSELM can provide better generalization performance at a much faster learning speed.Depending upon these advantages, OSELM has been successfully applied in the field of system modeling and prediction, such as online nonlinear system identification [8,9], ship roll motion prediction [10], consumer sentiments prediction [11], and time series prediction [12][13][14][15].Despite an excellent online learning algorithm, the OSELM may still suffer from a drawback of instability due to the potential ill-conditioned matrix inversion, and its stability and generalization performance could be greatly influenced once the autocorrelation matrix of the hidden layer output matrix is singular or ill-conditioning.Regularization technique is an effective way to cope with the ill-posed problem, and a regularized OSELM (R-OSELM) based on the biobjective optimization with Tikhonov regularization was proposed in [16].The R-OSELM successfully overcomes the potential ill-posed problem and tends to provide good generalization performance and stability, and it has become a practical online modeling method in real applications.
In time-varying environment, data evolve over time; accordingly the recent samples are more relevant to the actual target concept and the out-of-date samples may be useless in addition to introducing noise [17].Therefore, we should consider the increase in the importance of recent samples in contrast with the distant ones and reduce the adverse effects of the outdated data.In order to deal with the time-varying scenario, some improved OSELM algorithms incorporating sliding window strategy have been proposed, such as FOS-ELM [18], OS-ELMK [19], FORELM, and FOKELM [20].By using the sliding window strategy, these algorithms can discard the outdated data in the learning process so as to reduce their adverse effects to the subsequent online learning and prediction.Nevertheless, since an extra decremental learning procedure is required to remove the oldest sample from the sliding window whenever a new sample is added, the sliding window-based OSELM algorithms are more complex and more computation times are required compared with the original OSELM algorithm.Moreover, in these algorithms all the data samples within the sliding window are still equally treated.
Apart from the sliding window strategy, forgetting factor (FF) method is another more efficient and flexible means to reflect the variations of the observable data in a time-varying environment by assigning a different weight to each data sample according to their time order, and it has been widely applied in the adaptive filtering theory [21].Analogously, the concept of FF has also been introduced into the OSELM for dynamic system estimation and prediction [22][23][24].In [22], a low complexity adaptive forgetting factor OSELM (LAFF-OSELM) algorithm was proposed for time-varying nonlinear system identification, and it was found that the proposed algorithm can deal with the variation effectively by adaptive forgetting mechanism.In [23], the authors presented a new variable forgetting factor OSELM using the directional FF method (DFF-OSELM), which achieved superior prediction performance in industrial applications when compared to the OSELM algorithm.However, in both LAFF-OSELM and DFF-OSELM, the intrinsic ill-posed problem is not considered, so they may encounter a potential instability.To provide a more stable and reliable modeling approach for time-varying system, the FF method and the regularization technique were simultaneously incorporated into the OSELM, and then the FR-OSELM algorithms were developed [24].With the help of the FF, the FR-OSELM can maintain a good tracking ability for the time-varying system; meanwhile the stability of the algorithm is also enhanced by using the regularization approach.Nevertheless, since the FR-OSELM adopts a special exponential forgetting regularization, its regularization effect will fade gradually until up to nullification with time passing by, and ultimately the FR-OSELM may still encounter the similar or more serious problem of instability as the OSELM due to the probable operation of ill-conditioned matrix inversion.That is to say, the FR-OSELM is stable and workable only within limited period.
In this paper, a novel OSELM with generalized regularization and adaptive forgetting factor (AFGR-OSELM) is proposed for the modeling and prediction of the timevarying system.Our contributions are as follows.(i) Different from the previous FR-OSELM algorithms which use a special exponential forgetting regularization, a more generalized  2 regularization is employed instead in our AFGR-OSELM.The built-in generalized regularization approach makes the AFGR-OSELM have a constant regularization effect without fading in the whole learning process; as a result the potential ill-posed problem of the algorithm can be completely overcome and a persistent stability can be maintained in all online learning stages.(ii) Moreover, to better track the dynamic behaviors of the complex time-varying system, a new adaptive FF method under the condition of the generalized regularization is derived and incorporated in the AFGR-OSELM algorithm.In the sequential learning process of the AFGR-OSELM, the FF can be adaptively tuned in a recursive way to timely suit the dynamic changes of the timevarying system; therefore a desirable prediction performance of the algorithm can be anticipated.(iii) The effectiveness and practicability of the AFGR-OSELM algorithm are evaluated with both artificial and real world data sets, and the results show that the proposed algorithm can obtain more superior prediction accuracy with better stability compared with other representative models.
The rest of the paper is organized as follows.The related works including OSELM, R-OSELM, and FR-OSELM are revisited in Section 2. Section 3 presents the details of the proposed AFGR-OSELM.In Section 4, the performances of the proposed AFGR-OSELM are demonstrated by six illustrative examples.Finally, the conclusions and future work are given in Section 5.

OSELM and Related Algorithms
In this section, we provide a brief review of the related works including OSELM, R-OSELM, and FR-OSELM.For the sake of simplicity, all these OSELM-related algorithms are considered for regression with single output.[6] is a sequential implementation of the batch ELM algorithm [7] which is originally developed from the study of single hidden layer feedforward networks (SLFNs).For  arbitrary distinct samples (x 푗 ,  푗 ) ∈ R 푑 ×  1 , SLFNs with  hidden nodes are mathematically modeled as

OSELM. OSELM
where a 푖 is the weight vector connecting the th hidden node and the input nodes,  푖 is the threshold of the th hidden node, and  푖 is the weight vector connecting the th hidden node and the output nodes;  푖 (x 푗 ) = (a 푖 ,  푖 , x 푗 ) denotes the output function of the th hidden node with respect to input x 푗 .
That SLFNs can approximate these  samples with zero error which means that there exist (a 푖 ,  푖 ) and  푖 such that The above  equations can be written compactly as where H is called the hidden layer output matrix of the network; the th row of H is the output vector of the hidden layer with respect to input x 푗 and the th column of H is the th hidden node's output vector with respect to inputs x 1 , x 2 , . . ., x 푁 .It has been mathematically proved in [7] that SLFNs with random hidden nodes have the universal approximation capability, the hidden nodes can be randomly generated independent of the training data and remain fixed; then the hidden layer output matrix H is a constant matrix.Thus, training an SLFN is simply equivalent to finding a least squares solution β of the linear system H = T: where ‖ ⋅ ‖ is a norm in Euclidean space.The ELM adopts the smallest norm least squares solution of the above linear system as the output weights; that is, where H † is the Moore-Penrose generalized inverse of matrix H.If H 푇 H is nonsingular, then (6) can be further written as Furthermore, to suit the online learning scenario, an online version of ELM named OSELM is implemented to learn the training samples successively and incrementally.The learning procedure of OSELM consists of an initialization phase and a following sequential learning phase, and the oneby-one OSELM is summarized as follows.
In initialization phase, given an initial training set Ω 푘−1 = {(x 푗 ,  푗 ) |  = 1, . . .,  − 1}, according to (7), the initial output weights are given by where and In the sequential learning phase, the RLS algorithm is used to update the output weights in a recursive way.Suppose now that we receive another sample (x 푘 ,  푘 ); the corresponding partial hidden layer output matrix is calculated as and then the output weights update equations are determined by As seen from ( 9), the output weights of OSELM are recursively updated based on the intermediate results in the last iteration and the newly arrived data, which can be discarded immediately as soon as they have been learnt, so the computation overhead and the memory requirement of the algorithm are greatly reduced.The above one-by-one OSELM algorithm can be easily extended to chunk-by-chunk type.

R-OSELM.
Though the OSELM algorithm seems perfect in theory, it still has deficiency when directly applied to real world applications.As seen from ( 7) and ( 8), the derivation of OSELM is based on the assumption that the autocorrelation matrix H 푇 H is nonsingular; however, this assumption often does not hold in many practical applications.Once the singular or ill-posed problem occurs, the generalization performance of OSELM will deteriorate significantly.In order to overcome this problem, Huynh and Won [16] proposed a regularized OSELM (R-OSELM) by using the Tikhonov regularization.The learning procedure of the R-OSELM is almost the same as the OSELM, just adding a regularization item to the autocorrelation matrix H 푇 H to avoid the singular or ill-posed problem so as to improve the stability of the algorithm.The R-OSELM can be briefly retold below.
For the initial training set 푇 ; then the initial output weights can be written as where  is a positive real value called the regularization parameter and I is an identity matrix with the same dimensions as H 푇 H.
Next in the sequential learning process, whenever a new sample (x 푘 ,  푘 ) arrives, then , and the current output weights are recursively computed as Compare (11) and ( 12) with (9); though the finally obtained output weights update equations of R-OSELM seem the same as those in OSELM, we should keep in mind that in the sequential learning procedure of R-OSELM the regularization item I is embedded in the recursive formulas through iterative computations (see (11)).Besides, if the regularization parameter  equals zero, then the R-OSELM degenerates to the original OSELM.

FR-OSELM.
For time-varying system, the system behaviors often evolve over time; correspondingly the new samples are more effective in reflecting the changing trends of the system compared to the old ones.To better express the timeliness of the samples and further improve the prediction performance for the time-varying system, the FF method as well as the regularization technique is introduced into the OSELM, and then the FR-OSELM algorithm is proposed [24].Different from the OSELM and the R-OSELM which treat all the training samples equally, the FR-OSELM assigns a different weight to each sample, respectively, according to their time order.That is, the recent samples are assigned higher weights while the old samples are assigned lower weights so as to represent their different contributions to the learning model.
The implement of FR-OSELM can be briefly stated as follows.Suppose we have got the initial output weights  푘−1 using (10); now a new training sample (x 푘 ,  푘 ) arrives; then the corresponding  푘 is originally expressed as Considering the different effectivenesses of the newly arrived sample and the old samples to the model, ( 13) can be rewritten as where  ∈ (0, 1] is called the forgetting factor (FF), which is used to weaken the influences of the old samples and indirectly enhance the effects of the latest one to the model so as to better depict the current state of the time-varying system.Let and invert (15) on both sides; we have Applying Sherman-Morrison formula [25] to (16) yields Substituting ( 15) into (14), we obtain Equations ( 17) and (18) give the recursive formulas for updating the output weights of FR-OSELM.When  = 1, the FR-OSELM degenerates to the general R-OSELM.When  = 1 and  = 0, the FR-OSELM degenerates to the original OSELM.

Proposed AFGR-OSELM
In this section, we give the detailed implementation of the proposed OSELM with generalized regularization and adaptive forgetting factor (AFGR-OSELM).We first theoretically analyze the inherent defect of the FR-OSELM; then an improved FGR-OSELM algorithm with generalized regularization is presented to solve this issue; next a new adaptive FF method under the condition of generalized regularization is derived to accommodate the FGR-OSELM for better tracking of the time-varying system, and finally the AFGR-OSELM is proposed and summarized.

The "Conflict" of Regularization and Forgetting Factor in FR-OSELM.
The FR-OSELM is developed from the OSELM by incorporating the regularization technique in conjunction with the forgetting mechanism for time-varying system prediction.With the help of the FF, the FR-OSELM is effective in tracking the dynamic changes of the time-varying system and tends to provide good prediction results in time-varying environment.On the other hand, however, due to the impact of the FF, the regularization effect of the FR-OSELM will also be forgotten gradually as time goes on; as a result the FR-OSELM may encounter the similar ill-posed problem as the OSELM in the long run.Further theoretical analyses are as follows.
As described in Section 2.3, the derivation of the FR-OSELM is based on the intuitive analysis that in time-varying environment the new and the old samples are supposed to contribute differently to the learning model and should be weighted discriminatively.In theory, the FR-OSELM algorithm is equivalent to minimizing the following least squares cost function with FF and regularization item: where  and  are FF and regularization parameter, respectively.Applying the RLS method [21] to solve (19), we obtain the recursive solution of  푘 : which is completely identical to the results in FR-OSELM (see (17) and ( 18)).
From (19), we can find that the FR-OSELM equivalently uses a special exponential forgetting regularization item  푘 ‖ 푘 ‖ 2 in the cost function.The special scheme makes the deduced recursion formulas of FR-OSELM (see (20)) as concise as that in the original OSELM (see (9)) without extra computational burden; in other words, the computational complexity of the FR-OSELM algorithm at each iteration is the same as that of the original OSELM, that is, ( 2 ), where  is the hidden nodes number.On the other hand, however, it also brings a very unfavorable side effect as well.We have known that in time-varying environment  is often assigned a value slightly less than 1 for a better tracking; in this context  푘 is monotone decreasing and tends to zero for large , which means that the regularization effect of  푘 ‖ 푘 ‖ 2 will fade gradually with time and become disabled thoroughly in the end.In conclusion, the FR-OSELM only has a short-term regularization effect within limited period, and it may still encounter the ill-posed problem in the long term and run unstably.

An Improved FR-OSELM with Generalized Regularization (FGR-OSELM).
In order to effectively overcome the abovementioned problem in the FR-OSELM and further improve the stability and practicality of the algorithm, we introduce an improvement of FR-OSELM named as FGR-OSELM by using a generalized regularization approach.Different from the FR-OSELM which uses a special exponential forgetting regularization item  푘 ‖ 푘 ‖ 2 in the cost function, the FGR-OSELM adopts a more generalized  2 regularization item ‖ 푘 ‖ 2 instead.That is, the cost function of FGR-OSELM is written as It is clearly seen from ( 21) that the new generalized regularization scheme uses a constant  as the regularization coefficient which is completely irrespective to the FF; thus its regularization effect will keep constant and no longer fade with time anymore, which ensures a persistent stability of the FGR-OSELM in all online learning stages.
Next, we are devoted to deriving a recursive solution of ( 21) under the sequential learning scenario.Because the cost function ( 21) is equipped with a generalized regularization item, the derivation of FGR-OSELM will be much different from the FR-OSELM algorithm.The optimal output weights for minimizing (21) can be obtained by setting the first-order partial derivative of  FGR ( 푘 ) with respect to  푘 to zero.This yields where Define then (23) becomes Applying the Sherman-Morrison-Woodbury formula and Sherman-Morrison formula [25] to (25) and ( 26), respectively, we get 27) and ( 28) can be, respectively, rewritten as Define g * 푘 , g 푘 as follows: We can rewrite ( 29) and (30) as Also, (32) can be transformed and rewritten as From (34), g 푘 can be also expressed as From ( 22) and ( 24), we have Substituting ( 23) into (37), we obtain Equations ( 31), ( 33), ( 32), (34), and (38) give the whole recursive formulas for updating the output weights of FGR-OSELM.When  = 1, the FGR-OSELM degenerates to the general R-OSELM.When  = 1 and  = 0, the FGR-OSELM degenerates to the original OSELM.
Since the FGR-OSELM adopts a generalized regularization item in the cost function, the obtained recursive formulas seem more complex compared with that of the FR-OSELM (see (20)).Moreover, in (31), we need to calculate the inverse of a matrix with dimension  ( is the hidden nodes number).Fortunately, the computational complexity of (31) can be much reduced with an approximation approach [26].In the FGR-OSELM, the regularization parameter  is used just to avoid the ill-posed problem, and an inappropriate bigger value of  may equivalently introduce extra noise and negatively affect model specification; hence in practice  is usually set to a small value at the level of 10 −4 or even smaller.Additionally, in time-varying environment the FF  is often chosen as a positive constant slightly less than 1; then (1 − )/ is a very small value, so we can approximately estimate that ((1 − )/) 2 ≈ 0, and it follows that Further we get Therefore, (31) can be approximately expressed as Now, the computational complexity of updating g * 푘 is much reduced by avoiding the matrix inversion.
Similar to the R-OSELM, the second term of ( 21) can be extended to other types of norm instead of the -norm, such as the -norm.In the case of -norm complexity measure, a symmetric positive definite matrix  is added in the cost function and adopted as a new machine complexity measure [16].Using the same derivation steps as above, the learning formulas for online sequential learning process that use the -norm complexity measure can be obtained, which are essentially the same as those of the -norm except that the matrix  is added in some parts of the learning formulas.In general, the symmetric positive definite matrix  is often defined as the identity matrix [9,20,24], and, in this case, the -norm degenerates to the -norm.

Adaptive Forgetting Scheme for FGR-OSELM.
The FF plays an important part in adapting the FR-OSELM [24] and analogous models [27,28] to system dynamics and thus greatly improving the performances of the models for timevarying system.However, in reality it is very difficult or even impractical to compute an appropriate predetermined value for the FF without a priori knowledge [29].Moreover, a priori selection of the FF may not guarantee global adaptation of the system dynamics in a complex time-varying environment.Hence, a variable FF (VFF) strategy is a more attractive choice in real world applications.Up to now, many VFF-RLS algorithms have been proposed and successfully applied in adaptive signal processing [29][30][31], but most of these algorithms are derived on the basis of some given domain knowledge in the field of signal processing like the signalto-noise ratio and so on, which are not applicable in general applications.Comparatively, a more universal RLS algorithm with adaptive FF (AFF-RLS) was presented in [21], in which the FF is adaptively tuned using a gradient based method without need for much a priori knowledge.At present, the AFF-RLS has been successfully applied to the OSELM for time-varying nonlinear system identification [22] and fault prediction [24], and it has shown good tracking performance in time-varying environment.However, the AFF-RLS algorithm is originally derived in the situation of the special exponential forgetting regularization, which is not applicable in the generalized regularization scenario.In this section, a new adaptive FF method under the condition of generalized regularization is derived to accommodate our FGR-OSELM.
With the concept of adaptive forgetting, we wish the FF in FGR-OSELM can be updated recursively so as to adjust itself to time-varying environment.In this case, the objective is to calculate the FF  that optimizes the mean square of the a priori estimation error [21,22]; that is, the optimized cost function is written as where [⋅] is the expectation operator and  푘 =  푘 − h 푘  푘−1 is the a priori estimation error.Differentiating the cost function  耠 () with respect to  yields where the vector  푘 denotes the gradient of the output weights  푘 with respect to : Let S 푘 denote the derivative of the inverse correlation matrix P 푘 with respect to , and combine (36), (38), and (45) into (44); then the update equation of  푘 can be expressed as To compute  푘 , right now we just need to obtain S 푘 , and a recursion for updating S 푘 can be derived as follows.Substituting (36) into (34), we get The derivative of (47) with respect to  is obtained as By exchanging and collecting terms, (48) can be expressed as Applying the Sherman-Morrison formula [25] to the last term of (49), we have Let S * 푘 denote the derivative of P * 푘 with respect to  and combined with (33); we get The g * 푘 / in (51) can be achieved as follows by differentiating (41) with respect to  and followed by some calculations: Substituting (52) into (51), we finally obtain then the S 푘 can be calculated with (50); next  푘 can be calculated with (46).Provided that the estimate of the scalar gradient ∇휆 () in ( 43) is available, in a similar manner as [21], we may adaptively compute  using the recursion Using the instantaneous estimate − 푇 푘−1 h 푇 푘  푘 for ∇ 휆 () on the basis of (43), the complete expression for updating the FF is given as where  is a small, positive learning-rate parameter and the bracket denotes a truncation operator with the upper bound  + and the lower bound  − , respectively.In practice,  + may be set slightly less than 1 to ensure that some degree of forgetting is always present, while the setting of  − is problem specific, and it is advisable not to set  − a value too small, as it may produce numerical instability [21,32].
3.4.Summary of the AFGR-OSELM Algorithm.Incorporating the generalized regularization approach and the adaptive forgetting scheme into the OSELM simultaneously, then the new AFGR-OSELM is proposed.Similar to the original OSELM algorithm, the proposed AFGR-OSELM algorithm consists of an initial batch learning phase and a following sequential learning phase, which are summarized as follows. ( (c) Calculate the initial output weights where (2) Sequential Learning Phase.First set the initial values for S 푘−1 ,  푘−1 ,  푘−1 ,  + ,  − and ; then for each sequentially arriving observation (x 푘 ,  푘 ) (a) Calculate the hidden layer output vector h 푘 (d) Calculate the current FF  푘 and update S 푘 ,  푘 recursively: (e) Set  =  + 1. Go to Step (2).

Computational Complexity Analysis.
In the sequential learning process of the AFGR-OSELM algorithm, when a new observation arrives, the most costly steps are the calculations of the current output weights and the forgetting factor with (60) and (61), respectively.Assuming that arithmetic with individual elements have complexity (1), the complexity of addition or subtraction of an -by--order matrix is (), the complexity of multiplication of an -by--order matrix and an -by--order matrix is () [25].In (60), since P 푘 , P * 푘 , g * 푘 are  ×  matrices and g 푘 , h 푇 푘 ,  푘 are  × 1 vectors, the computational complexity of updating  푘 could be estimated as ( 3 ).Similarly, in (61), P 푘 , P * 푘 , g * 푘 , S 푘 , S * 푘 are × matrices and g 푘 , h 푇 푘 ,  푘 ,  푘 are  × 1 vectors; then the computational complexity required to calculate  푘 and update S 푘 ,  푘 is also ( 3 ).Summarizing the above analysis, it can be concluded that the total computational complexity of the AFGR-OSELM algorithm for online learning is ( 3 ), which is a little higher than the complexity of the original OSELM (( 2 )).
Since the computational complexity of the AFGR-OSELM algorithm is mainly decided by matrix multiplication, by using Strassen's matrix multiplication method [25], the number of multiplications can be reduced to ( 2.807 ); that is, the computational complexity of the AFGR-OSELM algorithm can also be reduced to ( 2.807 ).Moreover, with a proper approximation approach [22], the computational complexity of AFGR-OSELM may be further reduced, and we take it as future work.

Simulation Experiments and Performance Evaluation
In this section, the performance of the proposed AFGR-OSELM is evaluated on a simulative time-varying system, a parameter-varying chaotic time series system, two stock price data sets, and two real world industrial data sets with dynamic behaviors.The experimental results of the proposed algorithm are also compared with those of the OSELM [6], R-OSELM [16], FR-OSELM [24], and DFF-OSELM [23].All of the five algorithms use the same sigmoidal additive activation function (a, , x) = 1/(1 + exp(−(a ⋅ x + ))), where the input weights a and the biases  are randomly selected from the range [−1, 1].In our experiments, all the models are evaluated for predicting time-varying systems in an online scenario; that is, the samples are arranged to feed the predictors incrementally one-by-one after the initialization phase; meanwhile the predictors update continually during the process of online prediction.For each case, the reported experimental results are the average values of 30 independent experimental runs on the online data and the prediction performance is measured by the root mean squared error (RMSE) defined as where  푗 , t푗 are the real value and predicted value, respectively, and  is the number of the testing samples.The experiments have been carried out in MATLAB R2010b environment running on an ordinary PC with 3.4 GHZ CPU and 4 GB RAM.

Prediction of a Simulative Time-Varying System.
In the first experiment, an artificial data set that presents a gradual evolution in each sample is considered to simulate the dynamic behaviors of the time-varying system.The data set has a total of 2200 samples, and, for each sample, the input is constituted by four random variables which are normally distributed with mean zero and standard deviation 0.1; the output is the inner product of the input vector and the corresponding coefficient vector.In order to construct a time-varying system, the coefficients for each sample change continuously over time [1].To be specific, we first fix the initial coefficient vector as and subsequently these values evolve over time according to the following equation: where  indicates the index of the sample and  indicates the component of the coefficient vector,  = 1, 2, 3, 4. To make the prediction task real, some normal white noises with a level of 0.01 are added to the output.
All the five algorithms, OSELM, R-OSELM, FR-OSELM, DFF-OSELM, and AFGR-OSELM, are performed to model and predict the simulated time-varying system in an online mode.Since the hidden nodes number  is an important parameter which may greatly affect the learning accuracy and generalization performance of the neural network, for each algorithm we set  = 20,  = 50, and  = 100, respectively, for experiments, and the corresponding numbers of training data for initialization are taken as 50, 100, and 200, respectively.Besides, other parameters for each algorithm are set as follows: for R-OSELM,  = 10 −8 ; for FR-OSELM,  = 10 −8 ,  = 0.98; for DFF-OSELM, all the parameters are set the same as in [23], and an extra regularization parameter,  = 10 −8 , is added to stabilize the algorithm; for our AFGR-OSELM,  = 10 −8 ,  = 0.1,  + = 0.999,  − = 0.8, the initial value for  is set as 0.995, and the S and  can be simply initialized to identity matrix and zero vector, respectively.
Table 1 presents the mean and standard deviations (SD) of the prediction RMSE over 30 independent trials of the five models for performing different prediction steps on the simulated time-varying system.We can see from Table 1 that the performance of the OSELM is very unstable and highly sensitive to the hidden node number : when  = 20, the OSELM is able to provide effective prediction results, while in the case of  = 50 and  = 100 the OSELM is likely to produce very large prediction RMSE beyond reasonable range, and which are denoted as "×" in the table, similarly hereinafter.The reason for this is that in the sequential learning process of OSELM the autocorrelation matrix H 푇 H may be singular or ill-conditioning at certain instances; then the recursive calculation of (H 푇 H) −1 is nonsense and consequently the algorithm is apt to produce an unreliable result.By using the regularization technique the R-OSELM successfully avoids the potential ill-posed problem, so it obtains stable and effective prediction results in all situations.Though the FR-OSELM adopts the similar regularization technique to overcome the ill-posed problem, its regularization function decays gradually over time due to the effect of the FF, and the potential ill-posed problem may still occur in the long run, so we can see the FR-OSELM is workable only within short prediction steps and out of service in longer-term prediction steps in the cases of  = 50 and  = 100.In contrast, our AFGR-OSELM can maintain a persistent stability by using a generalized regularization approach and achieve stable and reliable performances all the time.Moreover, with the help of the adaptive forgetting mechanism our AFGR-OSELM model can effectively track the dynamic changes of the time-varying system and abandon the negative influences of the outdated samples in time, so it obtains the lowest prediction RMSE among the five models almost in all cases.
To intuitively compare the stability and the prediction accuracy of the five OSELM-related models, the typical absolute error (the difference between real value and predicted value) graphs of each model for performing 2000 steps online prediction in the case of  = 100 are demonstrated in Figure 1.Since the obtained prediction errors of FR-OSELM are very large and completely meaningless when the prediction step exceeds a certain limit, we only present the partial reasonable results within the first 300 steps.It is clear to see from Figure 1(a) that there are a few large peaks at certain instances, which reveals the instability of the OSELM algorithm.Relatively, the R-OSELM is a more robust algorithm and the corresponding error graph (Figure 1(b)) is quite stable without large fluctuation.Nevertheless, the R-OSELM is essentially not a targeted method to track the dynamic behaviors of the time-varying system, so its prediction errors increase gradually with time.Figure 1(c) shows that the FR-OSELM is workable only within short prediction steps and then fails rapidly when the prediction step exceeds 270, and the obtained experimental results are completely consistent with the theoretical analysis given in Section 3.1.Comparing Figure 1(b), Figure 1(d), and Figure 1(e), we can see that the DFF-OSELM and the AFGR-OSELM with forgetting mechanism have better tracking abilities and achieve smaller prediction errors than the R-OSELM without forgetting; also we can see that the AFGR-OSELM behaves better than the DFF-OSELM.In short, our AFGR-OSELM is able to provide more stable and accurate results compared with the state-ofthe-art models for predicting the time-varying system.

Prediction of the Logistics System with Time-Varying
Parameter.In actual engineering, the system parameters may change over time; consequently the corresponding system exhibits complex and changing characteristics as time goes on.Thus predicting the system with time-varying parameter is a practical and important task.Here we take the parameter time-varying Logistics system as an example to verify the performance of the AFGR-OSELM.Consider the following Logistic map with parameter  changing gradually [33]: In this simulation, a Logistic chaotic time series with a length of 4000 is generated by (65) with initial value (1) = 0.512, (1) = 3.6.In order to reduce the transient effect, the initial 1000 values are omitted and the last 3000 values are kept for experiment.To make the prediction task real, some normal white noises with a level of 0.01 are added to the generated time series.The embedding dimension and time delay for phase space reconstruction are chosen as 4 and 1, respectively.
With the generated Logistic chaotic time series, the performances of the AFGR-OSELM for performing different prediction steps are evaluated and compared with those of the OSELM, R-OSELM, FR-OSELM, and DFF-OSELM.In this simulation, the experimental design and the parameters setting for each model are the same as that in the first simulation in Section 4.1 except for the AFGR-OSELM using a different value of (0.05).The mean and SD of the prediction RMSE over 30 independent trials of the five models are demonstrated in Table 2.As shown in Table 2, on the whole, the prediction behaviors of the five models in this simulation are basically the same as those in the first simulation.In more detail, the OSELM and the FR-OSELM are quite unstable due to the intrinsic ill-posed problem, and the two algorithms are workable only in specific situations.In contrast, the stability of R-OSELM and AFGR-OSELM is greatly improved by using the nonfading regularization approach and they can always provide reliable prediction results.Moreover, with the help of the adaptive forgetting mechanism, the AFGR-OSELM is capable of depicting the dynamic characteristics of the parameter-varying chaotic system more accurately; thus it achieves better prediction performances than the R-OSELM.In addition, we notice that the DFF-OSELM does not behave well in this simulation; the possible reason is that the default parameters recommended by [23] are not suitable for this task, and it may imply that the DFF-OSELM is problem specific.
Similarly, for an intuitional comparison, we present the typical absolute error graphs of the five models for performing 2000-step online prediction in the case of  = 100, and the results are demonstrated in Figure 2.For the same reason stated above, only partial results of FR-OSELM are shown.The experiment results and the corresponding experimental analyses behind Figure 2 are basically the same as those in Figure 1, and which have been expounded before, so we will not repeat them again here.

Stock Price Prediction.
Stock price prediction is an important and challenging task in economic field.Since many factors that affect the stock price often change with time, the stock price data often have the time-varying characteristic.Two stock price data sets are used here for experiment (obtained from https://www.pmel.noaa.gov/tao/drupal/disdel/),and each has 3000 stock prices of sequential time points.For the two stock price time series, the embedding dimension and time delay for phase space reconstruction are chosen as 15 and 1, respectively [17].
All the five algorithms, OSELM, R-OSELM, FR-OSELM, DFF-OSELM, and AFGR-OSELM, are performed to model and predict the stock price in an online mode.In this simulation, the experimental design and the parameters setting for each algorithm are the same as that in the first simulation in Section 4.1 except that a different value of regularization parameter (10 −4 ) is used in the four regularized models, R-OSELM, FR-OSELM, DFF-OSELM, and AFGR-OSELM, for better stability.
Tables 3 and 4 present the mean and SD of the prediction RMSE over 30 independent trials of the five models for performing different prediction steps on the two stock price data sets.From Tables 3 and 4, we can see that the OSELM is very unstable and the obtained mean and SD of the prediction RMSE are much larger than that of other models.Similarly, since the regularization function fades gradually with time, the FR-OSELM is workable only within short prediction steps and out of service in longer-term prediction steps.In contrast, the proposed AFGR-OSELM is very stable by using a generalized regulation method and can always provide reliable prediction results.In addition, compared with R-OSELM and DFF-OSELM, our AFGR-OSELM achieves better prediction performances in almost all cases.

Prediction of the Industrial Systems with Time-Varying
Behaviors.To demonstrate the effectiveness and practicability of AFGR-OSELM in real applications, two real world industrial data sets are also employed and tested.Most   industrial processes exhibit some kind of time-varying behaviors, so these industrial data sets are crucial to evaluate the proposed methodologies.One is the Debutanizer column data set with a total of 2394 samples; the other is the Sulfur Recovery Unit (SRU) data set with a total of 10081 samples.The two data sets can be obtained from [34] and more details of them can be found there.These two data sets are well known and have been widely used to evaluate algorithms for dynamic system modeling and prediction in the literatures.Tests are carried out by comparing AFGR-OSELM to OSELM, R-OSELM, FR-OSELM, and DFF-OSELM on the Debutanizer column and the SRU data sets.For the two data sets, each model uses the same parameters.On the basis of the suggestions from [23], in our simulations the hidden nodes number is set as 20 and the number of training data for initialization is set as 100 uniformly for all the five models.Besides, the other parameters setting for R-OSELM, FR-OSELM, DFF-OSELM, and AFGR-OSELM are the same as that in the first simulation in Section 4.1 except that a different value of regularization parameter (10 −4 ) is used in all the four models for better stability and a smaller value of FF (0.9) is assigned in the FR-OSELM for better performance.According to the size of the data set, different online prediction steps are designed: they are 200, 500, 1000, 1500, and 2294, respectively, for the Debutanizer column, and 2000, 4000, 6000, 8000, and 9981, respectively, for the SRU.
Table 5 presents the mean and SD of the prediction RMSE over 30 independent trials of the five models for performing different prediction steps on the Debutanizer column; meanwhile the typical absolute error graphs of all the models for performing 2294-step online prediction are shown in Figure 3. From Table 5 we can see that the OSELM and the R-OSELM achieve similar performances; it means that the ill-posed problem is not serious in current situation.Though the FR-OSELM obtains good performances within the first 1000 prediction steps with the help of the forgetting mechanism, its prediction RMSE is the worst when the prediction step is equal to 1500 or 2294.The reason for this is the intrinsic instability of the algorithm itself, and it can be clearly verified through the corresponding absolute error graph given by Figure 3(c), in which several exceptional peaks occur after 1000 steps.From this example we may conclude that the FR-OSELM is worse than the OSELM in stability and it probably produces unreliable results even in wellconditioned environment.By comparison, the DFF-OSELM and the AFGR-OSELM equipped with adaptive forgetting scheme achieve much better performances, and our AFGR-OSELM works best on both prediction accuracy and stability.
Table 6 shows the mean and SD of the prediction RMSE over 30 independent trials of the five models for performing different prediction steps on the SRU, and the typical absolute error graphs of all the models for performing 9981-step online prediction are demonstrated in Figure 4. From Table 6 and Figure 4, it is clear to see that the OSELM is very unstable and the obtained mean and SD of the prediction RMSE are much larger than that of other models.Similarly, since the regularization function fades gradually with time, we can see from Figure 4(c) that the FR-OSELM is workable only within the first 170 prediction steps and fails rapidly after that; as a result the subsequent prediction RMSE are very large and completely invalid, which are represented as "×" in Table 6.Relatively, our AFGR-OSELM is quite stable and can always provide reliable results in all stages, and the obtained prediction RMSE are much smaller than those of the R-OSELM and the DFF-OSELM.From the two real world industrial data sets, we can conclude that the AFGR-OSELM is also very effective in providing stable and superior performances for predicting real systems with time-varying behaviors.
Finally, we give the processing times (CPU time) of the five OSELM-related models for performing online prediction with different prediction steps on Debutanizer column and SRU, which are shown in Tables 7 and 8, respectively.Since the computational complexities of OSELM, R-OSELM, and FR-OSELM are identical to each other, we can see that their prediction times are almost the same in the same cases.Relatively, there is an increase in prediction time for the DFF-OSELM and AFGR-OSELM because extra calculations are required for tuning the FF recursively in the online prediction process.Though the AFGR-OSELM is the least efficient among the five models, its stability and prediction accuracy are the best, so it is worth it.

Conclusions and Future Work
In this paper, a new AFGR-OSELM algorithm incorporating generalized regularization and adaptive forgetting factor has been developed for modeling and predicting the time-varying system in an online mode.By employing a generalized  2 regularization item instead of the traditional exponential forgetting regularization item in the cost function, the obtained AFGR-OSELM can maintain a constant regularization effect without fading in the whole learning process and a persistent stability of the algorithm can be guaranteed.Moreover, with the help of the adaptive forgetting factor, the AFGR-OSELM can well track the dynamic changes of the time-varying system and timely reduce the adverse effects of the outdated samples; thus it is capable of producing desirable prediction results in time-varying environment.The effectiveness and practicability of the AFGR-OSELM are evaluated and compared with other four representative OSELM-related models by artificial and real world data sets.The experimental results indicate that the AFGR-OSELM behaves much better than its counterparts in terms of stability and prediction accuracy for time-varying system prediction.
Although the proposed method has shown satisfactory results, it still can most probably be improved.As demonstrated in the above theoretical analyses and comparative experiments, compared with other models the AFGR-OSELM is superior in stability and prediction accuracy, yet its computation procedure is more complex and accordingly more computation times are required.Actually, the computational complexity of AFGR-OSELM may be further reduced with a proper approximation approach, and similar work has been done in [22] where a low complexity adaptive FF method was developed for the original OSELM.As future work, we will consider this issue further and try to provide a more efficient AFGR-OSELM.
"×" denotes nullification owing to the too large RMSE.
*The values have been multiplied by 100.