Short-Term Wind Speed Forecasting Based on Ensemble Online Sequential Extreme Learning Machine and Bayesian Optimization

,


Introduction
Wind energy has grown substantially for two decades [1]. It has become one of the primary renewable energy sources. However, wind energy is highly variable, which affects the stable operation of the grid. Wind speed prediction can enhance wind farm operations and reduce the influence of wind energy on the grid. As the installed capacity of wind energy increases year by year [2], the industry needs more accurate wind speed prediction, making this subject an essential topic in energy research. Over the past decade, scholars have proposed many wind speed prediction methods. ese methods are divided into four categories, i.e., (1) physical methods, (2) statistical methods, (3) artificial intelligence methods, and (4) hybrid methods. e physical methods are based on fluid dynamics principles to establish numeric weather prediction (NWP) models. ese methods need vast calculations and are not suitable for short-term wind speed prediction [3]. Statistical methods can analyze the patterns in historical data and establish linear prediction models. Representative methods include autoregressive (AR) [4], autoregressive moving average (ARMA) [5], autoregressive integrated moving average (ARIMA) [6], and pattern sequence similarity (PSF) [7]. ese methods are not capable of characterizing nonlinear relationships in the wind data to produce high-precision prediction results.
Artificial intelligence methods are good at modeling nonlinear relationships. Among the AI models, the most widely used ones are artificial neural networks (ANNs) [8] and support vector machines (SVMs) [9]. However, the ANNs have multilayer structures that contain many parameters to adjust. e SVM is sensitive to parameters and needs massive calculation on large data sets. Extreme learning machine (ELM) is a simple neural network [10]. Compared to the ANNs, ELM has a single hidden layer and therefore has fewer network parameters. Compared to SVM, ELM is more efficient. Consequently, ELM is an excellent predictor [11]. For instance, Liu et al. [12] used the ELM to complete the forecasting for the high-frequency sublayers obtained by the VMD-SSA. Fu et al. [13] proposed a hybrid approach based on dominant ingredient chaotic analysis and the ELM. However, in these papers, ELM is in the offline mechanism, and they cannot support real-time learning. To address this issue, an online sequential extreme learning machine (OSELM) was introduced. Zhang et al. [14] proposed an online sequential outlier robust extreme learning machine (OSORELM) for short-term wind speed prediction. Tian et al. [15] proposed an adaptive OSELM to improve ELM's prediction ability further.
Ensemble learning, such as Bagging [16] and Boosting [17], can combine multiple weak predictors to complete the forecasting. Bagging can reduce the prediction variance and improve the stability of the fundamental predictors. Zontul et al. [18] proposed a Bagging-based decision tree algorithm for wind speed prediction. Emeksiz and Demir [19] used the Bagging algorithm to estimate wind speed. Boosting can effectively enhance the performance of a weak predictor. Peng et al. [17] used the AdaBoost neural network to solve the lower accuracy defect. Liu et al. [20] proposed an AdaBoost algorithm and the multilayer perceptron (MLP) neural networks.
Besides ensemble learning, hybrid methods can improve the prediction robustness and accuracy of a single model. In a hybrid model, signal decomposition algorithms are employed to reduce the prediction complexity. e representative algorithms are wavelet decomposition (WD), wavelet packet decomposition (WPD), empirical mode decomposition (EMD), and ensemble empirical mode decomposition (EEMD). For instance, Fei and He [21] proposed a hybrid prediction method that combined WD and relevance vector machine. Liu et al. [22] presented a novel approach based on WPD and convolutional long short-term memory (ConvLSTM) networks. Zhang et al. [23] developed a model combining EMD, ANN, and SVM. Tian et al. [24] proposed a prediction approach using EEMD and extreme learning machine (ELM). However, the above decomposition methods have shortcomings. For instance, the wavelet-based approaches do not support adaptive processing; EMD cannot avoid mode mixing, and EEMD can add extra white noise into the wind data. A novel signal processing method, variational mode decomposition (VMD), was proposed to overcome the above obstacles. It can break down the original wind speed timeseries into a set of band-limited sublayer modes named intrinsic mode functions (IMFs). ese IMFs are stationary to predict. For instance, Zhang et al. [25] presented a hybrid model of the ANN, VMD, and Lorenz disturbance. e results proved the stable prediction performance of the proposed model. Gendeel et al. [26] presented an ANN prediction model with VMD. e comparison results indicated that the proposed model obtained significant improvements in forecasting accuracy.
Feature selection methods can improve the computational efficiency of the hybrid models. e typical filter-based approaches are partial autocorrelation function (PACF) and information theory methods. Sun et al. [27] applied PACF to identify the correlation between the decomposed components of EEMD. Memarzadeh and Keynia [28] used mutual information (MI) for feature selection. Huang et al. [29] used conditional mutual information (CMI) to analyze the correlation between the input features. Compared with the filter methods, the metaheuristic optimization-based wrapper approach can produce better accuracy. Sun et al. [27] used the binary-value gravitation search algorithm (BGSA) to improve the regression performance. Liu et al. [30] used the binarycoded genetic algorithm (BGA) for feature selection. Recently, the binary bat algorithm (BBA) has been proposed for feature selection. Compared with other metaheuristic algorithms, BBA has fewer parameters to adjust and can obtain better accuracy. Naik et al. [31] used the BBA to identify the relevant subset of features for the machine-learning tasks. Xie et al. [32] applied BBA to realize test-cost-sensitive attribute reductions. Liu et al. [33] used BBA to remove redundant features for image steganalysis effectively. Since BBA is superior to PACF, it is employed for feature selection in this paper.
Besides the feature selection, the metaheuristic optimization algorithms can be used to seek the optimal parameters of the prediction models to promote the predictors' performance on the datasets [34]. Among the metaheuristic algorithms, the genetic algorithm (GA) [35] and particle swarm optimization (PSO) [36] have been widely used in wind speed prediction. Although they are suitable for optimizing the model parameters, they need massive calculations and are vulnerable to improper parameter initialization. In the past few years, Bayesian optimization (BO) has emerged as a powerful tool for fine-tuning hyperparameters. Specifically, BO is capable of optimizing expensive black-box objective functions. Compared with the evolutionary computation methods, BO can achieve desirable results with fewer iterations. For instance, Cho et al. [37] used BO to fine-tune deep neural networks. e experimental results indicated that BO is a robust solution compared to the existing solutions. Muhuri and Biswas [38] used BO to optimize task scheduling. eir approach obtained optimal schedules without violation of the constraints.
e experimental results indicated that BO is sample-efficient and can significantly outperform existing optimizers.
is paper proposes a novel approach for short-term wind speed prediction based on the above issues. e proposed model combines VMD, BBA, OSELM, BO, and Bagging. e contributions of the paper are as follows: (1) VMD is utilized to preprocess the original wind time-series into more stationary sublayers for prediction. Compared to EMD and its variants, the proposed approach is more robust to data noise (2) BBA is employed to complete the feature selection.
Compared to PACF, BBA can achieve better prediction accuracy (3) BO-optimized OSELM, referred to as BO-OSELM, is used to forecast the low-frequency sublayers of VMD. Compared to ELM, OSELM can provide the capability of online learning. Besides, BO is used to optimize the structure of OSLEM (4) Bagging-based ensemble OSELM, referred to as Bagging-OSELM, is employed to forecast high-frequency sublayers of VMD. e Bagging-OSELM reveals better stability and accuracy than OSELM and AdaBoost-OSELM e remaining part of the paper proceeds as follows: Section 2 introduces the proposed hybrid model, Section 3 presents the experimental results and discussion, and Section 4 draws the conclusions.

The Proposed Hybrid Model
In this section, the proposed hybrid model, referred to as VMD-BBA-EnsOSELM, is presented. is approach combines VMD, BBA, BO, Bagging, and OSELM. e architecture of the proposed model is demonstrated in Figure 1.
e process of the proposed method is introduced as follows: (1) VMD is utilized to decompose the denoised original data set into stationary sublayers. (2) e feature selection method of BBA is applied to reserve critical features from the sublayers produced by VMD. e past twenty data points of the wind speed are chosen as the candidate feature sets. BBA determines the most relevant features of the candidate features. (3) OSELM is adopted to complete the forecasting for the low-frequency sublayers obtained by VMD. BO optimizes the parameters of OSELM. (4) Bagging-OSELM is adopted to complete the forecasting for the high-frequency sublayers obtained by VMD.

Variational Mode Decomposition.
e VMD algorithm is developed to overcome the limitations of EMD [39]. It can decompose an original signal x(t) into IMFs. In the literature, it presented significant advantages in time-series forecasting [40] and fault diagnosis [41]. e core principle of VMD is to realize the IMFs by resolving the constrained optimization problem as follows: where u k � u 1 , u 2 , . . . , u K denotes the IMFs; ω k � ω 1 , ω 2 , . . . , ω K is a central frequency of each IMF in the Fourier frequency domain; and δ(t) represents a Dirac function. e constraint conditions are (1) the original signal equals the sum of all the IMFs; and (2) the sum of the modal bandwidths is the least. Moreover, a Lagrange multiplier is introduced as where α denotes a penalty factor, guaranteeing the decomposition precision, and λ is a Lagrangian multiplier to assure the rigidity of the constraint conditions. e optimal solution to the above optimization problem is achieved as follows: where u k (t) is an IMF, u n+1 k (ω) is the Fourier transform of u k (t), and n denotes the number of iterations to resolve the problem.

Binary Bat Algorithm.
Inspired by bats, a novel metaheuristic algorithm, named the bat algorithm [42], is developed. In this algorithm, each bat can use echolocation to detect prey. In each iteration, a bat b i actively adjusts the loudness A i and the rate of pulse emission r i according to the prey's distance. Firstly, each bat b i is initialized with the position x i , the velocity v i , and the frequency f i . en, for each iteration t, the bat b i can be updated according to the following equations: where β denotes a randomly generated number; x j i (t) denotes the value of decision variable j for bat i at time step t; � x j represents the current global best solution for decision variable j; and α and c are user-specified constants (Algorithm 1).
In case of feature selection, a binary version of the bat algorithm is proposed restricting the new bat's position x j i can be calculated as follows:

Online Sequential Extreme Learning
Machine. ELM is a novel feedforward network with a single hidden layer. e mathematical expression of ELM is illustrated as follows: where β is the output weight vector between the single hidden layer and the output layer, and G(x) is the hidden layer output matrix. e optimal solution of β can be obtained by where G † is the Moore-Penrose inverse of G, and T is the training-target matrix. OSELM is a novel online learning algorithm [43]. e algorithm can be divided into two phases: the initialization phase and the online learning phase. In the initialization phase, given a training dataset T, the hidden layer output matrix G 0 and the output weight vector β * 0 can be calculated as follows: en, the online learning process starts, and the algorithm learns the data block by block. In the kth iteration, a batch of new observed training-target matrix T 0 was given. e output weight vector β * k can be calculated as follows: 2.4. Bayesian Optimization. Given a global optimization problem of an objective function f, where f is an expensive black-box function, and X is the design space of f(x). Besides, f can be evaluated arbitrarily in X. en, a sequential exploration process is proposed, which, at iteration n, location x n+1 is examined at which to evaluate f and observe y n+1 . After N evaluations, the exploration process terminates, and a final optimal location x * is obtained, which is the best optimization result. In the problem of wind speed forecasting, the black-box function f is a wind speed prediction model with hyperparameters x with a prediction error y � f(x) on a validation dataset. Such f is nonconvex and expensive to evaluate. Bayesian optimization [44] takes advantage of all the optimization function observations to make the sequential exploration process efficient. Bayesian optimization can be described as a sequential model-based optimization method that solves the objective problem. Initially, a probabilistic surrogate model is specified to represent the prior belief on the objective function, and then the posterior belief is calculated as f is evaluated sequentially. e posterior belief represents the belief of f on the observations of the objective function. e typical probabilistic surrogate models include Gaussian process regression, sparse pseudo-input Gaussian process, sparse spectrum Gaussian process, random forest, and gradient boosting decision tree. An acquisition function α n : X↦R is used to explore the design space X, incorporating the posterior belief model. It performs exploration and exploitation for the next evaluation of f. As a utility function, it measures how optimal a sequence of evaluations is. e acquisition function returns the utility estimate of candidate points for the next evaluation of f and selects x n+1 , which produces the maximum utility. e main acquisition functions are the PI (probability of improvement), EI (expected improvement), and UBC (upper confidence bounds). Currently, Bayesian optimization has been demonstrated as a powerful tool for optimal design problems, such as industrial control [45], robotics [46], and chemical experiments [47]. In this paper, a novel Bayesian optimization algorithm, named DART-EI Bayesian optimization is proposed for the wind speed forecasting models. e process of the algorithm is described in Algorithm 2. In the process, the probabilistic surrogate model is dropouts meet multiple additive regression trees (DART) [48], and the acquisition function is the EI. In each iteration of the Bayesian optimization process, the next query point is calculated as follows: where y optimal denotes the best current value, m(x) represents the DART's prediction mean, and α EIn (x n ) denotes the EI.
Since the performance of the OSELM model can be impacted by the number of hidden neurons, in this paper, BO is utilized to achieve the optimal performance of OSELM.
e objective function of BO is defined as the prediction result of 4-fold cross-validation for OSELM. e input variable of the objective function is the number of hidden neurons, which is a hyperparameter of OSELM. e output variable of the objective function is the mean absolute percent error of 4-fold cross-validation. e objective function is defined as follows: where CV 4 denotes the 4-fold cross-validation loss on the training data set. Besides, the acquisition function is critical, for that it can determine the exploration and exploitation of BO. In this paper, EI is employed as the acquisition function.

Bagging.
Bagging is an efficient ensemble learning algorithm [49]. It can significantly improve the performance of the primary learner. In this paper, Bagging-OSELM is introduced to complete the prediction of high-frequency sublayers of VMD. Initially, the bootstrap sampling method is used to draw two hundred sample data sets D 1 , D 2 , . . . , D 200 from the given training data set D. en, an OSELM R i is constructed per each data set D i , and the final ensemble model R is built on averaging the prediction values from R 1 , R 2 , . . . , R 200 . e detailed Bagging-OSELM algorithm is described as follows (Algorithm 3): 2.6. e Performance Evaluation Metrics. In this paper, the performance of the involved models can be evaluated by the mean absolute error (MAE), the mean absolute percent error (MAPE), and the root mean square error (RMSE). e smaller the evaluation metrics, the better the model performed. e MAE, MAPE, and RMSE are defined as

Mathematical Problems in Engineering 5
where y i and y i denote the predicted and observed value at the time i, respectively, and N represents the number of data points. Besides, improved percentage indices P MAE , P MAPE , and P RMSE are used to compare the performance of two models. e P MAE , P MAPE , and P RMSE are defined as

Pearson's Test.
Pearson's test can evaluate the prediction capability of the involved models. In Pearson's test, the correlation coefficient is calculated to describe the degree of association between the observed data and the predicted data. If the correlation coefficient is 0, then the observed and the predicted values are not correlated. If the coefficient is 1, the observed and the predicted values are 100% correlated. e larger the Pearson correlation coefficient is, the better the model is. Pearson's correlation coefficient can be described as follows: where Y i is the actual data, Y i is the forecasting data, Y m and Y m are the means of the actual data and the forecasting data, respectively, and N denotes the number of data points.

Wind Speed Data Description.
In this paper, two wind speed time-series are used to evaluate the proposed model. ese data were collected from the 135-m research towers of the NREL (National Renewable Energy Laboratory) from January 2012 to August 2012. e descriptive statistics of the data are given in Table 1. Each data set contains 1800 points with 10 min interval. Each original data set is divided into a training data set and a test data set. e training data set includes 1-1700 points, and the test data set contains 1701-1800 points. e wind time-series is depicted in Figures 2 and 3, respectively.

Parameter Settings.
In this paper, two kinds of wind speed prediction models are implemented: the single models and the hybrid models. e single models are the GPR model, the LSSVR model, the LSTM model, the OSELM Bat Algorithm (f ) Input: Target function f(x), x � (x 1 , . . . , x n ) Initialize the bat population x i with the velocity v i , the pulse frequency f i , the pulse rates r i and the loudness A i , i � 1, 2, . . . , m. For each bat b i , do Employ equations (5)- (7) to produce new solutions. If rand > r i , then Choose one candidate solution from the optimal solutions. If rand > A i and f(x i ) < f(� x), then Accept the newly proposed solutions. Update r i and A i by equations (8) and (9). Return the current best � x. In the GPR model, the kernel function is rational quadratic. In LSSVR model, the kernel function is RBF, and gamma is 0.01. In the LSTM model, the number of neurons is 40. In the OSELM models, the number of hidden neurons is 10. In the BBA-based models, the maximum time lag is 20 for selecting relevant input features. In the EMD-BBA-OSELM model, the number of EMD trials is adopted as 100.
In the EEMD-BBA-OSELM model, the number of EEMD trials is adopted as 100, and the standard deviation of Gaussian noise is 0.05 for EEMD. In the VMD-based models (VMD-BBA-OSELM and VMD-BBA-EnsOSELM), the number of modes for VMD decomposition is 10. In the PSO-OSELM model and the VMD-BBA-EnsOSELM model, the number of hidden neurons of OSELM is selected using PSO and BO. e search range is as [10,200].  Tables 6 and 7. 3.4. e Comparisons and Analysis. From the above section, it can be seen that the prediction results for all the wind speed series have similar laws. e comparison and discussion of the prediction results are as follows:

Experimental Results
(

e Sensitivity Analysis.
e proposed method involves the number of decomposition modes of VMD, which has to be preconfigured. In this section, several cases are conducted to discuss the sensitivity of the number of modes. e proposed model has performed 1-step predictions for the wind time-series 1 with the various numbers of modes. e forecasting results are shown in Table 8. From Table 8, it is concluded that the prediction errors of the proposed model can be reduced when the number of decomposition modes increases. For instance, when the number of modes grows from 4 to 5, the RMSE index is reduced by 6.62%; when the number of modes grows from 5 to 6, the MAPE index is reduced by 19.90%; when the number of modes grows from 6 to 7, the MAPE index is decreased by 9.19%; when the number of modes grows from 7 to 8, the MAE index is decreased by 19.23%.

Conclusion
Short-term wind speed forecasting is significant to wind energy development, and it is widely applied to turbine regulation, electricity market clearing, and preload sharing. is paper has presented a novel hybrid forecasting method based on VMD, BBA, BO, Bagging, and OSELM. In the proposed VMD-BBA-EnsOSELM model, VMD is used to decompose the original wind time-series into stationary subseries. BBA is used to complete the feature selection. BO-OSELM and Bagging-OSELM are utilized to complete wind speed prediction. Two experiments are conducted on the NREL datasets to verify the superiority of the proposed method. Twelve involved models are compared with the proposed method, including the GPR model, the LSSVR model, the LSTM model, the OSELM model, the AdaBoost-OSELM model, the Bagging-OSELM model, the BBA-OSELM model, the BO-OSELM model, the PSO-OSELM model, the EMD-BBA-OSELM model, the EEMD-BBA-OSELM model, and the VMD-BBA-OSELM. e experimental results and Pearson's test indicate that (1) BBA is suitable for feature selection; (2) Bagging can be better than AdaBoost for enhancing the prediction capability of OSELM; (3) BO can be superior to PSO for effectively improving the accuracy of a hybrid wind prediction model; (4) the proposed method can achieve the best prediction performance among the involved models. In conclusion, the proposed model fully utilizes the virtues of VMD, BBA, BO, Bagging, and OSELM, and it is suitable for the forecasting of short-term wind speed. Future research directions will focus on enhancing the proposed model for multistep wind speed prediction.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.