Short-Term Load Forecasting with Improved CEEMDAN and GWO-Based Multiple Kernel ELM

Short-term load forecasting (STLF) is an essential and challenging task for poweror energy-providing companies. Recent research has demonstrated that a framework called “decomposition and ensemble” is very powerful for energy forecasting. To improve the effectiveness of STLF, this paper proposes a novel approach integrating the improved complete ensemble empirical mode decomposition with adaptive noise (ICEEMDAN), grey wolf optimization (GWO), and multiple kernel extreme learning machine (MKELM), namely, ICEEMDAN-GWO-MKELM, for STLF, following this framework. *e proposed ICEEMDANGWO-MKELM consists of three stages. First, the complex raw load data are decomposed into a couple of relatively simple components by ICEEMDAN. Second, MKELM is used to forecast each decomposed component individually. Specifically, we use GWO to optimize both the weight and the parameters of every single kernel in extreme learning machine to improve the forecasting ability. Finally, the results of all the components are aggregated as the final forecasting result. *e extensive experiments reveal that the ICEEMDAN-GWO-MKELM can outperform several state-of-the-art forecasting approaches in terms of some evaluation criteria, showing that the ICEEMDAN-GWO-MKELM is very effective for STLF.


Introduction
Accurate load forecasting plays a significant role for the participants in the electricity industry because it can provide a safe and reliable automatic management for a smart grid. According to the length of the period involved, the tasks of load forecasting can be divided into three groups: long-term forecasting, mid-term forecasting, and short-term forecasting. Among them, short-term load forecasting (STLF) has become the hottest research topic in load forecasting because it can not only increase the scheduling efficiency but also reduce the costs of operations [1,2].
In the last decades, time series-based models, such as random walk, moving average (MA), autoregressive integrated MA (ARIMA), ARIMA with explanatory variable (ARIMAX), and generalized autoregressive conditional heteroskedasticity (GARCH), are widely used in load forecasting [3,4]. Lee and Ko embedded a lifting scheme into ARIMA models to improve the forecasting ability, and the simulation results verified the effectiveness of STLF [3]. Cui and Peng introduced temperature into ARIMA to propose an improved ARIMAX to deal with the mutation data structures [4]. However, because these time series-based models are usually built on the assumption that the load data are with the characteristics of linearity and stationarity, which are not always met in practical load data, the forecasting accuracy is limited. In fact, recent research has demonstrated that the load data are usually nonlinear and nonstationary. erefore, it is necessary to use other models instead of time series-based models to improve load forecasting accuracy. Artificial intelligence (AI) models are thought of as having the capacity to capture intrinsic features of complicated signals. erefore, they have become more and more popular in energy forecasting. Typical AI models include support-vector regression (SVR) and its extension least-squares SVR (LSSVR) [5,6], artificial neural network (ANN) [7][8][9], extreme learning machine (ELM) [10], sparse Bayesian learning (SBL) [11,12], deep learning [13] (stacked denoising autoencoders (SDAs) [14], deep belief network (DBN) [15], convolutional neural network (CNN) [16], and long short-term memory (LSTM)) [17]), and nature-inspired optimization algorithms [18,19]. For example, Chen et al. put forward a new SVR model that used the temperature before demand response as additional input variables for STLF [6]. Kulkarni et al. proposed a spiking neural network to forecast short-term load with consideration of weather variables [8]. Yoem and Kwak used an ELM with knowledge representation for STLF, and the experimental results indicated good performance of the approach [10]. Han et al. presented time-dependency CNN and cycle-based LSTM for STLF by mapping the load data as pixels and rearranging them into a 2-D image, and the extensive experiments demonstrated that the proposed models were superior to the compared models in terms of computation complexity [13]. Some other research aims to forecast load data accurately using hybrid models [20,21].
As far as energy time series forecasting is concerned, recent studies have shown that a framework called "decomposition and ensemble" is able to improve the forecasting performance significantly.
e main idea of this framework is to decompose the raw energy data series into several simpler components, then handle each component individually, and finally integrate the result from each component as the final forecasting result. is framework is a typical form of the strategy of "divide and conquer" that is widely used in energy price forecasting [22][23][24], wind speed forecasting [25,26], load forecasting [27,28], biosignal processing [29,30], fault diagnosis [31], image processing [32][33][34][35], and so on. ere are many types of decomposition methods that can be applied to decomposing energy time series. Among them, the members of the family of empirical mode decomposition (EMD), i.e., ensemble EMD (EEMD), complete EEMD (CEEMD), CEEMD with adaptive noise (CEEMDAN), and the improved CEEMDAN (ICE-EMDAN), have become very popular in signal decomposition [36][37][38][39]. With them, the complicated raw energy time series can be decomposed into several components, some high-frequency components and some low-frequency ones, making it easier to forecast each component individually than to forecast the raw energy time series directly. e existing research has revealed that the ICEEMDAN is better than the other decomposition methods in the family of EMD [39].
eoretically, any regression methods in AI can be applied to forecasting each component. ELM has shown its superpower in the tasks of both classification and regression in recent years [40]. In particular, ELM is able to achieve satisfactory results with time series forecasting [41][42][43]. When kernel trick is introduced into ELM, it has the ability to improve the performance of STLF further [44,45]. However, the existing kernel ELM usually uses one kernel or a simple combination of several kernels to construct the kernel matrix as the input of ELM, and the parameters in the single kernel or the weights of the several kernels are optimized by algorithms or specified by users. e number and types of single kernels in such ELM may limit forecasting effectiveness. An ideal way is to construct the kernel matrix using multiple kernels built by different types of kernels, and both the parameters and the weights can be optimized adaptively. Some nature-inspired algorithms, such as ant colony optimization [46,47], particle swarm optimization (PSO) [48,49], and grey wolf optimization (GWO) [50,51], have the potential to optimize the parameters and the weights for the involved kernels simultaneously. Recent research has demonstrated that GWO outperforms some other nature-inspired algorithms in numerical optimization [52,53].
Inspired by the power of ICEEMDAN in signal decomposition, GWO in numerical optimization, and multiple kernel ELM (MKELM) in regression or prediction, this paper proposes a novel approach that integrates ICEEMDAN, GWO, and MKELM, so-called ICEEMDAN-GWO-MKELM, for STLF. Specifically, the proposed ICEEMDAN-GWO-MKELM is composed of three stages. First, the raw load series is decomposed into a couple of components, some showing high-frequency characteristics while others showing low-frequency ones. Second, an independent MKELM model is built on each component. To improve the representation ability of the kernel matrix, GWO is applied to optimizing both the parameters and the weight of each base kernel simultaneously. Finally, the predicted results of all the individual components are simply accumulated as the final forecasting result. e novelty of this paper lies in the following two aspects: (1) it is the first time that the combination of ICE-EMDAN, GWO, and ELM is used for energy forecasting, especially for STLF. (2) A novel multiple kernel learning (MKL) framework optimizing both the parameters and the weight of every single kernel with GWO is proposed. We conduct extensive experiments to compare the proposed ICEEMDAN-GWO-MKELM with many state-of-the-art approaches, and the results demonstrate that the ICE-EMDAN-GWO-MKELM is able to outperform the compared approaches in most cases. e remaining part of this paper is structured as follows. Section 2 briefly describes ICEEMDAN, GWO, ELM, and MKL. e proposed ICEEMDAN-GWO-MKELM is methodologically formulated in detail in Section 3. To evaluate the proposed method, we conduct extensive experiments, and the results are analyzed and discussed in Section 4. Finally, the paper is concluded in Section 5.

e Improved Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (ICEEMDAN).
e ICE-EMDAN is a signal decomposition algorithm proposed by Colominas et al. based on CEEMDAN in 2014 [39]. It is a new member of the family of EMD. e EMD is a signal decomposition technique that can decompose a data series into a set of components with different frequencies, including intrinsic mode functions (IMFs) and a residue [36]. e signal must satisfy the following two characteristics: (1) in all data, the number of extreme values (maximum and minimum) and zero crossings are equal or different at most by one; (2) the local mean must be zero at any time.
2 Complexity e EMD algorithm is described as follows: Step 1: connect the local maxima as well as the minima of the raw data by two cubic spines to form one upper envelope and one lower envelope, respectively Step 2: average the upper envelope and the lower envelope, and subtract the mean value from the original signal to obtain a new sequence Step 3: examine whether the new sequence from Step 2 meets the two characteristics listed above; if not, take the new sequence as raw data and go to Step 1; otherwise, take the new sequence as one IMF and subtract it from the original sequence to obtain one residue (the local mean) Step 4: examine whether the residue from Step 3 is a monotonic function or its value is less than a predetermined threshold; if not, take the residue as raw data and go to Step 1; otherwise, end the decomposition and output all IMFs and one residue However, EMD is prone to mode-mixing problems during the decomposition process [37]. e main form of the problem is that an IMF contains signals with large scale differences. e main reason for this problem is as follows: in the case of an abnormal event, there will be an uneven distribution of extreme points in the signal, which will affect the shape of the envelope. e IMF thus contains the intrinsic mode of the original signal and the intrinsic mode of the adjacent time scale brought by the anomalous event.
In order to overcome the shortcomings of EMD, Wu and Huang put forward the ensemble EMD (EEMD) method, which is an integrated, straightforward, and adaptive decomposition method for nonlinear and nonstationary time series [37]. Similarly, EEMD decomposes the original data into a couple of IMFs and one residue and each decomposed component has the same length as the original data. In EEMD, a certain amount of different Gaussian white noises are firstly added to the original signal, as shown in the following equation: where x denotes the original series, w (i) is the i-th Gaussian white noise to be added, and x (i) is the corresponding processed signal. en, each processed signal is subjected to EMD processing to get IMFs and one residue, respectively. At last, the IMFs and residues from different processed signals are integrated and averaged correspondingly to obtain the final decomposition results: where c k is the k-th final decomposition result, c (i) k is the k-th decomposition result from x (i) , K is the final number of IMFs after averaging, determined by the length M of the original sequence of the time series, K � log 2 M − 1, and N is the realization number of decomposition. e method causes the extreme point distribution state in the original signal to be changed so that the processed signals can be continuously distributed on different time scales and reduce the probability of generating mode-mixing problems.
EEMD can reduce the mode-mixing problem to a certain extent. However, because of the added white noise, after a finite number of averaging calculations, the error is not completely eliminated, which will affect the accuracy of the reconstruction sequence and the accuracy of the prediction. Although the error can be reduced by the increase of the average number of integrations, the amount of calculation and the calculation time are greatly increased. Based on this phenomenon, Torres et al. designed CEEMDAN [38].
In CEEMDAN, Gaussian white noise is added to each decomposition layer to obtain one IMF and corresponding residual signal. Given the operator E k (·) that produces the k-th IMF by EMD, the CEEMDAN can be simply described as follows: (1) like the EEMD, the CEEMDAN decomposes the original data and gets the first IMF c 1 and residue r 1 ; (2) the following k-th IMF (k ≥ 2) and residue can be obtained by where w (i) is the i-th Gaussian white noise to be added, E k [w (i) ] is the k-th IMF by decomposing w (i) using EMD, and p k is the signal-to-noise ratio of the additional noise and the original signal. e algorithm terminates when the residual r k satisfies the iterative termination condition, and finally CEEMDAN can get several IMFs and compute one residue as follows: where R is the residue obtained from CEEMDAN, x is the original signal, and K is the number of IMFs. However, CEEMDAN still suffers from two main problems: (1) residual noise contained in models and (2) spurious mode problem. An improved CEEMDAN (ICE-EMDAN) was proposed to solve these problems. Let M(·) be the operator for generating local mean and E k (·) be the operator that produces the k-th IMF by EMD [39]. e specific decomposition process is as follows: Step 1: add E 1 [w (i) ] to the original signal x, x (i) � x + p 0 E 1 (w (i) )(i � 1, 2, . . . , N), where w (i) is the i-th white noise to be added, p 0 is the signal-to-noise ratio, and N is the amount of white noise added.
Step 2: calculate the local mean of x (i) by EMD and get the first residue r 1 � (1/N) N i�1 M(x (i) ); then, obtain the first IMF c 1 � x − r 1 .
Step 3: recursively use formulas c k � r k− 1 − r k (k ≥ 2) to get the k-th IMF c k , where Complexity e results show that the residual noise problem in the IMF is greatly reduced and also the mean value problem caused by the different numbers of IMFs generated by EEMD is also solved.

Grey Wolf Optimization (GWO).
Grey wolf optimization (GWO) algorithm is a new type of swarm intelligence optimization algorithm proposed by Mirjalili et al. in 2014 [50]. It is derived from the simulation of the predation behavior of the grey wolf population and achieves optimization through wolf group's tracking, encirclement, pursuit, and attack. e advantages of the GWO algorithm include simple principle, fewer parameters to be adjusted, easy implementation, and strong global search ability. e wolves are divided into four levels: α represents the dominant wolf in the group and it is at the first level, β represents the subordinate wolf at the second level who helps α make decision, δ represents the wolf following the instructions of α and β, and ω represents the wolves at the lowest level. In GWO, the pursuit behavior is performed by α, β, and δ, and ω follows the first three to track and coffer the prey and finally completes the predation task. Suppose the number of grey wolves is N and the dimension of search space is d, the position of the i-th grey wolf in the d-th dimension space can be expressed as x i � (x i1 , x i2 , . . . , x i d ). According to the fitness function of a specific optimization problem, the optimal individual is recorded as α and the corresponding individuals ranked second and third are recorded as β and δ with the remaining individuals recorded as ω. e position of the prey means the global optimal solution of the optimization problem. ree definitions to be used in the GWO are given as follows.
Definition 1. Distance between grey wolf and prey: where X → p (t) indicates the position of the t-th generation prey, X → (t) indicates the position of the t-th generation grey wolf individual, and C is a swing factor determined by where r → 1 is a random vector between 0 and 1.

Definition 2. Encircling prey.
In nature, grey wolves always hunt preys by encircling them. Also, the mathematical model is given as follows: where A → is the convergence factor determined by where r → 2 is a random vector between 0 and 1 and a → decreases linearly from 2 to 0 as the iteration number increases.

Definition 3.
Hunting and capturing prey. An issue with Definition 2 is that the position of the prey (the optimal solution) in practical optimization problems is unknown. So, in order to simulate the behavior of hunting prey, three types of wolves, namely, α, β, and δ, are defined based on their distance to the prey, and they have the clearest understanding of the position of the prey. e closer the distance, the more the wolves understand the position of the prey. We can use their positions to find the prey and lead the rest ω wolves to update their own locations. e mathematical expression of hunting prey can be explained as follows: During the hunting, it first calculates the distance between individuals within the group and α, β, and δ from equations (10)- (15) and comprehensively determines the direction in which the individual moves toward the prey by using equation (16).
Finally, wolves (searching agents) finish hunting when they capture the prey and the algorithm terminates. e main idea of GWO can be described on the basis of the following definitions: randomly generate a population of grey wolves in the problem space; evaluate each individual wolf based on its distance to the prey according to Definition 1, and nominate α, β, and δ wolves and then update the location of each wolf by Definition 3; repeat the operations of evaluation and update the positions of wolves until the wolves capture the prey [50].

Extreme Learning Machine (ELM).
e ELM is a new single-hidden-layer feedforward neural network (SLFN) [54]. is machine randomly initializes the linking weights and the bias, and only the number of hidden-layer nodes needs to be determined by users. ELM can get unique optimal output weights by only one-step calculation and thus gets high training speed. ELM has been proved to perform well in both regression and classification problems. 4 Complexity an ELM regression model with N hidden-layer nodes and activation functions G can be expressed as where a i � [a i1 , a i2 , . . . , a in ] T is a vector of weights of the i-th hidden-layer node and the input node, β i � [β i1 , β i2 , . . . , β im ] T is another vector of weights between the i-th hidden-layer node and the output node, b i is the bias of the ith hidden-layer node, a i · x j is the inner product between a i and x j , N is the number of hidden-layer nodes, and o j is the output when input is x j . When the model fits the N samples exactly, we can get Equation (19) can be written as where H is a hidden-layer output matrix, and the output weight can be obtained by solving the following linear system: and the solution is where H † is Moore-Penrose pseudoinverse of hidden-layer output matrix H. It is proved that equation (22) is the unique minimum norm least-squares solution of equation (21), which can be shown as [54] When the numbers of the hidden-layer nodes and the samples are identical, the network can approximate the samples very well, but in practice, the number of hidden nodes is usually less than the number of training samples, so the data samples may have multicollinearity problems. When solving the Moore-Penrose pseudoinverse H † � H † (HH T ) − 1 , the existence of multicollinearity may make HH T singular. Each time the model is modeled by ELM, the obtained matrix H † is different and the hidden output weight β of the hidden layer is also inconsistent. ese reasons finally cause the output of ELM prone to random fluctuations, and the model's stability and generalization ability are not ideal.

Multiple Kernel Learning (MKL).
To further enhance the generalization ability and stability of ELM, Huang et al. introduced the kernel function into ELM by comparing the principles of ELM and support-vector machine (SVM) and proposed kernel extreme learning machine (KELM) [40]. e objective of ELM is to minimize not only the sum of training error but also the norm of weights, which can be presented as where ξ i � [ξ i1 , ξ i2 , . . . , ξ im ] T is the error of the m output nodes when the input is x i , C is the penalty coefficient used to weigh the ratio between structural risk and empirical risk, β is the output weight vector between the hidden layer and the output layer, t i is the true value with respect to x i , N is the Complexity 5 amount of samples, and h(x i ) is the output vector of the hidden layer by input x i . According to the KKT theorem, equation (24) is equal to the dual optimization problem as follows: where α ij is the Lagrange multiplier, and KKT conditions of equation (25) are Equation (29) can be derived by substituting equations (26) and (27) into equation (28): where I is an identity matrix, HH T is generated by mapping the input samples through a kernel function, From equations (26) and (29), we can get e output function is We can define the kernel matrix as where h(x) is the hidden-layer node output function; K is a kernel function, and its forms include RBF kernel function, linear kernel function, and polynomial kernel function. e kernel matrix Ω KELM replaces the random matrix HH T in equation (31) and uses the kernel function to map all input of the n-dimensional space to a high-dimensional space. After the kernel parameter setting is completed, the mapped value of the kernel matrix Ω KELM is a fixed value. Now, the output function can be rewritten as [40] f In the kernel-based ELM algorithm, as long as a function satisfies Mercer's condition, it can be used as the kernel function, such as radial basis kernel function (RBF) and polynomial function (polynomial). Each kernel function usually has its own application fields, and a single kernel function often cannot maximize the representation ability. erefore, this paper presents a new kernel ELM framework that combines different types of kernels called multiple kernel ELM (MKELM). MKELM replaces the single kernel function in KELM with weighted combination of different kernel functions. e popular single kernels used in machine learning are as follows.
RBF kernel function: Polynomial kernel function: Linear kernel function: Wave kernel function: So, the combined kernel function can be formulated as where n 1 , n 2 , n 3 , and n 4 represent the number of four kernel functions in order, a i represents the weight of corresponding single RBF kernel function, and the same for b i , c i , and d i . e coefficients satisfy ere are 1, 2, 0, and 3 parameters to be optimized for a single RBF kernel, a single polynomial kernel, a single linear kernel, and a single wave kernel, respectively. So, in the combined kernel, all the number of parameters for optimization are the sum of the parameters for single kernels n 1 + 2n 2 + 3n 4 and the weights of all the kernels n 1 + n 2 + n 3 + n 4 , which is 2n 1 + 3n 2 + n 3 + 4n 4 .

GWO-Based Multiple Kernel Extreme Learning Machine.
is paper uses GWO to optimize the weights and parameters for MKELM, namely, GWO-MKELM, which can be described as follows: 6 Complexity Step 1: define the fitness function according to root mean square error (RMSE): where N denotes the number of training samples, y i denotes the actual value of input x i , p(i) is the vector of the parameters and weights in the proposed model which need to be optimized, and ϕ(x i , p i ) is the prediction value of the model with x i and p i .
Step 2: set parameters for running GWO, including maximum number of iterations, population size, and upper and lower boundaries for parameters of four different types of kernel functions and the regularization parameter. Randomly initialize the position for each wolf between the upper and lower boundaries, and set the fitness value of the α, β, and δ wolves to be infinite. Set the times of iteration t � 1. Initialize a, A → , and C → .
Step 3: for each wolf, if the fitness of current wolf is less than that of the α wolf, replace the α wolf with it; if the fitness is between β wolf and α wolf, replace the β wolf with it; if the fitness value is between δ wolf and β wolf, replace the δ wolf with it.
Step 5: judge whether t is larger than the maximum number of iterations; if not, go to Step 3; if so, break the iteration and output the position of α wolf p α as the optimized parameters and weights for MKELM.

3.2.
e Proposed ICEEMDAN-GWO-MKELM Approach. Because the electricity load time series is hard to predict for its nonlinear and nonstationary features, this paper adopts a forecasting framework called "decomposition and ensemble." e framework decomposes the original time series into several relatively simple components and predicts these components separately, and the results are aggregated as the final predicted value. It is worth noting that the components obtained by the decomposition can effectively preserve the characteristics of the original time series at different levels and can be directly processed by relatively simple methods, so it is possible to forecast load from the decomposed components.
e proposed ICEEMDAN-GWO-MKELM consists of three stages, namely, decomposition, individual forecasting, and ensemble, following the very popular framework of "decomposition and ensemble." e details of the approach are illustrated in Figure 1.
In the first stage, i.e., decomposition, the scheme uses ICEEMDAN to decompose the raw load data into a set of components (n − 1 IMFs and one residue). e decomposed components will show simpler characteristics than the raw data, making it easier to forecast the fluctuation of the components than to forecast the raw data directly. After that, an individual MKELM model is built on each component. To improve the forecasting accuracy, GWO is applied to optimizing both the parameters and weights adaptively. In the last stage, the forecasting results of all the components are simply summated as the final result.
From the figure, it can also be seen that the proposed ICEEMDAN-GWO-MKELM is a typical strategy of "divide and conquer." at is to say, the tough and challenging task of STLF is now divided into some subtasks of forecasting the decomposed components individually. Because the decomposed components show relatively simple features compared with the raw load data, the forecasting accuracy with such components might be significantly improved.

Data Description.
e data of electricity load demand can be accessed from Australian Energy Market Operator (AEMO) [55]. Especially, we use the hal-fhour demand data of 2014 from New South Wales (NSW), Tasmania (TAS), Queensland (QLD), Victoria (VIC), and South Australia (SA) to test and verify the proposed method. For each region, to reflect the difference of season, four months including three solar months of 31 days (January, July, and October) and one lunar month of 30 days (April) were chosen. e statistics of the used datasets are shown in Table 1. We use the first 80% observations as training data and the rest as testing data for each month. e size of datasets from a solar month is 1488, so the numbers of training and testing samples are 1190 and 298, respectively. Likewise, the size for a lunar month is 1440, so the numbers of the two parts are 1152 and 288.
We use the previous 48 data to predict the next data, so the horizon is 1 and the lag is 48, which can be formulated as where x t+h is the prediction of horizon h at time t, x t is the actual data at time t, and l is the lag.

Evaluation Criteria.
We use several widely used indices to evaluate the proposed approach: the mean absolute error (MAE), the mean absolute percent error (MAPE), and the root mean squared error (RMSE), as defined in equations (41)-(43), respectively:

Complexity 7
where N is the number of testing data and x t and x t are the true value and the prediction value at time t. Obviously, the lower the three criteria, the better the predict method. Meanwhile, the Nemenyi test is also conducted to show that the proposed method is significantly superior to the other methods [56]. e basic idea of the Nemenyi test is to determine whether the algorithm is similar to each other according to the average performance of the algorithm on different datasets. First, it sorts different algorithms from good to bad on each dataset according to the evaluation criteria so that the ordinal value is the position of the algorithm on the dataset, and 1 is the best. If the position is the same, the ordinal value is the same, and it can be calculated by k + ((n − 1)/2), where k is the position and n is the number of algorithms in the same position. en, it averages all the ordinal values of each algorithm to get the average ordinal value. Under a certain level of significance, the critical difference of the average ordinal value difference can be calculated by   where α is the level of significance, k is the number of algorithms, and N is the number of datasets. If the average ordinal value difference between the two algorithms exceeds CD, then the assumption that the two algorithms have the same performance is rejected at the level of significance of 1 − α. Here, we take a significant level of 0.05 and divide all prediction methods into two parts: one is without decomposition and the other one is with decomposition. e two parts of the algorithms take RMSE, MAPE, and MAE as the criteria for measuring performance to conduct the Nemenyi test, respectively, so a total of 6 tests were performed finally.

Experimental Settings.
First of all, some state-of-the-art prediction models will be introduced as comparative approaches to verify the validity of the proposed method ICEEMDAN-GWO-MKELM. According to whether the original data are decomposed and how they are decomposed, benchmark methods can be divided into three parts. e first part is methods without decomposition: SVR, ANN, random forest (RF), deep belief network (DBN) [15] Figure 2. Also, for GWO, the parameters are shown in Table 2.
We apply the min-max normalization to preprocessing the data before being fed into the models, as formulated by using the following equation: where x norm is the normalized data, x is the original data, and x min and x max are corresponding minimal and maximal of each data dimension. With this normalization, every data are mapped into a value in the range of [0, 1].
We conduct all the experiments by MATLAB R2016b on 64-bit Windows 10, and the main hardware includes a 3.6 GHz CPU as well as 32 GB RAM.

Results and Analysis.
To validate the proposed ICEEMDAN-GWO-MKELM, we conduct two groups of experiments: single models, which mean performing STLF with the raw load data, and ensemble models, which mean performing STLF with the decomposed components.

Single Models.
e results of MAPE, RMSE, and MAE by single models on the different datasets are listed in Table 3, where we use bold and italics to mark the best and the worst results, respectively.
From Table 3, we can find that ELM and KELM obtain the worst results in 43 and 17 out of 60 cases, respectively. It reveals that the kernel trick can improve the forecasting results for ELM to some extent. We can also see that none of the other models obtain the worst results, indicating that the nonkernel ELM and ELM with a single RBF kernel (KELM) underperform the other models. It may owe to both the forecasting ability of ELM and the representation ability of a single RBF which are still poor for STLF. However, when we use multiple kernel ELM optimized with GWO (GWO-MKELM) for STLF, it achieves the best results in 29 out of 60 cases, which indicates that GWO-MKELM outperforms other single models significantly. e possible reason is that GWO-MKELM can optimize the weights and parameters of the multiple kernels for ELM to improve the representation ability of the kernels. SVR and DBN achieve the best results 22 and 10 times, ranked second and third, respectively. ANN and RF obtain similar results in most cases.
We further use the Nemenyi test with RMSE, MAPE, and MAE to compare the models, as shown in Figures 3-5, respectively. Regarding Figure 3, the proposed GWO-MKELM is ranked first, with a value of 2, followed by DBN and SVR with 2.2 and 2.4, respectively. e ELM is ranked last, with a value of 6.6. erefore, the results show that GWO-MKELM is the best single model in terms of the Nemenyi test of RMSE. When we look at the Nemenyi tests of MAPE and MAE, we can see that both tests are almost the same. Specifically, SVR is ranked first with a value of 2, followed by GWO-MKELM and DBN with values of 2.1 and 2.15 in both tests, respectively. Once again, the ELM is ranked last in both tests.
All the experimental results with single models show that none of the single models can always outperform others, although GWO-MKELM achieves the best results the most times. erefore, to improve the accuracy of STLF, a possible way is to adopt the "decomposition and ensemble" framework.

Ensemble Models.
To demonstrate the performance of ICEEMDAN, we utilize EMD as a compared decomposition method. We still applied the seven forecasting approaches in single models to conduct individual forecasting for both decomposition methods, so there are a total of 14 ensemble Complexity 9 forecasting models. e results of ensemble models are listed in Table 4. From the table, it can be seen that the ensemble models with EMD and ICEEMDAN obtain the worst results, i.e., 51 and 9 out of 60 times, respectively, while all the best results of the 60 cases are achieved by the models associated with ICEEMDAN. Among the ensemble models with EMD, EMD-ELM and EMD-KELM obtain the worst results 23 and 28 times, respectively. e possible reason is that nonkernel ELM and ELM with a single RBF kernel lack good forecasting ability, so it is necessary to use multiple kernels to improve such ability. If we further investigate each individual forecasting method, we can find that the results with ICEEMDAN usually outperform those with EMD, confirming that the components decomposed by ICE-EMDAN are more powerful than those by EMD for STLF. e best results of all cases are associated with ICE-EMDAN, among which GWO-MKELM achieves the best results in 57 out of 60 cases while the remaining three best results are achieved by SVR. Regarding the individual forecasting models, DBN and RF obtain neither the best nor the worst results and their performance is at an intermediate level. ANN obtains the worst results 3 times. Note that although KELM with a single RBF cannot improve the    forecasting ability of ELM, the proposed GWO-MKELM can significantly outperform both ELM and KELM, showing that the proposed multiple kernel learning framework and the GWO-MKELM are very effective for STLF. When we look at the best result of each evaluation criterion, we can find that the proposed ICEEMDAN-GWO-MKELM is obviously advantageous over the compared methods in most cases. We take NSW data in JAN as an example, the ICEEMDAN-GWO-MKELM achieves the best MAPE 0.0037, which is far below the second best result 0.0076 achieved by the EMD-SVR and the ICEEMDAN-SVR. Likewise, the best MAE by the ICEEMDAN-GWO-MKELM is 30.2777, and it is less than half of the second best MAE by the ICEEMDAN-SVR. e results of MAE by some other methods are an order of magnitude larger than the MAE by the ICEEMDAN-GWO-MKELM. e results on RMSE show a similar trend.
Again, we use the Nemenyi test to compare all the involved ensemble models. e results regarding RMSE, MAPE, and MAE can be found in Figures 6-8, respectively. In all the figures, the proposed ICEEMDAN-GWO-MKELM is ranked first with a fixed value of 1.05, followed by ICE-EMDAN-SVR. In addition, EMD-KELM is still ranked last in all cases, showing that it has the worst ability for STLF when compared with the other ensemble models.
From the above analysis, we can see that (1) ensemble models outperform single models for STLF; (2) regarding the decomposition in ensemble models, ICEEMDAN is superior to EMD; (3) GWO is very useful for optimizing weights and parameters in MKELM simultaneously; and (4)      Figure 9.     From this table, it can be seen that when the realization number is less than 75, all the evaluation criteria are poor, while when it is equal to or greater than 75, all the results are rather stable. Especially when it equals 1000, all the evaluation criteria obtain the best values. erefore, to balance the computation cost and effectiveness, a value equal to or greater than 75 is reasonable for the realization number of the ICEEMDAN.
For the white noise strength, when the value is less than 0.09, the results are rather bad. However, when the value falls in the range of [0.09, 0.4], the results show obvious stability, as shown in Figure 10. More specifically, the proposed approach achieves the best results when the white noise strength equals 0.2.

e Influence of the Lag Order.
e lag order determines the length of the input data. A large lag order means more computation resources, while a small one may result in bad forecasting results. We test the lag orders in the range of 12, 24, 36, 48, 72, 96, 120, 144 { }, and the results are shown in Figure 11.
We can see that when the lag orders are equal to or less than 48, the results are very close, while after that, the results become worse and worse with the increase of the lag order.
Because the experiments use half-hour load data and it just has 48 data points for one day, 48 is a reasonable lag order to balance the computation resource and forecasting results.

e Influence of GWO-MKELM's Parameter Settings.
In the proposed approach, the combined kernel is built by four types of kernels, among which the number of RBF kernels is usually large. Here, we explore the influence of different numbers of RBF kernels, say 5, 10, 20, 30, 40, 50, { 60, 80, 100, 150, 200}, as shown in Figure 12. It is shown that the proposed approach achieves similar results when the number of RBF kernels is less than 60, and after that, the results become worse with the increase in the number of RBF kernels. e ideal value for such parameter is in the range of [5,50]. e iteration number and search agent number (population size) are two key parameters for GWO, whose impacts are shown in Figures 13 and 14, respectively.
From Figure 13, we can see that when the number of iteration varies from 5 to 15, the experimental results become better and better. When the iteration number is equal to 20, the results become slightly worse, and then the results remain stable with the increase of the iteration number. Likewise, the number of search agents has a relatively stable impact on the forecasting results. e optimal number of search agents is 50, and for other numbers of search agents, the forecasting results are very close, as shown in Figure 14.

Running Time.
We report the running time of the methods associated with ICEEMDAN in Table 5. It can be seen that ELM and DBN take the least time for training and testing, respectively. e GWO-MKELM takes the most time to train the forecasting model and test the samples because it has to perform matrix operations many times when evaluating each individual in the wolf pack. Fortunately, the training process can be conducted offline, and it needs to be performed only once. With the optimal weights and parameters optimized by GWO, the running time for testing the 298 samples is 0.5156 seconds. In other words, it takes about 0.0017 seconds to test one sample, which is acceptable in real-world applications.

Conclusions
STLF plays an important role in smart grids. Because of the nonstationary and nonlinearity, it is a challenging task of STLF from the raw load data. To address this issue, we propose a novel approach for STLF based on the well-known "decomposition and ensemble" framework. We firstly present a novel multiple kernel learning approach that uses GWO to optimize the weights and the parameters of every single kernel in the combined kernel for ELM simultaneously. After that, a new approach integrating ICE-EMDAN, GWO, and Multiple Kernel ELM, namely, ICEEMDAN-GWO-MKELM, is proposed for STLF. We also compare the proposed approach with some state-of-theart forecasting methods on publicly accessible datasets containing half-hour load data of 4 months from 5 regions in Australia. From the experimental results, we can draw the following conclusions: (1) Ensemble models significantly outperform the corresponding single models because the decomposed components can better express the intrinsic nature of the load data than the raw signal (2) ICEEMDAN is superior to EMD for the decomposition of the load data One limitation of the proposed ICEEMDAN-GWO-MKELM is that it has to evaluate each wolf and many kernel matrices must be computed. Hence, more time is needed during the training phase. Fortunately, the training process can be performed offline. erefore, once the forecasting model is built, the running time for testing is acceptable.
In the future, we will apply the ICEEMDAN-GWO-MKELM to forecasting other energy time series, such as electricity prices, natural gas prices, and wind speed.

Data Availability
e data used to support the findings of this study are included within the article.

Supplementary Materials
All the experimental data used in this manuscript are stored in an EXCEL file and are included in the "Optional Supplementary Materials." All the data were accessed and downloaded via the Australian Energy Market Operator 2014