Deep-Learning Prediction Model with Serial Two-Level Decomposition Based on Bayesian Optimization

-e power load prediction is significant in a sustainable power system, which is the key to the energy system’s economic operation. An accurate prediction of the power load can provide a reliable decision for power system planning. However, it is challenging to predict the power load with a single model, especially for multistep prediction, because the time series load data have multiple periods. -is paper presents a deep hybrid model with a serial two-level decomposition structure. First, the power load data are decomposed into components; then, the gated recurrent unit (GRU) network, with the Bayesian optimization parameters, is used as the subpredictor for each component. Last, the predictions of different components are fused to achieve the final predictions. -e power load data of American Electric Power (AEP) were used to verify the proposed predictor. -e results showed that the proposed prediction method could effectively improve the accuracy of power load prediction.


Introduction
With the rapid development of society, electric power is applied in all aspects of production and life. To ensure average production and living needs, power enterprises always produce more energy than needed. However, due to the nonstorage of power, excess energy will cause a waste of resources, and excessive operation also has an impact on the safety of power equipment [1][2][3]. erefore, power load prediction is of considerable significance to power enterprises. e benefits of load prediction include effective planning of annual power supply, reduction of power waste and costs, and development of operation plans. An accurate prediction can provide a reliable decision basis for operation and ensure the power system's sustainable development.
However, in reality, many factors cause the results of power load prediction to be challenging. e power load is a very complex nonlinear time series, making it very difficult to predict the power load accurately. For example, the weather will affect the cost of power [4]; other factors, such as the differences of the region's development levels and the unpredictable natural disasters [5], can also cause various changes in the power load. e power load data generally contain the following four components: (1) Trend component: it reflects the main trend of power load data, which mainly contains upward and downward trends. e trend component is the basic level of power load for a long time. If the power load increases, the component is an upward trend, and if the power load decreases, the component will have a downward trend. (2) Daily period component: the power load data have distinct period characteristics in a day, that is, high power load in the day and low power load at night.
(3) Annual period component: the power load data have another period component in a year, and the power load will change in different months. (4) Residual component: this part is the remains after removing the trend and periodic component from the original data, which contain complex nonlinear data and noise.
Figure 1(a) shows the power load data of the United States Electric Power Company (AEP) from January 1, 2017, to January 1, 2020. e horizontal coordinates in the graph indicate that the sampling interval is one hour. Figure 1(b) shows the trend component separated from the original data. Figure 1c(c_1) shows the overall daily period component, while Figure 1c(c_2) shows a portion of our excerpted July 2018 daily period component as a display, and we can see that the daily period component has a distinct period. Figure 1(d) shows the annual period component.
ere is too much electricity in winter and summer, while the power load in spring and autumn is low.
In recent years, an accurate prediction of time series has become the focus of researchers. Time series usually have nonlinear, nonstationary, and complex period characteristics [6,7]. Existing methods for predicting time series include statistical prediction, machine learning, and combined prediction [8][9][10].
e statistical prediction methods are usually based on mathematical models [11,12]. e establishment of prediction models often includes the regression analysis method [13], the gray model method [14,15], support vector machine (SVM) [16,17], autoregressive integrated moving average (ARIMA) [18], and artificial neural network (ANN) [19]. ese methods often find it difficult to obtain accurate predictions when dealing with complex nonlinear data.
Unlike the method described earlier, deep learning does not require the preknown information and has stronger learning ability and prediction ability. For example, Tang et al. [20] presented a multilayer two-way recursive neural network based on LSTM and GRU; Gao et al. [21] used GRU to build models for short-term load prediction. Guo et al. [22] presented an integrated in-depth learning method that integrates multiple LSTM networks to develop large-span cycles using LSTM's nonlinear mode and similar day methods. Kollia and Kollias [23] proposed using deep convolution-recursive neural networks to process data in time series or two-dimensional information to improve prediction accuracy. Yin et al. [24] proposed a three-state energy model, which includes three states: generator, power load, and closed status. And a kind of scalable deep learning is proposed for real-time economic power generation scheduling and control based on the three-state energy of the future smart grid. Zhang et al. [25] presented a prediction model based on the restricted Boltzmann machine and Elman. He et al. [26] proposed a deep belief network (DBN) embedded in a parametric copula model. Although these in-depth learning methods' accuracy is improved compared with the traditional methods, it is still difficult to learn enough feature representation in the face of nonstationary data.
Based on the above research, the latest research combines decomposition methods with deep learning to achieve better prediction results. ese methods decompose the original data into components, use different prediction methods to predict the decomposed components, and fuse the predicted results to achieve better prediction results. For example, the seasonal-trend decomposition procedure based on loess (STL) can obtain trend, seasonal, and residual components of the complex data [27], which have been used as a hybrid prediction in the authors' former research for weather forecasting [8,28]. Another decomposition method named as wavelet decomposition, Wang et al. [29], decomposed the original time series to construct the predictor for different subsignals. Li et al. [30] proposed to use an extreme learning machine (ELM) combined with the variational mode decomposition (VMD). Guo et al. [31] proposed to decompose the original sequence by empirical mode decomposition (EMD) and select different models (including AR, MA, and ARMA) based on different characteristics of the subcomponents. e authors had used the EMD to decompose the PM 2.5 time series data to obtain more accurate forecasting [32,33]. Compared with VMD and EMD, STL decomposition can guarantee to give a known number of components (including 3 components, such as trend, seasonal, and residual components) and is particularly suitable for sequential data with periodicity.
Comparing statistical and machine learning methods, the results show that the prediction accuracy can be improved by decomposing into multiple components and modeling separately with predictors. Moreover, it is also found that the hyperparameters have a significant impact on the prediction performance. erefore, optimization methods of hyperparameters have been used. For example, [34][35][36] decomposed the original data based on a wavelet algorithm and predicted the components by a particle swarm optimization (PSO) neural network. Another optimization method named as the fruit fly optimization algorithm (FOA) is used to select parameters for the generalized regression neural network (GRNN) [37]. In contrast, He et al. [38] used the Bayesian optimization algorithm based on a Parzen estimator to optimize the hyperparameters of the quantile regression forest (QRF) predictor. FOA and PSO are group optimization algorithms, which are not suitable for model hyperparameter tuning because they need to have enough initial sample points, and the optimum efficiency is not high. However, for the training process of deep learning models, we need to desample as little as possible to improve the optimum efficiency. erefore, the Bayesian optimization algorithm is widely used in the deep learning models as it can obtain the optimal global value with the fewest sampling points.
is paper uses a serial two-level decomposition structure to improve the prediction performance due to the complexity of multiple periods of the power load data. Furthermore, the Bayesian optimization algorithm is applied to optimize the hyperparameters of the model. e innovation contribution of this paper is shown as follows: e rest is arranged as follows. Section 2 introduces the proposed serial two-level decomposition and the model used to realize the prediction in detail. In Section 3, we give the experimental results of the SMBO optimization algorithm and compare them with other experiments. At last, Section 4 summarizes the conclusion.

Serial Two-Level Decomposition
Optimization Model e model consists of decomposition, prediction, and fusion processes. e prediction model's framework is shown in Figure 2, in which the two-level decomposition structure is used at first, and the data are decomposed into four components. In the training phase, GRUs are used to train these four components. In the prediction phase, GRUs are used to predict four components. Finally, the results of each submodel are fused to get the final prediction results.

Serial Two-Level Decomposition.
e time-series data of the original electric load are decomposed with two levels. Figure 3 shows the detailed information of the decomposition node of Figure 2. After the first-level decomposition, the three components of trend, period, and residual are obtained. However, the residual data still contain trend and periodic information. erefore, the residual obtained by the first decomposition is decomposed again. Similarly, three sets of data, new trends, new periods, and new residuals resulting from the second decomposition, can be obtained.
After the first-level decomposition of the original electric power load data, trend TD t , period PD t , and residual RD t are obtained. e second decomposition of residual RD t was carried out to obtain trend TY t , period PY t , and residual RY t . Finally, the components with the same characteristics are combined to obtain the final three decomposition results: trend T t , period P t , and residual R t .

First-Level Decomposition.
Power load data P t are a discrete-time series with a length of N, which means t � 1, 2, 3, . . . , N, so three sets of data, trend, period, and residual, can be represented as discrete: where PD t , RD t , and TD t are the period component, the residual component, and the trend component. Detailed decomposition steps can be represented as follows: (1) In the first-level decomposition, the power load of a day has a potential period, so the decomposition period is set to be 1 day. For the hourly data used here, the number of periods is set to 24. Num d � [N/24] is used to calculate the number of periods; the meaning of [a] in the formula is to round up the input data a. (2) Utilizing the average regression method, the trend component TD t , which can express the overall trend of time series, is extracted from the original data P t .
(3) e following two steps are followed to obtain the period component of the original data P t : (a) Calculate the initial value of the periodic com- and N may not be the same. erefore, instead of making selection on all the data, select data 1 st to Num d × 24 from the data XD t , superpose the data with the point of the same time, and divide by Num d to get a periodic curve, which is duplicated Num d times so that the periodic component PD t with the same N points can be obtained.
(4) Raw data minus period and trend data gives residual RD t , such as RD t � P t − PD t − TD t .

Second-Level
Decomposition. Considering that the first-level decomposition does not completely decompose the original data P t and the residual data RD t still contain rich periodic information and trend information, the second-level decomposition of the residual RD t obtained from the first-level decomposition is carried out in the similar method. Furthermore, we can get a second set of data that represents the annual trend, the annual period, and the residual: where PY t , RY t , and TY t are the period component per year, residual component, and trend component, respectively.
Detailed decomposition steps can be represented as follows: (1) e period of the second decomposition is set to 1 year, 8760 hours. e number of periods is then calculated using Num y � [N/8760], where [a] is described as rounding a.
(2) Utilizing the average regression method, the trend component TY t , which can express the overall trend (4) Daily residual data minus period and trend data gives residual RY t , such as RY t � RD t − PY t − TY t .

Subpredictor.
After two-level decomposition, the five components of residual RY t , trend TD t , trend TY t , period PD t , and period PY t were obtained. Trend TD t and trend TY t represent the linear trend of the data, and then TD t and TY t are combined to get the trend of the data. erefore, four groups of subdata are finally obtained. Four GRU networks are trained as predictors for each subdata. GRU network is the development of LSTM models. It simplifies the structure of the model, reduces the network parameters that need to be trained, and inherits the ability of LSTM to solve long-dependency problems. Hence, GRU is a good model structure for prediction. e GRU model consists of two parts, the update gate and the reset gate, and each GRU cell is structured, as shown in Figure 4. e update gate's function is to adjust the information transmission from the previous moment to the current moment. e smaller the value, the less information from the last moment to the present moment. e goal of the reset gate is to adjust the degree of ignored information from the previous moment. e larger the reset gate value is, the less information is ignored so that the new input can be fused with more stored information.
e formula for the forward propagation of input data in each GRU is shown in the following: where x t is the input; z t , r t , h t , and h t represent the candidate state of the update gate, the reset gate, the hidden node at the t-th time point, and output state of the hidden node at the t-th time point; U and W represent the weight in the model; b represents the bias; ⊙ represents the multiplication of elements; and σ and tan h are the activation functions used Complexity in the cell, and the mathematical expression of the activation functions is as follows: According to the structure of the above GRU cell and the relationship between the input and output data, a GRU network is built and shown in Figure 5. e network includes multiple GRU units, and the number of network layers is 2. As shown in Figure 5, X t (t � 1, 2, . . . , N) is the input of the GRU network, Y m+t (t � 1, 2, 3, . . . , N) is the output, and m is the number of GRU cells in each layer.

Sequence Model-Based Optimization (SMBO).
Before deep-learning model training, we need to initialize the hyperparameters of the model, which can improve the model's prediction performance. As to the two decomposition prediction models for the trend component and daily period component, the traditional parameter selection method can achieve prediction by the network's initialization parameters. However, the annual period component and residual data are complicated, so here, we use one of the Bayesian optimizations, named the SMBO algorithm [39].
SMBO needs an objective function and then updates the posterior distribution of the parameter space objective function. Here, the objective function L(x) is selected as the root mean square error (RMSE), that is, where num is the number of input hyperparameter group k, y(k i ) is the prediction result obtained by the model using hyperparameter combination k i , and y i is the real value. en, for the SMBO algorithm, we have where k * is the best parameter combination determined by the SMBO algorithm, k is a set of input hyperparameters, and K is the parameter space of multidimensional hyperparameters. e update of the parameter space includes two steps: Gaussian process (GP) and hyperparameter selection. In the Gaussian process, the algorithm realizes the modeling and fitting optimization of the objective function and obtains the posterior distribution corresponding to the input k; in the process of hyperparameter selection, "development" and "exploration" are used to realize the process of finding the optimal parameter with the minimum cost. "Exploration" refers to finding appropriate parameters in the unsampled hyperparameter space, which often leads to the global optimal parameter combination. "Development" will search the last set of the hyperparameter space according to the posterior probability distribution. When the set objective function L(k) follows the Gaussian distribution, where μ(k) is the average value of O(k, k ′ ) and L(k), O(k, k ′ ) is the covariance matrix of L(k), GP is expressed as a Gaussian process, L(k) is the objective function, A ∼ B is the distribution of A obeying B, and initialization can be expressed as During parameter searching of the SMBO algorithm, the covariance matrix of the Gaussian process will change with the number of iterations. If the hyperparameter group entered in step i + 1 is k i , the covariance matrix can be expressed as where o � [o(k i+1 , k 1 ), o(k i+1 , k 2 ), . . . , o(k i+1 , k i )]; then, the posterior probability of the objective function L(k) can be obtained successively: where D is the observation data, is the probability of obtaining the objective function L in the case of step i + 1 data D and parameter group k, N is the normal distribution, and A ∼ B is the distribution of A obeying B. e next step is to find the best parameter through hyperparameter selection after the posterior probability is obtained. e upper confidence bound (UCB) acquisition function is used in the "development" in this paper: where ζ i+1 is a constant, H(k | D i ) is the UCB acquisition function, and k i+1 is the hyperparameter group selected in step i + 1. 6 Complexity e SMBO algorithm is shown in Algorithm 1.

Experiment Results and Discussion
In this study, the electric power load data are from American Electric Power Company (AEP), which includes 26,280 data from January 1, 2017, to January 1, 2020. In the experiment, the first 70% data were selected as the training set data and the remaining 30% as the test set data. e data are first decomposed and then normalized for each subcomponent separately. In the training process, trend T t and period PD t are put directly into different subpredictors for training, while period PY t and residual RY t are first subjected to hyperparameter seeking by the SMBO algorithm and then put into different subpredictors for training. In the testing process, different components are put directly into the trained subpredictors to get the final prediction results, and the results from different subpredictors are fused to get the final prediction results. en, the power load prediction is used to plan the precision supply of the power and development of operation plans. e overall system flow diagram is shown in Figure 6.

Experimental Setup.
e predictor uses Keras to build a learning model. All models were trained and tested on a PC server with Intel Core-CPU i7-2.21 GHz processor with 32 G RAM. In deep learning, many hyperparameters need to be set (for example, the number of network layers, weight initialization, and learning rate). e GRU network structure is set to two layers.
For the complex components such as the annual periodic component and residual component, this paper uses the SMBO algorithm to find some optimal hyperparameters of the network. e rest of the parameters use the Keras default initialization parameters to obtain the model parameters by optimizing the predetermined optimization function.
For components such as the daily period component and trend component, the GRU model uses the Nadam optimization algorithm, and all parameters use Keras default values. Activation functions in the network use tanh and ReLU. e learning and prediction steps are set to 24; that is, the model uses the previous day's power load to predict the next day's power load. Predicting one day in advance can help related departments get a general idea of the next day's power load and make appropriate plans based on the load.
In this study, five indexes are used to evaluate the performance of the model, including root mean square error (RMSE), normalized mean square error (NRMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), and Pearson correlation coefficient (R). e smaller the first four of the five indicators, the more accurate the model prediction. e larger the value of the fifth indicator (R), the better the fitting effect between the observed value and the predicted value. e calculation formula of these five indicators is as follows: where n is the number of samples, y is the ground-truth value of the power load, y is the average of the ground-truth value, y is the predicted value, and y is the average of the prediction.

Hyperparameter Selection Based on Bayesian
Optimization. Table 1 shows the superparameter space of the SMBO algorithm. e selected hyperparameters include the number of neurons in the first layer, batch size, and optimizer. All the superparameter groups are tested in the model with an epoch of 100, and the optimized group of hyperparameters is finally obtained for subsequent network training. Table 2 shows the comparison of the prediction performance of the Bayesian parameter optimization and nooptimization method for the annual periodic component PY t and the residual component RY t . According to Table 2 e comparison between the predicted power load for December 12 to 23, 2019, and the real power load is shown in Figure 7. We can see that the weekend power load is lower, while the weekday power load is higher. As to one day, the Complexity Input: L(k) is the root mean square error of the proposed model, T is the number of selected hyperparameter groups, H is the UCB acquisition function, X is the input data, S is the proposed model, x is the input hyperparameter group. Output: returns the optimal hyperparameter group x * . D ⟵ InitSamples(L(k), X) for i ⟵ |D| to T do Model the objective function and calculate the posterior probability. P(L | D, k) ⟵ Fitmodel(S, D) Parameter group selection using the UCB acquisition function. k i ⟵ argmax k∈K H(k | P(L | D, k)) Using superparameter group k * to train network to get the prediction y i ⟵ L(k i ) Update data set D ⟵ D ∪ (k i , y i ) end for k * � argmin k∈K L(k) return k * ALGORITHM 1: Sequential model-based optimization.   8 Complexity morning and afternoon are higher, and the noon is lower. erefore, it is reasonable that the proposed method decomposes the original power load data twice, taking into account the periodicity per day and year.

Comparison of Prediction Results with Different Models.
In the setup experiment, we compare the performance of the proposed method with seven models, i.e., recurrent neural network (RNN) [40], long short-term memory (LSTM) [41],  Complexity GRU [42], STL-RNN (RNN based on STL), STL-LSTM (RNN based on STL) [43], STL-GRU (GRU based on STL) [8], and wavelet-LSTM (W-LSTM) [44]. e hourly power load data used are from February 6, 2019, to December 31, 2019. e partial prediction results of each model are shown in Figure 8. It can be seen from the figure that the proposed model has the best performance. Figures 9 and 10 show the comparison results of the five indicators, respectively. From     On the contrary, the results show that the GRU network has the best prediction performance. For example, compared with the RMSE of RNN and LSTM, the RMSE of GRU is reduced by 11.1% and 3.7%, respectively. As to the decomposition method, compared with STL-RNN and STL-LSTM, the RMSE of STL-GRU is decreased by 12.3% and 3.1%, respectively. It validates the correctness of selecting GRU as the subpredictor in this paper.
Furthermore, we find that the serial two-level decomposition is rational, and the proposed model works best, which obtains the least of RMSE 676.6433, MAE 486.0197, and SMAPE 0.0328, the highest R 0.9575, and the secondleast NRMSE 0.0572. It is believed that the original data contain nonlinear information; after two serial decompositions, the complex periodic information and trend information are predicted separately, which can better fit the data, and the combination can obtain a predicted result in better performance. e proposed deep-learning prediction models in this paper can combine some parameter estimation algorithms [45][46][47][48][49][50][51] such as the iterative algorithms [52][53][54][55][56][57] and the recursive algorithms [58][59][60][61][62][63][64] to study new modeling and prediction approaches for different engineering application problems [65][66][67][68][69] such as system modeling, information processing, and transportation communication systems.

Conclusions
More accurate results of power load prediction can make the power generation companies and power operation companies better control the operation status, facilitate the regulation of the market, save costs, and prevent pollution.
is study uses a serial two-stage decomposition structure to decompose the electric power load time series according to different periods, which reduces the complex nonlinear relationship of the original data of the electric power load. e overall trend component indicates that the electric power load data change slowly, which can be understood as the electric power load remains at a certain level for a long time. e daily period component indicates daily variation, higher in the day and slightly lower in the night; furthermore, as to one year, it is lower in winter and higher in summer, all of which correspond to the actual use of electricity.
After decomposing the raw power load sequence, the GRU is used to build the model prediction component. e predictions from the subpredictors are then fused to obtain a more accurate prediction. After a two-stage decomposition of the data, the trend information in the original complex time series, as well as the multiple period information, is separated into subsequences. Different prediction models for subsequences were built based on their different characteristics to obtain the final prediction results. e proposed prediction methods in this paper can be applied to other literature studies [70][71][72][73][74][75][76] for different purposes. In the future research, to further improve the model's performance, new network structures would be adopted, and other decompositions or combined methods would be tried. e model proposed in this study can be applied not only to power prediction but also to other data that contain multiple period information.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.