A Novel Data-Driven Method for Medium-Term Power Consumption Forecasting Based on Transformer-LightGBM

With the widespread use of new energy sources and Internet of things, the power market landscape has become complex. In particular, new energy is more stochastic and volatile; it is prone to the problems of inaccurate forecasting on longer time scales, aecting electricity trading. is study proposes a new method for predicting medium-term load series data based on the transformer-lightGBM. e method rst preprocesses electricity market data, including missing value processing, outlier processing, overall analysis, and correlation analysis, to extract features with a strong correlation to medium-term electricity consumption forecasts. en, a transformer neural network is used to learn the complex patterns and dynamic time scales of the load series data to predict the day-ahead market series. Finally, lightGBM is used to combine power characteristics and time characteristics to forecast power consumption. e eectiveness of the proposed method is proved using the ISO-NE dataset. Experimental results indicate that the present method veried more accurate prediction than LSTM-based methods.


Introduction
Load forecasts are based on the known demand for electricity and take into account the political, economic, climatic, and other relevant factors to forecast future electricity demand. Internet of things brings more extensive data access to smart grid, making the data involved in prediction richer. From an economic point of view, a load forecast is essentially a forecast of electricity market demand [1]. In recent years, renewable energy has grown considerably under the target incentives and policy support of governments. Owing to the high randomness of renewable energy, there is a trouble of inaccurate forecasting on a longer time scale, which may result in more conservative trading results in medium-and long-term power trading, which is not conducive to the full consumption of renewable energy [2]. erefore, many countries in the electricity market use medium-and longterm contracts to lock in power generation and create revenue. en, they introduce electricity spot market construction to promote direct participation of renewable energy in the market competitive transactions [3]. e electricity spot market consists of day-ahead demand, composed primarily of xed and price-sensitive demand bids, as well as real-time demand, de ned as the sum of nondistributed load assets, station-served load assets, and nonmetered load assets. e accuracy of load forecasting has a large impact on electricity generators and operators. A forecast that is too low may result in reduced revenue from electricity sales; a forecast that is too high may result in poor utilization of new generation capacity or even existing generation capacity. e medium-term load forecast has a forecast period of 1 month to 1 year and is used for the preparation of long-term operational plans for reservoir scheduling, unit maintenance, exchange plans, and fuel plans. e main methods of medium-term load forecasting include dynamic averaging based on extrapolation of time-series trends, exponential smoothing, growth rate methods, grey forecasting, Markov forecasting, and growth curve methods [4]. In medium-term load forecasting, single-factor regression analysis and elasticity coe cient methods are used to take into account unrelated factors, and forecasting methods for multicorrelated factors include multiple regression analysis, clustering forecasting, decision trees, and econometric methods. Machine learning techniques have developed rapidly; they can establish complex nonlinear relationships through a learning process involving trends in historical data. e following studies highlight the latest advances in forecasting future energy demand. Guan et al. [5] proposed a method for the kernel function of the Gaussian process model for long-term load probability prediction. Son et al. [6] proposed that LSTM was used for medium-and longterm load forecast. Omaji et al. [7] proposed a model to predict the hourly load for the following month based on historical hourly data and temperature data, and improved entropy mutual information was used in the data preprocessing. Dong et al. [8] proposed a new selective sequence learning method to transform the multiyear long-duration prediction problem into a sequence prediction problem with multiple time steps. Also, a neural network prediction model with singular spectrum analysis was proposed for the problem of the decomposition prediction accuracy of nonsmooth, nonlinear, and medium-term load sequences. Further, Liu and Zhao [9] proposed a medium-term forecasting method considering economic and meteorological factors. First, the monthly historical electricity consumption is decomposed into long-term trend components, cyclical components, seasonal components, and irregular components using the seasonal decomposition method. en, based on electricity, meteorological, and economic data, a support vector machine is used to forecast each component separately and make a comprehensive forecast of the total monthly electricity consumption. Zhang et al. [10] proposed a medium-and long-term forecasting model that takes into account the coordination relationship and lag effect of each influencing factor. e strongly correlated factors affecting the load change are obtained through correlation matrix screening. After obtaining the characteristic decomposition part, the delayed effect test is used to determine the number of lag periods, and the effect of data noise is removed using principal component analysis. He et al. [11] presented a probability density with a continuous conditional quantile function to predict the medium-term electricity load for a given day and introduced the concept of electricity load density. Khatereh et al. [12] proposed a model for predicting solar energy with a HMM to find the energy variation at a specified time over consecutive days. Shang et al. [13] used WNN and generalized regression to predict weather factors, Elman neural networks, and the cuckoo search to optimally predict wind speed. Finally, the variance analysis model was presented to combine the forecast results of weather factors and wind speed data. Shobana Devi et al. [14] presented an integration prediction method to predict wind power to improve the performance of the prediction. A modified LSTM enhanced forgetting gate model was used to optimize the parameters of the LSTM-EFG model using the cuckoo search optimization algorithm for the prediction of subseries data extracted from integrated empirical modal decomposition. Xia et al. [15] improved GRU-RNN structure, and improved training methods were used to improve robustness. Yang et al. [16] proposed a multitask prediction framework with BDL to quantify the uncertainty among different groups. Mohsen et al. [17] using customers' consumption records at different times, proposed a reliable procedure to check consumption changes. Florian et al. [18] derived speed prediction curves taking into account individual driving style characteristics and real-time traffic data to obtain EV vehicle consumption predictions. Alonso et al. [19] proposed a multilevel poly-time series clustering method using several representative features to summarize each time series to quantile autocovariance as well as simple and partial autocorrelation. Qian et al. [20] proposed a combination of simulation and transfer learning to improve the prediction accuracy of thermal and cooling loads to address historical data due to changes. Tanveer and Zhang [21] introduced two novel deeply supervised machine learning models, including a fitted stochastic feature expansion Gaussian kernel regression model and a nonparametric KNN-based model for demand forecasting for buildings and utility companies. Electricity demand was forecasted in the medium-to-long term by analyzing customers' electricity consumption patterns to stabilize the supply [6]. Nasir et al. [22] presented a mixture framework of SVM, GRU, and CNN. GreyWolf optimization and EarthWorm optimization were used to optimize the hyperparameters of the SVM and CNN-GRU. Omaji et al. [7] presented a method to forecast monthly hourly loads in advance using hourly load and temperature data. An improved entropic mutual information feature selection method was used for data preprocessing, and CRBM was used for load forecasting, while consumer behavior was clustered using adaptive k-means. e transformer is a new model proposed in 2017 [23]. It is based entirely on the attention mechanism and completely discards the structure of CNN and RNN to solve the longrange dependency problem of RNN and its variants. It has a better memory, remembers information over longer distances, and supports parallelized computing. Its groundbreaking ideas turn the previous equation of sequence modeling with RNNs on its head. So, it has been widely used in various areas of natural language processing [24,25], but it has been less used in load forecasting tasks.
For the above motivation, we propose a novel datadriven method for medium-term power consumption forecasting based on the transformer-lightGBM method, which aims to further improve the accuracy of electricity market forecasting for long time series. e study's key contributions are as follows: (i) A data-driven method is adopted to analyze the data of the New England electricity market, including missing value processing, outlier processing, holistic analysis, and correlation analysis, and features strongly correlated with medium-term electricity consumption forecast are extracted for forecasting. (ii) In this study, the transformer is used to capture the features of the actual changes and fluctuating trends of the load volume. For the feature in power consumption forecasting, the transformer-lightGBM method combines a multiheaded self-attentive mechanism with a temporal modeling capability. e transformer model can extract long-term temporal relationships owing to its special multiheaded attention structure, and lightGBM is an integrated learning framework that integrates feature sequences with efficient adaptive boosting capabilities.
(iii) In this study, we derive monthly and quarterly power consumption from hourly demand forecasts. Our proposed method allows for multitimescale forecasting by paying attention to the selected feature and the results of the feature forecasts, which allows our method to improve the accuracy of power consumption in the context of renewable energy.

PCC.
e PCC is a characteristic quantity of a random variable used to measure the linear relationship between two continuous random normal variables, as shown in the following equations: In the above formula, X and Y are stochastic variables, ρ XY is PCC, E is the expected value of the random variable, cov is the covariance, cov (x, y) is the sample covariance, and σ X and σ Y are the standard deviations. e PCC is a value between −1 and 1. When the linear relationship between two  variables is enhanced, the PCC tends toward 1 or −1, When one variable increases and the other also increases, the PCC is greater than 0. If one variable increases and the other decreases, the PCC is less than 0. If the PCC is equal to 0, there is no linear correlation.

Transformer
Model. e transformer model improves the attention mechanism and discards gating models such as LSTM, without the RNNs or CNNs. e transformer uses a stacked self-attention and an overall structure of point-topoint fully connected layers in the encoder and decoder. e structure of transformer is shown in Figure 1.
e proposed attention mechanism is derived from the human visual processing mechanism. After visual input, not all the information is processed, but the attention is focused on a specific part. e transformer is a Seq2Seq model, with an encoder and a decoder, not an RNN. It is based on the mechanisms of attention and self-attention.
Attention involves giving more weight to key information. Here, we use the column vector X to represent the input data and the sequence [x 1 , x 2 , . . . , x N ] to represent the relevant input vector. e query vector q is given by the attention mechanism to calculate its correlation with the input vector. T ∈ [1, N] represents the position of the selected input sequence. e definition is shown in the following equation: where s is the function to calculate the score, and α i is the distribution to the attention mechanism. In this study, the scaled dot-product attention commonly used in the selfattention model is used to represent the score function, as shown in the following equation: where dis the dimension of x. e self-attention introduces Q (query vectors to match others), K (key vectors to be matched), V (information vectors to be extracted), and the scaled dotproduct as the function to generate dynamic weights, which is also used to process sequences containing different lengths.  positional code layer appears only after the embedding on the encoder side and the decoder side and before the first block. In the absence of this, the transformer model does not work. Position encoding is a unique component of the transformer framework that complements the fact that the attention mechanism itself cannot capture positional information.
In summary, multihead self-attention can be expressed as shown in the following equation: where

LightGBM Model.
LightGBM improves the performance of the GBDT, and the optimal segmentation point based on the histogram algorithm and leaf-wise decision tree growth method with depth limitation is adopted. e method eliminates most samples with a small weight in the training process from the perspective of sample reduction, and only the information gain is calculated for the remaining sample data [26][27][28]. e histogram optimization algorithm needs to convert feature values into bin values in advance before training, that is, make a piecewise function for the value of each feature, divide the values of all samples on this feature into a certain segment (bin), and finally convert feature values from continuous values to discrete values. An intuitive example is shown in Figure 2.
ere are many advantages to using histogram algorithms. First, it reduces memory consumption, and it can only store values after feature discretization. us, the cost of computing is dramatically reduced. e histogram algorithm only needs to calculate k, which can be considered as a constant and the number of segments. is way, the calculated time complexity measure can be reduced from O (data × feature) to O (k × feature). e histogram difference only needs to traverse k buckets of the histogram. LightGBM can construct a histogram of a leaf (the parent node is calculated in the previous round) and get a histogram of its brother leaf at a fraction of the cost, doubling its speed, as shown in Figure 3.
LightGBM adds a maximum depth limit to the leaf-wise method to ensure high efficiency while preventing overfitting, as shown in Figure 4.
where P is the set of extraction, Q is the set of random extraction, a and b are the extraction ratios, S is the value of P ∪ Q, S l is the set of data less than v, S h is the set of data greater than v, and g k is the opposite direction gradient of X k .      [29]. Meteorological data are from [30]. e data are from January 1, 2016, to March 31, 2021, with 46009 data records. e seasonal decomposition of the real-time demand series is analyzed using the 2020 dataset as an example, and the original, trend, seasonal, and residual series are shown in Figure 5. e real-time power demand density distribution is shown in Figure 6. e data are still volatile, and we want the fluctuations to be relatively stable; otherwise, it will be easy  to produce overfitting, so we want to process the data to make it relatively stable. Here, we choose logarithmic variation to make the data stable, as shown in Figure 7. After logarithmic transformation, the data distribution is more uniform, and the size difference is also reduced. Using such labels is effective for the training model. From the data-driven method for medium-term power consumption forecasting, we analyze the ISO-NE hourly dataset as a whole: e overall analysis is shown in Figure 8. e data correlation analysis is shown in Figure 9. From the data correlation, it can be seen that the highest correlation coefficient is 0.97 for real-time demand and dayahead cleared demand, 0.45 for day-ahead LMP, and 0.45 for per hour. So, we analyze the sequence features of day-ahead cleared demand, day-ahead LMP, and per hour as the main feature.

Medium-Term Power Consumption Forecasting Based on Transformer-LightGBM.
In this study, we consider the deep relationship between the real-time market demand and the day-ahead market demand and then design a multivariate model to predict the power consumption in the next month and the next quarter based on deep learning. We consider extracting the sequence features of the power spot market by the transformer network. LightGBM gives the final prediction. e overall research framework is shown in Figure 10, and the processing schematic diagram is shown in Figure 11. e framework consists of three steps: data preprocessing, feature extraction based on the transformer, and load prediction based on lightGBM.
In this study, the transformer-lightGBM parameter settings are listed in Table 1. A cross-validated grid search method is used to optimize the parameters of the estimator for lightGBM.
Processing of the real-time demand data and related data is listed in Table 2.
e method constructed in this study offers two main improvements on the dataset feature extraction and processing: (i) e prediction of the long time series of real-time loads in the power market is transformed into supervised learning in machine learning; that is, the results of data of length T are predicted for time points T + 744 and T + 2208 using data of length T. After obtaining data with satisfactory accuracy, the daily and monthly electricity consumption is calculated based on real-time loads.
(ii) e transformer-lightGBM method is built to effectively combine the variable timing processing capabilities of the multihead self-attentive mechanism with the ensemble learning model to increase algorithmic accuracy, speed, and generalization capabilities.

Experimental Results and Analysis
e experimental environment is as follows: Python 3.7, Tensorflow 2.3 GPU, NVIDIA GeForce 940MX graphics card, Intel i5, 64 bit, and 12 GB RAM. e datasets for experiment 1 and experiment 2 are annual hourly data from ISO-NE, spanning 1916 days from January 2016 to March 2022.
In this experiment, parameter setting for comparison methods is listed in Table 3.    e APE in equation (8), the MAPE in equation (9), the RMSE in equation (10), and the MAE in equation (11) are used: Pipre means the predicted data, Pireal means the actual value, and N means the totality.  Table 4. Figure 12 shows the comparison of the real-time demand for each method from March 1, 2021, to March 31, 2021.
It shows the presented approach can provide a better fit to the actual demand in the forecast curve. In especial, the presented method can capture occasional fluctuations. However, the other compared methods show different deviations in the form of sudden changes or peaks. From the experiment 1 results, it can be seen that the presented method improves the prediction accuracy of real-time demand for the coming month.
Based on the predicted real-time demand at 744 hourly points in the future, we can draw the daily total power  Mobile Information Systems 13 consumption curve for the next months (March 1, 2021, to March 31, 2022), as shown in Figure 13.

Experimental 2: Power Consumption Quarterly Forecast.
In this experiment, real-time electricity demand is forecasted for 2208 points in March 2021. e training set runs from January 1, 2016, to September 31, 2020, the validation set runs from October 1, 2020, to December 31, 2020, and the test set runs from January 1, 2021, to March 31, 2021. e prediction results of each model are listed in Table 5. Figure 14 shows the comparison of the demand from January 1 to March 31, 2021. It shows the presented method can provide a better fit to the actual demand power in the forecast curve. From the experiment 2 results, it can be seen that the method proposed in this study improves the prediction accuracy of power consumption for the next quarter.
Based on the predicted real-time power demand at 2208 hourly points in the future, we can draw the daily total power consumption curve for the next three months (January 1, 2021, to March 31, 2022), as shown in Figure 15.

Conclusion
In this study, we have presented a novel data-driven approach to forecast medium-term power consumption with transformer-lightGBM, which is summarized as follows: (1) A data-driven approach was adopted to analyze the data of the ISO-NE, and we derived monthly and quarterly power consumption from hourly demand forecasts. Our proposed method allowed for multitimescale forecasting by paying the attention to selected features. (2) A novel method for medium-term power consumption forecasting based on transformer-lightGBM was designed and improved. We used a model architecture consisting of monthly and quarterly forecasts.