A Short-Term Photovoltaic Power Generation Forecast Method Based on LSTM

The intermittence and fluctuation of photovoltaic power generation seriously affect output power reliability, efficiency, fault detection of photovoltaic power grid, etc. The precise forecasting of photovoltaic power generation is the critical method to solve the above limitations. Current photovoltaic power generation forecasting methods generally usually adopt meteorological data and historical continuous photovoltaic power generation as inputs, but they do not take into account historical periodic photovoltaic power generation as inputs, which makes the existing methods inadequate in learning time correlation. Therefore, to further study the time correlation for improving the prediction accuracy, an LSTM-FC deep learning algorithm composed of long-term short-term memory (LSTM) and fully connected (FC) layers is proposed. The double-branch input of the model enables it not only to consider the impact of meteorological data on power generation but also to consider time continuity and periodic dependence, thereby improving the prediction accuracy to a certain extent. In this paper, meteorological data, historical continuous data, and historical periodic data are used as experimental data, and these three types of data are combined into different input forms to evaluate and compare LSTM-FC with other baseline models, including support vector machines (SVM), gradient boosting decision tree (GBDT), generalized regression neural network (GRNN), feedforward neural network (FFNN), and LSTM. The simulation results show that the accuracy of the models with meteorological data, continuous data, and periodic data as input is higher than that of other input forms, and the accuracy of LSTM-FC is the highest among these models, and its root mean square error (RMSE) is 11.79% higher than that of SVM.


Introduction
e gradual depletion and pollution of resource-based energy have become the focus of attention of the whole society, while renewable energy represented by solar energy has absolute security, cleanliness, and resource sufficiency and plays an important role in the long-term energy strategy [1,2]. According to the statistical data results of the International Renewable Energy Agency (IRENA) in 2019, China's installed photovoltaic (PV) power generation capacity has reached 205.493 GW [3]. PV power stations are based on solar energy. However, as the output of PV power generation is highly fluctuant and intermittent, when PV power generation reaches a certain threshold, it will bring huge impact and challenges to the power system and reduce the efficiency and reliability. e purpose of this paper is to propose a deep learning algorithm to accurately predict the PV power generation, so as to reduce the impact of the fluctuation and intermittence of PV power generation on the PV system, improve the operation efficiency, generate stable and reliable power supply, and help the operators adjust the operation mode and decision implementation.
At present, many modern machine learning models including some deep learning models have been well applied in the fields related to PV power generation prediction [4]. In addition, several hybrid model based methods are also proposed for improving prediction accuracy [5][6][7]. It can be concluded from [8][9][10][11] that traditional machine learning methods often require a lot of feature engineering work before training and testing, which makes the preliminary work relatively heavy. On the other hand, these methods consider fewer relevant factors, such as only considering meteorological data. In [12][13][14][15][16][17][18], some deep learning models, including LSTM, GRNN, and FFNN, have also been applied in the field of PV power generation. However, they only consider some meteorology data and historical continuous data, regardless of historical periodic data, so there is still room for improvement in accuracy.
To compensate for these shortcomings, our contribution is that we use real public historical PV power generation data and meteorological information to forecast PV power generation through LSTM-FC in the next 1 hour. LSTM models have performed well in the field of temporal modelling, while the fully connected layers usually are suitable for feature mapping. An LSTM-FC structure with two branches, including main input and auxiliary input, not only use LSTMs layer to obtain time continuity and periodicity, but also obtain the important role of meteorological information through FCs layer. is indicates that these two branches play complementary roles, which can further improve the accuracy of PV power generation prediction. e rest of this paper is organized as follows. e second section mainly introduces the related work to this paper. e third section mainly introduces the model structure and evaluation metrics. In the fourth section, continuous data, periodic data, and meteorological data are used as experimental data to obtain the prediction results of LSTM-FC and baseline model, and their errors were compared using RMSE, etc. Finally, the summary and future prospects are given.

Related Work
e current machine learning methods including some deep learning methods can be widely used in the short-term forecasting of PV power generation. For example, Shi et al. developed a method based on SVM to predict PV power generation under different weather conditions [10]. ey divided four models according to different weather conditions and then used historical data and weather forecasts as inputs to predict PV power generation. e authors obtained promising results when testing their technique to forecast the power output of a PV station in China. However, the data sets in their paper are mostly sunny or rainy, so it is troublesome to design different models for regions with more weather types.
Wang et al. proposed a short-term forecasting method of PV power based on GBDT [11]. ey used historical weather data and PV power data to train the model. e simulation results show that this method is superior to SVM and autoregressive moving average model (ARMA). However, this method does not use periodic PV power data and has few features. Ramsami and Oree proposed a prediction method based on hybrid model [15]. First, stepwise regression (SR) was used to obtain important features as input, and then FFNN, GRNN, and multiple linear regression (MLR) were used to predict. Simulation results showed that this method could achieve the same accuracy or even better with a small number of features. For the same problem, this method only considers the average value of meteorological data of the day before yesterday and does not carry out SR processing for relevant characteristics in time. In [16], Cheng et al. proposed PV power forecast research of PV system based on double-level neural network; the input of the double-level neural network adopts the calculated value or the predicted value and learns the existing measured value through the network to correct the theoretical value and get the prediction result.
is idea can use LSTM to further improve the prediction accuracy.
Huang et al. proposed a short-term prediction-based solar irradiation on LSTM-MLP [17]. In this method, the historical continuous irradiance and the meteorological data of the predicted day are taken as the input of LSTM, and then the output of LSTM and meteorological data are combined as the input of MLP, and finally the final result is obtained through MLP.
e simulation results show that the performance of the model is better than that of the baseline model. In [18,19], the authors also use prediction methods based on LSTM or other hybrid models of deep learning models and LSTM, and the simulation results also prove that these methods improve the accuracy of prediction on the original basis. However, they only considered the correlation between continuity data and did not take the correlation of data in the period as input, so there was room for the prediction accuracy to rise.
In a word, the aforementioned methods only consider the impact of the historical continuous PV power generation and meteorological data on the power generation, and they do not consider the time-periodic dependence. In LSTM-FC model, it can obtain not only the effect of meteorological data to PV power generation, but also time continuity and periodicity dependence. Next, we'll look at the model structure of LSTM-FC and why it has the above characteristics.

The Proposed Model
3.1. Model Structure of LSTM-FC. LSTM can well solve the problem of gradient disappearance and gradient explosion and has a good application for the acquisition of time correlation, while FC can better acquire the mapping relationship between features. We can take advantage of the complementary characteristics of the two networks to further improve the accuracy of the prediction. Below, we will introduce the LSTMF-FC network structure.
As shown in Figure 1, this is the model structure of LSTM-FC proposed in this paper. It contains two components. e first component consists of an input layer, two LSTM layers, and an output layer; it has two input types A and B. For input A, the historical PV power generation at time is the feed LSTM layer1, and then go through two LSTM nodes and the output layer to get the output of the first component, where n is the length of the time series (number of samples) and p is the period interval. e input of B is one more meteorological data than that of A. e second component consists of an input layer, three fully connected layers, and an output layer. e input layer is the meteorological data at time t, and then the data is passed to the FC layer1, and then go through several hidden layers and the output layer to get the output of the second component. e final prediction result of the two components will be weighted and summed according to the output of the two components in a certain proportion. e expression of the operation in Figure 1 is as follows: x main_input represents the main input, such as input A or input B. y LSTM represents the LSTM layer, h LSTM represents the output passing through the LSTM layer, and x aux_input represents the auxiliary input, which is the meteorological data at time t. o main represents the output of the first component, and o aux represents the output of the second component. e final output result is fused by the parameter matrix, and W is learned by the model itself.
As shown in Figure 1, the models can be divided into Model-A, Model-B, Model-AC, and Model-BC according to different inputs. In the following sections, we will discuss how to choose an appropriate length of contiguity data sequence and the sequence length of periodicity data and then get an excellent model from these models.

Evaluation Function of Model Accuracy.
To better distinguish the performance of the model and implement a baseline to compare with other baseline models, some evaluation indicators need to be set. e evaluation functions used to verify the model performance are presented below, including RMSE, standardized root mean square error (nRMSE), mean absolute error (MAE), and determination coefficient (R 2 ).
Y is the predicted value obtained by the model, and Y ′ is the expected true value. μ Y′ is the mean of the expected values.  Each evaluation index has its own specific target. For PV power generation, RMSE, nRMSE, and MAE can well reflect the dispersion degree between the predicted value and the real value, but in some cases, R 2 is more useful than either of the above and can better explain the performance of the model.

Experimental Data and Environment.
e experimental data is real application data, which comes from Jinan PV operation and maintenance platform. Considering the confidentiality of data, the PV power generation data and meteorological data is from 2018-1-1 to 2018-12-31. e meteorological data includes temperature, humidity, weather, wind direction, and wind speed. Since the PV system does not generate electricity during certain periods, the time set in this article is 7-19 hours a day. As shown in Table 1, this is the structure of the data. To meet the requirements of training, we should preprocess the data, such as data cleaning and data normalization. e specific steps are as follows. Firstly, weather and wind direction need to be encoded into data that computers can process. After processing, the corresponding number for local cloudy days is 15 and the corresponding number for east wind is 3.
Secondly, the outliers need to be processed to reduce the impact on the accuracy of the prediction algorithm. Currently, the main outlier detection algorithms include One-ClassSVM, EllipticEnvelope, IsolationForest, and LOF [20]. Here, the four methods above are, respectively, used to test the data, and it is found that LOF is the most suitable method compared with the other three methods. As shown in Figure 2, this is the outlier obtained by the LOF method; xaxis is the sample index, and y-axis is the hourly PV power generating. For the removal of outliers, the method used is to make up with the PV power generated at the same time, the day before or the day after the outlier at that time. e processed data is shown in Figure 3, which shows that the processed data is normal.
irdly, we analyze the characteristics and calculate the correlation between them and PV power generation. e Pearson correlation coefficient is used to measure whether two data sets are on a line, and it is used to measure the linear relationship between distance variables. Its formula is shown in formula (3). To determine whether meteorological data should be taken as input features, the meteorological data and PV power generation data are calculated in this paper. Table 2 shows the Pearson correlation coefficient between these five meteorological variables and PV power generation. x is the meteorological data, y is PV power generation, and u is the average value.
It can be observed that there is almost no correlation between the weather data and the PV power generation, and there is a certain correlation between the other several variables, among which temperature and humidity are the highest.
erefore, in addition to the weather variable, several other variables can be used as input variables. Figure 4 shows the continuity of PV power, and the change in continuous time is relatively smooth. As shown in Figure 5, it is obvious that PV power reaches its peak at noon every day, and it is lower in the morning and evening. e curve shows an obvious periodicity. Next, we analyze the Pearson correlation coefficient between historical continuous PV power generation and periodic PV power generation and the PV power generation on the forecast day. Figure 6 shows the fitting degree of the PV power at time t − 1 and t − 2 and t, indicating that there is a great dependence between the two. Figure 7 shows the fitting degree of the PV power at time t − 1 * p and t − 2 * p and t, and it can also be seen that there is a great dependence between the two. We only analyzed the correlation between the data with a lagging time of 2 hours (or period), but longer lagging time is still relevant. erefore, we will conduct experiments to determine the optimal lagging time later.
After the above process is completed, we finally form a data set which includes 13,792 training data and 3,412 test data. As shown in Table 3, all the data are integrated into a table, and the data will be constructed according to different input requirements when reading the data. Since the characteristic dimensions cannot be fully displayed here, we only show meteorological data and a small amount of continuous data and periodic data in the table, in which power is PV power generation at time t, c i is PV power generation at time t − i, and p i is PV power generation at time t − i * p. e statistical results of some characteristics of training data are shown in Table 4, including mean value, standard deviation, minimum value, and maximum value.
In this experiment, the hardware environment includes Windows 10, 6G memory, and GTX 1660 Ti graphics card. e software environment includes CUDA 9.0, Cudnn7.4.1, scikit-learn0.22.2, Python3.6, and Keras2.1.6, and the corresponding Keras back end is TensorFlow 1.11.0. Figure 1, we divided four different models according to the different input types and the choice of double branches, namely, model-A, model-B, model-AC, and model-BC, but each model contains continuous and periodic PV power. erefore, the length of the continuous data series and the length of the periodic data series should be determined first. According to these two kinds of data, the input variables of model-A are formed. In addition to these inputs, meteorological data should be added to model-B, model-AC, and model-BC. en, these models are compared and evaluated to verify the influence of different input types on the accuracy of model prediction. Finally, the best model is selected according to the optimal evaluation results. e following is to conduct continuous training to determine the optimal length of continuous PV power and periodic PV power, respectively. Considering that the effective generating time of one day obtained in the previous data is 7-19 hours, the maximum length of continuous data is set to the first 9 hours. Considering the time consumption and the decrease of long time series dependence, the maximum sequence length of periodic data is also 9.

Forecasting Test. In
After many training sessions, as shown in Figure 8(a), RMSE decreases as the length of the input variable increases at an interval of 1 hour. It can be seen that the         prediction accuracy is the highest when the lagging time is 9, but the decline is no longer significant after the lagging time is 4. As shown in Figure 8(b), RMSE decreases as the length of the input variable increases, and the prediction accuracy is highest when the lagging time is 9, where p is 13. After the above analysis, the length of the final selected continuous data is 4, and the length of the periodic data is 9.
After the above experiments, we will now select a continuous data sequence of length 4 and a periodic data sequence of length 9 as the inputs of Model-A, Model-B, Model-AC, and Model-BC. In this paper, these two kinds of data are combined into data with dimension 13 as the input of Model-A. At the same time, we add meteorological data to form the input of the other three models as required and then verify and evaluate each model.
As shown in Table 5, the worst is Model-A, with an nRMSE of 0.2209 and the best is Model-BC, with an nRMSE of 0.1743. Certainly, Model-A, Model-B, Model-AC, and Model-BC are all better than just considering continuous data or just periodicity data. After the above discussion, we finally choose Model-BC as the LSTM-FC model. Next, we will describe the parameters of the model that is finalized.   e performance of the model will be affected by some factors. For example, if the number of network layers is too small, the accuracy may not be high; if the number of network layers is too large, the result of overfitting may also be caused. erefore, an optimal parameter should be selected after several times of training. After feedback of training results for many times, the final network parameters in this paper are shown in Table 6, mainly including 2 LSTM layers and 4 FC layers, and Adam optimization function, and the continuous variables and periodic variables determined by the above tests are 4 and 9, respectively.

Result Analysis.
After the above experimental results, we know that Model-BC has the highest accuracy. Following, we will use LSTM-FC to represent Model-BC. We will evaluate LSTM-FC on the test set, and the prediction results obtained by inputting the test set into our model are shown in Figure 9, where the solid line blue is the truth value and orange is the predicted value. As can be seen from the local enlarged figure, the expected value and the predicted value are basically fitted, and the error between them is very small.
As shown in Table 7, (M + C) means to take continuous data and meteorological data as input, and (M + C + P) is to add one periodic data as input. We use SVM, GBDT, FFNN, GRNN, and LSTM models commonly used in related work to test different inputs, and the results showed that the model error with periodic data as input was less than the model error not applicable to periodic data. is verifies the idea that using periodic data as input in this paper can improve the accuracy, and under the same, the error of the LSTM-FC model proposed in this paper is minimal. e following model results take (M + C + P) as input. Compared with SVM, GBDT, FFNN, GRNN, and LSTM, LSTM-FC's RMSE improves 11.79%, 7.77%, 9.1%, 11.33%, and 9.16%, respectively. Figure 10 shows the errors of each model on the test set, and it can be seen that the errors of LSTM-FC are smaller than other models in most of samples. rough the above discussion, it can be concluded that the performance  of LSTM-FC model in PV power generation prediction is superior to other models.

Conclusions
is paper first expounds the benefits of accurate PV power quantity prediction; that is, it can improve the operation efficiency of PV power station, generate stable and reliable power supply, et al. en, we discuss the research of some current machine learning and deep learning methods in PV power generation prediction. In most of the previous PV power prediction studies, the influence of time continuity and meteorological data on PV power was considered, but did not obtain the periodicity dependence or the model used is not as good as LSTM in capturing time correlation. erefore, a PV power generation prediction method of LSTM-FC with double branches is proposed. It used LSTMs to better obtain the temporal correlation of PV power generation, used FCs to capture the mapping relationship between meteorological data and PV power generation, and then weighted the output of the two branches to get the final prediction result. e simulation results show that the accuracy of (M + C + P) as input is higher than (M + C), and the accuracy of LSTM-FC model is the highest compared with other baseline models in the same situation, with nRMSE reaching 0.1743 and RMSE reaching 2.5605.
Using the results of this research, we can use the predicted value of PV power generation to provide decisionmaking for power dispatch, which greatly improves the output efficiency, basically solving the problem of power output reliability caused by the volatility and intermittentness of PV power generation. On the other hand, we can set an outliers confidence interval. When the error between the actual power generation and the predicted power generation exceeds this interval, it indicates that the power station may have a certain fault, which provides feasibility for fault detection. In subsequent studies, we will consider solar irradiance, average power, panel temperature, and rainfall as input features to obtain a better model. Moreover, since the training time of deep learning model is slightly longer than other machine learning models' training time, we will consider how to further reduce the time consumption and make the training faster in the future.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.