Using Deep Learning to Predict Complex Systems : A Case Study in Wind Farm Generation

Making every component of an electrical system work in unison is being made more challenging by the increasing number of renewable energies used, the electrical output of which is difficult to determine beforehand. In Spain, the daily electricity market opens with a 12-hour lead time, where the supply and demand expected for the following 24 hours are presented. When estimating the generation, energy sources like nuclear are highly stable, while peaking power plants can be run as necessary. Renewable energies, however, which should eventually replace peakers insofar as possible, are reliant on meteorological conditions. In this paper we propose using different deep-learning techniques and architectures to solve the problem of predicting wind generation in order to participate in the daily market, by making predictions 12 and 36 hours in advance. We develop and compare various estimators based on feedforward, convolutional, and recurrent neural networks. These estimators were trained and validated with data from a wind farm located on the island of Tenerife. We show that the best candidates for each type are more precise than the reference estimator and the polynomial regression currently used at the wind farm. We also conduct a sensitivity analysis to determinewhich estimator type ismost robust to perturbations. An analysis of our findings shows that themost accurate and robust estimators are those based on feedforward neural networks with a SELU activation function and convolutional neural networks.


Introduction
A region's electricity grid consists of a series of components that have to work together to achieve a balance between generation and demand, while at the same time ensuring the security of the electricity supply and providing a certain level of quality and service.The structure of the power system may be divided into four key activities: generation, transport, distribution, and marketing.The start of the electricity supply process takes place at power plants, where the electricity is generated.Depending on the type of facility, different types of primary energy sources are used to drive a turbine or motor, thus converting the primary energy into mechanical energy.The turbine is connected to a generator, which turns the mechanical energy into electrical energy.The process of supplying electricity continues via the transport network, which links the various production plants to consumption centres.This process takes place at high voltages to lower the currents and thus the losses.The distribution process comes next, in which the electricity is relayed from substations to the transport network for the various consumption points.These substations reduce the voltage from that of the transport network to values that are suitable for use by consumers.The electricity supply process concludes with the marketing activity, in which the electricity is sold to consumers based on their contracted power.
In Spain, Law 54/1997 went into effect in 1998.This law is notable because, as in the rest of Europe, it deregulated the generation and marketing activities, while continuing to regulate the transport and distribution aspects.Ever since, two primary operators have been charged with managing the technical and economic aspects of Spain's electricity market.One is Red Eléctrica de España (REE) and the other is Operador del Mercado Ibérico de Energía (OMIE), which are required to coordinate their efforts.The latter is charged with handling the bids for selling and buying energy.With this power system model, the price of energy became defined by the matching processes that started to be used in the various market sessions: daily, intraday, and ancillary services.The daily session, or daily market, takes place at 12:00, during which the bids for the 24 hours following the session close are placed.This is the main market and therefore the session in which much of the energy is negotiated.
The intraday markets are convened over the course of the previous day and the delivery day.Once the new offers are matched, they are added to the daily schedule to yield what is known as the final hourly schedule (PHF in Spanish).Obviously, less energy is traded in these markets since their time horizons are gradually reduced.They are designed to accommodate potential changes to trading forecasts.Table 1 shows the different time spans for the six intraday sessions, which are only open to those buyers or sellers that have taken part in the daily sessions.
Finally, the ancillary services are used when needed to resolve the imbalance between demand and generation, regulate the frequency/power, and control the voltage in the transport network.Their purpose, then, is to guarantee the balance, security, quality, and reliability of the electrical system.
For each hour, the producers and consumers that want to produce or consume electricity must place a bid in the various markets depending on their needs.Hourly in each session, the bids received are arranged from the highest to the lowest sale price and the highest to the lowest purchase price, with the lowest price being 0 and the highest being 180.3 €/MWh.Graphically, the result would be two aggregate curves, where the -axis is the energy and the -axis is the price.The matching method is "marginalist," meaning that the matching price for that hour and session is set at the intersection of the two aggregate curves.Any units remaining below and above that value will be sold and bought, respectively, at that price.In other words, all of the power contracted will be sold at that price.
To illustrate the matching process, Figure 1 shows an example for a case with five power plants and six large consumers placing a bid in the market for hour H.
Since most of the demand is not manageable, it offers to buy at the maximum of 180.3 €/MWh.But it is worth asking what criteria producers use to craft their sale offers to cover this demand.Nuclear and renewable plants tend to sell at 0 €/MWh to ensure that all of the energy they produce is consumed.This is due to their technical limitations, such as the inability to halt production in the case of nuclear and the inability to store primary energy in the case of renewables.The difference between the total system demand and the energy produced by the above technologies and hydropower (in which the water flow can be regulated by using reservoirs) is known as the thermal gap.This difference is the energy demand that must be met using thermal technologies (such as gas and coal), the variable costs of which are higher than for renewables and nuclear.Therefore, the competition in the electricity market lies between the thermal generating plants, as it is on these plants that the intersection between the demand and supply curves during the market matching process, and therefore the final sale price, depends.
As Figure 1 shows, a lower demand entails lower prices by requiring fewer plants to be in operation and excluding the most expensive plants from the matching process.An increase in renewable production can result in a sharp drop in the matching price, leading to cases in which all of the demand is covered by the production priced at 0 €/MWh, as has already happened on numerous occasions.As the proportion of renewables in the energy mix grows, the average price in the electricity market drops.Specifically, the average energy price in Spain's electricity market in 2016 (48.4 €/MWh) fell by 23% with respect to 2015 to the lowest price since 2010.This was due primarily to the large response by hydro and wind power to cover the demand in the first few months of the year.
Reducing the use of thermal technologies requires incorporating more wind power into the energy mix.To do so, producers have to know how many units to supply to the daily market.A wind farm's generation must be forecast one day in In this paper we present a series of regression estimators that rely on deep-learning techniques to predict the generation of a wind farm based on an estimate of atmospheric conditions.These predictions are intended for the daily market, meaning they must offer sufficiently accurate results 12 to 36 hours in advance.

Materials and Methods
Below we describe the estimators implemented, the data used, and the procedures employed to train, evaluate, and compare the estimators.

Data Sets. The Institute of Technology and Renewable
Energies (ITER) is the agency that runs the largest wind farm on the island of Tenerife.It also provided us with the data for this study.The ITER runs the MADE farm, which has a rated power of 4800 kW, supplied by eight MADE AE-46 generators.A weather forecast for the following 48 hours in one-hour periods is generated twice a day.Once a day the wind speed for the following 12 to 36 hours is forecast and a polynomial regression is carried out that is used to estimate the generation for each hour of said interval.This estimate is sent to the OMIE to be used in the daily market.
The ITER gave us a data set with an hourly sample with the results of the Numerical Weather Prediction (NWP) for different meteorological variables, the generation forecast made by the ITER using a polynomial regression, and the actual wind generation measured and a free-text field containing problems involving the operation of the generators.Table 2 summarises the features in each sample of the data set provided by the ITER and shows which were used as inputs and outputs for the estimators.
To prepare the data, we used certain feature engineering techniques.The timestamp was broken down into day of year and time of day, each of which was represented using the pair of values for their sine and cosine in an effort to capture their periodic nature and their effects on the local daily wind cycles.
The wind direction was also encoded using this method for the same reason.The following shows an example of this using the wind direction: The text for the issues was manually converted into the fraction of generators not in service at a given time, since some of the samples for the total farm output, which were used to adjust the predictor's output, were obtained when some of the generators were out of service.When the validation and test sets were configured, however, only those samples taken when all of the generators were in service during the period measured were used.
All of the inputs were normalised by min-max scaling between 0 and 1, the goal being to achieve maximum efficiency during training.For the training, we had data sampled each hour from January 2014 to April 2016, which were randomly divided into three sets: 60% comprised the data training set, 20% the validation set, and the remaining 20% the test data.The data were stored in TensorFlow TFRecords files for efficiency purposes for use with the TensorFlow framework [1], which was the tool used to develop, train, and evaluate the various predictors.

Feedforward Neural Networks.
The first architecture evaluated was the Feedforward Neural Network (FNN).In a FNN [2] every neuron in one layer receives as its input all of the outputs from the neurons in the previous layer.The output    of the th neuron in the th layer can be expressed as indicated in the following: where    is used to denote the weight for the output of the th neuron in the ( − 1)th layer as the input to the th neuron in the th layer,    is the bias of the th neuron in the th layer, and   is the activation function of the neurons in the th layer.The most common activation function is the sigmoid function, which is expressed as shown in the following:

Complexity
In a regression problem like the one at hand, the sigmoid activation function is used in every layer except the last one, the output layer, which uses the identity function   () =  as the activation function.
The learning was done using standard deep learning techniques, like minibatch gradient descent [3] and Adagrad (for adaptive gradient algorithm) optimiser [4].The latter allows different step sizes for different features, so it does not require a learning rate to be specified for it.The adjusted model was validated every 5000 trained batches.The stopping criterion employed was for the evaluation's Mean Square Error (MSE) not to decrease during three consecutive iterations.But in order to compare the accuracy of different predictors, we use the Mean Absolute Error (MAE) [5] and the Mean Absolute Scaled Error (MASE) [6].The MAE and MASE are common measures of forecast error in time series analysis.
To avoid overadjusting the models when training them, the cost function includes a component to 2-norm to regularise all the weight    and the bias    of the entire model.In some specific cases we used dropout [7] to check its effects when attempting to further generalise the trained models, but it did not help to improve the results.When dropout is used during training, only a selection of neurons chosen with probability  keep can be activated.The following shows the generalisation of the output expression for neuron    when dropout is used: where  ∼  (0, 1) . (4)

ReLU Activation Function.
When training using minibatch gradient descent, the backward propagation undergoes a phenomenon called vanishing gradient [8], which considerably hampers the training of networks with a large number of layers.In these cases the ReLU activation function is very practical because it does not suffer from the vanishing gradient problem.Moreover, in regression problems it has the advantage of not being limited to outputs between 0 and 1, thus favouring the dispersity of the solution in the hidden neurons.The following shows the expression for the activation function for the th layer of a neural network with  layers.
where  is used to indicate the input to the activation function, that is, the weighted sum of the inputs to the neuron, as shown in (2).The FNNs with the ReLU activation function that we trained use this function for the neurons in every layer except the output layer.

SELU Activation Function.
Even using the ReLU activation functions, truly deep FNNs are difficult to train, which hampers their ability to handle high-level abstract relationships in the input samples.The Scaled Exponential Linear Unit (SELU) activation function induces selfnormalising properties that make the neuron activations automatically converge toward an average of 0 and a variance of 1 [9].This property propagates throughout the network even in the presence of noise and perturbations.This allows training networks with more layers and the use of strong regularisation and it makes the training more robust.
The following shows the expression for this type of activation function: where  is used to indicate the input to the activation function.Klambauer et al. [9] justify why  and  must have the values shown in in order to ensure that the neuron activations converge automatically toward an average of 0 and a variance of 1.

Convolutional Neural Networks.
It is possible for the atmospheric conditions in previous hours to contain information that can be used to improve the forecast at any given time.This information was introduced into predictors based on FNNs to prepare samples that contained the weather forecast features for every hour  and for the  − 1 previous hours, with the input layer for said models being suitably expanded.
In order to check the results when the neural network is forced to exploit the time-local correlation between the features by forcing a connection pattern between adjacent neurons in each layer, we implemented some predictors using models based on Convolutional Neural Networks (CNNs).CNNs are biologically inspired variants of FNNs used primarily in computer vision problems [10], although their ability to exploit spatially local correlation in images can also be used in time-series forecasting.
In these models, the output of each neuron    is not generated based on the output of every neuron in the previous stage, as shown in (2); rather, it is generated from a subset of spatially adjacent neurons.So to improve the learning efficiency, every neuron in the same layer shares the same weight and bias, meaning the layer can be expressed in terms of a filter that is convoluted with the output of the previous layer.The following shows the output    of the th neuron of the th convolutional layer: where (  *  −1 )  is the th element resulting from the convolution of the filter defined by   with the output of the previous layer  −1 ,   is the activation function for the In order to apply a CNN to the problem of time-series forecasting, we arranged the samples in such a way that the time series of each characteristic is an input channel to the network and thus to the first convolution later.
Figure 2 shows a general diagram of the CNN developed in this paper to forecast time series.Behind each convolutional layer with a ReLU activation function is a max-pooling layer, which partitions the input into a set of nonoverlapping ranges and, for each range, outputs the maximum value.Behind several convolutional exchange layers and maxpooling layers there is a feedforward layer (as described in (2)) to yield the output of the entire network.

Recurrent Neural Networks. As we have shown, FNNs
and CNNs can use historical data series but they lack the memory to store information over the long term.They also cannot use the information contained in the output of the network at previous instants.Recurrent Neural Networks (RNNs) [11] solve this problem by making the output   at timestamp  depend on previous computations through a hidden state   that acts as a memory for the network, as shown in the following: where   is the input to the network, () is the state activation function,  and  are the weights by which the inputs and the state for the previous instant are multiplied, respectively, to generate the new state   , and () is the function that generates the network's output based on the state.On occasion this function () will be the identity function () = , but it can also be a feedforward layer like the one described in (2). Figure 3 shows the RNN we used, unfolded into a full network.By unfolded we simply mean that we write out the network for the complete sequence of inputs and take the output at  as the network's prediction for fitting.Instead of the basic RNN cell explained previously, we used two, more advanced cells: long short-term memory (LSTM) [11] and Gated Recurrent Unit (GRU) [12] cells.

RNN cell
RNN cell RNN cell LSTM recurrent neural networks are capable of learning and remembering over long input sequences and tend to work very well for time-series forecasting problems [13].As (10) well shows, the output   depends on the state   of the LSTM cell through the activation function   () (which is generally tanh()).The output gate   controls the extent to which the state   is used to compute the output   by means of the Hadamard product (∘): The state   depends on the state of the previous instant  −1 and on the candidate for the new value of the state s .The input gate   controls the extent to which s flows into the memory and the forget gate   controls the extent to which  −1 remains in memory.The   ,   , and   gates and the candidate for the new value of the state of the cell s can be interpreted as the outputs of conventional artificial neurons whose inputs are the input to cell   at  and the output of cell  −1 at  − 1.The activation function for the gates   is the sigmoid function, while for   it is tanh().
GRU recurrent neural networks use a simpler cell without a forget gate and with fewer parameters, meaning they can generally be trained with fewer samples.Chung et al. [14] shows experimentally its superiority over LSTM for simple networks, but cannot conclude that GRU is better in different Complexity cases.The following shows the equations that govern the behaviour of these cells: where ỹ is the cell's output candidate,   is the update gate, which controls the extent to which  −1 or ỹ is used to compute the output, and   is the reset gate, which controls the extent to which  −1 flows into the cell's output candidate ỹ .In GRU the activation functions   () and   () are the sigmoid function and tanh(), respectively.

Sensitivity to
Disturbances.An important aspect in this paper is to analyse the behaviour of our models in the presence of input disturbances.In the following equation we show the expression for the input   assuming that it undergoes a small incremental change Δ  : where  *  is the th value of the input sample without disturbance.It is important to note that the input   to the model is the th input to the first layer.Similarly,  is the output of the last layer.
A perturbation Δ  in the input   induces a disturbance in the output  of the neural network.When there is no perturbation in any of the model's inputs, the output of the neural network is  * .In order to determine if the model is robust against perturbations in the th input, the sensitivity   has to be calculated [15].We show its expression in the following: where Δ is the corresponding change in the value of the output variable  and   = Δ  / *  is the input perturbation ratio.
If the sensitivity   is lower than 1.0, it means that the network attenuates the input disturbances, whereas if it is equal to 1.0, it means that the network neither attenuates nor amplifies disturbances.

Results and Discussion
As noted earlier, the wind farm currently uses a polynomial regression to predict the farm's generation, as required to participate in the daily market.To give an idea of its accuracy, Table 3 shows the MAE and MASE for some estimators using the historical data available.The MASE indicates the absolute error relative to the error in the one-hour naive forecast reference estimator.Therefore, a MASE greater than 1.0 indicates the predictor works worse than the reference estimator, while a MASE lower than 1.0 indicates that it works better.
As Table 3 shows, the polynomial estimator is a little over two times worse than the one-hour naive forecast reference estimator; however, this naive estimator would never be able to be used because the prediction has to be made and sent to the grid operator 24 hours in advance.
To obtain a more realistic comparison, it was compared with another naive estimator that uses the actual generation measured 24 or 48 hours earlier for its prediction at a given time.This naive estimator could be used at the farm, though Table 3 shows that the polynomial regression is considerably better than this second naive estimator.4, 5, and 6 show how the MASE evolved for the validation data set over the course of the iterations for networks with different numbers of hidden layers, neurons and ReLU, sigmoid, and SELU activation functions, respectively.

Feedforward Neural Networks. Figures
In every case the final result is similar and slightly better than for the polynomial regression, whose MASE is 2.11. Figure 4 shows that a ReLU activation function yields good results with around 20 neurons between all the hidden layers.If this size is increased, the number of overfitting cases rises gradually and the MASE is not reduced by either adding more layers or making the layers larger.
Figure 5 shows the results for FNNs with a sigmoid activation function.In this case we clearly see that two hidden layers yield better results than one, but after that no further improvements are obtained by expanding the network.The   FNNs with a sigmoid activation function need three or four times more steps than a ReLU network to converge.In fact, of all the network types studied, they required the highest number of steps to converge.This problem grows with the number of layers due to the vanishing gradient problem.
Figure 6 shows the same curves but for FNNs with a SELU activation function.The improvement resulting from increasing the size of the first layer has a bound that can be overcome by increasing the number of layers.In general, it does not yield better results than the ReLU activation function for our problem, but this is to be expected since the benefits of this function are evident when used in problems that require a large number of layers Klambauer et al. [9].We then trained estimators similar to the above, but with samples that contained the weather forecast features for every hour  and for the previous 5 hours.As before, Figures 7, 8, and 9 show how the MASE evolved for this new data set.The final result is similar in every case, but better than for the polynomial regression and for the previous FNNs.This shows that these estimators are capable of making good use of time information.
Figures 8 and 9 also exhibit behaviour similar to Figures 5 and 6 respectively, converging to solutions with a smaller MASE.It is interesting to note that although the networks with a SELU activation function behave similarly to those with the ReLU function, the former exhibit fewer overfitting problems as the size of the network grows.In other words,  with the SELU activation function, increasing the number of layers does not improve the results but the estimators are trained equally well.

Convolutional Neural Networks.
We trained an CNNbased estimator using the architecture shown in Figure 2 in order to compare its performance with that of previous FNNs and to try to improve the results obtained by the latter.The size of the convolution filters was set at five, the ReLU activation function was selected for the convolutional layers, as is usual for these networks, and the size of the max-pooling window was set at three. Figure 10 shows how the MASE evolved for the validation data set over the course of the iterations for networks with different numbers of channels at the output of each convolution: between 32 and 8.The MASE for networks with at least 32 to 16 channels is very similar to that obtained for FNNs, but the CNNs converge much faster, in approximately half the time.The advantage is that the training for this type of network can be speeded up considerably by using graphics processing units (GPU), although this possibility was not explored for this paper.
Figure 11 uses an image with 256 grey levels to show a representation of the coefficients of the filters trained for the first convolutional layer of the CNN with eight channels.Each filter has one row per input characteristic and shows how the filter uses said characteristic to contribute to the layer's output.

Recurrent Neural Networks.
A similar procedure was used to train the RNNs with LSTM and GRU cells of various sizes for the output   .Figures 12 and 13 show the trend in the MASE when training the RNN LSTM and GRU, respectively.Both RNN types exhibit an error that is very slightly larger than that of the FNNs and CNNs.The number of steps needed to complete the training is also very similar, though in reality each step consumes much more time.Although not shown  in the graph, the amount of time needed by the RNNs was approximately five times that needed by the equivalent FNNs.
It should be noted that no large differences are evident between the RNN LSTM and the RNN GRU.It is also surprising how well both types of RNNs work, even with size 1 cells, which store considerably less information in their status than larger cells.
In light of these results, the problem of predicting wind generation for the daily market is better resolved by providing as input a time series with the forecast for previous hours and using estimators based on FNNs with a ReLU or SELU activation function, or CNNs.Specifically, the latter can be trained with a lower number of iterations and, presumably, in less time with the use of GPUs.As concerns ReLU FNNs and SELU FNNs, it is easier to avoid overfitting in the latter, though, if trained correctly, both are equally accurate.
3.4.Sensitivity to Disturbances.Finally, we selected the best models of each type in order to analyse their behaviour in response to disturbances in the input.In every case, we used models trained with 6 h time series.As we discussed in Section 2.5, we ran the models with and without perturbations in the inputs in order to calculate their sensitivity.The perturbations were applied into the matching inputs for the forecasts of wind velocity and direction for the time when the generation is to be predicted, because these are the inputs with the greatest influence on the model's output.
In this paper the input perturbation ratio and the sensitivity values obtained are shown in percentages.Sensitivities below 100% indicate that the estimators are capable of attenuating the perturbations, while sensitivities in excess of 100% indicate that they are amplified.
Figure 14 compares the average sensitivity of the best estimators for various network classes and sizes for different perturbation levels in wind speed.The figure shows that all of the estimators attenuate the input perturbations, reducing their influence on the output.In every case, the best   performance is exhibited by the estimators based on RNNs, followed by SELU FNNs, sigmoid FNNs, and CNNs.
Figure 15 shows the average sensitivity of the same models for different perturbation ratio values in the wind direction input.In this case the results are much more dissimilar, undoubtedly because the relationship between wind direction and power output is highly nonlinear and much more complicated to model adequately during training.As Figure 15 shows, only the estimators based on SELU FNNs, CNNs, and RNNs cells of size 1 are capable of attenuating this type of perturbations.In light of these results and considering those obtained previously involving the accuracy of the different estimator types evaluated, the problem of predicting wind power generation for the daily market is better resolved by using estimators based on SELU FNNs or CNNs.

Conclusions
In this paper we considered the problem of predicting wind power generation in order to take part in the daily market that regulates the supply and demand in the Spanish electric system.We used deep-learning techniques to develop different predictors based on neural networks that were trained using data provided by the MADE wind farm, operated by the ITER on the island of Tenerife.
The predictors evaluated are based on feedforward neural networks of varying sizes and with different activation functions, convolutional neural networks, and recurrent neural  networks.The conditions were the same as those employed for the farm with the polynomial model now in use, namely, relying on the weather forecast at least 24 hours in advance to output a predicted generation for the farm.The methodology was checked by training and validating the model with samples taken every hour during the past three years.The results were adequate, yielding better results than the 1-hour reference naive forecast estimator and the polynomial model used at the wind farm.Specifically, the use of time series for the input samples proved to be the best way to minimise the error.Moreover, of the different types of neural networks evaluated, the CNNs and FNNs with the ReLU or SELU activation function were shown to be the most accurate, although the differences between the best candidates from the various network types were not significant.The traditional sigmoid FNNs are on a par with the other types trained, though they converge much more slowly during training.
Finally, we conducted a sensitivity analysis of the models, which revealed that trained neural networks are able to attenuate some input disturbances.For disturbances in the wind speed input, the best candidates from every network type were able to attenuate the disturbances, though this is much more difficult to achieve with perturbations in the wind direction input, which even caused some network types to amplify the perturbations.In this case, the CNN, SELU FNNs, and the various RNN types exhibit the best performance.Taking all the results into consideration, the best neural network estimators, from the standpoint of offering the lowest absolute error and being the least sensitive to perturbations, are those based on SELU FNNs and CNNs.

Figure 3 :
Figure 3: RNN unfolded into a full network.

Figure 5 :
Figure 5: Trend in MASE for different FNNs with a sigmoid activation function.

Figure 6 :
Figure 6: Trend in MASE for different FNNs with SELU activation function.

Figure 7 :
Figure 7: Trend in MASE for different FNNs with ReLU activation function and 6 h historical input data.

Figure 8 :
Figure 8: Trend in MASE for different FNNs with sigmoid activation function and 6 h historical input data.

Figure 9 :
Figure 9: Trend in MASE for different FNNs with SELU activation function and 6 h historical input data.

Figure 11 :
Figure 11: Coefficients of the filters in the first convolutional layer with 8 channels.

Figure 14 :
Figure 14: Trend in sensitivity for different disturbances in the wind speed.

Figure 15 :
Figure 15: Trend in sensitivity for different disturbances in wind direction.

2 Complexity Table 1 :
Hours of operation for intraday market sessions.

Table 2 :
Features in data examples.
Figure 2: General diagram of the CNNs used.convolutionallayer, and  ∈ [0 ⋅ ⋅ ⋅ ] indicates that it is the output of the th channel of the layer.In each convolutional layer, different outputs can be applied to the output of the previous layer to generate different representations or channels, thus yielding a fuller representation of the data.

Table 3 :
Accuracy of current estimators.Figure 4: Trend in MASE for different FNNs with ReLU.