A Parsimonious Bootstrap Method to Model Natural Inflow Energy Series

Fernando Luiz Cyrino Oliveira, Pedro Guilherme Costa Ferreira, and Reinaldo Castro Souza 1 Department of Industrial Engineering, Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rua Marquês de São Vicente 225, Gávea, 22451-900 Rio de Janeiro, RJ, Brazil 2 Brazilian Institute of Economics (IBRE-FGV), Rua Barão de Itambi 60, Botafogo, 22231-000 Rio de Janeiro, RJ, Brazil 3 Department of Electrical Engineering and Postgraduate Metrology for Quality and Innovation Programme, Pontifical Catholic University of Rio de Janeiro (PUC-Rio), Rua Marquês de São Vicente 225, Gávea, 22451-900 Rio de Janeiro, RJ, Brazil


Introduction
The Brazilian electrical energy production and transmission system is unique in the world, due to its continental dimension and the strong participation of hydro plants (about 80% of the energy produced in the country comes from hydro plants with various owners).The national grid, also known as National Interconnected System (NIS from now on) embodies generation utilities of the following geographical regions: Southeast/Midwest, South, Northeast, and North.Only 3.4% of the generated energy in Brazil is outside the pool of the NIS; they are formed by isolated generators located, mainly, in the Amazonas region [1].
It is well known that generation systems with strong hydro share are highly dependent on the hydrological regimes.Therefore, the energy operational planning consists of determining what would be the optimal amount of hydro and thermal generation for the planning horizon, that meets the projected demand and takes into account the constraints of the generation plants as well as the electrical constraints of the system [2].
As a consequence of this strong hydrological dependence, the uncertainty associated with the energy planning in Brazil requires the use of robust and reliable stochastic modeling of the hydrological time series.Hence, the need of generating 2 Mathematical Problems in Engineering syntheticseries from such stochastic models is required in order to produce the optimal operating costs and, at the same time, to increase the reliability of the system as a whole [3].
The energy planning in Brazil is carried out via a chain of mathematical model sand corresponding computer programs, aiming at producing the optimal operation and expansion programs [4].
The medium range operation computer program used in the country is the so-called NEWAVE.This software produces for each month of the planning horizon (that may vary from 5 to 10 years, on a monthly basis) the optimal share of hydro and thermal resources that minimize the expected operational costs of the system.The hydro plants are represented in an aggregate manner and the estimation of the optimal operational policy is carried out via a Stochastic Dynamic Dual Programming (SDDP from now on) [5].
To produce the optimal policy, NEWAVE considers a range of scenarios of energy inflows, all of them obtained by stochastic simulations having as the base models the Periodic Autoregressive structure, PAR() models, one for each month of the year.The problem formulation is based on equivalent energy systems, that is, the four geographical subsystems that constitute the NIS.In this approach, the hydro logical series are transformed into Natural Inflow Energy (NIE, for short) [6].
This way, the stochastic models employed to generate hydrological scenarios to be used by NEWAVE should preserve, as close as possible, the main features of the historical records, in order to guarantee that the produced optimal policy is reliable and satisfactory.In this paper, it is shown that the current approach to set the order and the parameters estimates of the PAR() structures is, somehow, limited and may be improved.Indeed, the existing approach tends to identify less parsimonious AR structures and the corresponding parameters are estimated by the classical method of moments.The proposed improvement described in this paper uses a nonparametric technique, known as Bootstrap, which allows, among other facilities, the estimation of more reliable confidence intervals for the model parameters.The dataset used in this paper corresponds to four monthly NIE time series related to the four regions of the country (subsystems Southeast/Midwest, South, Northeast, and North), covering the period ranging from January 1931 to December 2010.
The remaining part of the paper is organized as follows.Section 2 presents a short overview of the existing approaches to model NIE series, including the current approach used in the Brazilian Electrical System (BES), which is, in a sense, the main criticism of the paper.It is then followed by Section 3 where the proposed use of Bootstrap as an improvement tool to the existing model is thoroughly described.Finally, in the last section, the main results are presented and discussed, followed by suggestions for future researches on the subject.

Modeling NIE Series: A Review including the Model in Current Use in Brazil
According to [7], the time series models applied to hydrological series are an important tool for the planning and the operation of hydrothermal systems.There are a great number of papers in the specialized literature dealing with this problem for all possible types of time series observations, that is, yearly, monthly, weekly, daily, and so forth.Among the models cited in the literature, the following are the most used: (i) Concerning the use of the Box & Jenkins family of models, in particular the AR structures, they can be considered as the pioneers to model hydrological series.Indeed [8] proposed an AR(1) to model inflows and [9] used a similar structure to model monthly rainfall in Australian rivers covering the period from 1859 to 1952.
After these pioneer proposals, the following works on the area were related to improvements in their applications to hydrology problems as, for instance, in the models identification structure and parameters estimations.Some important papers using AR() and ARMA(, ) structures to model daily, monthly, and yearly hydrological series were published, such as [7,[10][11][12][13][14][15][16].In the case of the two contributions by Beard mentioned above, it is addressed the question related to the inflow simulations for a particular inflow station and the extension of this result to more than one inflow station that may have a correlation structure among them.
More recently, [17] uses the AR() and ARMA(, ) structures to model the average monthly inflows of Goksu river, located in the Southern region of Turkey.The interesting aspect dealt by the author is the use of three possible ways to arrive at the model structure identification: via the Autocorrelation Function, Minimum Residual Variance, and the Akaike Information Criterion.Such analysis showed that it should be considered more than one approach to select the model order, differently of what is used in the BES that uses only one structural identification method.
Concerning the periodic class of models, the emphasis has been put on the PAR() and PARMA(, ) structures.The PAR() structure fits a single AR() to each period of the series and, similarly, a PARMA(, ) structure fits an ARMA(, ) structure to each period of the series.
The extension from PAR() into PARMA(, ) structures is not trivial, as pointed out by [18].Besides, it is also mentioned that the PAR() structures have produced rather good fit to hydrological series and this would not encourage the use of more elaborated structures, such as PARMA(, ).However, one can find in the literature a considerable number of papers where the adopted structure was the more elaborated PARMA(, ) as, for instance, the works by [16,[19][20][21][22]. Particularly, in [22] the author presents two possible ways to make the parameters estimation: via Method of Moments (MOM) and via Maximum Likelihood (ML) and concluded that the latter produces better results than the former.On the other hand, [19] proposes an alternative way to estimate the model parameters of a PARMA(, ) model through a more parsimonious formulation and apply the approach to a monthly inflow series of Frazer river (British Columbia) and Salt river (Arizona).
In Brazil, the paper by [23] is the main reference concerning the production of weekly forecasts based on ARMA(, ) models, considering periodic and nonperiodic parameters.They use the software PREVIVAZ [24], developed by CEPEL (Brazilian Electrical Research Center) [25] and used by ONS, the Brazilian National System Operator [1], on the nationwide operation planning.
The model used in the BES to produce the NIE scenarios is the Periodic Autoregressive PAR() structure.These scenarios are used in a SDDP, that produces the Cost to Go Functions [26,27].As a result of this approach adopted officially in Brazil, a large number of papers have been devoted to the subject, discussing and proposing alternatives to improve this statistical procedure.
Although the use of PAR() models has been widely used in Brazil, it is important to bear in mind that they were first proposed by [8] and have been discussed ever since in the international literature; see, for instance [28,29], that have made substantial contributions to the implementation of such models.Also it is important to mention the contributions of other authors on the discussion concerning the best way to estimate the model parameters (Method of Moment, Maximum Likelihood, and Bayesian Inference Methods, among others) [16,22,30].
Therefore, this paper aims to propose methodological advances in the PAR() model regarding the structure identification and the model parameters estimation.

The PAR(𝑝) Model.
Assuming that a time series is composed of  years and  seasonal periods of observations (in this work,  = 12 months) and that  , is a realization of this series in year  and month  ( = 1, 2, . . .;  = 1, 2, . . ., ), the Periodic Autoregressive model, PAR(), where  is the model order ( =  1 ,  2 , . . .,   ), obtained by assuming a simple AR structure for each seasonal period.The analytical expression for the PAR() model is given by : expected mean for seasonal period .
: th autoregressive coefficient for seasonal period .
: independent noise series with zero mean and variance  2   for seasonal period .
Denoting    as the correlation between  , and  −, , and considering that ( 1) is an autoregressive model of the Box & Jenkins family, the parametric estimation can be conducted by solving the Yule-Walker system [31].For any period  we have Denoting by    the th autoregressive parameter of a process of order ,    is the last parameter in this process.The Yule-Walker equations to each period  can be rewritten as follows: If the model orders were previously known, the parameters could be obtained directly by solving the equation system (3).However, as one has to first identify the order of the model, the system (3) can be solved sequentially for each value of   , obtaining the values for the parameters     ,1 ⋅ ⋅ ⋅     ,  for each model order   .At this stage, it is important to discuss some questions regarding the model identification and parameters estimation.Concerning the identification of the model order, it is usually used the 95% confidence interval for   , given by the asymptotic result 1.96/, where  is the number of years of the time series.Such asymptotic interval was proposed by [32,33].In [34], the authors defined a Quasi White Noise process for series with this asymptotic approximation.This study was further developed by [35] for Periodic Autoregressive structures.
In practical terms, there are various ways to identify the Box & Jenkins model structure.One possible way consists of incrementing the value of   , starting from 1 until a value "+1, " for which the parameter   +1,+1 is no more significant.In this case, the last value of   for which     ,  is significant will be the model order; that is,   = .The model parameters become   ,1 ⋅ ⋅ ⋅   , .This criterion will be referred to here as "left to right" identification procedure (LR from now on).
The second possibility, that is, "right to left" identification procedure (RL from now on), as opposed to what happens with the LR procedure is such that the order of the model is defined as the value  * such that, for all  >  * ,    is not significant.As an example, if the Partial Autocorrelation Function (PACF) shows that the lag 6 is significant and for all lags greater than 6 is not significant, the identified order will be 6, even if intermediate values for the parameters    are nonsignificant.In actual fact, this is the way the model orders are identified in the current BES implemented in the NEWAVE computer system.

Method
Structural identification Parameters estimation MOM-RL PAFC asymptotic confidence interval.Criterion: for a period , the procedure looks for the largest order  so that all    estimations for  >  are not significant.

MOM (point estimates)
BMOM-RL PACF Bootstrap confidence interval for each lag.
Criterion: the same as that of MOM-RL.

MOM (point estimates)
MOM-LR PAFC asymptotic confidence interval.Criterion: for a period m, the procedure looks for the largest order i so that all    estimations for  <  are significant.This criterion does not admit nonsignificant intermediary lags.

PBMOM
PACF Bootstrap confidence interval for each lag.Criterion: the same as that of MOM-LR.

MOM (Bootstrap confidence intervals)
A third way to identify the PAR() model order was proposed recently by [35].The authors use the resampling technique known as Bootstrap to arrive at the model order identification.It was observed that the proposed Bootstrap based approach produces results very close to that of the LR method.
Another possibility, mentioned in the appropriate literature, is the minimization of some information criteria to identify the model order.The most cited criteria are the Akaike Information Criterion [36] and Schwarrz Criterion [37].The advantage of this approach is the equilibrium between the fitted model and the model parameter parsimony.In the specific case of using these criteria for time series, the estimated parameters of any model will not be used if they do not contribute to the good fitting of the model to the series.According to the recent literature, this is the most adequate structural identification procedure of the Box & Jenkins models; see [38].In [39], it is emphasized the importance of this procedure for periodic models.
Finally, with respect to the parameters estimation, it is worth mentioning that the technique proposed in this paper is somehow similar to the one adopted in the BES; that is, it used the Method of Moments (MOM), where the sampling moments are equated to the theoretical moments obtained directly from the model analytical specification.
Once the model structure has been identified, the next step consists of obtaining estimates for the model parameters.The recommended approach to obtain these estimates is the Maximum Likelihood (ML) method.In the case of pure AR structures, like PAR(), one can also use the Method of Moment estimate procedure, as it produces similar results to ML estimation, [29].
Having the model identified and the parameters estimated, the next step consists of the application of a set of goodness of fit tests to check if the model has produced residuals that resemble a white noise process, as well as check on possible underfitting or overfitting of the model structure.
As mentioned in the introductory section, the goal of this paper is to propose improvements in the current stochastic approach used in BES, regarding the identification and estimation phases of the PAR() model.This way, it will be possible not only to estimate parsimonious models, as the LR identification does [39,40], but also to fit the confidence intervals to the model parameters.As a result of that, only significant parameters will be estimated by the models.Therefore, the main objective here is to raise the discussion about this subject and propose a new way to produce confidence intervals for the parameters of the PAR() models, as well as improve on the structural identification.
Finally, four models were tested in order to evaluate the performance of the proposed approach.For each one of them, it was carried out both parameters estimates and structural order identification using different schemes, as shown in Table 1.The last one, called Parsimonious Bootstrap Method of Moments (PBMOM), is the main contribution of this paper.

The Bootstrap Technique
Bootstrap is a computationally intensive statistical technique that allows the evaluation of the variability of estimators based on a unique sample.The technique is recommended for problems where the conventional statistical procedures are not straight forwardly implemented.It is recommended for situations with small and/or large samples, provided the results obtained are close to those obtained via the usual asymptotic analysis in large samples.
Operationally, the technique consists of drawing with replacement the elements of a finite random sample, producing the so-called Bootstrap sample of the same size of the original sample.This way one can obtain a set of Bootstrap samples, enough to generate the Bootstrap distribution of the statistics of interest.Thus, the set of Bootstrap observations corresponds to an estimate of the true sampling distribution of the statistics of interest.As shown in [41], as the original sample size tends to infinity the Bootstrap distribution of the statistics converges to the true distribution of these statistics.
Let  = ( 1 ,  2 , . . .,  −1 ,   ) be the original finite sample of size  obtained from an unknown probability model described by its distribution function  and  = () particular statistics of interest.Denote  *  ,  = 1, 2, . . ., , as the th Bootstrap sample of size  obtained from the original sample  by means of a drawing with replacement procedure.For each sample one can obtain the corresponding Bootstrap estimate of the statistics of interest:  *  = ( *  ).
The mean, variance, and standard error of the Bootstrap estimator of  is defined by As shown in [41], where  1 and  2 are constants that depend on the population distribution , but do not depend on  and .The result shown in (5) illustrates that the uncertainty of the Bootstrap estimators depends, mainly, on the original sample size, no matter how big is the number of Bootstrap samples one takes (i.e., the value of ).In fact, in the present application, it was not necessary to estimate explicitly this expression.Some of the published methods to build confidence intervals using Bootstrap are Nonstudentized Pivotal, Bootstrapt, Percentile, Bias Corrected Percentile, Bias Corrected and Accelerated Percentile, Test-Inversion, and Studentized Test-Inversion.For details, see [41][42][43].In this paper, it will be used the Percentile method, which is based on the percentiles of the estimated Bootstrap distribution and is adequate for parametric and nonparametric simulations, according to [42].
Denoting by Ĝ the distribution function of θ * , then the (1 − 2) percentile confidence interval is defined by the  and the (1 − ) percentiles of Ĝ.Hence, the lower bound value of the interval is given by Ĝ−1 () while the upper bound value is Ĝ−1 (1 − ).By considering Ĝ−1 () = θ * () , then the confidence interval of the Bootstrap distribution can be written as It is important to emphasize that this result is theoretical and it refers to the case where the number of samples is infinite.According to [44], on the Bootstrap case, this number is finite and is given by ( being the original sample size) Therefore, according to the previous equation, for a sample size equal to 10, it would be possible to obtain  = 92378 Bootstrap samples.In real terms,  should be big enough to allow the calculation of the statistics of interest.On the other hand, the Bootstrap sample size  does not necessarily need to be obtained as a result of (7).This was the case in the present application.These statistics are ordered in ascending order and, as such, the percentiles of interests are estimated.Let θ * ()  the Bootstrap samples.For instance, in the case of  = 10.000 and  = 0.025, then lower bound value is the point in the position 250 and the upper bound value is the point in the position 9750.According to [34], this sample size is large enough to guarantee the minimum variance estimators, as shown in (5).Hence, the approximate (1 − 2) interval is given by The percentiles are computed from the empirical distribution using the Bootstrap samples.Notwithstanding, according to [42], supported by the Central Limit Theorem, considering very large Bootstrap samples ( → ∞), the Bootstrap histogram resembles a Gaussian distribution, justifying (8).With these two intervals estimated, one can check their significance in the following way: if the zero value is within the interval [ θ%,inf , θ%,sup ], then this parameter is nonsignificant and its value is zero.Otherwise, the parameter is statically different from zero.

Results
Before presenting the results, it is important to call the attention of the reader to the correct understanding of the tables and figures containing the results of the analysis.In Tables 2, 4, 6, and 8 are shown the estimated parameters and the identified orders for each one of the 12 months for both methods: BMOM-RL and MOM-RL.These procedures may result in high model orders, increasing, therefore, the risk of nonparsimonious models.The nonsignificant parameters, according to the Bootstrap significance level (that is used in the BES to generate the simulated series), are marked by bold.
As a last point related to Tables 2, 4, 6, and 8, on the right hand side of them are produced the differences, expressed in percentages, between the estimated values from the BMOM-RL and the MOM-LR methods.
With respect to Tables 3, 5, 7, and 9, they show the results obtained with both PBMOM and MOM-LR.Besides producing the results coming from the two methods, there is also the comparison with the more parsimonious identification method (MOM-LR) to show the importance of the reduction of the autoregressive order of the PAR() models, specially concerning the convergence of the estimated parameters from the methods PBMOM and MOM-LR.This result, in a sense, points out the robustness of the PBMOM method.
In building the results for the BMOM-RL and PBMOM approaches,  = 10, 000 Bootstrap samples were generated.In this way, once the model order is obtained following the procedure described above, the lower and upper bounds percentiles were calculated, as well as the Bootstrap averages for each one of the lags corresponding to the autoregressive orders of the models.
Moving now to the presentation of the results by subsystems, one can see from Table 2 that, in general, when the identification is not carried out by the most parsimonious procedure, the difference between the estimated parameters by the two approaches was around 20% superior in average.
Besides that, when one looks at the parameters estimated by the BMOM-RL, that allows for the estimation of confidence intervals, it can be noticed that the significant parameters obtained by the MOM-RL method (the one used in the BES) should have been classified as nonsignificant, once the value zero lies within the confidence interval.Such limitation could only be observed by the use of the Bootstrap estimation procedure.
As a counterpart to the identification criterion shown in Table 2, in Table 3 are presented the results obtained when the identification and the estimation of the parameters are carried out using the PBMOM technique.The average difference between the estimated parameters from the PBMOM and the MOM-LR methods is less than 1%; a clear cut indication of the consistency of the method.Moreover, the recommended method, if adopted, makes the estimation of nonsignificant parameters in the models impossible.
Considering that the PBMOM produced more parsimonious formulations, such models were fitted and used to generate scenarios, which, in a sense, reproduced the historical data for all subsystems.Mean, variance, distributional form, correlation, and negative sequence tests were performed [45].
In all of them the null hypothesis of similarity with the historical observations was accepted at 95% level.As an example, Figure 1 shows that, for the Southeast/Midwest subsystem, the average of the generated scenarios and the historical series average almost coincide.
Based on the scenarios obtained by using the proposed model, the performance of these series was compared with the corresponding historical ones using Stochastic Dynamic Programming, as shown in [26].The total expected operating costs for both sets of series were compared.This expected cost (for the whole system) is shown in Table 10.It can be observed that the use of synthetic series (generated by PBMOM) allows for a cost nearly 7% lower than the one obtained with the historical series.Thus, one can state that, using this approach to produce synthetic series, it is possible to obtain more economical operation policies.

Final Remarks
This paper presents a new approach for the stages of structural identification and parametric estimation in order to simulate synthetic scenarios of NIE via PAR() using the Bootstrap technique.In particular, the Bootstrap technique was essential in the estimation of the confidence interval for the PACF coefficients, leading to models far more parsimonious than the traditional asymptotic approach to set such intervals used previously.Besides this, synthetic scenarios were produced and these series were generated using the available monthly data for the four Brazilian subsystems to fit the PAR() PBMOM structures.The obtained results using the PBMOM produced not only more parsimonious model orders but also adherent stochastic scenarios and, in average terms, lead to a better use of water resources in the energy operation planning.
As a consequence of these findings, one could state that the proposed approach in this paper is quite promising, specially taking into account that it has been used for the highly complex Brazilian hydrothermal energy system.Therefore, the method could also be recommended to be used for any other large scale hydrothermal energy system.

Table 9 :
Comparison between the parameters estimated: PBMOMand MOM-LR-subsystem North.

Table 3 :
Comparison between the parameters estimated: PBMOM and MOM-LR-subsystem Southeast/Midwest.

Table 4 :
Comparison between the parameters estimated: BMOM-RL and MOM-RL-subsystem South.

Table 5 :
Comparison between the parameters estimated: PBMOM and MOM-LR-subsystem South.

Table 6 :
Comparison between the parameters estimated: BMOM-RL and MOM-RL-subsystem Northeast.

Table 7 :
Comparison between the parameters estimated: PBMOM and MOM-LR-subsystem Northeast.

Table 8 :
Comparison between the parameters estimated: BMOM-RL and MOM-RL-subsystem North.

Table 10 :
Expected total operating cost (R$) for the NIS.