Statistical Uncertainty Estimation Using Random Forests and Its Application to Drought Forecast

Drought is part of natural climate variability and ranks the first natural disaster in the world. Drought forecasting plays an important role in mitigating impacts on agriculture and water resources. In this study, a drought forecast model based on the random forest method is proposed to predict the time series of monthly standardized precipitation index SPI . We demonstrate model application by four stations in the Haihe river basin, China. The random-forestRFbased forecast model has consistently shown better predictive skills than the ARIMA model for both long and short drought forecasting. The confidence intervals derived from the proposed model generally have good coverage, but still tend to be conservative to predict some extreme drought events.


Introduction
Drought is part of the natural variability of climate and its recurrence is inevitable and random. It affects nearly everywhere across all climate regions, though its features differ from region to region. Although specific definitions of drought depend on differences in disciplinary perspectives e.g., meteorology, hydrology, agriculture, and socioeconomy , a typical definition of it originates from a deficiency of precipitation over a prolonged period, resulting in water supply shortage for some activity, group, or environmental sector. Drought reduces crop, livestock, and forest production and can result in widespread famine and death. Drought ranks the first amongst all natural hazards in terms of the number of people directly affected 1-3 , and it is threatening nearly 50% of the most populated areas 4 .

Mathematical Problems in Engineering
Drought forecasting plays an important role in taking contingency actions in advance of drought to mitigate its risk and impacts. A variety of forecasting methods have been developed to predict drought occurrence. Statistical run theory is the first attempt to predict drought likelihood 5-10 . Other stochastic models such as Markov chain 11, 12 , renewal process 13, 14 , and Poisson process 15, 16 also have been long suggested to characterise and predict drought events. Time series model is another widely approach for drought forecasting. For example, Chung and Salas 17 applied a low-order discrete autoregressive moving average model to estimate drought occurrence probabilities and the risk of dependent hydrological processes. Mishra and Desai 18 used seasonal and nonseasonal autoregressive integrated moving average ARIMA models to forecast the standardized precipitation index SPI in the Kansabati river basin in India. They showed that the predicted SPI agreed with observations with one-to two-month lead time but the prediction performance decreased with increasing lead time. Durdu 19 showed that the method of Mishra and Desai 18 can be used for drought forecast with reasonably accuracy upto two months in the Buyuk Menderes river basin in Turkey. Moreover, many other forecasting methods have been developed to address nonlinearity and nonstationarity of time series of drought events. Kim et al. 20 evaluated the application of the nonparametric kernel smoothing to estimate return periods of drought in arid regions. Kim et al. 21 proposed a conjunction model to forecast drought based on wavelet transformation and neural networks. Mishra and Desai 22 showed that a feed-forward recursive neural network provided better prediction of SPI than the ARIMA model with one-month lead time.
Drought forecasting remains challenging and is subject to great uncertainty partly due to anticipating some parts of hydrologic cycle e.g., rainfall, soil moisture, and groundwater level . It is, therefore, desired to make single-value predictions as well as reliable uncertainty measures, especially for long lead times. Uncertainty estimation of drought forecast can largely improve decision making on water resources management. All researches mentioned in the previous paragraph only focus on predicting the mean of drought stage and did not include any analysis of uncertainty. Consideration of forecast uncertainty has been shown to provide valuable information to decision makers who need to understand the variability of upcoming drought status. For example, Carbone and Dow 23 resampled historical monthly temperature and precipitation and derived an ensemble forecast of the Palmer drought severity index PDSI with weighted resamples. Hwang and Carbone 24 applied the resampling strategy suggested by Carbone and Dow 23 to the residuals of a predictive model of drought indices and generate drought ensemble forecasts. They found the predictive model based on nonparametric autoregressive models had good forecast capability of SPI with up to 3-month lead time in terms of mean forecast. They also demonstrated that variability of the forecast ensemble members shared probability density functions that were similar to those of the observations. The objective of this research is to introduce a statistical method, called Random Forest RF , to predict drought events together with an appropriate estimation on uncertainty measures. The RF method does not restrict a particular relationship within the drought index series or make any distributional assumptions on the model error. An ensemble of equally probable realisations of drought indices is generated by the proposed model and differences amongst the realisations can be used as a measure of uncertainty. Figure 1 illustrates how the RF model is superior over the ARIMA model by an example. This paper is organised as follows. Section 2 describes the study region and data used in this study. The predictive model based on the RF method is introduced in Section 3, followed by the study results reported in Section 4. Summary and further research recommendations are presented in Section 5.

Study Region and Data
Flowing through many big cities including Beijing and Tianjin, the Haihe river basin is the largest and the most important water system in northern China. However, the water usage of the Haihe river basin has a tremendous conflict between water supply and demand. The water shortages in this region have especially intensified due to the rapid development of the economy and the explosion of the city population during recent years and cause many severe environmental and ecological problems, such as drought, drying up of river systems, and degradation of lakes and wetlands 25, 26 . The monthly precipitation observations in the period of 1966-2004 recorded from four stations in the Haihe river basin including Beijing, Shijiazhuang, Tangshan, and Tianjin were used in this study Figure 2 .
The SPI 27 is the drought index used in this study and it measures the severity of drought over different time scales. It is a dimensionless index and is only defined by historical rainfall observations. In contrast to the PDSI, the SPI is not adversely affected by topography, but can provide indication or identification of drought a few months sooner 28 . The SPI indicates the severity of drought in a large scale and provides little details for a particular drought event. The critical drought duration is generally characterized by a stochastic process, such as a second-order Markov chain 29 . The 3-month and 12-month, SPI, denoted by SPI 3 and SPI 12 , respectively, were considered in the present work. In particular, forecasting using SPI 3 with a one-month lead time and SPI 12 with a six-month lead time were made for short-term and long-term drought forecasting, respectively. Forecasting models were developed based on the data from 1966 to 1995, and predictions one month ahead for SPI 3 or six month ahead for SPI 12 were made from 1996 to 2004. SPI based on a month other than 3 and 12 months can be forecasted.

Meteorological stations River
Subbasin boundary Haihe river basin boundary

Model
Let SPI m represent the SPI of a specific month m, and the predictive model of SPI m can be written as where l is the forecast lead time and d is the order of nonparametric autoregressive model. In this model, SPI is predicted from the SPI of previous months. If the function f is a linear function, the model defined by 3.1 is an autoregressive model 30 . No specific functional form of f is assumed in the proposed model and this will reduce the risk of model misspecification. We will apply the RF method to estimate f for all months. In the rest of this section, a brief description of RF will be presented.

Regression Trees
Regression trees, often described in graphical and biological terms, use a tree structure to predict new data from training data. In contrast to linear regression, which is a global model and applies a single predictive formula holding over the entire data space, regression trees partition space into smaller regions that have the most homogeneous collection of outcomes. The partitioning is repeated on each derived sub-data space in a recursive manner, called recursive partitioning, and an optimality criterion is used to guide each partition. This process is often likened to the way a tree divides into smaller branches called nodes and eventually Mathematical Problems in Engineering 5 to a single leaf called terminal node , hence the name of the method. The terminal nodes would comprise a collection of closely matched previous observations found from recursive partitioning, and the fitted response is the average of the observations for that node. Figure 3 provides a simple example of regress tree to predict SPI and circles denote nodes and boxes denote terminal nodes. The basic regression-tree-growing strategy involves in at least three fundamental questions: splitting rules, terminal nodes, and terminal node assignment.

Splitting Rules
On what criteria are splits to be made? Creating splits is similar to variable selection in regression. For regression problems, the variable and the location of a split are chosen by the sum of squared error between the observations and the mean of the observations within each node. For example, the root node SPI m−1 was split to two branches and a threshold value 2.1 was used as a splitting rule shown as Figure 3.

Terminal Nodes
When to stop a tree from growing? The simplest terminal node is the node where all training data are from a single class or a single value of response. This choice may make a tree grow so large that it fits all training data perfectly. This tends to overfit training data and has potentially poor predictions on independent test data. If a tree is too small, the relationship between predictors and predicants may not be extracted completely. Choosing an appropriate tree size is thus of great importance. A common approach is to grow an overly large tree so that a minimum node size is reached and prune the tree back to the optimal size by an independent test data or cross-validation. The example regression tree in Figure 3 has four terminal codes.

Terminal Node Assignment
Which value is to be assigned to the terminal nodes? For regression problems, values at the terminal node are generally assigned by the mean of the observations in that node. The assignment rules can be modified to reflect costs of explanatory variables and misspecification, if necessary. The values for terminal codes in Figure 3 are −1, 1.5, 0.2, and 0.5, respectively.
More technical aspects regarding regression trees can be found in Hastie et al. 31 , Chapter 9, for example.

Random Forest
A simple regression tree has the disadvantage of being sensitive to training data used to build the tree, especially when the size of training data is small. Little change in training data may result in very different trees and predictions 32 . Ensemble trees are one solution to improve robustness and reliability of regression trees by "randomly" growing a collection of trees from bootstrap samples i.e., training data randomly drawn with replacement and aggregating predictions. RF 32-34 is one of the most popular ensemble tress methods and has been extensively applied in medicine, neuroscience, bioinformatics e.g., 35-37 , and 6 Mathematical Problems in Engineering  The confidence interval of SPI j * m at a nominal level of α e.g. 0.95 is defined by 3.2 From the algorithm described above, only two parameters p the number of predictors randomly selected at each node and B the number of ensemble trees are required to specify to implement the RF method. In this study, we choose p d/3, where d is the number of total predictors i.e., the total number of previous SPI used in prediction and B is chosen to be 500. An RF R package available at http://cran.r-project.org/web/packages/randomForest is used in this study.

Results
In order to demonstrate model performance, the ARIMA model is considered as a baseline and is compared with the model based on RF. As described in Section 2, the number of the previous SPI used in the predictive model is determined by the best fitted ARIMA model selected from the Akaike information criterion AIC . The performance difference between the RF-based model and the best fitted ARIMA reflects the validity of the assumption of the ARIMA model, such as linearity and stationarity. Three well-known error statistics were calculated to measure the difference between the observed and predicted SPI series, including bias, mean absolute error MAE , and root mean-squared error RMSE and they are defined by where EST i and OBS i denote the ith estimated and observed values, respectively. Furthermore, according to the weather classification of McKee et al. 27 , dry days are defined by the days with SPI values falling below −1. We thus considered another two error statistics based on dry days: RMSE at dry days and the proportion of dry days detected i.e., SPI predictions less than −1 at dry days . These two additional error statistics are intended to evaluate whether the RF is a more effective and efficient drought prediction tool than the ARIMA. All error statistics of one month ahead SPI 3 prediction and six month ahead SPI 12 prediction for each stations are presented in Table 1. We have the following findings. 1 In general, the RF performed consistently better than the ARIMA. In particular, all error statistics from the RF were smaller than those from the ARIMA, except that the biases of SPI 3 predictions obtained from both methods for Beijing were almost equal. 2 RMSE at dry days was greater than overall RMSE for all predictions, indicating the difficulty of predicting drought 8 Mathematical Problems in Engineering events accurately. The difference between RMSE and RMSE at dry days of the RF-based model was smaller than that of the ARIMA for each prediction. This suggests that the RFbased model is more robust in predicting dry events. 3 The RF-based model is even more robust for longer term prediction. The longer-term drought forecast typically involves more uncertainty and thus is more challenging to predict. The ARIMA indeed lost the predictive capability for SPI 12 prediction. For example, none of dry days in Tianjin indicated by SPI 12 was forecasted by the ARIMA. Instead, the accuracy of SPI prediction based on the RF was less affected by a longer lead time. In particular, at three out of four stations except for Shijiazhuang the RF led to comparable and even smaller prediction errors indicated by five error statistics. For a graphical illustration, the SPI 3 and SPI 12 predictions by the RF method together with the associated 95% confidence intervals are shown in Figures 4 and 5, respectively. Most predictions of SPI agreed with observations very well. For example, from Figure 4, a few extreme drought events with SPI 3 < −1.5 for Shijiazhuang were well forecasted by the RF method but not ARIMA. However, a number of extreme drought events identified by SPI < −1.5 still fell outside of the 95% confidence interval. It is evident that the confidence intervals for the one month ahead SPI 3 prediction were more narrow than that for the six month ahead SPI 12 , because more uncertainty is expected for the forecast with a longer lead time.

Conclusion
In this study, a drought forecast model based on RF is proposed to predict SPI from the SPI in previous months. Unlike traditional time series models such as the ARIMA, the forecast model is built on a nonparametric framework and is more flexible to capture the underlying relationship. The RF-based model has another advantage of generating ensemble of drought forecast rather than a mean prediction. A confidence interval based on the ensemble forecast can be served as a measure of forecast uncertainty. The performance of the proposed forecast model has been demonstrated by its applications to four stations in the Haihe river basin, China. Compared with the ARIMA, the RF-based predictive model is more reliable and efficient for both short-and long-term drought forecasting. The 95% confidence interval derived from the ensemble forecast covers nearly all observations reasonably well, though a few extreme drought events are identified outside the specified range. Further potential improvement of the drought forecast skill may be made by introducing useful climate indices and the outputs from climate models to the RF-based predictive model.