Statistical Downscaling of Temperature with the Random Forest Model

The issues with downscaling the outputs of a global climate model (GCM) to a regional scale that are appropriate to hydrological impact studies are investigated using the random forest (RF) model, which has been shown to be superior for large dataset analysis and variable importance evaluation. The RF is proposed for downscaling daily mean temperature in the Pearl River basin in southern China. Four downscalingmodels were developed and validated by using the observed temperature series from 61 national stations and large-scale predictor variables derived from the National Center for Environmental Prediction–National Center for Atmospheric Research reanalysis dataset.TheproposedRFdownscalingmodelwas compared tomultiple linear regression, artificial neural network, and support vector machine models. Principal component analysis (PCA) and partial correlation analysis (PAR) were used in the predictor selection for the other models for a comprehensive study. It was shown that the model efficiency of the RF model was higher than that of the other models according to five selected criteria. By evaluating the predictor importance, the RF could choose the best predictor combination without using PCA and PAR. The results indicate that the RF is a feasible tool for the statistical downscaling of temperature.


Introduction
Global climate models (GCMs) are considered the most credible tools for the projection of future global climate change [1].However, there is a general mismatch between the spatial and temporal resolution of the GCM output and regional scale climate change impact studies.Various techniques have been developed to downscale GCM outputs to finer scales.These methods are widely divided into dynamic (physical) and statistical (empirical) downscaling [2,3].Because of the complexity in modeling and computing dynamic downscaling, statistical downscaling techniques have been used widely in climate change studies due to their simplicity and ease of implementation.
Statistical downscaling techniques can be divided into three categories: weather typing, weather generators, and regression-based methods.Various models have been developed and applied in the downscaling of temperature, like linear regression [4,5], canonical correlation analysis (CCA) [6,7], artificial neural networks (ANN) [8], support vector machines (SVM) [9], and so forth.Some comparison studies have been made in the past.Schoof and Pryor [10] demonstrated that ANN models give better estimates than multiple linear regression (MLR) models for daily temperature downscaling at Indianapolis.Kostopoulou et al. [11] demonstrated that MLR and CCA are superior to ANN in the simulation of minimum and maximum temperatures over Greece.Duhan and Pandey [12] compared MLR, ANN, and the least square support vector machine (LS-SVM) models to downscale the temperature of the Tons River basin in India and demonstrated that LS-SVM models perform better than ANN and MLR models.These comparison studies indicate that none of the aforementioned methods can assure an accurate estimate of temperature under different situations.
The predictor selection is critical for developing a statistical downscaling model.Suitable predictors should be informative, and the relationship between the predictors and predictands should be stationary [13].Informative predictors can be identified using statistical measures, such as the Pearson, Spearmen, and Kendall correlation analysis [9], CCA [14], maximum covariance analysis (MCA) [15], partial correlation (PAR) [16][17][18], and principal component analysis (PCA) [19,20].Interactive model fitting approaches are also used in predictor selection [21].However, some limitations are found during the application, such as the limited ability of traditional correlation analysis for interpreting nonstationary and nonlinear relationships [22].Therefore, a precise statistical downscaling method with an inbuilt predictor selection mechanism will be helpful for researchers studying climate change impact.The random forest (RF) model is an ensemble machine learning technique based on a combination of classification or regression methods and statistical learning theory [23].RF models have been well applied in various fields, such as risk analysis [24], ground water studies [25], remote sensing analysis [26], and flood hazard assessment [27] and especially show advantages in land cover classification [28][29][30][31].There are two important advantages of RF models.The first is the ability to handle large datasets with correlated conditional variables, because it includes precision in the prediction, is nonparametric, and is robust in the presence of outliers, noise, and overfitting [23,32].The second is the inbuilt variable importance evaluation.By permuting the variables randomly, each variable can be compared to the prediction results and evaluated for its importance [33].Based on this body of knowledge, RF should be, in theory, highly applicable to downscaling and able to rectify multivariable and nonlinear issues.Eccel et al. [34] adopted RF with four linear and nonlinear models in the postprocessing of two numerical weather prediction models for the prediction of minimum temperatures in an alpine region.However, the RF just served as one of the comparative models in the study.The advantages and disadvantages of the RF and its applicability in statistical downscaling have not been studied in detail.
The purpose of this study is to fully investigate the applicability of RF for statistical downscaling of temperature.The predictors are the 26 large-scale variables derived from the National Center for Environmental Prediction (NCEP) reanalysis daily dataset, and the predictands are the observed temperatures at 61 national standard stations located in the Pearl River basin.The RF is used to capture the complex relationship between selected predictors from the NCEP data and observed daily mean temperature from these stations.A comparison study was conducted involving the application of MLP, ANN, and SVM models.The PCA and PAR methods were used in predictor selection for the comparative models for a comprehensive study.[35,36].The Pearl River basin (Figure 1) is located in tropical and subtropical climate zones, the annual temperature is 14-22 ∘ C, and the annual precipitation is 1200-2200 mm.The distribution of precipitation is gradually reduced from east to west.This regional distribution is significantly different, and the interannual  change is large.The precipitation is concentrated during April-September [36], accounting for 72-88% of the annual precipitation [35].

Study Area and Data Description
The Pearl River basin is a rich water resource.The water resource in the entire river basin is 4700 cubic meters per capita, equivalent to 1.7 times the national per capita rate, but the spatial and temporal distribution of the water resource are uneven, and basin flooding, waterlogging, drought, and saltiness create frequent natural disasters.The Pearl River basin is a highly developed region, having a prominent position in the economic development of China.

Data Description
2.2.1.Temperature Data.The observed meteorological data used in this study are the daily mean temperatures at 61 national standard stations in the Pearl River basin.A continuous data series for the period of 1961-2005 was selected for the study.These observations were obtained from the National Climate Center, which is in charge of monitoring, collecting, compiling, and releasing high quality hydrological data in China.The observed mean temperature and standard deviation at the meteorological stations are shown in Figure 2. The mean daily temperature shows an increasing trend from the north to south in the period.However, the standard deviation showed an opposite trend.

Large-Scale Atmospheric
Variables.The NCEP reanalysis daily data were downloaded from the website, http://www .cdc.noaa.gov,which included 26 large-scale atmospheric variables at a scale of 2.5 ∘ × 2.5 ∘ that were derived from the dataset over the period of 1961-2005.The details of the predictors are shown in Table 1.The NCEP data were interpolated  to each station using the bilinear interpolation method which has been widely used in statistical downscaling [37][38][39].

Downscaling by Using the Random
Forest Method As shown in Figure 3, the RF is composed of a set of CARTs.The accuracy of the RF prediction depends on the strength of the individual CARTs [23].Each CART consists of a root node, internal nodes, and leaves.Each internal node is associated with a test function to split the incoming data.For regression trees, splitting is made in accordance with a squared residuals minimization algorithm, which implies that the expected sum variances in the two resulting nodes should be minimized, as shown in argmin Here,   and   are fractions of samples in the left and right nodes, var(  ) and var(  ) are response vectors for corresponding left and right child nodes, and   ≤    ,  = 1, 2, . . .,  are optimal splitting questions.
Compared to traditional CART methods that use whole data sets, the RF trains each individual CART on bootstrap resamples ( samples) of the total dataset.Instead of using all the features, the RF uses a random selection of features to split each node.The best split is chosen among a randomly selected subset of Ntry input variables at each node.The tree is then grown to the maximum size without pruning.In this way,  CARTs are grown, and the final output is the average of the predictions of those trees.

Importance of Variables.
The RF performs an inbuilt cross-validation in parallel to the training process by using out-of-bag (OOB) samples, which are not chosen during the bootstrap split.In the regression mode, the total learning error is obtained by averaging the prediction error of each individual tree using their OOB samples, as shown in where  is the total number of OOB samples, Ŷ(  ) is the RF output corresponding to a given input sample   , and   is the observed output.
The RF provides two methods of evaluating the importance of each variable [27].The first method evaluates the variable importance based on how much poorer the prediction will be if the variable is permuted randomly.The prediction errors of the OOB samples of each tree (termed E OOB1 ) are calculated during the training procedure.At the same time, each input variable in the OOB samples is permuted one at a time.These modified datasets are also predicted by the tree (termed E OOB2 ).At the end of the training procedure, the importance of each variable is obtained by averaging the difference between E OOB1 and E OOB2 .It is then normalized by the standard deviation.The second method is based on the calculation of the node impurity criterion.As illustrated in (1), we can calculate how much the split decreases the node impurity.For regression trees, the decrease of the node impurity can be calculated using the difference between the residual sum of squares (RSS) before and after the split.The importance of a variable can be rapidly calculated by combining the decreases of node impurity for the variable over all trees [40].

Model Implementation and Validation.
The RF is utilized to simulate the nonlinear relationship between the NCEP predictors and the observed temperatures at the 61 stations.In this study, the RF algorithm is implemented in the R programming language with a package "random forest," which has built-in functions to measure variable importance.As illustrated previously,  and Ntry are two sensitive parameters in the RF models.Ntry is the square root of the total number of variables [41]. influences the convergence of the RF and can be determined through the OOB error.For model development, the daily mean temperature series from the national standard stations and the large-scale atmospheric variables of the NCEP data are divided into two datasets.The first 31 years  are used for calibrating the regression model, while the remaining 14 years of data (1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005) are used to validate the model.For testing the performance of the proposed model, two data-driven models, ANN and SVM, are selected for model comparison, which are commonly used in statistical downscaling.The three-layer backpropagation artificial neural network (BP-ANN) and LS-SVM are adopted, which have been applied successfully in downscaling temperature [12].The MATLAB functions, "newff" and "tunelssvm," are employed for obtaining the values of the models parameters.Multiple linear regression analysis is adopted for the MATLAB implementation.The PAR [18] and PCA [19,20] methods are used in predictor selection for the MLR, ANN, and SVM models.

Model Performance
Here,  obs is the vector of the observed predictands,  obs is the mean of the observed predictands,  sim is the vector of the simulated predictands, and  sim is the mean of the simulated predictands.In general, a higher Nash and  indicate better model efficiency.In contrast, smaller values of RMSE, MAE, and Bias indicate higher accuracy in the model prediction.

Result Analysis and Discussion
4.1.The Choice of Predictors.One of the most important steps in the development of downscaling models is the choice of appropriate predictors [42].The RF can evaluate the relative contribution of each predictor for downscaling results by combining the RSS decreases over all trees, which makes it convenient in choosing predictors for the RF.Using two of the stations as examples, Figure 4 shows the relative importance of the predictors for the predictands at Guangzhou and Nanning Stations, where the number of the predictor has been given as indicated in Table 1.For a comprehensive investigation of the predictor's importance in the Pearl River basin stations, the rank (the most important predictor is indicated as 1) of the relative importance of the predictor in each station is calculated and indicated in Figure 5.  Numbers of predictors According to Figures 4 and 5, the number 26 predictor, the mean temperature at 2 m height, is the most important predictor for all RF models at the 61 stations.Similarly, the number 25 predictor, the surface-specific humidity, ranked second for predictor importance at all stations.In contrast, the number 5 predictor, surface vorticity, is the least important predictor for all stations.The ranks of the other predictors vary for the different stations, which may be caused by the meteorological and geographical differences of the stations.
Because the OOB samples can provide unbiased estimation of the RF model performance, the rationality of including each factor can be tested using the MSEs of the OOB (indicated by EOOB).The predictors of each station are screened using the ranks of relative importance, which were evaluated in previous step.Then the MSE of the OOB samples in each station is calculated with the increase of the predictor combination and plotted in Figure 6.
The results show that EOOB generally decreases, but at a decreasing rate as more predictors are included.This indicates that the RF avoids overfitting successfully.However, the improvement of model performance by including more predictors is slight when the predictor number exceeds a certain number, which is seven in this study.With enough computation resources, all predictors are chosen in the RF modeling to downscale the temperature for these stations.

Comparative Study.
The performance of the RF model is compared with that of the MLR, ANN, and SVM models.Two predictor selection methods, PAR and PCA, are applied in the four models.
The partial correlations of the 26 predictors with the predictands are shown in Figure 7.It can be observed that the ranks of the predictors' partial correlation vary in different stations.In general, the number 1 predictor, mean sea level pressure, has the largest partial correlation in most of the stations.However, the number 5 and number 8 predictors, corresponding to surface vorticity and 500 hPa airflow strength, have the smallest partial correlations in the majority of the stations.
The partial correlation coefficients are used to decide the variables that are included in the input combination.Based on former studies [18,43,44], a combination of seven predictors with the highest partial correlation coefficients are selected and applied in the MLR, ANN, and SVM models.These results are marked as MLR-par, ANN-par, and SVM-par.
PCA, which was commonly used by previous researchers, is also used in the comparative study.Before PCA, the predictors are standardized by subtracting the mean from the original values and then dividing the results by the standard deviation of the original variables.The PCA method is then applied to the standardized NCEP predictor variables to extract principal components (PCs) that are orthogonal.The obtained PCs preserve more than 90% of the variance present at each station.Then, the PCs are used in the MLR, ANN, and SVM modeling, and these results are marked as MLRpca, ANN-pca, and SVM-pca.
The calibration and validation results of the RF and comparative models are summarized in Table 2, and the average Nash in the calibration and validation periods for all models are plotted in Figure 8.
The results showed that the RF is superior to the comparative models in the calibration and validation periods.According to the average values of Nash, RMSE, MAE, , and Bias, the RF for the 61 stations are 0.98, 0.80, 0.58, 0.99, and 0.00 in the calibration period and 0.94, 1.46, 1.12, 0.97, and 0.21 in the validation period, respectively.All of these criteria are superior to those of the comparative models.It can also be observed in Figure 8 that the RF shows higher precision and more stability over the other models in the study area.
The results of the two parameter selection are also compared for the MLR, ANN, and SVM models.The PAR is   superior to PCA in the MLR, ANN and SVM models.For the ANN models, the average  are similar in both the calibration and validation period; however, the average Nash, RMSE, MAE, and Bias of the results using PAR are superior to those of PCA with decreases of 0.01, 0.12, 0.08, and 0 in the calibration period and 0.02, 0.20, 0.15, and 0.28 in the validation period, respectively.For the SVM models, the increases in average Nash and  are 0.03 and 0.01 in the calibration period and 0.04 and 0.02 in the validation period, respectively.The decreases of average RMSE, MAE, and Bias are 0.21, 0.16, and 0.01 in the calibration period and 0.33, 0.25, and 0.28 in the validation period, respectively.For the same, the PAR is superior to PCA for MLR for most of the criteria, with increases in Nash and  and decreases in RMSE and MAE.
The spatial distributions of model precision of the RF and the comparative models are shown in Figure 9, in which Nash is selected as the evaluating criteria.For the comparative results, the PAR was used in the MLR, ANN, and SVM modeling.It is shown that the RF is superior to the comparative models at most of the individual stations.In addition, it can be observed that the precision of the RF in stations located in the plain region is higher than that in the   mountain regions, which indicates the limited applicability in complex terrains.

Summary and Conclusions
Statistical downscaling models are effective in solving the mismatch between large-scale climate models and local scale hydrological responses.In this study, a statistical model based on the random forest method was proposed and applied to the Pearl River basin.The objective of this study was to investigate whether the RF approach could successfully simulate the complicated relationship between the predictors and predictands.The daily mean temperature observations from 61 stations in the Pearl River Basin and the NCEP reanalysis daily data from 1961 to 2005 were selected in order to compare the results of the RF model with those of the MLR, ANN, and SVM models.The following summarizes the discussion points and provides conclusions derived from this analysis: (1) The RF model successfully simulated the relationship between the predictors and predictands and performed better than the MLR, ANN, and SVM models.According to five statistical criteria, the RF showed the highest model efficiency in both the calibration and validation periods.In addition, it was observed that the model efficiency of the RF increased continuously by considering more predictors.In this study, all 26 predictors from the NCEP were considered in the RF modeling.By taking full advantage of the information in the predictors and avoiding the influence of noise, the RF model performance dominated the other results.(2) The built-in variable importance evaluation process and the OOB samples in the RF made predictor selection convenient.The built-in variable importance evaluation process ranked the importance of each predictor for the prediction of the predictands.The OOB samples gave the unbiased estimation of the model efficiency.Both of these were helpful in the predictor selection.Although all predictors were considered in this study, they will be most valuable when the RF is used in more complex downscaling problems, such as precipitation downscaling.
(3) The spatial distribution of model precision at the individual stations for the RF and the comparative models was also discussed.Although the RF was superior to the comparative models for most of the stations, there was still room for improvement for the prediction accuracy in mountainous areas.
As this study mainly discussed the development and comparison of temperature downscaling methods, future projects in the Pearl River basin were not further explored.We will pursue further research in the application of the RF to additional meteorological elements, such as the precipitation.This will be helpful in understanding the strength and limitation of the RF method.Furthermore, the surface parameters, like the elevation, slope, vegetation, and so forth, also influence the distribution of these meteorological elements, especially in complex terrain [45]; we will explore further in considering these parameters as part of model input.

Figure 2 :
Figure 2: Observed mean temperature and standard deviation at the meteorological stations.

Figure 3 :
Figure 3: Structure of a random forest model.

Figure 4 :
Figure 4: Relative significance of the predictors at Guangzhou Station and Nanning Station.

Figure 7 :
Figure 7: Rank of partial correlation of predictors in the 61 stations.

Figure 8 :
Figure 8: Average Nash in calibration and validation period for all models.

Figure 9 :
Figure 9: The spatial distribution of Nash for different models.

Table 2 :
Performance assessment for predictands in calibration and validation.