Statistical Learning-Based Spatial Downscaling Models for Precipitation Distribution

Climate Modeling Laboratory, School of Mathematics, Shandong University, Jinan 250100, China MOE Key Laboratory of Environmental Change and Natural Disaster, Beijing Normal University, Bejing 100875, China Wolfson College, Oxford University, Oxford OX2 6UD, UK Institute of Biomedical and Environmental Science & Technology, University of Bedfordshire, Luton LU1 3JU, UK School of Life Sciences, Shanxi University, Taiyuan 030006, China University of Chittagong, Chittagong 4331, Bangladesh


Introduction
Global warming is signi cantly in uencing the environment, hydrology, and ecosystem. Continued warming in the 21 st century will signi cantly impact precipitation and monsoons and lead to the intensi cation of extreme rainstorm and drought events [1][2][3]. South Asia is a wellknown summer monsoon region. e formation of the South Asian monsoon is mainly caused by the seasonal movement of the pressure belt and wind belt, as well as the in uence of thermal di erences between land and ocean as well as topographic factors. About 80% of precipitation in South Asia are closely linked with monsoons [4][5][6]. More than one billion people rely on monsoonal rainfall for agricultural production, hydroelectric generation, and other basic needs [7]. Especially, Bangladesh is located in one of the largest deltas in the world with a dense network of main rivers and their tributaries, resulting Bangladesh being a ood prone country. Due to the reliance on rain-fed agriculture, Bangladesh is extremely sensitive and high vulnerability to climate change. Since Bangladesh has only few and sparse precipitation monitoring stations, it is very important to generate high spatial resolution precipitation data to mitigate climate change impacts. However, only very limited downscaling research in Bangladesh was carried out by now: observed precipitation data in Bangladesh were downscaled by using multilinear regression as the core part of the downscaling algorithm [8,9]. Since the nonlinear relation between the large and small-scale dynamics in these research studies was ignored, the obtained downscaling accuracy is unstable. Simulated precipitation data from ensemble climate models in Coupled Model Intercomparison Project Phase 5 (CMIP5) were downscaled by using the method of model output statistics [10,11], but this method can only be applied for simulated climate data.
Generally, for any country with few and sparse precipitation monitoring stations, downscaling is the key technique to generate high spatial resolution precipitation data. Downscaling can be divided into dynamical downscaling and statistical downscaling. Dynamic downscaling mainly depends on physical principles governing the climate system and high-resolution regional climate models, while statistical downscaling is based on statistical relation between local variables and large-scale variables [11]. Compared with dynamic downscaling relying on some local scale models or regional climate models, statistical downscaling uses a multilinear regression model to establish the correlation between local variables and largescale variables. Since Earth's climate is a complex, the multidimensional multiscale system with different physical processes acting on different temporal and spatial scales, statistical downscaling cannot reveal complex nonlinear relationships between local variables and large-scale variables [12,13].
Compared with traditional statistical techniques, advanced statistical learning techniques have showed excellent performance on solving problems with complex nonlinear correlations between variables [14]. Statistical learning techniques can map the predictor(s) only rely on the existing relationship between the two rather than the explicit function [15]. Main statistical learning techniques include the following. (a) Support vector machine (SVM) uses a kernel function to map features to a high-dimensional space for classification and regression; the main advantage lies in that SVM can effectively solve small-sample, nonlinear and high-dimensional regression problems. (b) Random forest (RF) is an ensemble learning method based on bagging, which can handle classification and regression problems well. (c) Gradient boosting regressor (GBR) is an ensemble learning model based on boosting, which reduces the loss by fitting the residuals to obtain high prediction accuracy. Compared with other statistical learning techniques (e.g., neural networks), the SVM requires only small amount of samples and RF and GBR can avoid over fitting [13], so in this study, based on SVM, RF, and GBR, we propose a new downscaling approach to produce a finer spatial resolution precipitation map. In order to demonstrate efficiency and accuracy of our models over traditional multilinear regression (MLR) downscaling models, we use a downscaling analysis for daily observed precipitation data from 34 monitoring sites in Bangladesh. Moreover, based on obtained high spatial resolution precipitation distribution, we analyzed patterns and trend of Bangladesh's precipitation from 1989 to 2018.

Downscaling Methods
Based on three statistical learning algorithms, such as support vector machine (SVM), random forest regression (RF), and gradient boosting regressor (GBR), we proposed an efficient downscaling approach to produce high spatial resolution precipitation, especially for any country with few and sparse precipitation monitoring stations.

ree Known Statistical Learning
Algorithms. Support vector machine (SVM) can map the complex data features into a high-dimensional space by using nonlinear mapping algorithms and separate data using optimal linear hyperplane [16][17][18]. For given n training data (x 1 , y 1 ), (x 2 , y 2 ), · · · , (x n , y n ), the SVM is to find a regression function f(x) � ω, Φ(x) + b, such that f(x i ) has at most ε deviation from the actual value y i , where Φ is a kernel function mapping the input data to a high-dimensional space, and the parameters ω and b are the weight term and bias term, respectively. e basic algorithm to search f(x) is to minimize the regression risk by the following formulas: where Γ(·) is a cost function, and the parameter C can balance the prediction error and model complexity to avoid the overfitting of training data. Random forest (RF) uses the bagging (or bootstrap aggregation) technique and decorrelation technique to combine a series of small-scale decision trees into a single procedure for better regression prediction [19]. RF can overcome the disadvantage of single decision tree in overfitting to training data and can handle data with few missing values. By using one in a randomly chosen subset of m predictors from a total of n predictors, a new node in a decision tree of RF can be generated, where the bootstrap resampling technique is used to randomly select k samples from N original training samples as its training set, and the remaining N-k samples (i.e., out-of-bag samples) are used for cross validation. Each decision tree is only trained by m predictors and k training samples, and different decision trees are generated by different predictors and training samples which are randomly chosen. In order to reduce the variance of prediction results by decision trees, the optimal prediction by RF is the average of the predictions from all decision trees (i.e., so-called the aggregate procedure). e prediction accuracy and computing efficiency of RF models are mainly affected by the number of decision trees and the number of predictors/training samples in each decision tree [20].
Gradient boosting regressor (GBR) is an ensemble regression tree model which starts from a simple regression tree and adds a new regression tree again and again [21]. e GBR is a weighted sum of regression trees: where h m (x) is a m th regression tree for boosting predication accuracy. e core procedure in GBR is to continuously 2 Advances in Meteorology reduce the loss by searching optimal parameters in the new regression tree to fit the negative gradient of the residual error of existing ensemble regression tree model. In detail.the F(x) in GBR can be estimated through an iterative procedure by using the following formula.
During each iteration, a new regression tree h m (x) is constructed to minimize the residual error by using a gradient descent method. e output of GBR can achieve better generalization performance than a single regression tree [22]. e idea behind GBF is very different from RF. e RF is to build all regression trees in parallel and the output of RF is the average of prediction results from all decision trees, while GBR is to build regression trees in a form of sequence and the output of GBR is the sum of prediction results from all regression trees.

Statistical Learning-Based Downscaling Technique.
e widely used statistical downscaling techniques are usually based on traditional multiple linear regression (MLR), which cannot effectively deal with the instability of downscaling time series and the existence of collinearity between downscaling factors and makes the improvement of downscaling performance significantly limited. In this study, based on GBR, RF, and SVM, we propose an efficient downscaling method to produce high spatial resolution precipitation, where daily station-level precipitation data and longitude/latitude/altitude are used as the input of GBR/SVM/RF models. e output is the downscaled precipitation product. Our downscaling models can largely make up for the deficiencies of the MLR downscaling approach.
For the validation of our downscaling method, noticing that available observed precipitation data are small scale, and in order to avoid overfitting and use as much data as possible in model training, we utilized the 5-fold cross validation method [23]. e main model training process was to divide all data into five subsets; each time one subset was used for the test set and the remaining four subsets were used for training set, and finally, the average of five training errors is used as the result. e correlation of determination (R 2 ), mean absolute error (MAE), and root mean square error (RMSE) are used to assess the performance of different downscaling models. To demonstrate accuracy and efficiency of our models with traditional MLR downscaling models, we used a downscaling analysis for daily observed precipitation data from Bangladesh.

Study Area and Data
Bangladesh is located on deltas of large rivers flowing from the Himalayas, leading to that its topography is extremely flat (Figure 1). Traditionally, it is divided into seven regions (Figure 2). High humidity, warm temperature, and wide seasonal variability in precipitation are the main climate characteristics of Bangladesh. is climate is mainly caused by geographic location, north-south continental atmospheric pressure gradient, and fluctuation in terrestrial and sea surface temperature [24]. Due to significant high precipitation in monsoon seasons and flat and low delta plain  Advances in Meteorology with a dense river network (Figure 1), floods and related disasters take place frequently [25]. Due to an agriculturebased economy, the high spatial resolution precipitation map can play a key role in Bangladesh's flood control, drought resistance, and water resource management. Since there are few and sparse precipitation monitoring stations in Bangladesh, it is necessary to conduct a downscaling analysis for observed precipitation data in Bangladesh. To achieve this aim, the daily precipitation data in Bangladesh were obtained from 34 monitoring stations (Figure 3) of the Bangladesh Meteorological Department, and the longitude, latitude, and elevation data of Bangladesh were extracted from Google Earth [26]. Based on statistical learning-based downscaling models in Section 2.2, we can produce high spatial resolution precipitation in Bangladesh.

Optimal Statistical Learning-Based Spatial Downscaling
Models. Based on daily precipitation data during 1989-2018 and longitude/latitude/elevation data in Bangladesh, we used our statistical learning-based downscaling models to produce higher spatial resolution precipitation data. Table 1 provides the validation results of our models and a traditional MLR downscaling model during 5-fold cross-validation processing. Our downscaling models demonstrate good performance over traditional MLR downscaling models. In terms of R 2 value, the downscaled data using GBR and RF showed good consistency with the original observation data. In validation analysis, the GBR downscaling model produced the highest R 2 (0.98) and the lowest RMSE (9.63) and MAE (7.24). Figure 4 shows the correlation between the downscaled products and the observed precipitation. e GBR downscaling model yielded the highest performance followed by RF, and the SVM downscaling model ranked the last. In terms of spatial distribution, our downscaling models were better than the traditional MLR model ( Figure 5). e spatial distribution maps of downscaled precipitation produced by GBR and RF are in high agreement with observations. e downscaling precipitation produced by SVM revealed only coarse spatial distribution characteristics: the precipitation gradually increased from western to central regions.
In summary, by using our downscaling model (GBR, RF, and SVM), to simulate the relationship between terrain variables and observed precipitation data in Bangladesh, it is clear that the GBR downscaling model performed best, compared with the RF model, the SVM model, and the traditional MLR model.

Spatial Variation Analysis of Downscaled Precipitation over Bangladesh.
In order to analyze the seasonal variation of precipitation in Bangladesh, we used our GBR downscaling model to produce mean seasonal precipitation distribution during 1989-2018 ( Figure 5). Bangladesh has significantly high precipitation during the monsoon season and low precipitation during the remaining three seasons ( Figure 6). In the winter season, the precipitation is significantly lower and is close to uniform spatial distribution; in the premonsoon season, the highest precipitation occurs in the middle region; in the monsoon season, higher precipitation occurs in the southwestern and southeastern regions; in the postmonsoon season, the precipitation distribution is particularly uneven and has high spatial variability. Relative dry conditions will occur in the northwestern and central regions.
Using downscaled precipitation by our GBR downscaling model, we demonstrated a difference between the seven regions of Bangladesh (Figure 7). e eastern region showed the highest fluctuation, followed by the southeastern region.
e F-statistic value exceeds the critical point in analysis of variance (ANOVA) showing that these regional differences are statistically significant.
Based on the Mann-Kendall trend test and Sen's slope test (Table 2), eastern, southwestern, southern, and southeastern regions showed upward trends during 1989-2018, but these trends were not significant. e remaining three regions showed downward trends, where only one region showed a statistical significance. Among all seven regions, the northern region showed the highest downward trend with −13.38 mm/year, while southeastern region shows the highest upward trend with 4.24 mm/ year.

Conclusions
For an agriculture-based country like Bangladesh, water resource contributes the most to agricultural planning. Precipitation plays a more important role on agricultural development than other climatic and environmental variables. It can influence flood disaster management, drought resistance, long-term planning of land and water resources, and different kinds of infrastructure. erefore, to produce high spatial resolution precipitation data is crucial in analyzing climate change impacts, especially for countries with few and sparse precipitation monitoring stations. Downscaling is an effective technique to solve this issue. e widely used statistical downscaling techniques are usually based on traditional MLR, which cannot effectively deal with the instability of downscaling time series and the existence of collinearity between downscaling factors, and make the improvement of downscaling performance significantly limited. In this study, based on GBR, RF, and SVM, we propose an efficient downscaling approach to produce high spatial resolution precipitation from daily station-level precipitation data and longitude/latitude/ altitude data. In order to demonstrate the efficiency and accuracy of our models with traditional MLR downscaling models, we did a downscaling analysis for daily observed precipitation data from 34 monitoring sites in Bangladesh. Our downscaling models have clear advantages over traditional multilinear regression (MLR) downscaling models. e GBRbased downscaling model had the best performance in all four downscaling models. erefore, we suggest that the GBR-based downscaling models should be used to replace traditional MLR downscaling models to produce a more accurate map of highresolution precipitation for mitigating impacts of climate disasters, especially South Asian countries with few and sparse precipitation monitoring stations.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Disclosure
Yichen Wu and Zhihua Zhang are the co-first authors.

Conflicts of Interest
e authors declare that they have no conflicts of interest.