A Climate Downscaling Deep Learning Model considering the Multiscale Spatial Correlations and Chaos of Meteorological Events

Climate downscaling is a way to provide finer resolution data at local scales, which has been widely used in meteorological research. -e twomain approaches for climate downscaling are dynamical and statistical.-e traditional dynamical downscaling methods are quite timeand resource-consuming based on general circulation models (GCMs). Recently, more and more researchers construct a statistical deep learning model for climate downscaling motivated by the single-image superresolution (SISR) process in computer vision (CV). -is is an approach that uses historical climate observations to learn a low-resolution to high-resolution mapping and produces great enhancements in terms of efficiency and effectiveness. -erefore, it has provided an appreciable new insight and successful downscaling solution to multiple climate phenomena. However, most existing models only make a simple analogy between climate downscaling and SISR and ignore the underlying dynamical mechanisms, which leads to the overaveraged downscaling results lacking crucial physical details. In this paper, we incorporate the a priori meteorological knowledge into a deep learning formalization for climate downscaling. More specifically, we consider the multiscale spatial correlations and the chaos in multiple climate events. Depending on two characteristics, we build up a two-stage deep learning model containing a stepwise reconstruction process and ensemble inference, which is named climate downscaling network (CDN). It can extract more local/ remote spatial dependencies and provide more comprehensive captures of extreme conditions. We evaluate our model based on two datasets: climate science dataset (CSD) and benchmark image dataset (BID). -e results demonstrate that our model shows the effectiveness and superiority in downscaling daily precipitation data from 2.5 degrees to 0.5 degrees over Asia and Europe. In addition, our model exhibits better performance than the other traditional approaches and state-of-the-art deep learning models.


Introduction
Climate variations are influencing society's well-being all over the world, such as global warming, extreme storm, precipitation, and sea-level rising [1][2][3]. A lot of industrial and agricultural products have been severely affected by these climate changes with increasing intensity, duration, and frequency [4]. In recent years, climate researchers are looking for more reliable climate simulation methods, which can provide a more detailed evolution process of meteorological events, for more effective disaster mitigation and decision-making [5].
General circulation models (GCMs) are the traditional tools for simulating and investigating the various climate events [6,7]. ey are usually developed on the basis of complicated physical dynamical principles and executed on large-scale supercomputers. ese GCMs usually take a large number of physical variables into account and model the entire climate evolution processes into specific differential equations. ese variables are treated as initial fields and have specific spatial resolutions, respectively, so that the meteorological researchers can investigate and forecast the potential trends that rely on these physical patterns with various resolutions under model integration. However, when simulating and studying the meteorological evolutions in small-scale areas, one may encounter the issues of too coarse spatial resolutions in GCMs, which limit revealing critical detailed physical processes, such as precipitations in key river basins and extreme typhoons. is severely hinders the assessment of the impact of climate events on these relevant key regions [8].
In this situation, climate downscaling has always been the hotspot in meteorological research, which aims to reconstruct more detailed and finer resolution meteorological data for small/medium scale regions. Technically, climate downscaling is used to enlarge the resolutions and elaborate the detailed contents for physical variables in climate events.
ere are two classic approaches to solve this problem: dynamical downscaling and statistical downscaling. Dynamical downscaling accounts for developing high-resolution GCMs or regional climate numerical models (RCMs) to simulate local meteorological processes. It usually includes embedding subgrid of key regions into GCM or using the output of GCM as the boundary conditions in RCMs for high-resolution projections. is is an advantage that dynamical downscaling has clear physical causalities and usually covers any regions even the entire Earth. However, limited to construction principles of GCMs or RCMs, developing and integrating the dynamical downscaling may need a large amount of calculation and time resources; in addition, modifying and configuring GCMs into a certain resolution are not convenient.
Statistical downscaling is a data-driven method and learns the projection relationship from low-resolution (LR) to high-resolution (HR) climate patterns via historical meteorological big data. A large number of existing studies have suggested that statistical downscaling has more efficient computation and almost equally or even more accurate downscaling results [9,10], which is a reliable alternative. erefore, climate researchers are paying more attention to it recently.
Statistical downscaling has been implemented by a variety of numerical calculation methods in wide scope; one of the most important ideas is to use regression to simulate the mapping projection including linear and nonlinear ways. Traditional linear regression makes preliminary progress in terms of downscaling of sea temperature and precipitation in early stage [11]. Subsequently, many studies focus on the nonlinear evolution in climate events and extend linear to nonlinear (such as artificial neural network (ANN)) approach based on machine learning to reduce downscaling errors. ese nonlinear methods, including quantile regression neural networks [12], Bayesian model averaging [13], and expanded downscaling [14], can capture a more general relationship between LR and HR, so they achieve better results. e above-mentioned statistical methods are efficient in climate downscaling. However, limited to simplified formalization in mathematics, the existing statistical downscaling can only learn the most superficial correlations contained in meteorological data but overlooks potential physical details, which causes the unreasonable downscaling results, such as overaveraging results or even describing the extreme events' failures [15]. In meteorological research, the extreme behaviors in the Earth system should be paid more attention to than normal events [1,8], such as typhoons, unusual floods, and strong El Niño events [16], because extreme events may cause more severe damages. erefore, there is a special requirement for climate downscaling: reconstructing more accurate climate data for reflecting the truly detailed scenarios, instead of simple interpolation.
With the development of computing equipment, deep learning methods, as the extension of ANN, provide an appreciable new insight in many frontier areas, which inspire climate researchers to explore suitable alternatives in the field of deep learning computer vision (CV) and effectively transfer these ideas to climate downscaling. More specifically, single-image superresolution (SISR) problems in CV are to raise the image resolution for providing clearer and more comfortable pictures for human eyes. In terms of such purpose, SISR problem is quite similar to climate downscaling: they both learn a projection relationship between low-resolution (LR) and high-resolution (HR) data. SR is based on image data; climate downscaling relies on climate observed or simulated data. But the meteorological data has a natural property; that is, they are usually available on grid points. erefore, we can make an analogy between SR and climate downscaling: treating the grid points in a climate data as pixels in an image and then modeling climate downscaling as enlarging resolution in SR via deep learning CV techniques [15].
Recently, a large number of established benchmark image databases and SR competitions have significantly promoted the development of SR problem in CV, so that we can absorb these successful experiences and apply them to climate downscaling. e first successful deep learning model was established by Dong et al. [17], who proposed a three-layer SRCNN and achieved superior performance than conventional methods. en the SISR models are constructed larger scale, deeper and deeper by introducing residual learning [18] and dense connection [19], such as 20 layers in VDSR [20] and DCRN [21]. In addition, more and more CNN-based models improved the pioneering works and achieved better results, such as applying subpixel convolution [22], generative adversarial network [23], and Charbonnier penalty function [24] in the model structures to enhance model expression and ease model training. In general, deep learning methods are relatively easier to construct than dynamical downscaling and they can make up the shortcomings of execution time and computing resources. Furthermore, many latest deep learning technologies are also convenient to use in SR, which are more and more conducive to improving performance, such as dense connection and residual learning. ese studies imply that deep learning can be a reliable tool in climate investigation, and have great potential in climate downscaling.
However, the relationship between climate downscaling and SR is not exactly the simple analogy. SR requires the modulated images to improve viewing comfort, including higher resolutions, adaptive contrast, and edges. On the contrary, climate downscaling requires absolutely true physical details and dynamic features that cannot be described well at low resolution, which may even generate "extra" spatial information. To tackle this challenge, we can absorb some general guidelines in the SR model and then combine them with specific concepts of meteorological knowledge to build up a more comprehensive formalization for climate downscaling problem, which can increase the model interpretability and improve downscaling skills.
In this paper, we incorporate the a priori meteorological knowledge into the deep learning modeling process for climate downscaling. e multiscale spatial correlations and the chaos are the two key concepts in multiple climate events. We take both into account and novelly formulate climate downscaling as a two-stage process. In the first stage, the same LR climate data will be reconstructed to multiple same-size HR data considering multiple spatial scale correlations, which will generate many HR candidates. In the second stage, we will evaluate the reconstruction uncertainties of these HR candidates and then infer the optimal probability distribution of the results, which can provide the optimal mean and corresponding variance at the same time. Based on such formalization, we further implement a more detailed model named climate downscaling network (CDN), which implements the above-mentioned two stages.
To the best of our knowledge, it is the first time to integrate meteorological knowledge into deep learning in climate downscaling. We train and evaluate our proposed model on the real-world climate science dataset (CSD) and benchmark image dataset (BID), respectively, on the metrics of SSIM and PSNR. We carry out comparative experiments to determine a better model structure which can make a trade-off between performance and speed, and then we design several different magnification experiments with our two datasets and conduct extensive experiments to evaluate the performance of our model. From CSD experiments, our model shows the effectiveness and superiority in downscaling daily precipitation data from 2.5 degrees to 0.5 degrees over Asia and Europe. In addition, we also discover that what our model learns is not a fixed resolution mapping from LR to HR but the enlarged size, which means that our model is not limited by the size of input or output and has good extensibility for transfer learning [25]. As for BID experiments, our model has excellent performance, which outperforms than the other traditional approaches and stateof-the-art models, especially at the small magnifications.

Formalization.
Climate events can be treated as very complicated hydrodynamic phenomena, where different coupled physical variables develop under deterministic dynamic mechanisms. Climate downscaling is to enhance the resolution of a certain physical variable in the same area and provide more detailed forecast results. In other words, climate downscaling is to supplement the data on the finer grid points by extracting and learning the underlying physical laws of the original data.
Spatial correlation plays an important role in climate downscaling, which is an important internal characteristic. Increasing evidence suggests the remote correlations produce climate variabilities. Rong et al. [26] found that the positive sea surface temperature anomalies (SSTA) in the northern tropical Atlantic (NTA) can trigger the eastward Kelvin wave which induces low-level anticyclone over the western North Pacific during boreal summer. Rueda et al. [27] found that the waves have great contributions to extreme sea-level fluctuations, for both local and remote windstorms. In general, climate events in the Earth system are not isolated; because multiples atmospheric circulations or ocean waves building up a bridge span the ocean (or land) and atmosphere, the coupled interactions connect all climate variables spatially together regardless of adjacency correlations or remote correlations.
However, such spatial correlations are not fully utilized in the climate downscaling problem. Traditional regression approaches and CNN filters usually extract local information in continuous spaces to interpolate extra grid points, which is confined to adjacency neighborhood, and tend to average the downscaling results; the remote meteorological spatial correlations are usually ignored, which makes the results lacking credible dynamic causality and interpretation.
From the perspective of climate dynamics, the interpolated details are mainly determined by local scales information or the so-called adjacency correlations; the large-scale information may also contain the potential trends, such as remote waves or wind forcing. So if multiscale correlations can be taken into account in the downscaling model, the performance will be greatly improved. In addition, the chaos of the dynamic system is inevitable; meteorological researchers are paying more attention to describing the uncertainties in climate forecasting, leading to a turning from determinant single-value predictions to probabilistic predictions, such as ensemble prediction [28,29]. erefore, we will take good advantage of probabilistic predictions and provide more comprehensive downscaling.
On the basis of these two pieces of a priori meteorological knowledge, we formulate a 2-stage process for climate downscaling. e first stage (stage 1) is to reconstruct HR climate data considering different multiscale spatial correlations and obtain several HR candidates. e second stage (stage 2) is to infer the optimal probabilistic distributions for downscaling results via the obtained candidates from stage 1, providing the optimal mean and corresponding variance at the same time (see Figure 1).

Stage 1.
In general, small-scale spatial correlation can reflect more adjacency information; large-scale spatial correlation contains more remote interactions. erefore, to cover as many different spatial correlations as possible, we design a structure containing N path to reconstruct HR data, where every path extracts and learns different scale spatial correlations. Finally, we can obtain N HR candidates.
In every reconstruction process of these N paths, we propose the stepwise reconstruction process to gradually enlarge the climate data. Classic methods usually use the upsampling or bicubic interpolation to enlarge the data by integer multiples directly, but these operations are done in one step and always limited in integer multiples, such as from an original size 40 × 60 to a required size 80 × 120 (2× magnification), which may fill too much content at one time and cause too large space skips. In fact, climate downscaling may have a more flexible magnification size according to requirement, which is usually not an integer-multiple amplification, such as from an original size 40 × 60 to a required size 50 × 80, in which aspect traditional methods have limitations. erefore, in every reconstruction path, we divide the entire process into gradually enlarged subprocesses, which can fully extract the spatial correlations of different scales.
More specifically, suppose the original climate data size is i × j and the required data size is i ′ × j ′ in a 2D downscaling scenario (channel size is 1 for expression simplification): where Δh and Δw are the resolution increments, respectively, for each edge of original climate data. According to equation (1), we divide Δh and Δw into M steps and enlarge the data size with (n h (m) , n w (m) ) in each step as depicted in n h where y (m) and y (m+1) are the input and output data of each step, respectively, and g is the magnification operator related to enlarged sizes (n h (m) , n w (m) ). For example, the input size for y (m) is i (m) × j (m) and the output size for y (m+1) is (i (m) + n h (m) ) × (j (m) + n w (m) ). is process is described in Figure 2, where (a) and (e), (b) and (f ), and (c) and (g) show the processes for M � 1, 2, 3, respectively; (d) represents the process of dividing M into infinite magnifications, which is similar to calculus. We design a rule to deduce the enlarged sizes (n h (m) , n w (m) ) to guarantee them as equal as possible during these M steps as in equation (3). In such a process, ⌊Δh/M⌋ or ⌈Δh/M⌉ represents every enlarged size n h (m) and the same for n w (m) .
For a certain climate downscaling problem, Δh and Δw are fixed values; the magnification factors (n h (m) , n w (m) ) and step size M are negatively related to each other; for example, when M is very small, (n h (m) , n w (m) ) are relatively large, but when M is very large, (n h (m) , n w (m) ) are relatively small. In addition, the smaller (n h (m) , n w (m) ) means the size change of climate data is not obvious, which leads the climate data to tend to project more on to the adjacency spatial correlations and learn more local information to fill the grid point of expanded resolution and vice versa. As we use N different paths to reconstruct the HR data and set as many different decomposed steps M as possible in every path where (n h (m) , n w (m) ) are totally different, we can fully demonstrate the adjacency and remote spatial interactions in the whole process. Figure 3 displays the detailed stepwise reconstruction structure in each path.

Stage 2.
In meteorological research, the forecasts are more willing to be expressed in the form of a probability distribution. On the one hand, this is a more rigorous choice to measure the uncertainties of forecast results. On the other hand, climate events often imply potential extreme conditions, which cannot be ignored. erefore, a large number of studies use ensemble forecasts to describe climate evolutions.
Taking the ensemble idea into account, we build up a probabilistic distribution for obtained downscaling candidates from stage 1 as where y and σ 2 are the optimal mean and variance for the downscaling results, [y 1 , y 2 , . . . , y N ] represents the N HR candidates from stage 1, and F represents the downscaling ensemble system. Data-driven methods, especially deep learning, show excellent performance in discovering the intricate nonlinear characteristics. erefore, we construct a detailed deep learning architecture to approximate g and F depicted in the following sections. is novel architecture is composed of two corresponding stages. e first stage is to reconstruct downscaling data from the perspective of multiscale spatial correlation, and the second stage is to approximate the mean and variance for the obtained candidates and provide the optimal result.

Data Collection and Application Scenario.
We use a deep learning network to implement the magnification operator g and downscaling ensemble system F. When the deep learning network structure is determined, the training dataset quality directly affects the downscaling performance. From the perspective of meteorology, multiresolution datasets are easily available. One of the ways is to collect realworld observation data on the same area and time period with different resolutions as a training set; another is to use simulation data by adjusting the resolution setting of a certain numerical climate model. No matter in which way, the LR climate data can be treated as the input of the deep learning model, and the corresponding HR climate data can be applied to supervised learning.
However, one may also encounter the situation that there is a single source for a certain physical variable, which only provides one fixed resolution climate data. It is difficult to obtain adequate data. In this situation, we can draw on the idea of SISR problem to augment the dataset in three ways.
(1) Scaling: randomly downscale the original climate data between [0.5, 1]. (2) Rotation: randomly rotate image by 90°, 180°, or 270°. (3) Flipping: flip images horizontally or vertically with a probability of 0.5. e expended dataset can be used as HR data, then generating the LR training set using the bicubic downsampling method. It is worth noting that what the model actually learns via such LR-HR data pair is more similar to the inverse process of the downsampling kernel (such as the inverse of the bicubic kernel) [30].
erefore, it is a better choice to use the real-world or simulated LR-HR climate data to train the network. Furthermore, our model is very expandable. What the model learns is not a fixed resolution mapping from LR to HR, but the enlarged size, which can be applied to any

Original size
Stepwise reconstruction subprocesses Downscaling size

Climate downscaling process
Stepwise reconstruction subprocesses

Methodology
Based on the proposed architecture, we design a more detailed network for providing improved climate downscaling, named climate downscaling network (CDN). e structures and intentions of the two-stage structure are interpreted in the following sections.

stage 1:
Stepwise Reconstruction Process. According to our proposed formalization, we build up a deep learning model which contains N different paths, and every path enlarges the LR data in M steps gradually. It will reconstruct and generate N HR candidates. e network structure is designed as in Figure 4. e input is the original climate data or climate data after some preprocess operations, such as min-max scale, standardization, and normalization. After the climate data passed into network, it is assigned to N different paths, respectively. In each path, the original climate data will be expended in M steps gradually to the required size, which is called the stepwise reconstruction process as stair-like green blocks in Figure 4.
In this structure, the path number N and corresponding step number M are crucial for model structure and downscaling performance, which will influence the change sizes (n h (m) , n w (m) ) directly. Suppose the climate data sizes for HR and LR are i ′ × j ′ and i × j, respectively; we apply three effective strategies for determining the path number N and step number M in every path. ① Nature-strategy (NS): the step size in the first path is 1, and the step size between adjacent paths is increased by 1 from 1 to min((i ′ − i), (j ′ − j)). ② Even (Odd)-strategy (ES or OS): the step size in the first path is 1 (2), and the step size between adjacent levels is increased by 2 from 1 (2) to min((i ′ − i), (j ′ − j)). ③ Prime-strategy (PS): choose all prime numbers between 3 and min((i ′ − i), (j ′ − j)) as the step size for every path. An example is shown in Figure 5. In the following experiments, we will verify the pros and cons of these three strategies. e most import module in the stepwise reconstruction process is the green block in Figure 4, which is used to enlarge the feature maps by (n h (m) , n w (m) ). As our proposed formalization, change size (n h (m) , n w (m) ) may be small. erefore, we propose a special dense block to configurablely enlarge the size of feature map by (n h (m) , n w (m) ) described in Figure 6. In this block, the input feature map is firstly expended by c convolution layers as χ for feature extraction; then the channel-expended feature map χ is connected to a k-stair-like structure, which has the following characteristics: (1) Every stair contains one convolution layer and one final dilated transformed convolution layer. (2) e first stair uses feature map χ as input. e other stairs concatenate the last stair's input and the output of the convolutional layer and then use it as the input of this layer.
After all stairs enlarge the feature maps' size by the dilated transformed convolution layer, we concatenate the output from every stair together and shrink the channel size to the original input channel size by continuous d convolution layers. e example shown in Figure 6 exhibits one structure with c � 1, d � 3, and k � 3 stairs.
Such dense block fully consists of convolutional layers and can be considered to contain 3 different effects: feature extraction, mapping, and shrinking. e first c convolution layers represent the feature extraction.
e k-stair-like structure represents the mapping, which can discover the mapper between coarse resolution (LR) and fine resolution (HR). e final d convolution layers represent the shrinking, which can compress and filter the redundant features after multiple mappings. We design such structure by dense connection, where the feature representations at lower levels are shared and connected directly with higher levels and thus can increase the nonlinearity of the network to learn complex mappings at the finer levels.
In this dense block, the key point is the dilated transformed convolution layer, which is used to expand the size of feature maps efficiently in each stair. We improve the original dilated transformed convolution layer according to (n h (m) , n w (m) ). For example, when expanding data size i (m) × j (m) to (i (m) + n h (m) ) × (j (m) + n w (m) ), the contributions of (n h (m) , n w (m) ) are twofold: firstly, (n h (m) , n w (m) ) are used to define the padding size around original feature maps, which means padding zeros around the feature maps to make the size expand as required. e left-up-right-bottom padding sizes are n h (m) /2 , n w (m) /2 , n h (m) /2 , n w (m) /2 , respectively, shown as dotted rectangles around the light blue rectangles in Figure 7; secondly, (n h (m) , n w (m) ) control the dilated rate (dr) for the kernel of dilated transformed convolution layer, where the relationship between dr and (n h (m) , n w (m) ) is described in the following equation: e difference between dilated and common convolution lies in that the dilated rate dr means inserting (dr − 1) zeros between two elements in convolution kernel, as grey rectangles inserted into the dark blue kernel in Figure 7(b). Because the convolution kernel is a skipped structure, the dilated kernels are concerned not only with continuous spatial information (when dr � 2) but also with remote grid points with discontinuous remote spatial correlations (when dr > 2).
is is also related to the receptive field in deep  Figure 4: Our proposed two-stage deep learning model. e upper row represents the stepwise reconstruction model, the input is the original data, and the output is the HR candidates' collection set. e lower row represents the ensemble inference model, which is a symmetrical encoder-decoder structure, the input is the HR candidates set, and the output is the final downscaling result.

Strategy Path number
Step size Data size change  learning, which is determined by the real size of convolution kernel. e smaller dr is, the smaller the convolution kernel is, the more the network pays attention to the adjacency information, and vice versa. In our network architecture, we reconstruct the HR data into multiple paths, and we divide the entire enlarge process into various steps, so that the network covers as many multiscale spatial correlations as possible.

stage 2: Ensemble Inference for Uncertainty Estimation.
After obtaining N reconstruction candidates, we will estimate the optimal probabilistic distribution. Mathematically, when solving the optimal downscaling result y * for a new observed sample x * , it allows us to infer the probabilistic distribution p(y * | x * ) via Bayesian formula as where D is the original dataset and θ is the optimal trainable parameters in the downscaling system. However, the posterior distribution p(θ | D) usually cannot be evaluated analytically. Variational inference is often used to approximate it, which uses a relatively simple distribution q(θ) to simulate p(θ | D) by minimizing the Kullback-Leibler (KL) divergence in equation (7). is process is shown in Figure 8.
Dropout variational inference is one of the useful techniques to approximate the posterior distribution. As proved by Gal et al. [31], the posterior distribution p(θ | D) is approximately equivalent to the dropout mask distribution q ′ (θ), which can be viewed as a Bernoulli distribution related to dropout rate z. At the meanwhile, the results can be described by sampling T times in testing, which is a stochastic process. erefore, equation (6) can be regarded as follows: Via equation (8), we can estimate the approximate mean and variance for the downscaling results. As for stage 2, we design an encoder-decoder structure to implement it as shown in Figure 4 (lower row). is structure is also fully convolutional. e input of this network is the concatenation of N candidates obtained from stage 1. During the feature maps propagation forward in this network, the width and height are never changed; we only use the convolution filters to expand and reduce the channel size. e encoder and decoder are designed to be symmetrical. In the encoder, the channel size is gradually increased, such that we use [16,32,64] filters in Figure 4 (lower row). On the contrary, the channel size is gradually decreased in the decoder, such as [64, 32,16]. In encoder and decoder, all layers are adopted dense connection, where the input of every layer is the outputs' concatenation of the previous layers. Such structure reuses the extracted features and strengthens the feature propagation.
Furthermore, a trainable spatial dropout layer [32] is applied to connect encoder and decoder, which is to drop out some redundant information points spatially. is layer provides uncertainties for the whole network and allows the approximation of p(y * | x * , D) from the stochastic dropout sampling. On the other hand, we can take advantage of all our training dataset, because the vast majority of samples are neural events which contain little extreme information for the regression; determinant single-value traditional methods tend to be overaveraged downscaling naturally (neutralization), but the probability distribution is convenient to describe the true situation.

Loss Function and Metrics.
Our goal is to approximate the stepwise magnification operator g and the downscaling ensemble system F for generating an HR data y, which is close to the ground truth y. According to our formalization, we design two different loss functions for these two different stages, respectively, and combine them together, which can ensure the spatial consistency of reconstructed HR data.
For stage 1, we use N paths to reconstruct N different HR candidates by stepwise subprocesses. erefore, we Mathematical Problems in Engineering optimize every candidate y n in every path by ℓ 2 norm, which is described in equation (9): For stage 2, we combine ℓ 2 norm and ℓ 1 norm together, where ℓ 2 norm is to measure the smoothness of the reconstructed HR data and ℓ 1 norm is to measure the peak distribution of the reconstructed HR data, described in equation e total loss is as follows: After training the model, we evaluate the reconstructed HR data in a validation set with two commonly used quality metrics: PSNR and SSIM [33], shown as equations PSNR � 10 · log 10 MAX 2 MSE , In PSNR, MAX is the maximum value among grid points or pixels, such as MAX � 255 in image evaluation (the pixel range is [0, 255] in an image). In SSIM, l, c, and s represent bright, contrast, and structure comparison, respectively. μ y and μ y are the averages of all grid point in y and y, σ y and σ y are the standard deviations of all grid point in y and y, and σ yy is the covariance between y and y. We set α � β � c � 1, and c 1 , c 2 , and c 3 are all small constants which are to prevent denominator from 0.

Climate Science Dataset and Benchmark Image Dataset.
For fully measuring the performance of our proposed model, we collect two types of datasets and carry out extensive experiments on them, respectively: ① Climate science dataset (CSD): this dataset is composed of real-world observed daily precipitation data with different 3 resolutions covering the globe [34,35], which are 0.5°, 1°, and 2.5°, respectively (0.5°only for Europe). ② Benchmark image dataset (BID): this dataset is composed of several different datasets. e training and validation set is from DIV2K [36], containing 800 images for training and 100 images for validation, respectively. e testing set is from 5 datasets: SET5 [37], SET14 [38], BSDS100 [39], URBAN100 [40], and MANGA109 [41]. Among these datasets, SET5, SET14, and BSDS100 consist of natural scenes; URBAN100 contains challenging urban scenes images with details in different frequency bands; and MANGA109 is a dataset of Japanese manga.
For CSD, we crop the daily precipitation data covering Europe and Asia from original data, respectively; the resolution and size are shown in Table 1. Europe covers (150°W to 65°E, 30°N to 80°N), and Asia covers (70°E to 140°E, 0°to 60°N). We make pairs of these data at the same area under different resolutions and treat such pairs as an individual subdataset for training and testing so that we obtain 4 different subsets: 2.5°⟶ 1°Europe, 2.5°⟶ 0.5°Europe, 1°⟶ 0.5°Europe, and 2.5°⟶ 1°A sia. In every subset, we select 85% randomly as the training set, 5% as the validation set, and 10% as the testing set. is dataset is with one channel and preprocessed to [0, 1] by max-min scale as where x is the original data, x norm is the scaled data, and x max and x min are the maximum and minimum value in original data x. In addition, in the following experiments, we choose lower resolution data as input and higher resolution data in the same area as output of the model. e PSNR and SSIM indices are calculated for the reconstructed HR data directly, where MAX � 1 for PSNR because the data is scaled to [0, 1]. For BID, we design experiments with 2× and 3× magnifications. Due to the characteristics of climate downscaling, meteorological researchers tend to choose climate data with a resolution close to the requirement for downscaling, so the magnifications of climate data are often not We run our model on a GPU server (Intel i7 8700k and dual NVIDIA GTX 1080Tis). e experiments are divided into two parts. Firstly, we investigate the optimal model structure by determining the hyperparameters. Secondly, we measure and investigate the performance by conducting comparison experiments on the above-mentioned two datasets, respectively.

Model Analysis.
e structure of a deep learning model affects the downscaling performance directly. Our proposed model has many adjustable modules, and we design a lot of comparative experiments to find the optimal combination.

Network Structure.
To determine the path number N and corresponding step size M in each path, we evaluate our model with three proposed strategies. We train and test our model on BID with 2× magnifications, where we set c � 1, d � 3, and k � 3 in the dense block of stage 1 and set encoder (decoder) depth of stage 2 at 3. Under such settings, we dynamically adjust the strategies to evaluate the model performance. e results are summarized in Table 2. From the results, the PS has obvious better performance than others. Specifically, when building the whole model with NS, the results will obtain the largest downscaling errors. It may be due to the fact that the enlargement process is too finely divided, even containing overmuch continuous (+1, +1) in many paths.
is will lead to an overly dense network structure, so that too many redundant dense blocks may exist in this process, which leads to overfitting. In addition, the OS or ES also exhibits a little bit overfitting.
is demonstrates that the subprocesses should not be divided too much, and it will also lead to an increase in the number of trainable parameters and extend the training time.
On the other hand, the division of subprocesses can be designed adaptively, such as Tolerance2-Strategy (T2S) or Tolerance3-Strategy (T3S), where the difference of step size between adjacent paths is 2 or 3. e results are also described in Table 2. e T2S has downscaling results similar to PS, which outperforms than T3S. T3S divides sparser paths in the entire model, which may lead to some big skips in the dilated rate and discontinuous spatial correlations.
is implies that the model is unable to learn enough detailed features with sparse division of stepwise reconstruction subprocesses.

Network
Depth. In our proposed dense block, by reducing the stair depth k to 1 in our proposed model, the block falls back to a network structure similar to FSRCNN [42]. On the contrary, when increasing the depth k of dense block, the block is similar to the structure of LapSRN [24] with the dense connections attached. Increasing the depth k will definitely bring more trainable parameters of the model, causing the considerable extension of training time and downscaling speed, but also improves the model performance. To find the trade-offs between performance and speed, we train the model with different block structure combinations, which represents the different total network depth. We use PS in stage 1 and set encoder (decoder) depth of Stage 2 at 3 and then dynamically adjust the depth of the dense block. e parameter combinations of dense block and trade-offs between performance and speed are shown in Table 3. From the results, we can conclude that deeper model will perform better than shallow. As the network gradually deepens, the performance improvement is not very obvious. erefore, we choose k � c � d � 3 in each dense block of stage 1 for the following experiments to balance the efficiency and performance.
For stage 2, we also train the proposed model by changing the depth of symmetrical encoder and decoder (with PS and k � c � d � 3 in the dense block). e results are shown in Table 4. It is the same as stage 1, when the network deepens, the performance will increase. We choose encoder (decoder) depth at 5 because we do not observe significant performance gains by using more convolutional layers.

Climate Downscaling Performance on CSD.
We apply our model to CSD dataset. e CSD dataset contains 5 subdatasets as different regions and resolutions; we divide the entire dataset into 4 groups of LR-HR training pairs. e downscaling of CSD can be considered as 2× (1 to 0.5), 2.5× (2.5 to 1), and 5× (2.5 to 0.5) according to the groups. e LR and HR climate are treated as the input and output of model, respectively. e results are described in Table 5.
From Table 5, we can detect that the restored HR is very close to the real observation in general, where SSIM and PSNR are both maintained at a high level. As the magnification increases, SSIM and PSNR will decrease slightly.
is is because the LR data (20 × 38) especially in 5× experiments is too sparse for the required size (100 × 190) and cannot provide enough spatial information for more accurate results. In addition, at the same magnification such as in 2.5× experiments, although the input data of the model are different, such that the input sizes are 20 × 38 and 24 × 28 for Europe and Asia, respectively, the downscaling results are almost the same. is implies that our model is almost not limited by the input size under the same magnification. In addition, the sizes of the input and output have significant impacts on the training time. e larger the size is, the more the time it takes to train the model. erefore, under the premise of ensuring model performance, it is important to choose a suitable input and output size.
Furthermore, we choose real-world observation data in the validation set to exhibit the downscaling results as shown in  In these figures, in the left three columns: the second column (b) describes the original HR climate data, the first column (a) shows the results of bicubic interpolation, and the third column (c) shows the output of our proposed model. e rightmost column (d) in these figures describes the downscaling variance of our model. By comparing the downscaling results, we can see that the bicubic interpolation tends to underestimate and especially overestimate the downscaling results, such as excessively expanding the region of the large value area (red areas in the figures) compared to ground truth; this phenomenon is particularly common. Technically, climate downscaling is to insert more grid points into original data. Traditional approaches are similar to heat diffusion process, which tends to smoothly transport energy between grid points. is makes the results more average, ignoring the real physical mechanism. However, our model considers the different multiscale spatial correlations at the same time in the modeling process; it pays more attention to extract potential features in the discontinuous remote spaces. erefore, our model can achieve better performance.
From the perspective of variance ((d) in these figures), the largest variance always appears at the large value area.
is indicates that, in areas where the events are relatively strong, the uncertainties are high. As the magnification increases, the uncertainty gradually increases and becomes more and more obvious. It may be that the LR data provides too little information, which leads to increased uncertainties. Overall, it still maintains at a very low level. From these realworld downscaling results, although the climate downscaling has become more and more accurate in recent years, it is still unavoidable that uncertainties will interfere with results.
erefore, quantitative evaluation of the uncertainties is very important in meteorological research.
We measure the performance of the above-mentioned models with PNSR and SSIM. We train these models with 2× and 3× to detect the differences on performance. e results are shown in Table 6. Note that all results of other methods given in this section are generated by the officially released models or papers.
As shown in Table 6, in general, our proposed model is roughly comparable with or even outperforms other models especially in 2× experiments, which indicates the superiority of the stepwise reconstruction and ensemble inference. At    present, the existing deep learning models are designed deeper and deeper, and they make full use of dense connection and residual learning in the model structures. However, these models are usually one step to enlarge the feature maps directly (such as from 30 × 30 to 60 × 60 directly), which may cause too large feature size changes and incomplete feature extractions. On the other hand, these models only contain the integer-multiple magnification block, so sometimes they cannot adjust any enlargement size such as 2.5× magnification. Due to the stepwise reconstruction process, our model tends to use as many magnification blocks as possible, which enlarge the data gradually and make full use of spatial correlations. erefore, it can extract more spatial dependencies and achieve better results.
In addition, our model performs better on 2× experiments than 3× experiments in general. is may be because the LR images have closer resolutions with HR images in 2× experiments.
It is worth noting that the SSIM indicator obtained by our model is better, which usually ranks higher than PSNR. According to the mathematical definitions of PSNR and SSIM, PSNR tends to describe the ratio of the noise information (MSE) between two images, while SSIM thoroughly measures the structures of the two images under multiple aspects, such as luminance and contrast, many values of which are calculated by correlation coefficients of these two images based on equation (13). It shows that SSIM pays more attention to the trend and spatial consistency of  erefore, this implies that the downscaling results of our model will slightly overestimate or underestimate the real situation due to the interference of uncertainties, but it can exhibit more comprehensive spatial characteristics, which is more important in meteorological research.

Discussions
In which aspect does our proposed model consider the multiple spatial remote correlations of climate events? In order to take good advantage of this intrinsic characteristic of climate phenomena, we associate the concept of receptive    field in deep learning with spatial remote correlation and design a stepwise reconstruction process to enlarge climate data gradually. More specifically, in the proposed dense blocks, we adaptively design different dilated rate in dilated transformed convolution layers to capture multiscale spatial correlations. In detail, the real kernel size in dilated transformed convolution layers kernel real � kernel + (kernel − 1)(dr − 1), which means inserting (dr − 1) zeros between every two elements of the original kernel. So when dr ≥ 2, the kernel size is enlarged, and the receptive fields are also expanded. e space filtered by convolution kernels is not continuous, and there are lots of skips in space, so that broader remote spatial correlations are considered. Figure 13 shows some examples when kernel � 3.
In addition, in the stepwise subprocesses, we divide the entire enlargement process into many small processes that gradually enlarge the feature maps. e stepwise reconstruction process can cover as many different dilated rate dr as possible. erefore, in the entire enlarging process with a specific strategy (such as Prime-strategy), the model usually contains a large dilated rate and small dilated rate at the same time, which can make every grid point participate in the reconstruction process with multiscale spatial correlations.
Such operation ensures that our proposed model can extract more comprehensive multiscale spatial information and achieve better results.
In the implementation detail, there are no pooling layers existing in the proposed model. Although pooling layers are the most common structure to enlarge the receptive field, it may reduce the size of the data, which is contrary to the downscaling purpose and loss information. erefore, we design special dilated transformed convolution layers to dynamically adjust the receptive field while increasing the climate data resolution. e chaos of the dynamic system is always a severe obstacle for climate forecasting.
is is the direction for further climate research to try to capture the uncertainties. We design a key structure of a trainable spatial dropout layer, which can increase the randomness of the model, so that our model can infer the probability distribution of downscaling results and describe the optimal mean and variance at the same time.

Conclusions
In this paper, we propose a novel formalization for climate downscaling considering the multiscale spatial correlations and chaos in climate events and incorporate these pieces of a priori meteorological knowledge into the deep learning modeling process. Our formalization contains 2 stages. In the first stage, we creatively formulate climate downscaling into a parallel N-path process, and each path contains a stepwise reconstruction process to gradually enlarge the data considering the different scale spatial correlations. e same LR climate data will be reconstructed to N same-sized HR data. stage 1 will generate N HR candidates. In the second stage, we absorb the thought of ensemble prediction and  Figure 13: Some examples of the relationships between the real kernel size and dilated rate when the original kernel size is 3. evaluate the reconstruction uncertainties of these HR candidates by inferencing the optimal probability distribution of the results, which can provide the optimal mean and corresponding variance at the same time.
During the implementation, we take advantage of deep learning technologies on SISR problem and build up a more detailed model named climate downscaling network (CDN). According to our proposed stepwise reconstruction and ensemble inference, we build up a model with two corresponding stages. stage 1 consists of an N-path process, and each path divides the entire process into different steps to gradually enlarge the data. ese paths will dynamically design the enlargement step size. en every path will generate an HR candidate, respectively. In stage 1, the most basic module is the dense block proposed by us. We design such blocks by making an analogy between the concept of receptive field in deep learning and multiscale spatial correlations of climate events. is block is a stair-like structure with dense connections. stage 2 is composed of a fully convolution network, which is used to infer the optimal probability distributions. A classic symmetrical encoderdecoder structure is used in stage 2 with dense connections, where the uncertainty of the network is controlled by a trainable spatial dropout layer between encoder and decoder. According to the proposed model, we design loss functions for these two stages, respectively, and combine them together as the final loss. SSIM and PSNR are set as the evaluation metrics.
We collect and make two different datasets to train and evaluate our proposed model: climate science dataset (CSD) and benchmark image dataset (BID). CSD contains the realworld observed daily precipitation data with different 3 resolutions covering the globe, which are 0.5°(Europe only), 1°, and 2.5°, respectively. BID is composed of 6 different image datasets, where DIV2K is for training and validation, and SET5, SET14, BSDS100, URBAN100, and MANGA109 are for testing.
After building up the deep learning model and collecting the dataset, we design two series of comparison experiments to determine the optimal network structure and depth. Firstly, we confirm and evaluate our proposed 3 different path strategies with 2 other adaptive strategies and find that the paths and step sizes should not be set too dense, which leads to heavy overfitting. e Prime-strategy (PS) is a better structure for the model. en, to find the trade-off between the speed and performance, we gradually change the depth of the dense block in stage 1 and the depth of encoderdecoder in stage 2. In terms of the SSIM and PSNR, we determine the optimal model structure.
en, we measure and investigate the performance of our model on the above-mentioned two datasets, respectively. In CSD, we use a real-world observed daily precipitation dataset covering Europe and Asia to train our model. We use different LR-HR pairs' dataset to conduct climate downscaling. Our proposed model effectively learns the mapper projection from LR to HR. e results demonstrate the efficiency and superiority of our model in downscaling daily precipitation data from 2.5 degrees to 0.5 degrees over Asia and Europe. Although as the magnification increases, the model performance will decrease slightly, it still maintains at a relatively high level. In addition, the sizes of input and output will not affect the performance of the model. We further carry out downscaling simulations on several realworld observation data; the results show that the large uncertainties are always located in the areas where the events are relatively strong, which indicates that the chaos of the dynamic system is always a severe obstacle for climate forecasting. Despite this, our model progressively downscales the original climate data with low variances.
In BID, we compare our model with 8 other state-of-theart models in SSIM and PSNR. Sufficient experiments demonstrate that it is reasonable to use our proposed formalization which provides a reliable forecast up to at least 2 and 3 magnifications. Furthermore, our model is not limited in the magnification size compared to other models, tapping its great potentials in exploring detailed dynamic mechanisms for various climate events. What the model learns is not a fixed resolution mapping from LR to HR but the enlarged size, which can be applied to any resolution enlargement and convenient for transfer learning.
Accurate climate forecasting not only depends on models and experience but also requires an overall grasp of the laws of nature. us, we will use this approach to investigate the climate predictability. In the future, we will explore dynamical mechanism by downscaling more different physical variables (such as sea surface level and thermocline depth). Furthermore, through absorbing more effective meteorological knowledge, we will build a more targeted model structure to improve the model performance and reduce the computational complexity.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.