Precipitation is the main factor that drives hydrologic modeling; therefore, missing precipitation data can cause malfunctions in hydrologic modeling. Although interpolation of missing precipitation data is recognized as an important research topic, only a few methods follow a regression approach. In this study, daily precipitation data were interpolated using five different kernel functions, namely, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, to estimate missing precipitation data. This study also presents an assessment that compares estimation of missing precipitation data through Kth nearest neighborhood (KNN) regression to the five different kernel estimations and their performance in simulating streamflow using the Soil Water Assessment Tool (SWAT) hydrologic model. The results show that the kernel approaches provide higher quality interpolation of precipitation data compared with the KNN regression approach, in terms of both statistical data assessment and hydrologic modeling performance.
1. Introduction
Precipitation data are key factors in hydrologic modeling for estimating rainfall-runoff mechanism [1]. Malfunctions in running hydrologic modeling can occur due to noncontinuous time series precipitation inputs. In light of this important issue, estimation of missing precipitation data is a challenging task for hydrologic modeling. Many hydrologic modeling require interpolation of missing precipitation data [2], meteorological data series completion [3], or imputation of meteorological data [4]. To estimate missing precipitation, researchers should consider spatiotemporal variations in precipitation (rainfall and snowfall) values and the related physical processes. However, accounting for spatial-temporal variation and physical processes can be difficult if there is a lack of equipment for measuring precipitation. Thus, statistical approaches have emerged as widely used methods for filling in missing precipitation data [5].
Many studies have investigated supplanting missing streamflow data with several statistical approaches [5], but there are limited studies on the interpolation of incomplete precipitation and temperature data [6–10]. Recently, the investigation of artificial neural networks (ANNs: [11]), a more advanced statistical approach, to estimate missing precipitation data, has been proposed [12]. ANNs can learn from training data to reconstruct a nonlinear relationship and obtain values for missing data. Pisoni et al. [13] investigated the interpolation of missing data for sea surface temperature (SST) satellite images using the ANN method; they found that the results from the ANN approach show better accuracy than the results from an interpolation system, as suggested by Seze and Desbois (1987). Nevertheless, ANNs are still under dispute because their neuron systems cannot provide clear relationships between data [14].
The American Society of Civil Engineers (ASCE) Task Committee [15] discussed that although the performance of ANNs for estimating missing precipitation data has already been verified, an alternate solution should be suggested for cases in which the available data are insufficient due to the reliance of ANNs on high data quality and quantity. Additionally, ANNs have other limitations, such as a lack of physical concepts and relations, based on the experience and preferences of those using, studying, and training the networks [15–17]. Since ANNs are regarded as black-box model [18], it is difficult to use this method for realizing more linear relationships, even though ANNs can achieve convergence for almost any problem [17]. Thus, for real mechanisms in hydrologic models, in which linear relationships exist between series of weather inputs, the solution is less explicit [19].
Generally, a regression or a distance weighted method is most commonly used for estimating missing precipitation for hydrologic modeling [20]. Daly et al. [21] also propose a variety of regression models to incorporate spatial variation in weather data. However, Creutin et al. [22] found that even though simple linear regression of interpolation approaches show satisfactory serial correlation of daily or monthly streamflow; precipitation patterns do not show proper correlation when simple linear regression or interpolation approaches are used. Furthermore, if a regression method is used for estimating missing precipitation to make refined precipitation time series, a small data sample would not follow the normal distribution based on basic theory of linear regression.
Another approach for estimating missing precipitation data to use neighboring data is based on distance weight. Xia et al. [23] used the closest station to reconstruct missing precipitation data through geometrical distance weight; Willmott et al. [24] used arithmetic data averaging from neighboring data to filling missing precipitation; and Teegavarapu and Chandramouli [25] used an inverse distance weight method from neighboring data to estimate missing precipitation data. Smith [26], Simanton and Osborn [27], and Salas [28] suggest that traditional weighting and data-driven methods, namely, distance based weighting methods, are interpolated for estimating missing precipitation data. Distance weight approaches for estimating missing precipitation data are combined with linear regression and median distribution of regression [29, 30]. Young [31] and Filippini et al. [32] suggested spatially interpolating the correlation to define weight in terms of each station.
Estimation of missing precipitation data is possible when data are available for the same location. Linacre (1992) investigated the interpolation of missing precipitation data by using the mean value of a data series at the same location and Lowry [33] suggested simple interpolation between available data series. Acock and Pachepsky [34] used data from several days before and after missing precipitation data points for estimating the incomplete precipitation data. K-nearest neighborhood (knn) regression is a basic method for estimating missing precipitation data that considers vicinity. However, the method has some weaknesses when the data have outliers or a nonlinear trend exists around the missing data. While knn regression has a fundamental assumption to follow a normal distribution which is statistically unsound, the kernel method uses a mean value, which can overcome knn regression’s weakness through the kernel weighting method. By using neighbor data in a kernel function, even though the data show a nonlinear trend, it can overcome knn regression weakness.
The objective of this study was to reconstruct daily precipitation data by using five different kernel functions (Epanechnikov, Quartic, Triweight, Tricube, and Cosine) to estimate missing precipitation data. This study also presents an assessment that compares estimation of missing precipitation data through knn regression to the five different kernel estimations and their performance in simulating streamflow using the Soil Water Assessment Tool (SWAT) hydrologic model. The remainder of this paper is organized as follows. Section 2 provides a description of the study area and the hydrologic model. In Section 3, the methodology of the five different kernel methods is presented. Section 4 presents the results of the interpolation of the missing daily precipitation data and the hydrologic model simulation. Finally, conclusions are in Section 5.
2. Study Area and Hydrologic Model
The Imha (Figure 1) watershed was selected as the test bed for this study. The Imha watershed is a tributary of the Nakdong River basin and is located in the upper side of the Nakdong River basin in South Korea. It is characterized by a mountainous area; approximately 79.8% of the total area of 1,361 km2 is mountainous. The slope in the Imha watershed is 40% to 60%, that is, 655 km2 as 33% of total watershed area. The elevation of the Imha watershed ranges from 80 to 1215 m. The average annual precipitation, minimum temperature, maximum temperature, humidity, and wind speed for the Imha watershed are 1,050 mm, 7°C, 18.8°C, 65%, and 1.6 m/s, respectively (Water Management Information System (WAMIS), http://www.wamis.go.kr/). Since the climate conditions in this area are defined by warm temperatures, there is no precipitation in the form of snow; all precipitation consists of rainfall. For this evaluation of interpolation of precipitation data and hydrologic model performance, precipitation and streamflow gauges were selected as shown in Figure 1 and precipitation and streamflow data were sourced from the Water Management Information System (http://www.wamis.go.kr/).
Study basin locations including rain and stream gauges (left figure: map of South Korea; right figure: Imha watershed).
This study selected the SWAT model for analysis. SWAT has a GIS extension, ArcSWAT, which allows the use of various GIS based datasets to model the geomorphology of a given basin. The SWAT model was developed through research by the USDA (United States Department of Agriculture), Agricultural Research Service (ARS). Major data inputs for SWAT include temperature (maximum and minimum), daily precipitation, solar radiation, relative humidity, wind speed, and geospatial data representing soil types, land cover, and elevation. A watershed is divided into smaller subbasins, which must be broken up into smaller units known as hydrologic response units (HRU). Each of these HRUs is characterized by uniform land use and soil type. SWAT can be used to accurately predict hydrologic patterns for extended periods of time [35]. Canopy interception is implicit in the curve number (CN) method and is explicit for the Green-Ampt method. Infiltration is most accurately accounted for using the CN method in SWAT. An alternative method may be used to account for infiltration is the Green-Ampt method. However, the Green-Ampt method has not been shown to increase accuracy over the CN method, thus the CN method was used in this study.
3. Methodology
This study used the five kernel functions, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, as a weight to predict missing values. Tricube method has large weight around target point. Even though Tricube weight is similar to Triweight, the decreasing acceleration of weight as far away from target point is less than Triweight. Next higher weight around target point is Quartic, which speed in decreasing weight is similar to Triweight. Both Epanechnikov and Cosine have small effect on neighboring values. A brief description of the five kernel functions and their application for reconstructing the missing values is presented in the following and specific kernel functions are described in Appendix A.
3.1. Epanechnikov
The Epanechnikov kernel is the most often used kernel function. The Epanechnikov kernel assigns zero weight to observations that are a distance of four, six, and eight away from the reference point. These values correspond to the choice of the interval width. This is often called the choice of smoothing parameter or band width selection. The main character of the Epanechnikov kernel is that even though the distance is far away from target value, namely, the missing value in this research, its estimation is smooth. A brief description is given by the following: (1)Kx=341-x2,where K(x) is the kernel function and x is surrounding the nearest value as an independent in data.
3.2. Quartic
The second kernel function used in this research was the Quartic kernel which has more weight sensitivity based on distance from the missing value. Since the applied weight is largely different between near and far data points, it is more influenced by surrounding data. It consists of a fourth-order equation which has more sensitivity in terms of distance than second-order equation. It is described by the following:(2)Kx=15161-x22.
3.3. Triweight
The third kernel function used in this research was the Triweight kernel which consists of a sixth-order equation. It has the most sensitivity in terms of distance because a sixth-order equation estimates the missing value based on the difference in distance with a weighted function as shown by the following:(3)Kx=35321-x23.
3.4. Tricube
The fourth kernel function used in this research was the Tricube kernel, which uses absolute values. Since it uses absolute values, it presents a smoother pattern for nearest values than the Triweight kernel. However, as the values move further away from the nearest values, it shows a steep trend. The Tricube kernel has the most sensitivity in terms of weighted distance due to the fact that it consists of a ninth-order equation, as shown in the following:(4)Kx=70811-x33.
3.5. Cosine
The fifth kernel function used in this research was the Cosine kernel function. It is a widely applied kernel function in various fields because it has a constant curvature. Its shape is similar to the Epanechnikov kernel, even though it uses a cosine function as shown in the following:(5)Kx=π4cosπ2x.
3.6. Calculation of the Missing Value
After using a kernel function to calculate the weight of the missing data, estimation of the missing data is performed using the following:(6)M=1P∑i=1Pxi·Kui,ui=Ni0.5P+1,Ni=-P2,…,P2,where M is the missing value, P is the number of the nearest neighborhood, and ui is the Nth nearest values which correspond to xi (positive means the right side and negative means the left side). The kernel function should have bilateral symmetry based on a value of zero. If using, for example, the four nearest neighborhoods for estimating the missing value, the neighborhood values used will be two from right side and another two from left side. The specific equation for this example is shown in the following and example calculation is described in Appendix B:(7)M=14K-23·x1+K-13·x2+K13·x3+K23·x4.
3.7. Statistic Tests
A normality test is required to evaluate for infilling the methods for filling in interpolation data. The Shapiro-Wilk [36] normality test was used with nineteen samples to determine whether the average difference is normally distributed or not. The test statistic is as shown in the following:(8)W=∑i=1naiyi2∑i=1nyi-y¯2,where yi is the ith order statistic, namely, the ith smallest value in the sample, y¯ is the mean of yi, and ai is a constant given by ordered data. The null hypothesis of the Shapiro-Wilk normality test is that sample is normally distributed, and if significance probability is less than 5%, the null hypothesis will be denied, meaning the sample does not satisfy normal distribution. Since the significance probability for the entire group (Table 1) is below 5%, the null hypothesis is denied. This study should, therefore, use a nonparametric test for normality analysis.
Results of normality test with Shapiro-Wilk method for each K-nearest neighborhood. DF represents degree of freedom and P value means significance probability.
The Friedman test [37], which is a kind of k-sample test that can provide the difference between paired values, was selected as a nonparametric test. This method evaluates a small sample for differences by ranking a sequence list. The null hypothesis of the Friedman test is that there is no average difference in each group and if the significance probability is less than 5%, the null hypothesis will be denied, thus conducting that in each group exists an average difference. A brief description of Friedman test is in the following:(9)Q=SStSSe,where SSt and SSe are the sum of the squared treatment and sum of the squared error, respectively.
The null hypothesis in this instance was denied because the significance probability was less than 5% for each and this study concluded that each interpolation method has an average difference, which is why each method is considered independent, even though this study used five different kernel methods. For example, the average rank for four reference points for knn-regression, Tricube, Quartic, Cosine, Triweight, and Epanechnikov varies from a large average to a small average rank (Table 2). For six reference points, the knn-regression, Tricube, Triweight, Quartic, Cosine, and Epanechnikov were ranked as shown in Table 2. In another example, eight reference points used knn-regression, Triweight, Quartic, Cosine, and Epanechnikov average rank (Table 2). As shown in Table 2, the knn-regression has the largest average rank and Epanechnikov has the smallest rank average for all of the reference point cases. This result proves the dissimilarity of these methods.
Chi-square (Χ2) test with Friedman method for finding difference among six infilling methods. SD represents standard deviation. P value means significance probability.
To determine which methods are dissimilar to the others, this study performed the Wilcoxon signed rank test [38]. The basic feature of the Wilcoxon signed rank test is that data samples that come from the same population are paired and it is detailed in the following:(10)W=∑i=1Nsigny2,i-y1,iRi,where N is the sample size, y2,i is ith value of the second data point, y1,i is ith value of the first data point, and Ri is the rank of y2,i-y1,i. If the W value is less than 5%, it means there is different mechanism used on the sample data or method. Table 3 shows that the W value for knn-regression is less than 5% for all cases. Accordingly, this signifies that knn-regression is completely dissimilar to the other methods. Although the five different kernel methods for data interpolation exhibit similarity or dissimilarity to each other depending on the number of reference points, all of the kernel methods can be distinguished from knn-regression using the Wilcoxon signed rank test.
Chi-square (Χ2) test with Wilcoxon signed rank method between regression and five different kernel methods.
Since Epanechnikov has the smallest average rank, which signifies a small difference between the observation value and the interpolated value for all reference points in Table 2, interpolation data obtained from the Epanechnikov method has the best result among the studied methods. Figure 2 shows that filling in data from knn-regression has a large difference at both four and six reference points. Interpolation data from the kernel methods are close to zero for both the average and median values at four reference points, meaning that the interpolation data are similar to the observation data. On the other hand, more than 75% of the interpolation data from knn-regression exhibits a difference than zero. When the interpolation data are evaluated at six reference points in Figure 2, the median value from the knn-regression is shown to be far away from zero. At eight reference points, knn-regression is close to zero for both average and median values; however, it is difficult to conclude that this is an ideal method because outlying maximum values will affect the average and median value.
Box plots for difference between actual precipitation and interpolated precipitation. y-axis represents mm per day.
4-NN
6-NN
8-NN
This study on precipitation data interpolation also evaluated the simulation of the interpolated data using the SWAT hydrologic model. In SWAT hydrologic modeling, the surface runoff is estimated by considering excess precipitation with abstractions and infiltration factor through Soil Conservation Service Curve Number (SCS-CN) method. Green-Ampt (GA) infiltration method is another method to calculate the surface runoff in SWAT. A study shows that both methods give reasonable results, and there is no significant advantage observed in using one over the other. However, the GA method appears to have more limitations in modeling seasonal variability than the SCS-CN method does. Hence, the SCS-CN method is used for infiltration factor in this study. An SCS curve number based simulation needs time step updated information as soil water content changes. Excess rainfall equation in SCS-CN method was generated based on historical relationship between the curve number and the hydrologic mechanism for over 20 years. Throughout the surface runoff calculation, infiltration should be updated over time according to the soil type. Other abstractions such as evapotranspiration and soil and snow evaporation are calculated by Penman-Monteith method and meteorological statistics. Finally, the kinematic storage model is used to compute groundwater storage and seepage. Flow resulting in SWAT modeling is routed HRUs to watershed outlet. Figure 3 shows the calibration of the model simulation as the initial step and the specific parameters are described in Table 4. After the calibration of the SWAT model, the six different interpolated precipitation datasets, with three different reference ranges for each (a total of twenty-four interpolated precipitations data points), were used to assess the performance of interpolated precipitation data for hydrologic model simulation. Streamflow simulations were done for three years from 2008 to 2010. To evaluate the model performance considering the use of different interpolated precipitation datasets, this study used ENS (Nash-Sutcliffe coefficient), R-square (coefficient of determination), and RMSE (root mean square error). Table 5 and Figure 4 show that the simulation results from knn-regression exhibit low SWAT simulation performance for streamflow estimations, with 0.54 ENS, 0.74 R-square, and 23.78 m3/s RMSE as an average. All of the kernel functions, on the other hand, exhibit good performance for hydrologic simulations with interpolated precipitation data (Table 5 and Figure 4), the average of ENS, R-square, and RMSE (1) for Epanechnikov is 0.83, 0.86, and 14.03 m3/s; (2) for Quartic is 0.84, 0.88, and 13.03 m3/s; (3) for Triweight is 0.93, 0.93, and 9.30 m3/s; (4) for Tricube is 0.94, 0.95, and 8.13 m3/s; and (5) for Cosine is 0.93, 0.94, and 9.00 m3/s, respectively.
Details of SWAT parameters which are related to runoff mechanism for Imha watershed.
Parameter
Description
Selected value
ESCO
Soil evaporation compensation factor
0.9500
EPCO
Plant water uptake compensation factor
1.0000
EVLAI
Leaf area index at which no evaporation occurs from water surface [m2/m2]
3.0000
FFCB
Initial soil water storage expressed as a fraction of field capacity water content
0.0000
IEVENT
Rainfall/runoff code: 0 = daily rainfall/CN
0.0000
ICRK
Crack flow code: 1 = model crack flow in soil
0.0000
SURLAG
Surface runoff lag time [days]
4.0000
ADJ_PKR
Peak rate adjustment factor for sediment routing in the subbasin (tributary channels)
0.0000
PRF
Peak rate adjustment factor for sediment routing in the main channel
1.0000
SPCON
Linear parameter for calculating the maximum amount of sediment that can be reentrained during channel sediment
0.0001
SPEXP
Exponent parameter for calculating sediment reentrained in channel sediment routing
1.0000
Details of simulation results with six different precipitation infilling methods in Imha watershed.
4-NN
6-NN
8-NN
ENS
R2
RMSE
ENS
R2
RMSE
ENS
R2
RMSE
Ep
0.80
0.83
15.32
0.91
0.92
10.60
0.78
0.82
16.16
Qu
0.73
0.78
17.83
0.91
0.93
10.48
0.88
0.92
11.80
Tw
0.91
0.91
10.56
0.93
0.94
9.25
0.95
0.95
8.10
Tc
0.95
0.95
7.72
0.93
0.94
9.03
0.95
0.95
7.64
Co
0.93
0.94
8.83
0.95
0.95
7.72
0.91
0.93
10.44
Reg
0.69
0.80
19.14
0.21
0.65
30.71
0.71
0.73
21.48
Calibrated model result using original precipitation input. x-axis represents time in days and y-axis represents flow in cubic meters per second.
Scatter plots for SWAT simulation (EP, QU, TW, TC, CO, and KNN represents Epanechnikov, Quartic, Triweight, Tricube, Cosine, and KNN-regression, resp.).
4-NN
6-NN
8-NN
5. Conclusions
Five different kernel functions were applied to the Imha watershed to evaluate the performance of each weighted method for estimating missing precipitation data and the use of interpolated data for hydrologic simulations was assessed. The following conclusions can be drawn from this research.
To estimate missing precipitation data points, exploratory procedures should consider the spatiotemporal variations of precipitation. Due to difficulty on accounting for these variations, statistical methods for estimating missing precipitation data are commonly used.
Although ANNs are an advanced approach for estimating missing data, mechanisms are unclear because the neuron system is ultimately a black-box model. Thus, regression methods are widely used for estimating missing data, even though there are limitations in that regression methods cannot follow normal distribution when the sample is small.
When using kernel functions as a weighted method, estimated missing data would satisfy normal distribution which is more statistically sound. Also, kernel methods can overcome weakness in knn-regression if the data have outliers and/or a nonlinear trend around the missing data points in terms of mean value.
This study assessed the five kernel functions, Epanechnikov, Quartic, Triweight, Tricube, and Cosine, as a weight for predicting missing values. In comparison with the knn-regression method, this study demonstrates that the kernel approaches provide higher quality interpolated precipitation data than the knn-regression approach. In addition, the kernel function results better conform to statistical standards.
Furthermore, higher quality of interpolated precipitation data results in better performance for hydrologic simulations, as exemplified in this study. All of the statistical analyses of the streamflow simulations showed that the simulations using the interpolated precipitation data from the kernel functions provide better results than using knn-regression.
Use of kernel distribution is a more effective method than regression when the precipitation data have an upward or downward trend. However, if the precipitation data have a nonlinear trend, it is difficult to effectively reconstruct the missing values. For further research, a time series analysis or a random walk model using a stochastic process are possible methods by which to estimate missing data where there is a nonlinear trend.
AppendicesA. Kenel Functions
Kernel density estimation is an unsupervised learning procedure, which historically precedes kernel regression. It also leads naturally to a simple family of procedures for nonparametric classification.
A.1. Kernel Density Estimation
Suppose we have a random sample x1,x2,…,xN draw from a probability density fx(x) and we wish to estimate fx at a point x0. For simplicity we assume for now that x∈R (real value). Arguing as before, a natural local estimate has the form(A.1)f^xx0=#xi∈Nx0Nλ,where #xi means number of xi which converges to N(x0) and N(x0) is a small metric neighborhood around x0 of width λ. This estimate is bumpy, and the smooth Parzen estimate is preferred,(A.2)f^xx0=1Nλ∑i=1NKλx0,xi,because it counts observations close to x0 with weights that decrease with distance from x0. In this case a popular choice for Kλ is the Gaussian kernel Kλx0,xi=ϕ(x-x0/λ). Letting ϕλ denote the Gaussian density with mean zero and standard-deviation λ, then (A.2) has the form(A.3)f^xx0=1N∑i=1Nϕλx-xi=F^*ϕλx,the convolution of the sample empirical distribution F^ with ϕλ. The distribution F^(x) puts mass 1/N at each of the observed xi and is jumpy; in f^xx we have smoothed F^ by adding independent Gaussian noise to each observation xi.
The Parzen density estimate is the equivalent of the local average, and improvements have been proposed along the lines of local regression (on the log scale for densities). We will not pursue these here. In Rp the natural generalization of the Gaussian density estimate amounts to using the Gaussian product kernel in (A.3),(A.4)f^xx0=1N2λ2πp/2∑i=1Ne-(1/2)(||xi-x0||/λ)2.
A.2. Kernel Density Classification
One can use nonparametric density estimates for classification in a straight-forward fashion using Bayes’ theorem. Suppose for a J class problem we fit nonparametric density estimates f^jX, j=1,…,J separately in each of the classes, and we also have estimates of the class priors π^j (usually the sample proportions). Then(A.5)Pr^G=j∣X=x0=π^jf^jx0∑k=1Jπ^kf^kx0.In this region the data are sparse for both classes, and since the Gaussian kernel density estimates use matric kernels, the density estimates are low and of poor quality (high variance) in these regions. The local logistic regression method uses the tricube kernel with k-NN bandwidth; this effectively widens the kernel in this region and makes use of the local linear assumption to smooth out the estimate (on the logit scale).
If classification is the ultimate goal, then learning the separate class densities well may be unnecessary and can in fact be misleading. In learning the separate densities form data, one might decide to settle for a rougher, high-variance fit to capture these features, which are irrelevant for the purposes of estimating the posterior probabilities. In fact, if classification is the ultimate goal, then we need only to estimate the posterior well near the decision boundary (for two classes, this is the set {x∣PrG=1∣X=x=1/2}).
B. Procedures of Missing Precipitation
This step shows example calculation for kernel functions for weighted mean. It is an example question about the weight of each situation. If the kernel functions are all symmetric, same values are used for weight based on day distance. Following Table 6 1st, 2nd, 3rd, and 4th day distance and weighted values are shown. For example, if we want to estimate missing precipitation for 2010-02-12 (actual value is 6), see following procedures (3 steps) with 4-NN Epanechnikov kernel (Table 7).
Weighted values depending on day distance with each KNN.
4-NN
6-NN
8-NN
1st
2nd
1st
2nd
3rd
1st
2nd
3rd
4th
Ep
0.667
0.417
0.703
0.563
0.328
0.720
0.630
0.480
0.270
Qu
0.741
0.289
0.824
0.527
0.179
0.864
0.662
0.384
0.122
Tw
0.768
0.188
0.901
0.461
0.092
0.968
0.648
0.287
0.051
Tc
0.772
0.301
0.824
0.579
0.167
0.844
0.709
0.416
0.100
Co
0.680
0.393
0.726
0.555
0.301
0.747
0.635
0.462
0.243
Example calculation to interpolate for missing precipitation.
Step 1
Step 2
Step 3
Date
Prec.
Weight
Prec.·Weight
Estimation
2010-02-10
15
0.417
6.255
5.949
2010-02-11
17.2
0.667
11.472
2010-02-12
6
—
—
2010-02-13
9.1
0.667
6.070
2010-02-14
0
0.417
0
Step 1.
Select the date for target interpolation data.
Step 2.
Decide Kth nearest days precipitation and each kernel weight.
Step 3.
Calculate the weight average to estimate missing.
The rest of the kernel methods for estimating missing precipitation are described in Table 8.
Calculating missing precipitation for 2010-02-12 (actual value is 6) with six different methods.
This section shows how to calculate missing precipitation with kernel mean weighed function by using certain number. This sample selected daily data from 2008 to 2010 with 0.02 possibilities to bivariate by random. After selected data, setting data location is operated. Zhang et al. [39] addressed that kernel based nonparametric multiple imputation has better performance than general linear regression when the sample data is small or limited.
Table 9 shows procedure of kernel weight in each function. We used data Feb. 10, 2012 from Feb. 14, 2014 to estimate Feb. 12, 2012 missing data. Epanechnikov kernel showed that longest data has highest estimation as 0.417; however, Triweight kernel showed that longest data has lowest estimation as 0.188. Highest weight in nearest value is Tricube kernel and lowest weight is Epanechnikov kernel. Generally, Tricube, that is, high weight, shows the overestimation for missing precipitation.
Sample calculation with certain number.
Epanechnikov
Date
2.10.
2.11.
2.12.
2.13.
2.14.
Prec.
15.0
17.2
6.0
9.1
0.0
Ep. weight
0.417
0.667
—
0.667
0.417
Prec.·weight
6.26
11.47
—
6.07
0.00
Estimation
5.95
Quartic
Date
2.10.
2.11.
2.12.
2.13.
2.14.
Prec.
15.0
17.2
6.0
9.1
0.0
Qu. weight
0.289
0.741
—
0.741
0.289
Prec.·weight
4.34
12.75
—
6.74
0.00
Estimation
5.96
Triweight
Date
2.10.
2.11.
2.12.
2.13.
2.14.
Prec.
15.0
17.2
6.0
9.1
0.0
Tw. weight
0.188
0.768
—
0.768
0.188
Prec.·weight
2.82
13.21
—
6.99
0.00
Estimation
5.75
Tricube
Date
2.10.
2.11.
2.12.
2.13.
2.14.
Prec.
15.0
17.2
6.0
9.1
0.0
Tc. weight
0.301
0.772
—
0.772
0.301
Prec.·weight
4.52
13.28
—
7.03
0.00
Estimation
6.20
Cosine
Date
2.10.
2.11.
2.12.
2.13.
2.14.
Prec.
15.0
17.2
6.0
9.1
0.0
Co. weight
0.393
0.680
—
0.680
0.393
Prec.·weight
5.90
11.70
—
6.19
0.00
Estimation
5.94
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
KangK.MerwadeV.Development and application of a storage-release based distributed hydrologic model using GIS20114031-211310.1016/j.jhydrol.2011.03.0482-s2.0-79955896729AbebeA. J.SolomatineD. P.VennekerR. G. W.Application of adaptive fuzzy rule-based models for reconstruction of missing precipitation events200045342543610.1080/026266600094923392-s2.0-0034210373Ramos-CalzadoP.Gómez-CamachoJ.Pérez-BernalF.F. Pita-LópezM.A novel approach to precipitation series completion in climatological datasets: application to Andalusia200828111525153410.1002/joc.16572-s2.0-53149132437SchneiderT.Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values20011458538712-s2.0-0035284320RegondaS. K.SeoD.-J.LawrenceB.BrownJ. D.DemargneJ.Short-term ensemble stream forecasting using operationally-produced single-valued streamflow forecasts—a Hydrologic Model Output Statistics (HMOS) approach2013497809610.1016/j.jhydrol.2013.05.0282-s2.0-84880015692CoulibalyP.EvoraN. D.Comparison of neural network methods for infilling missing daily weather records20073411-2274110.1016/j.jhydrol.2007.04.0202-s2.0-34250818666El SharifH. A.TeegavarapuR. S. V.Evaluation of spatial interpolation methods for missing precipitation data: preservation of spatial statisticsProceedings of the World Environmental and Water Resources CongressMay 20123822383210.1061/9780784412312.3842-s2.0-84866061063BárdossyA.PegramG.Infilling missing precipitation records—a comparison of a new copula-based method with other techniques20145191162117010.1016/j.jhydrol.2014.08.025SchammK.ZieseM.BeckerA.FingerP.Meyer-ChristofferA.SchneiderU.SchröderM.StenderP.Global gridded precipitation over land: a description of the new GPCC first guess daily product201461496010.5194/essd-6-49-2014TeegavarapuR. S. V.Statistical corrections of spatially interpolated missing precipitation data estimates201428113789380810.1002/hyp.99062-s2.0-84900853383DaY.XiurunG.An improved PSO-based ANN with simulated annealing technique20056352753310.1016/j.neucom.2004.07.0022-s2.0-12144252495di PiazzaA.ContiF. L.NotoL. V.ViolaF.La LoggiaG.Comparative analysis of different techniques for spatial interpolation of rainfall data to create a serially complete monthly time series of precipitation for Sicily, Italy201113339640810.1016/j.jag.2011.01.0052-s2.0-79955603489PisoniE.PastorF.VoltaM.Artificial Neural Networks to reconstruct incomplete satellite data: application to the Mediterranean Sea Surface Temperature2008151617010.5194/npg-15-61-20082-s2.0-39449120242SharmaV.RaiS.DevA.A comprehensive study of artificial neural networks2012210278284ASCE Task CommitteeArtificial neural networks in hydrology. II: hydrologic applications20005212413710.1061/(ASCE)1084-0699(2000)5:2(124)ASCE Task CommitteeArtificial neural networks in hydrology. I: preliminary concepts20005211512310.1061/(ASCE)1084-0699(2000)5:2(115)RumelhartD. E.WidrowB.LehrM. A.The basic ideas in neural networks1994373879210.1145/175247.1752562-s2.0-0028387050AmorochoJ.HartW. E.A critique of current methods in hydrologic systems investigation1964452307321MinnsA. W.HallM. J.Artificial neural networks as rainfall-runoff models199641339941710.1080/026266696094915112-s2.0-0030159380KangK.MerwadeV.The effect of spatially uniform and non-uniform precipitation bias correction methods on improving NEXRAD rainfall accuracy for distributed hydrologic modeling2014451234210.2166/nh.2013.1942-s2.0-84896502552DalyC.GibsonW. P.TaylorG. H.JohnsonG. L.PasterisP.A knowledge-based approach to the statistical mapping of climate20022229911310.3354/cr0220992-s2.0-0037032071CreutinJ. D.AndrieuH.FaureD.Use of a weather radar for the hydrology of a mountainous area. Part II: radar measurement validation19971931–4264410.1016/s0022-1694(96)03203-92-s2.0-0031172138XiaY. L.FabianP.StohlA.WinterhalterM.Forest climatology: estimation of missing values for Bavaria, Germany1999961–313114410.1016/s0168-1923(99)00056-82-s2.0-0033618894WillmottC. J.RobesonS. M.FeddemaJ. J.Estimating continental and terrestrial precipitation averages from rain-gauge networks199414440341410.1002/joc.33701404052-s2.0-0028585198TeegavarapuR. S. V.ChandramouliV.Improved weighting methods, deterministic and stochastic data-driven models for estimation of missing precipitation records20053121–419120610.1016/j.jhydrol.2005.02.0152-s2.0-25844451634SmithJ. A.MaidmentD. R.Precipitation19933chapter 3New York, NY, USAMcGraw HillSimantonJ. R.OsbornH. B.Reciprocal-distance estimate of point rainfall1980106712421246SalasJ. D.-J.MaidmentD. R.Analysis and modeling of hydrological time series199319chapter 19New York, NY, USAMcGraw-Hill19.119.72JeffreyS. J.CarterJ. O.MoodieK. B.BeswickA. R.Using spatial interpolation to construct a comprehensive archive of Australian climate data200116430933010.1016/s1364-8152(01)00008-12-s2.0-0034787335FranklinM.KotamarthiV. R.SteinM. L.CookD. R.Generating data ensembles over a model grid from sparse climate point measurements200812501201910.1088/1742-6596/125/1/0120192-s2.0-65549121631YoungK. C.A three-way model for interpolating for monthly precipitation values1992120112561256910.1175/1520-0493(1992)120x0003C;2561:atwmfix003E;2.0.co;22-s2.0-0027088934FilippiniF.GallianiG.PomiL.The estimation of missing meteorological data in a network of automatic stations19944283291LowryW. P.1972Geneva, SwitzerlandSecretariat of the World Meteorological OrganizationAcockM. C.PachepskyY. A.Estimating missing weather data for agricultural simulations using group method of data handling20003971176118410.1175/1520-0450(2000)039x0003C;1176:emwdfax003E;2.0.co;22-s2.0-0034233010NeitschS. L.ArnoldJ. G.KiniryJ. R.WilliamsJ. R.KingK. W.2005College Station, Tex, USATexas Water Resource InstituteShapiroS. S.WilkM. B.An analysis of variance test for normality: complete samples1965523-459161110.1093/biomet/52.3-4.591MR0205384FriedmanM.The use of ranks to avoid the assumption of normality implicit in the analysis of variance19373220067570110.1080/01621459.1937.10503522WilcoxonF.Individual Comparisons by Ranking Methods194516808310.2307/3001968ZhangS.JinZ.ZhuX.ZhangJ.Missing data analysis: a kernel-based multi-imputation approach20095300Berlin, GermanySpringer122142Lecture Notes in Computer Science10.1007/978-3-642-00212-0_7MR29125462-s2.0-67650486475