Reservoirs are important for households and impact the national economy. This paper proposed a timeseries forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed timeseries forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Shimen Reservoir is located between Taoyuan City and Hsinchu County in Taiwan. The Shimen Reservoir offers irrigation, hydroelectricity, water supply, flood control, tourism, and so on. This reservoir is very important to the area and offers livelihood, agriculture, flood control, and economic development. Thus, the authorities should plan and manage water resources comprehensively via accurate forecasting.
Previous studies of reservoir water levels have identified three important problems:
There are few studies of reservoir water levels: related studies [
Only a few variables have been used in reservoir water level forecasting. The literature shows only a few related studies of forecasting [
No imputation method used in datasets of reservoir water level: previous studies of water level forecasting in hydrological fields have shown that the collected data are noninterruptible and longterm, but most of them did not explain how to deal with the missing values from human error or mechanical failure.
To improve these problems, this study collected data on Taiwan Shimen Reservoir and the corresponding information on daily atmospheric datasets. The two datasets were concatenated into single dataset based on the date. Next, this study imputed missing values and selected a better imputation method to further build forecast models. We then evaluated the variables based on different models.
This paper includes five sections: Section
This section introduces a forecast method of machine learning, imputation techniques, and variable selection.
Radial Basis Function Networks were proposed by Broomhead and Lowe in 1988 [
The Kstar is an instancebased classifier that differs from other instancebased learners in that it uses an entropybased distance function [
The
A Random Forest can be applied for classification, regression, and unsupervised learning [
Random Tree is an ensemble learning algorithm that generates many individual learners and employs a bagging idea to produce a random set of data in the construction of a decision tree [
The daily atmospheric data may have missing values due to human error or machine failure. Many previous studies have shown that the statistical bias occurred when the missing values were directly deleted. Thus, imputing data can significantly improve the quality of the dataset. Otherwise, biased results may cause poor performance in the ensuing constructs [
The variable selection method mainly identifies the key variable that actually influences the forecasting target from several variables. It then deletes the unimportant variables to improve the model’s efficiency. It can solve high dimensional and complex problems. Previous studies in several field have shown that variable selection can improve the forecasting efficiency of machine learning methods [
Variable selection is an important technique in data preprocessing. It removes irrelevant data and improves the accuracy and comprehensibility of the results [
Filter models utilize statistical techniques such as principal component analysis (PCA), factor analysis (FA), independent component analysis, and discriminate analysis in the investigation of other indirect performance measures. These are mostly based on distance and information measures [
PCA transforms a set of feature columns in the dataset into a projection of the feature space with lower dimensionality. FA is a generalization of PCA; the main difference between PCA and FA is that FA allows noise to have nonspherical shape while transforming the data. The main goal of both PCA and FA is to transform the coordinate system such that correlation between system variables is minimized [
There are several methods to decide how many factors have to be extracted. The most widely used method for determining the number of factors is using eigenvalues greater than one [
Reservoirs are important domestically as well as in the national defense and for economic development. Thus, the reservoir water levels should be forecast over a long period of time, and water resources should be planned and managed comprehensively to reach great costeffectiveness. This paper proposes a timeseries forecasting model based on the imputation of missing values and variable selection. First, the proposed model used five imputation methods (i.e., median of nearby points, series mean, mean of nearby points, linear, and regression imputation). It then compares these findings with a delete strategy to estimate the missing values. Second, by identifying the key variable that influences the daily water levels, the proposed method ranked the importance of the atmospheric variables via factor analysis. It then sequentially removes the unimportant variables. Finally, the proposed model uses a Random Forest machine learning method to build a forecasting model of the reservoir water level to compare it to other methods. The proposed model could be partitioned into four parts: data preprocessing, imputation and feature selection, model building, and accuracy evaluation. The procedure is shown in Figure
The procedure of proposed model.
To understand the proposed model more easily, this section partitioned the proposed model into four steps.
The related water level and atmospheric data were collected from the reservoir administration website and the Taoyuan weather station. The two collected datasets are concatenated into an integrated dataset based on the date. There are nine independent variables and one dependent variable in the integrated dataset. The variables are defined in the integrated dataset and are listed in Table
Description of variables in the research dataset.
Output  Shimen Reservoir daily discharge release 
Input  Shimen Reservoir daily inflow discharge 
Temperature  Daily temperature in Daxi, Taoyuan 
Rainfall  The previous day Shimen Reservoir accumulated rainfall 
Pressure  Daily barometric pressure in Daxi, Taoyuan 
Relative Humidity  Daily relative humidity in Daxi, Taoyuan 
Wind Speed  Daily wind speed in Daxi, Taoyuan 
Direction  Daily wind direction in Daxi, Taoyuan 
Rainfall_Dasi  Daily rainfall in Daxi, Taoyuan 
After Step
In order to rescale all numeric values in the range
The six normalized datasets are partitioned into 66% training datasets and 34% testing datasets. We also employed a 10fold crossvalidation approach to identify the imputation dataset that has best prediction performance.
We utilized five forecast methods including Random Forest, RBF Network, Kstar, IBK (KNN), and Random Tree via five evaluation indices. These include correlation coefficient (CC), root mean squared error (RMSE), mean absolute error (MAE), relative absolute error (RAE), and root relative squared error (RRSE). We identified the normalized dataset with the smaller index value over five evaluation indices as well as the more fit imputation. Section
Based on Step
The imputed integrated datasets are partitioned into 66% training datasets and 34% testing datasets.
Factor analysis ranked the importance of the variables.
The variable ranking of factor analysis was used to iteratively delete the least important variable. The remaining variables were studied with Random Forest, RBF Network, Kstar, KNN, and Random Tree until the RMSE can no longer improve.
Based on the previous Step
Concurrently, we used five evaluation indices (CC, RMSE, RRSE, MAE, and RAE) to determine which forecast method is a good forecasting model.
To verify the performance of the reservoir water level forecasting model, this step uses the superior imputed datasets with different variables selected to evaluate the proposed method. It then compares the results with the listing methods. This study used CC, RMSE, MAE, RAE, and RRSE to evaluate the forecast mode. The five criteria indices are listed as equations (
This section verifies the performance of the proposed forecast model and compares the results with the listing methods. To determine which imputation method has the best performance for the collected dataset, this study collected daily atmospheric data from the monitoring station and the website of Water Resources Agency in Taiwan Shimen Reservoir. This work also compares the proposed model with the listing models with/without variable selection.
The research dataset consisted of two historical datasets: one was collected form the website of Taiwan Water Resources Agency and the other was from Dasi monitoring station nearest to the Shimen Reservoir. The two datasets were collected from January 1, 2008, to October 31, 2015. The two collected data are concatenated into an integrated dataset based on the date. There are nine independent variables and one dependent variable in the integrated dataset that has 2,854 daily records. The study mainly forecasts the water level of the reservoir. The water level is the dependent variable. The independent variables are Reservoir_OUT, Reservoir_IN, Temperature, Rainfall, Pressure, Relative Humidity, Wind Speed, Direction, and Rainfall_Dasi, respectively. The detailed data types are shown as Table
The partial collected data.
Date  Rainfall  Input  Output  Rainfall_Dasi  Temperature  Wind Speed  Direction  Pressure  Relative Humidity  Water level 

2008/1/1  0.1  83.6  0  10.2  4.7  65  1001.5  56  244.09  
2008/1/2  0.1  96.08  286.24  0  10.4  6.3  38  1000.7  59  243.93 
2008/1/3  0  82.72  82.17  0  14.5  4.5  50  997.7  67  243.81 
2008/1/4  0  133.32  262.22  0  15.3  3.4  40  996.5  82  243.78 
2008/1/5  0  125.6  305.94  0  16  2.9  46  996.3  77  243.55 
2008/1/6  0.3  98.74  192.33  0  16.7  1.2  170  995.4  83  243.32 
2008/1/7  0  116.6  192.76  0  18.4  1.9  46  994.9  77  243.34 
2008/1/8  0  93.12  109.73  0  19.9  1.3  311  992.9  78  243.33 
2008/1/9  0  107.57  123.98  0  19.9  2.2  11  992.2  77  243.23 
2008/1/10  0  65.15  276.74  0  19.6  1.6  357  991.6  80  243 
2008/1/11  0  55.64  249.09  0  21.5  1.3  185  990.2  71  242.78 
2008/1/12  0  91.67  191.81  0  19.8  4.2  37  992.1  75  242.74 
2008/1/13  0.9  107.34  182.22  1.5  14.2  7.1  39  996.2  85  242.53 
2008/1/14  5.2  80.09  146.62  1  12.7  7.1  36  997.7  85  242.49 
2008/1/15  4  85.77  243.82  0  13.5  7.1  35  998.4  86  242.38 
Based on the computational step in Section
To achieve better processing of the missing values dataset, this study applies series mean, regression, mean of nearby points, linear, or the median of nearby points’ imputation methods to estimate missing values. It then directly deletes missing values to determine which method has the best performance. After normalizing the six processed missing values datasets, we used two approaches to estimate the datasets: percentage spilt (dataset partition into 66% training data and 34% testing data) and 10fold crossvalidation. The two approaches employ Random Forest, RBF Network, Kstar, KNN, and Random Tree to forecast water levels for evaluating the six processed missing values methods under five evaluation indices: CC, RMSE, MAE, RAE, and RRSE. The results are shown in Tables
Variable selection and model building uses the results from above. The best imputation method is the mean of nearby points. Therefore, this study will use the imputed dataset of the mean of nearby points to select the variable and build the model. For variable selection, this study utilizes factor analysis to rank the importance of independent variables for an improved imputed dataset. Table
Next, we determine the key variables and build a forecast model. This study utilizes the ordering of important variables and iteratively deletes the least important variable to iteratively implement the proposed forecasting model when the minimal RMSE is reached. First, all independent variables are used to build the water level forecasting model. Second, the least important variables are removed onebyone. The remaining variables serve as a forecast model until the minimal RMSE is reached. After these iterative experiments, the optimal forecast model is achieved when the Wind Speed and Direction are deleted. The key remaining variables are Reservoir_IN, Temperature, Reservoir_OUT, Pressure, Rainfall, Rainfall_Dasi, and Relative Humidity. As a rule of thumb, we recommend interpreting only factor loadings with an absolute value greater than 0.5, which explains around 25% of the variance [
Model comparison: this study compares the proposed forecast model (using Random Forest) with the Random Forest, RBF Network, Kstar, IBK (KNN), and Random Tree forecast models (Tables
The results of listing models with the five imputation methods under percentage spilt (dataset partition into 66% training data and 34% testing data) before variable selection.
Methods  Index  RBF Network  Kstar  Random Forest  IBK  Random Tree  

Before variable selection  
Delete the rows with missing data  CC  0.085  0.546  0.728  0.288  0.534  
MAE  0.183  0.133  0.111  0.186  0.145  
RMSE  0.227  0.200  0.157  0.271  0.226  
RAE  0.998  0.727  0.607  1.015  0.789  
RRSE  0.997  0.879  0.690  1.193  0.995  
Serial mean  CC  0.052  0.557  0.739  0.198  0.563  
MAE  0.174  0.123  0.102  0.191  0.126  
RMSE  0.222  0.189 

0.283  0.202  
RAE  1.001  0.705  0.587  1.098  0.722  
RRSE  0.999  0.850  0.679  1.276  0.908  
Linear  CC  0.054  0.565  0.734  0.200  0.512  
MAE  0.175  0.121  0.101  0.189  0.138  
RMSE  0.222  0.188  0.152  0.281  0.218  
RAE  1.000  0.690  0.575 
1.082  0.787  
RRSE  0.999  0.844  0.684  1.264  0.980  
Near median  CC  0.054  0.571  0.737  0.227  0.559  
MAE  0.175  0.120  0.101  0.188  0.126  
RMSE  0.222  0.186  0.152  0.277  0.202  
RAE  1.000  0.689  0.577  1.074  0.719  
RRSE  0.999  0.838  0.681  1.244  0.907  
Near mean  CC  0.053  0.572 

0.232  0.512  
MAE  0.175  0.121 

0.186  0.132  
RMSE  0.222  0.186 

0.275  0.217  
RAE  1.000  0.690 

1.062  0.756  
RRSE  0.999  0.837 

1.235  0.975  
Regression  CC  0.052  0.564  0.739  0.200  0.509  
MAE  0.174  0.121  0.102  0.191  0.133  
RMSE  0.222  0.188  0.151  0.283  0.216  
RAE  1.001  0.695  0.586  1.096  0.762  
RRSE  0.999  0.845  0.680  1.275  0.974 
The results of listing models with the five imputation methods under 10folds crossvalidation before variable selection.
Methods  Index  RBF Network  Kstar  Random Forest  IBK  Random Tree  

Before variable selection  Delete the rows with missing data  CC  0.041  0.590  0.737  0.246  0.505 
MAE  0.184  0.126  0.109  0.195  0.143  
RMSE  0.227  0.188  0.154  0.281  0.225  
RAE  1.000  0.682  0.592  1.059  0.775  
RRSE  0.999  0.825  0.678  1.235  0.986  
Serial mean  CC  0.038  0.612  0.755  0.237  0.574  
MAE  0.171  0.113  0.098  0.181  0.125  
RMSE  0.217  0.175 

0.270  0.202  
RAE  1.001  0.660  0.575  1.058  0.731  
RRSE  0.999  0.802  0.658  1.241  0.929  
Linear  CC  0.042  0.615  0.753  0.243  0.551  
MAE  0.173  0.112  0.098  0.181  0.127  
RMSE  0.218  0.175  0.144  0.269  0.207  
RAE  1.000  0.649  0.568  1.057  0.736  
RRSE  0.999  0.800  0.660  1.233  0.948  
Near median  CC  0.043  0.614  0.752  0.251  0.535  
MAE  0.173  0.113  0.098  0.180  0.131  
RMSE  0.218  0.175  0.144  0.268  0.211  
RAE  1.000  0.653  0.568  1.041  0.757  
RRSE  0.999  0.801  0.661  1.227  0.967  
Near mean  CC  0.043  0.613 

0.250  0.558  
MAE  0.173  0.113 

0.179  0.125  
RMSE  0.218  0.175  0.144  0.268  0.205  
RAE  1.000  0.654 

1.039  0.725  
RRSE  0.999  0.802 

1.226  0.937  
Regression  CC  0.038  0.618  0.754  0.240  0.522  
MAE  0.171  0.112  0.098  0.181  0.133  
RMSE  0.217  0.174 

0.270  0.214  
RAE  1.001  0.653  0.574  1.055  0.778  
RRSE  0.999  0.798  0.659  1.239  0.983 
The results of variable selection.
Factor  

1  2  3  
Input  . 
.072  .164 
Output  . 
.037  .071 
Rainfall  . 
.064  .503 
Temperature  .092  . 
−.149 
Pressure  −.254  −. 
−.182 
Wind Speed  .096  −. 
.052 
Direction  .041  . 
−.067 
Rainfall_Dasi  .290  .068  . 
Relative Humidity  .025  −.196  . 
The results of compare forecasting models under percentage spilt (dataset partition into 66% training data and 34% testing data) after variable selection.
Methods  Index  RBF Network  Kstar  Random Forest  IBK  Random Tree  

After variable selection  Delete the rows with missing data  CC  0.033  0.638 
0.729 
0.251  0.545 
MAE  0.182 
0.121 
0.111 
0.199  0.135 

RMSE  0.229  0.176 
0.156 
0.287  0.212 

RAE  0.992 
0.657 
0.602 
1.083  0.736 

RRSE  1.007  0.775 
0.688 
1.262  0.935 

Series mean  CC  0.107 
0.661 
0.739 
0.242 
0.551  
MAE  0.172 
0.107 
0.101 
0.179 
0.129  
RMSE  0.221 
0.167 
0.151 
0.268 
0.205  
RAE  0.988 
0.615 
0.579 
1.027 
0.740  
RRSE  0.995 
0.753 
0.678 
1.208 
0.923  
Linear  CC  0.105 
0.666 
0.735 
0.258 
0.596 

MAE  0.173 
0.106 
0.100 
0.175 
0.120 

RMSE  0.221 
0.166 
0.151 
0.266 
0.196 

RAE  0.987 
0.606 
0.572 
1.002 
0.683 

RRSE  0.995 
0.748 
0.681 
1.198 
0.883 

Median of nearby points  CC  0.106 
0.666 
0.740 
0.264 
0.553  
MAE  0.173 
0.107 
0.100 
0.177 
0.127  
RMSE  0.221 
0.166 
0.151 
0.266 
0.207  
RAE  0.987 
0.611 
0.571 
1.013 
0.723  
RRSE  0.995 
0.747 
0.677 
1.195 
0.932  
Mean of nearby points  CC  0.1059 
0.667 

0.249 
0.540 

MAE  0.173 
0.107 

0.179 
0.129 

RMSE  0.221 
0.166 

0.268 
0.214 

RAE  0.987 
0.611 

1.025 
0.735 

RRSE  0.995 
0.747 

1.207 
0.962 

Regression  CC  0.107 
0.663 
0.739 
0.242 
0.559 

MAE  0.172 
0.106 
0.101 
0.179 
0.126 

RMSE  0.221 
0.167 
0.151 
0.268 
0.200 

RAE  0.987 
0.610 
0.581 
1.027 
0.723 

RRSE  0.994 
0.752 
0.678 
1.207 
0.900 
The results of compare forecasting models under 10folds crossvalidation after variable selection.
Methods  Index  RBF Network  Kstar  Random Forest  IBK  Random Tree  

After variable selection  Delete the rows with missing data  CC  0.103 
0.665 
0.737 
0.233  0.529 
MAE  0.181 
0.115 
0.108 
0.193 
0.143 

RMSE  0.226 
0.171 
0.154 
0.282 
0.223 

RAE  0.984 
0.627 
0.589 
1.047 
0.774 

RRSE  0.994 
0.749 
0.677 
1.238  0.977 

Series mean  CC  0.081 
0.688 
0.751 
0.295 
0.547  
MAE  0.169 
0.103 
0.098 
0.170 
0.131  
RMSE  0.217 
0.158 
0.144  0.260 
0.209  
RAE  0.988 
0.600 
0.571 
0.990 
0.767  
RRSE  0.996 
0.727 
0.661  1.193 
0.960  
Linear  CC  0.081 
0.692 
0.750  0.286 
0.551 

MAE  0.171 
0.102 
0.098 
0.169 
0.128  
RMSE  0.218 
0.158 
0.145  0.261 
0.207 

RAE  0.988 
0.590 
0.566 
0.981 
0.740  
RRSE  0.996 
0.723 
0.662  1.196 
0.948 

Median of nearby points  CC  0.083 
0.692 
0.752 
0.305 
0.555 

MAE  0.171 
0.102 

0.169 
0.126 

RMSE  0.218 
0.158 
0.144 
0.259 
0.208 

RAE  0.987 
0.593 

0.980 
0.732 

RRSE  0.996 
0.722 
0.660 
1.186 
0.951 

Mean of nearby points  CC  0.082 
0.694 

0.276 
0.537  
MAE  0.171 
0.102 

0.171 
0.129  
RMSE  0.218 
0.157 

0.263 
0.210  
RAE  0.988 
0.593 
0.564 
0.993 
0.747  
RRSE  0.996 
0.721 

1.204 
0.960  
Regression  CC  0.081 
0.690 

0.295 
0.572  
MAE  0.169 
0.102 
0.098 
0.169 
0.126  
RMSE  0.217 
0.158 

0.259 
0.204  
RAE  0.988 
0.595 
0.571 
0.989 
0.735  
RRSE  0.996 
0.725 

1.193 
0.938 
After variable selection and model building, some key findings can be highlighted:
Imputation: after the two collected datasets were concatenated into an integrated dataset, there are missing values in the integrated dataset due to human error or mechanical failure. Tables
Variable selection: this study uses factor analysis to rank the ordering of variables and then sequentially deletes the least important variables until the forecasting performance no longer improves. Tables
Forecasting model: this study proposed a timeseries forecasting model based on estimating missing values and variable selection to forecast the water level in the reservoir. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing models in the five evaluation indices. The proposed timeseries forecasting model is feasible for forecasting water levels in Shimen Reservoir.
This study proposed a timeseries forecasting model for water level forecasting in Taiwan’s Shimen Reservoir. The experiments showed that the mean of nearby points’ imputation method has the best performance. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. The key variables are Reservoir_IN, Temperature, Reservoir_OUT, Pressure, Rainfall, Rainfall_Dasi, and Relative Humidity. The proposed timeseries forecasting model with/without variable selection has better forecasting performance than the listing models using the five evaluation indices. This shows that the proposed timeseries forecasting model is feasible for forecasting water levels in Shimen Reservoir. Future work will address the following issues:
The reservoir’s utility includes irrigation, domestic water supply, and electricity generation. The key variables identified here could improve forecasting in these fields.
We might apply the proposed timeseries forecasting model based on imputation and variable selection to forecast the water level of lakes, salt water bodies, reservoirs, and so on.
In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
The authors declare that they have no conflicts of interest.