A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method

Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.


Introduction
Shimen Reservoir is located between Taoyuan City and Hsinchu County in Taiwan. The Shimen Reservoir offers irrigation, hydroelectricity, water supply, flood control, tourism, and so on. This reservoir is very important to the area and offers livelihood, agriculture, flood control, and economic development. Thus, the authorities should plan and manage water resources comprehensively via accurate forecasting.
Previous studies of reservoir water levels have identified three important problems: (1) There are few studies of reservoir water levels: related studies [1][2][3][4] in the hydrological field use machine learning methods to forecast water levels. They focused on water level forecasting of the flood stages in pumping stations, reservoirs, lakes, basins, and so on. Most of the water level forecasting of these flood stages collected the data about typhoons, specific climate, seasonal rainfall, or water levels. (2) Only a few variables have been used in reservoir water level forecasting. The literature shows only a few related studies of forecasting [5,6]. These used water level as the dependent variable, and the independent variable only has rainfall, water level, and the time lag of the combined two variables. Thus, a few independent variables were selected. It is difficult to determine the key variable set in the reservoir water level. (3) No imputation method used in datasets of reservoir water level: previous studies of water level forecasting in hydrological fields have shown that the collected data are noninterruptible and long-term, but most of them did not explain how to deal with the missing values from human error or mechanical failure.
To improve these problems, this study collected data on Taiwan Shimen Reservoir and the corresponding information on daily atmospheric datasets. The two datasets were concatenated into single dataset based on the date. Next, this study imputed missing values and selected a better imputation method to further build forecast models. We then evaluated the variables based on different models.

Computational Intelligence and Neuroscience
This paper includes five sections: Section 2 is related work; Section 3 proposes research methodology and introduces the concepts, imputation methods, variable selection, and forecasting model; Section 4 verifies the proposed model and compares with the listing models. Section 5 concludes.

Related Work
This section introduces a forecast method of machine learning, imputation techniques, and variable selection.

Machine Learning Forecast (Regression)
2.1.1. RBF Network. Radial Basis Function Networks were proposed by Broomhead and Lowe in 1988 [7]. RBF is a simple supervised learning feed forward network that avoids iterative training processes and trains the data at one stage [8]. The RBF Network is a type of ANN for applications to solve problems of supervised learning, for example, regression, classification, and time-series prediction [9]. The RBF Network consists of three layers: input layer, hidden layer, and output layer. The input layer is the set of source nodes, the second layer is a hidden layer high dimension, and the output layer gives the response of the network to the activation patterns applied to the input layer [10]. The advantages of the RBF approach are the (partial) linearity in the parameters and the availability of fast and efficient training methods [11]. The use of radial basis functions results from a number of different concepts including function approximation, noisy interpolation, density estimation, and optimal classification theory [12].

Kstar.
The Kstar is an instance-based classifier that differs from other instance-based learners in that it uses an entropy-based distance function [13]. The Lazy Family Data Mining Classifiers supports incremental learning. It contains some classifiers such as Kstar, and it takes less time for training and more time for predicting [14]. It provides a consistent approach to handling symbolic attributes, real valued attributes, and missing values [15]. Kstar uses an entropybased distance function for instance-based regression. The predicted class value of a test instance comes from values of training instances that are similar to the Kstar [16].

KNN.
The -Nearest-Neighbor classifier offers a good classification accuracy rate for activity classification [17]. The kNN algorithm is based on the notion that similar instances have similar behavior and thus the new input instances are predicted according to the stored most similar neighboring instances [18].

Random Forest.
A Random Forest can be applied for classification, regression, and unsupervised learning [19]. It is similar to the bagging method. Random Forest is an ensemble learning method. A decision tree represents the classifier. Random Forest gets outputs through decision trees. These are forecast by voting for all of the predicted results. It can solve the classification and regression problems. Random Forest is simple and easily parallelized [20].

Random Tree.
Random Tree is an ensemble learning algorithm that generates many individual learners and employs a bagging idea to produce a random set of data in the construction of a decision tree [21]. Random Tree classifiers can deal with regression and classification problems. Random trees can be generated efficiently and can be combined into large sets of random trees. This generally leads to accurate models [22]. The Random Tree classifier takes the input feature vector and classifies it with every tree in the forest. It then outputs the class label that received the majority of the votes [23].

Imputation.
The daily atmospheric data may have missing values due to human error or machine failure. Many previous studies have shown that the statistical bias occurred when the missing values were directly deleted. Thus, imputing data can significantly improve the quality of the dataset. Otherwise, biased results may cause poor performance in the ensuing constructs [24]. Single imputation methods have several advantages such as a wider scope than multiple imputation methods. Sometimes it is more important to find the missing values than to estimate the parameters [25]. The median of nearby point imputation methods uses nearby values for ordering and then selects the median to replace the missing value. The advantage of the median imputation method is that its replaced value is actually a real value in the data [26]. Series mean imputation methods replace the average of the variables directly. Regression imputation method uses simple linear regression to estimate missing values and replace them. The mean of the nearby point imputation methods is the mean of nearby values. The number of nearby values can be found by using a "span of nearby points" option [27]. The linear imputation is most readily applicable to continuous explanatory variables [28].

Variable Selection.
The variable selection method mainly identifies the key variable that actually influences the forecasting target from several variables. It then deletes the unimportant variables to improve the model's efficiency. It can solve high dimensional and complex problems. Previous studies in several field have shown that variable selection can improve the forecasting efficiency of machine learning methods [29][30][31][32].
Variable selection is an important technique in data preprocessing. It removes irrelevant data and improves the accuracy and comprehensibility of the results [33]. Variable selection methods can be categorized into three classes: filter, wrapper, and embedded. Filter uses statistic methods to select variables. It has better generalization ability and lower computational demands. Wrapper methods use classifiers to identify the best subset components. The embedded method has a deeper interaction between variable selection and construction of the classifier [34].
Filter models utilize statistical techniques such as principal component analysis (PCA), factor analysis (FA), independent component analysis, and discriminate analysis in the investigation of other indirect performance measures. These are mostly based on distance and information measures [35].
PCA transforms a set of feature columns in the dataset into a projection of the feature space with lower dimensionality. FA is a generalization of PCA; the main difference between PCA and FA is that FA allows noise to have nonspherical shape while transforming the data. The main goal of both PCA and FA is to transform the coordinate system such that correlation between system variables is minimized [36].
There are several methods to decide how many factors have to be extracted. The most widely used method for determining the number of factors is using eigenvalues greater than one [10].

Proposed Model
Reservoirs are important domestically as well as in the national defense and for economic development. Thus, the reservoir water levels should be forecast over a long period of time, and water resources should be planned and managed comprehensively to reach great cost-effectiveness. This paper proposes a time-series forecasting model based on the imputation of missing values and variable selection. First, the proposed model used five imputation methods (i.e., median of nearby points, series mean, mean of nearby points, linear, and regression imputation). It then compares these findings with a delete strategy to estimate the missing values. Second, by identifying the key variable that influences the daily water levels, the proposed method ranked the importance of the atmospheric variables via factor analysis. It then sequentially removes the unimportant variables. Finally, the proposed model uses a Random Forest machine learning method to build a forecasting model of the reservoir water level to compare it to other methods. The proposed model could be partitioned into four parts: data preprocessing, imputation and feature selection, model building, and accuracy evaluation. The procedure is shown in Figure 1.

Computational
Step. To understand the proposed model more easily, this section partitioned the proposed model into four steps.
Step 1 (data preprocessing). The related water level and atmospheric data were collected from the reservoir administration website and the Taoyuan weather station. The two collected datasets are concatenated into an integrated dataset based on the date. There are nine independent variables and one dependent variable in the integrated dataset. The variables are defined in the integrated dataset and are listed in Table 1.
Step 2 (imputation). After Step 1, we found some variables and records with missing values in the integrated dataset due to mechanical measurement failure or human operation error. Previous studies showed that deleting missing values directly will impact the results. To identify that one that better fits with the imputation method, this paper utilized five imputation methods to estimate the missing values and then compared it with no imputation method to directly delete the missing value. The five imputation methods were the median of the nearby points, series mean, mean of nearby points, linear imputation, and regression imputation. This study had six processed datasets after processing the missing values problem. The problem is then how to identify the imputation method that is a better fit to the integrated dataset.
To determine this, the following steps were followed: (i) In order to rescale all numeric values in the range [0, 1], this step normalized each variable value by dividing the maximal value for the five imputed datasets and then deleted the missing value dataset. All independent and dependent variables are positive values.
(ii) The six normalized datasets are partitioned into 66% training datasets and 34% testing datasets. We also employed a 10-fold cross-validation approach to identify the imputation dataset that has best prediction performance.
(iii) We utilized five forecast methods including Random Forest, RBF Network, Kstar, IBK (KNN), and Random Tree via five evaluation indices. These include correlation coefficient (CC), root mean squared error (RMSE), mean absolute error (MAE), relative absolute error (RAE), and root relative squared error (RRSE). We identified the normalized dataset with the smaller index value over five evaluation indices as well as the more fit imputation. Section 4 shows that the better imputation method is the mean of nearby points.
Step 3 (variable selection and model building). Based on Step 2, this study will now determine the better imputation method (i.e., mean of nearby points). Then, the important problem is to determine the key variables that influence the reservoir water level. Therefore, this step utilized factor analysis to rank the importance of the variables for building the best forecast model.  the best forecast method. This step could be introduced stepby-step as follows: (i) The imputed integrated datasets are partitioned into 66% training datasets and 34% testing datasets. (ii) Factor analysis ranked the importance of the variables. (iii) The variable ranking of factor analysis was used to iteratively delete the least important variable. The remaining variables were studied with Random Forest, RBF Network, Kstar, KNN, and Random Tree until the RMSE can no longer improve. (iv) Based on the previous Step 3, the key variables are found when the lowest RMSE is achieved.
(v) Concurrently, we used five evaluation indices (CC, RMSE, RRSE, MAE, and RAE) to determine which forecast method is a good forecasting model.
Step 4 (evaluation and comparison). To verify the performance of the reservoir water level forecasting model, this step uses the superior imputed datasets with different variables selected to evaluate the proposed method. It then compares the results with the listing methods. This study used CC, RMSE, MAE, RAE, and RRSE to evaluate the forecast mode. The five criteria indices are listed as equations (1)- (5).
Computational Intelligence and Neuroscience 5 where is the actual observation value of the data,̂is the forecast value of the model, and is the sample number.

Correlation Coefficient (CC)
where and are the observed and predicted values, respectively; and are the mean of the observed and predicted values.

Root Relative Squared Error (RRSE)
where is the predicted value, is the actual value, and is the mean of the actual value.

Mean Absolute Error (MAE)
Here, is the number of observation datasets,̂is the forecast value at time , and is the actual value at time .

Relative Absolute Error (RAE)
where is the actual observation value of the data, is the mean value of the ,̂is the forecast value of the model, and is the sample number.

Experimental Results
This section verifies the performance of the proposed forecast model and compares the results with the listing methods. To determine which imputation method has the best performance for the collected dataset, this study collected daily atmospheric data from the monitoring station and the website of Water Resources Agency in Taiwan Shimen Reservoir. This work also compares the proposed model with the listing models with/without variable selection.

Forecast and Comparison.
Based on the computational step in Section 3, this section will employ the practically collected dataset to illustrate the proposed model and compare it with the listing method. A detailed description is introduced in the following section.
(1) To achieve better processing of the missing values dataset, this study applies series mean, regression, mean of nearby points, linear, or the median of nearby points' imputation methods to estimate missing values. It then directly deletes missing values to determine which method has the best performance. After  Tables 3 and 4, and we can see that the mean of the nearby points' method wins versus other methods in CC, MAE, RAE, RRSE, and RMSE. Therefore, the mean of the nearby points' imputation method better estimates the Shimen Reservoir water level.
(2) Variable selection and model building uses the results from above. The best imputation method is the mean of nearby points. Therefore, this study will use the imputed dataset of the mean of nearby points to select the variable and build the model. For variable selection, this study utilizes factor analysis to rank the importance of independent variables for an improved imputed dataset.   As a rule of thumb, we recommend interpreting only factor loadings with an absolute value greater than 0.5, which explains around 25% of the variance [37]. We can see that the loadings of the two deleted variables are smaller than 0.5 as seen in Table 5  forecasting performance. The Random Forest model as applied to variable selection with full variables is better than the listing model.

Findings.
After variable selection and model building, some key findings can be highlighted: (1) Imputation: after the two collected datasets were concatenated into an integrated dataset, there are missing values in the integrated dataset due to human error or mechanical failure. Tables 3 and 4 show the integrated dataset that uses the median of nearby points, series mean, mean of nearby points, linear, regression imputation, and the delete strategy to evaluate their accuracy via five machine learning forecast models. The results show that the integrated dataset that uses the mean of the nearby points' imputation method has better forecasting performance.
(2) Variable selection: this study uses factor analysis to rank the ordering of variables and then sequentially deletes the least important variables until the forecasting performance no longer improves. Tables  6 and 7   the five evaluation indices. The proposed time-series forecasting model is feasible for forecasting water levels in Shimen Reservoir.

Conclusion
This study proposed a time-series forecasting model for water level forecasting in Taiwan's Shimen Reservoir. The experiments showed that the mean of nearby points' imputation method has the best performance. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. The key variables are Reservoir IN, Temperature, Reservoir OUT, Pressure, Rainfall, Rainfall Dasi, and Relative Humidity. The proposed time-series forecasting model with/without variable selection has better forecasting performance than the listing models using the five evaluation indices. This shows that the proposed time-series forecasting model is feasible for forecasting water levels in Shimen Reservoir. Future work will address the following issues: (1) The reservoir's utility includes irrigation, domestic water supply, and electricity generation. The key variables identified here could improve forecasting in these fields.
(2) We might apply the proposed time-series forecasting model based on imputation and variable selection to forecast the water level of lakes, salt water bodies, reservoirs, and so on.
In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

Conflicts of Interest
The authors declare that they have no conflicts of interest.