It is important to accurately estimate rainfall for effective use of water resources and optimal planning of water structures. For this purpose, the models were developed to estimate rainfall in Isparta using the data-mining process. The different input combinations having 1-, 2-, 3- and 4-input parameters were tried using the rainfall values of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations in Isparta. The most appropriate algorithm was determined as multilinear regression among the models developed with various data-mining algorithms. The input parameters of Multilinear Regression model were the monthly rainfall values of Senirkent, Uluborlu and Eğirdir stations. The relative error of this model was calculated as 0.7%. It was shown that the data mining process can be used in estimation of missing rainfall values.
The meteorological events affect permanently human life. Considering the meteorological phenomena, which have no possibility of intervention, they cause the important results in human life, accurate estimation and analysis of these variables are also very important. Precipitation, which is generating flow, is an important parameter. The occurrence of extreme rainfall in a short time causes significant events that affect human life such as flood. However, in the event of insufficient rainfall in long period occurs drought. Thus, rainfall estimation is very important in terms of effects on human life, water resources, and water usage areas. However, rainfall affected by the geographical and regional variations and features is very difficult to estimate. Nowadays, there are many researches about artificial intelligence methods used in the estimation of rainfall [
One of the aims of storing this data in databases and receiving data from many sources is to convert raw data into information at present. This process is called as data-mining (DM) process of converting data into information. In recent years, the use of data-mining process in the field of hydrology is increasing. The studies have been performed using DM process in many areas [
The aim of the study is to evaluate the use of data-mining process to estimate rainfall of Isparta in Turkey. This study is performed using rainfall data of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations in Isparta city.
Knowledge discovery is a process that extracts implicit, potentially useful or previously unknown information from the data. The knowledge discovery process is described in Figure
Knowledge discovery process.
Let us examine the knowledge discovery process in the diagram in Figure Data coming from variety of sources is integrated into a single data store called target data. Data then is preprocessed and transformed into standard format. The data-mining algorithms process the data to the output in form of patterns or rules. Then those patterns and rules are interpreted to new or useful knowledge or information.
The ultimate goal of knowledge discovery and data-mining process is to find the patterns that are hidden among the huge sets of data and interpret them to useful knowledge and information. As described in process diagram above, data-mining is a central part of knowledge discovery process.
The data-mining definition is defined as “the process of extracting previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions” [
Data-mining also can be defined as the computer-aid process that digs and analyzes enormous sets of data and then extracting the knowledge or information out of it. By its simplest definition, data-mining automates the detections of relevant patterns in database [
The emergence of knowledge discovery in databases (KDD) as a new technology has been brought about with the fast development and broad application of information and database technologies. The process of KDD is defined as an iterative sequence of four steps: defining the problem, data preprocessing (data preparation), data-mining, and postdata-mining.
The goals of a knowledge discovery project must be identified. The goals must be verified as actionable. For example, if the goals are met, a business organization can then put the newly discovered knowledge to use. The data to be used must also be identified clearly.
Data preparation comprises those techniques concerned with analyzing raw data so as to yield quality data, mainly including data collecting, data integration, data transformation, data cleaning, data reduction, and data discretization.
Given the cleaned data, intelligent methods are applied in order to extract data patterns. Patterns of interest are searched for, including classification rules or trees, regression, clustering, sequence modeling, dependency, and so forth.
Post data-mining consists of pattern evaluation, deploying the model, maintenance, and knowledge presentation.
The KDD process is iterative. For example, while cleaning and preparing data, it might be discovered that data from a certain source is unusable, or that data from a previously unidentified source is required to be merged with the other data under consideration. Often, the first time through, the data-mining step will reveal that additional data cleaning is required [
In this study, the data used to developed rainfall estimation models are the monthly rainfall data of Isparta, Senirkent, Uluborlu, Eğirdir, and Yalvaç stations. The Isparta city is located in the Lakes Region located in the north of the Mediterranean Region, and between 30°20′ and 31°33′ east longitudes and 37°18′ and 38°30′ north latitudes. The altitude of Isparta having a surface area of 8933 km2 is the average of 1050 m. The average annual total rainfall of Isparta is 440.3 kg/m2. The most of rainfall (72.69%) has occurred in winter and spring months. The summer and autumn months are quite dry (29.31% of total rainfall). While it is observed usually rain, occasional snow in winter in the region, it is observed in the form of rainstorm the in spring and summer months. The study region and the locations of rain gauges are shown in Figure
Locations of rain gauges in Isparta.
The monthly rainfall data for 1964–2005 years used in this study were obtained from Turkish State Meteorological Service. The various rainfall estimation models were developed for Isparta using the rainfall values of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations as input parameters. It was investigated whether or not there are any missing data. Then, the mean values were used for substitution of missing values. The training dataset consisted of the 1964–1996 years was used to develop the models. The trained models were used to run the testing dataset for 1997–2005 years.
In the model assessment stage, after it has built a set of models using different algorithms, these models were evaluated in terms of accuracy. There are a few popular criteria to evaluate the quality of a model. It was chosen coefficient of determination (
The root mean square error represents the error of model and defined as
For rainfall estimation, Decision Table, KStar, Multilinear Regression, M5’Rules, Multilayer Perceptron, RBF Network, Random Subspace, and Simple Linear Regression algorithms were used in this study. The fifteen models were developed using different input combinations with the rainfall values of Senirkent, Uluborlu, Eğirdir and Yalvaç stations to estimate rainfall of Isparta station. These models including 1-input, 2-input, 3-input and 4-input parameters were given in Tables
The performance criteria of the models having 1-input parameter.
Input parameters | Eğirdir | Senirkent | Uluborlu | Yalvaç | ||||
---|---|---|---|---|---|---|---|---|
Models |
|
RMSE |
|
RMSE |
|
RMSE |
|
RMSE |
Decision Table | 0.254 | 141.5 | 0.695 | 57.90 | 0.638 | 68.62 | 0.531 | 89.10 |
KStar | 0.686 | 59.60 | 0.641 | 68.14 | 0.648 | 66.82 | 0.543 | 86.70 |
Multilinear Regression | 0.671 | 62.49 | 0.745 | 48.44 | 0.717 | 53.63 | 0.616 | 72.84 |
M5’Rules | 0.671 | 62.49 | 0.745 | 48.44 | 0.717 | 53.63 | 0.616 | 72.84 |
Multilayer Perceptron | 0.711 | 54.89 | 0.649 | 66.58 | 0.653 | 65.81 | 0.578 | 80.06 |
RBF Network | 0.533 | 88.67 | 0.641 | 68.13 | 0.672 | 62.28 | 0.495 | 95.81 |
Random Subspace | 0.617 | 72.71 | 0.634 | 69.56 | 0.590 | 77.77 | 0.492 | 96.43 |
Simple Linear Regression | 0.671 | 62.49 | 0.745 | 48.44 | 0.717 | 53.63 | 0.616 | 72.84 |
The performance criteria of the models having 2-input parameters.
Input parameters | Eğirdir-Uluborlu | Eğirdir-Yalvaç | Eğirdir-Senirkent | Senirkent-Uluborlu | Senirkent-Yalvaç | Uluborlu-Yalvaç | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Models |
|
RMSE |
|
RMSE |
|
RMSE |
|
RMSE |
|
RMSE |
|
RMSE |
Decision Table | 0.638 | 68.62 | 0.254 | 141.52 | 0.695 | 57.90 | 0.695 | 57.90 | 0.695 | 57.90 | 0.638 | 68.62 |
KStar | 0.765 | 44.52 | 0.732 | 50.83 | 0.751 | 47.21 | 0.727 | 51.81 | 0.668 | 62.93 | 0.684 | 60.07 |
Multilinear Regression | 0.807 | 36.65 | 0.743 | 48.80 | 0.792 | 39.40 | 0.765 | 44.60 | 0.745 | 48.44 | 0.717 | 53.63 |
M5’Rules | 0.807 | 36.65 | 0.743 | 48.80 | 0.792 | 39.40 | 0.765 | 44.60 | 0.745 | 48.44 | 0.717 | 53.63 |
Multilayer Perceptron | 0.796 | 38.64 | 0.743 | 48.69 | 0.746 | 48.29 | 0.678 | 61.08 | 0.662 | 64.12 | 0.670 | 62.64 |
RBF Network | 0.663 | 63.87 | 0.550 | 85.41 | 0.568 | 81.96 | 0.647 | 67.02 | 0.556 | 84.19 | 0.567 | 82.21 |
Random Subspace | 0.782 | 41.45 | 0.620 | 72.05 | 0.725 | 52.27 | 0.695 | 57.93 | 0.610 | 74.12 | 0.638 | 68.65 |
Simple Linear | 0.717 | 53.63 | 0.671 | 62.49 | 0.745 | 48.44 | 0.745 | 48.44 | 0.745 | 48.44 | 0.717 | 53.63 |
Regression |
The performance criteria of the models having 3-input parameters.
Input parameters | Senirkent-Uluborlu-Eğirdir | Senirkent Uluborlu-Yalvaç | Senirkent-Yalvaç-Eğirdir | Uluborlu-Yalvaç-Eğirdir | ||||
---|---|---|---|---|---|---|---|---|
Models |
|
RMSE |
|
RMSE |
|
RMSE |
|
RMSE |
Decision Table | 0.695 | 57.90 | 0.695 | 57.90 | 0.695 | 57.90 | 0.638 | 68.62 |
KStar | 0.771 | 43.54 | 0.693 | 58.20 | 0.745 | 48.33 | 0.771 | 43.43 |
Multilinear Regression | 0.813 | 35.43 | 0.765 | 44.60 | 0.792 | 39.40 | 0.798 | 38.38 |
M5’Rules | 0.808 | 36.43 | 0.765 | 44.60 | 0.792 | 39.40 | 0.711 | 54.89 |
Multilayer Perceptron | 0.774 | 42.83 | 0.726 | 51.98 | 0.772 | 43.33 | 0.797 | 38.55 |
RBF Network | 0.622 | 71.67 | 0.560 | 83.48 | 0.583 | 79.23 | 0.574 | 80.90 |
Random Subspace | 0.760 | 45.62 | 0.680 | 60.83 | 0.714 | 54.31 | 0.757 | 46.12 |
Simple Linear | 0.745 | 48.44 | 0.745 | 48.44 | 0.745 | 48.44 | 0.717 | 53.63 |
Regression |
The performance criteria of the models having 4-input parameters.
Modeller |
|
RMSE |
---|---|---|
Decision Table | 0.695 | 57.90 |
KStar | 0.761 | 45.33 |
Multilinear Regression | 0.806 | 36.89 |
M5’Rules | 0.766 | 44.35 |
Multilayer Perceptron | 0.774 | 42.91 |
RBF Network | 0.573 | 80.95 |
Random Subspace | 0.757 | 46.17 |
Simple Linear Regression | 0.745 | 48.44 |
Firstly, the relationships between rainfall data of Isparta station and them of other stations (Senirkent, Uluborlu, Eğirdir, and Yalvaç) were investigated using statistical analyses. The effective variables on Isparta station were ranked in the order of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations. The performance criteria of the models developed with 1-input parameters were given in Table
Examining the models given in Table
As seen from Table
It was shown that the
It was shown that the
Comparison plot for MLR model.
Time series for MLR model.
It was shown that, for Isparta region, the developed MLR model gave the best results to estimate rainfall. They cannot be used to estimate rainfall of another region, because the MLR models were developed for Isparta region. For a different region, the models need to be reestablished or need to be calibrated according to data of a new region. In the future, when more data are obtained, the developed models need to be revised. The other methods can give better results than MLR when adding more data or developing model for different region.
The rainfall which is an important factor for the use of water resources is a difficult variable to estimate. In this study, data-mining process was used to estimate monthly rainfall values of Isparta. The monthly rainfall data of Senirkent, Uluborlu, Eğirdir, and Yalvaç stations were used to develop rainfall estimation models. When comparing the developed models to measured values, multilinear regression model from data-mining process gave more appropriate results than the developed models in this study. The input parameters of the best model were the rainfall values of Senirkent, Uluborlu, and Eğirdir stations. Consequently, it was shown that the data-mining process, producing a solution more quickly than traditional methods, can be used to complete the missing data in estimating rainfall.