Floods belong to the most hazardous natural disasters and their disaster management heavily relies on precise forecasts. These forecasts are provided by physical models based on differential equations. However, these models do depend on unreliable inputs such as measurements or parameter estimations which causes undesirable inaccuracies. Thus, an appropriate data-mining analysis of the physical model and its precision based on features that determine distinct situations seems to be helpful in adjusting the physical model. An application of fuzzy GUHA method in flood peak prediction is presented. Measured water flow rate data from a system for flood predictions were used in order to mine fuzzy association rules expressed in natural language. The provided data was firstly extended by a generation of artificial variables (features). The resulting variables were later on translated into fuzzy GUHA tables with help of Evaluative Linguistic Expressions in order to mine associations. The found associations were interpreted as fuzzy IF-THEN rules and used jointly with the Perception-based Logical Deduction inference method to predict expected time shift of flow rate peaks forecasted by the given physical model. Results obtained from this adjusted model were statistically evaluated and the improvement in the forecasting accuracy was confirmed.
Disaster management is generally becoming more and more important task. Among many natural disasters, floods are the one of the most hazardous, and moreover, one of the most frequently occurring in the region of the central Europe. Researchers invest enormous efforts into investigation of distinct flood models that would help to forecast floods and thus provide the disaster management with a reliable decision support that could be used in order to prevent further deceases and material costs.
One of such long-term researches focusing on the disaster management and especially on modeling and forecasting floods gave rise to the creation of the FLOREON, a system for emergent flood prediction [
Therefore, it seems to be appropriate to focus on some analysis of the performance of the system that could give at least a vague idea under which conditions the system works, under which conditions it provides us with a certain imprecision, and under which conditions we are able to correct the forecast. Based on the sources of the imprecision, it seems that an appropriate data-mining technique that would involve fuzziness might provide us with promising results and is worth of being attempted. In this investigation, we face the above foreshadowed problem with the help of the fuzzy GUHA method, that is, a specific variant of associations mining technique that allows using the concepts of fuzzy logic in a broad sense [
The data being analyzed come from the measures of water flow rate of the Odra River in Ostrava, Czech Republic. Measuring stations provide us with a flow rate [m3/s] on hourly basis. The goal is to forecast a future flow rate. This is done by the so-called Math-1D model [
The Math-1D model is a differential equation based model of the flow rate. In order to provide us with flow rate forecasts, it uses information about precipitations (past and forecasted), soil type, river bank shape, and other parameters. Although it is a well-established physical model that is empirically examined, it is not sufficiently reliable. The reason does not lie in the model but in the fact that most of the parameters and input data are highly imprecise. For example, the soil type is provided by a hydrologist expert but, due to natural limitations, without a deeper geological analysis and, moreover, the provided soil type is the same for the whole river flow.
Having in mind these limitations, the Math-1D model forecasts depend mainly on the measured past precipitations and flow rates and on the forecasted precipitations. Thus, provided forecast, though often reliable, may be even highly imprecise. The imprecision may be viewed in two perspectives: in the vertical one and the horizontal one. The vertical imprecision actually means either the overestimation or even worse the underestimation of the flow rate in the culminating peak. For our investigation, the second, that is, the horizontal imprecision, is crucial. That is, we focus on the precision in terms of time; that is, we focus on the question whether and under which conditions the model forecasts the peak discharge earlier or later and how big is the time shift of the peak.
The vertical as well as horizontal imprecision may be significant. As one can see from an exemplary forecast in Figure
Particular example of real measured values (in black) and the Math-1D model simulation (in gray). Flow rate values [m3/s] on the vertical axis are measured on a particular measuring station placed on the Odra River starting from time −119 up to 0 (horizontal axis, time [
Our task is to analyze and forecast the peak shift on the horizontal (time) axis. In other words, the task is to build a model that would (based on the flow rate measurements and the Model-1D performance in the past) provide disaster management with a valuable information about possible horizontal imprecision of the Math-1D model and, moreover, that would additionally provide the disaster management with an estimation about the peak shift. This peak shift estimation could be used in the corrections of the forecasts.
In this Section, we introduce fundamental theoretical background that is used in our investigation. As there is no space to introduce all the theoretical concepts in detail, we will provide readers only with a brief introduction and refer to further valuable sources [
One of the main constituents of systems of fuzzy/linguistic IF-THEN rules is
A simple form of evaluative expressions keeps the following structure:
Linguistic hedges and their abbreviations.
Narrowing effect | Widening effect |
---|---|
Very (Ve) | More or less (ML) |
Significantly (Si) | Roughly (Ro) |
Extremely (Ex) | Quite roughly (QR) |
— | Very roughly (VR) |
Shapes of extensions (fuzzy sets) of evaluative linguistic expressions. DEE denotes the defuzzified values obtained using the
Evaluative expressions of the form (
Examples of evaluative predications are “temperature is very high,” “price is low,” and so forth. The model of the meaning of evaluative expressions and predications makes distinction between
The intension of an evaluative predication “
Given an intension (
Evaluative predications occur in conditional clauses of natural language of the form
Fuzzy/linguistic IF-THEN rules are gathered in a
The
We also need to consider a linguistic phenomenon of topic-focus articulation (cf. [
To be able to state relationships among evaluative expressions, for example, when one expression “covers” another, we need an ordering relation defined on the set of them. Let us start with the ordering on the set of linguistic hedges. We may define the ordering
We extend the theory of evaluative linguistic expressions by the following
Based on
In other words, evaluative expressions of the same type are ordered according to their specificity which is given by the hedges appearing in the expressions. If we are given two evaluative predications with an atomic expression of a different type, we cannot order them by
Finally, we define the ordering
It should be noted that usually the
In this case, the ordering
Based on the ordering
Let
Once one or more antecedents
Suppose that
In many application, the inferred output fuzzy set
In this paper, we employ the so-called linguistic associations mining [
Standard GUHA table.
|
|
|
| |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The original GUHA allowed only boolean attributes to be involved; see [
The goal of the GUHA method is to search for linguistic associations of the form
Classical GUHA fourfold table.
|
Not |
|
---|---|---|
|
|
|
Not |
|
|
Symbol
The relationship between the antecedent and the consequent is described by so-called
For example, let us consider Table
Depending on the chosen confidence and support degree, the GUHA method could generate, for example, the following linguistic association:
Example of GUHA table.
|
|
|
|
|
|
|
---|---|---|---|---|---|---|
|
1 | 0 | 1 | 0 | 1 | 0 |
|
0 | 1 | 0 | 1 | 0 | 1 |
|
0 | 1 | 1 | 0 | 0 | 1 |
|
1 | 0 | 1 | 0 | 1 | 0 |
|
0 | 1 | 0 | 1 | 0 | 1 |
|
|
|
|
|
|
|
|
0 | 0 | 0 | 1 | 0 | 1 |
According to [
In many situations, including our situation, it is better to define fuzzy sets on the numerical variables and use the fuzzy variant of the GUHA method [
In the fuzzy variant of the method, the attributes are not boolean but rather vague. The minimum (resp., maximum) of a particular attribute becomes
For example, instead of defining a boolean variable
Example of a part of a fuzzy GUHA table with variable BMI and one canonical adjective Sm.
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|
|
0.5 | 0.6 | 0.7 | 0.8 |
|
1 | 1 |
|
0.8 | 0.9 | 1 | 1 |
|
1 | 1 |
|
0 | 0 | 0 | 0.1 |
|
0.3 | 0.4 |
|
0 | 0 | 0 | 0 |
|
0 | 0 |
|
0.6 | 0.9 | 1 | 1 |
|
1 | 1 |
|
|
|
|
|
|
|
|
|
0 | 0.5 | 0.8 | 0.9 |
|
1 | 1 |
In this way we treat every quantitative variable so that the final fuzzy GUHA table will look similarly to Table
Example of fuzzy GUHA table (compared with Table
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|
|
0.5 |
|
0.7 |
|
0 | 1 |
|
0 |
|
0.8 |
|
1 |
|
0 | 0.4 |
|
0 |
|
0 |
|
0 |
|
0.1 | 0 |
|
0.4 |
|
0 |
|
0 |
|
0.4 | 0 |
|
0.3 |
|
0.6 |
|
1 |
|
0 | 1 |
|
0 |
|
|
|
|
|
|
|
|
|
|
0 |
|
0.8 |
|
0 | 0.5 |
|
0 |
The fourfold table analogous to Table
By using fuzzy sets, we generally get more precise results, and, more importantly, we avoid undesirable threshold effects [
A confirmed association as
This approach has been found very efficient and reasonable, for example, for the identification of the so-called Fuzzy Rule Base Ensemble [
In Section
As mentioned in the introduction, we are provided only with the data from the measuring stations and from the Math-1D model implemented in the FLOREON system. Unlike the Math-1D model, we are neither provided with the measured precipitations nor with their forecasts nor with other physical attributes or their estimations. The reason is that this is the domain for the physical model Math-1D and our task is not to build another competitive physical model but to concentrate on the analysis of the existing one. However, in order to deal with the (fuzzy) GUHA method, we need to generate several features (artificial variables) and investigate the question, which of those variables have some influence on the performance of the model.
For the purpose of this investigation, we were provided with the data set collected from different
So, we can introduce the following two sets:
Indeed, the values
The aim is to analyze associations between input variables that were at disposal at the time
For the sake of result quality evaluation, the data was split into a training set and a testing set in the ratio of 2 : 1, that is, 38 simulations for the training and 19 simulations for the testing.
For each simulation
All the statistics listed above were computed for each of the following data
Analogously, the same statistics have been utilized also for
Finally, the time point of the forecasted peak,
was also added as an additional feature. It means that a total amount of 205 new features were generated.
From the pool of features, a regression method [
denoting the peak shift, was modelled with the linear regression of all the generated features. After that, statistical significance of all the regression coefficients was tested and only features with
In this way, we ended the feature selection with the following three features:
All computed features, which were found statistically significant, as described in the previous subsection, are viewed as quantitative variables. In order to use them in mining linguistic associations, we had to convert them into fuzzy attributes. More specifically, we generated all the possible linguistic expressions (see Section
The above introduced variable
Part of the resulting fuzzy GUHA table that contained 84 columns, 63 for antecedent attributes, and 21 for consequent attributes is shown in Table
Example of a part of the fuzzy GUHA table for peak shift of PS of the peak forecasted by the Math-1D model. Objects
|
|
|
PSExSm |
|
PSExBi | |
---|---|---|---|---|---|---|
|
0.97 |
|
0.62 | 0.45 |
|
0 |
|
0 |
|
0.2 | 0 |
|
0.58 |
|
0.75 |
|
0.97 | 0.66 |
|
0 |
|
|
|
|
|
| |
|
0.66 |
|
0.74 | 0.69 |
|
0 |
Upon the choice of the multitudinal implicational quantifier and the degree of confidence they describe the situations, under which the disaster management may expect some time shift of the water flow rate peak, which is essential for precise warning and evacuation of people or other preparations works that may save material costs of the approaching disaster; connected to the PbLD inference mechanism, they may be directly used to forecast the time shift of the peak originally forecasted by the Math-1D model and, thus, to directly correct and precisiate the forecast by the physical model.
Examples of fuzzy rules found by the fuzzy GUHA method.
Rule |
IF part | THEN part | ||
---|---|---|---|---|
|
|
|
PS | |
|
Sm | VR Sm | Ve Bi | QR Sm |
|
VR Sm | Ve Sm | Ve Bi | Ro Sm |
|
Ve Sm | –- | Ex Bi | ML Sm |
|
|
|
|
|
|
VR Me | QR Sm | VR Bi | Ro Me |
The prediction model was evaluated on a testing dataset, that is, on data previously hidden during the whole data-mining procedure. The testing dataset consists of 19 simulations, each simulation containing hourly flow rates for five days in the past and two days of predictions for the future.
On the testing simulations, the prediction accuracy of the time of culminating-peak was compared between the original Math-1D model and the Math-1D model newly adjusted with GUHA association rules.
For each testing simulation
A comparison of peak forecast errors
Model | Min | 1st quart. | Mean | Stdev | Median | 3rd quart. | Max |
---|---|---|---|---|---|---|---|
Original |
|
0.31 | 0.603 | 0.52 | 0.46 | 0.88 | 1.92 |
Adjusted |
|
|
|
0.65 |
|
0.07 | 1.02 |
Briefly, it can be stated that the original model expects the flood peaks approximately a half an hour later than in reality, on the testing dataset. After adjustments made by our GUHA model, the estimates become more accurate. More precisely, the original (Math-1D) model error is on average 0.603 days (with standard deviation 0.521). The error of the adjusted model is −0.205 days (with standard deviation 0.65).
A bias towards positive values of the original model was also justified by the one sample Wilcoxon rank sum test [
Statistical tests of hypotheses of
Model |
Wilcoxon test | One sample |
|||
---|---|---|---|---|---|
|
|
|
df |
|
|
Original | 166 | 0.0004871 | 5.042 | 18 | 0.00008486 |
Adjusted | 61 | 0.1776 |
|
18 | 0.1866 |
In this paper, we attempted to deal with an adjustment of a physical model of water flow rate during floods with the help of linguistic associations mining. As any physical model based on differential equations (the Math-1D model, in our case) is highly dependent on many unreliable parameters, it seems reasonable to perform some real data analysis that would inform us, when and under which conditions the model is (in terms of the culminating water flow rate peak) time lagged or vice-versa too much ahead.
We approached the task with the help of the fuzzy GUHA method that automatically generates linguistic associations. The provided data was firstly extended by a creation of artificial variables describing various features of the data. The resulting variables were later on translated into fuzzy GUHA table using the so-called Evaluative Linguistic Expressions. This table was used to mine the associations that may be directly interpreted as fuzzy IF-THEN rules. Such interpretation is beneficial not only because of its interpretability but it can also be used jointly with the Perception-based Logical Deduction inference method in order to predict expected time shift of the flood peaks originally forecasted by the physical model. Results obtained from this adjusted model were statistically evaluated in order to confirm the improvement in the forecasting accuracy.
Let us note that the data-mining analysis as well as experimental evaluation was performed only on a single measuring station Svinov placed on Odra River. Indeed, as the physical model depends on many imprecise and estimated parameters that may differ over the river flow, each station would require its own analysis. However, as the number of stations in the whole region is rather low (9 stations placed on four main rivers), such approach is obviously feasible. Thus the promising results give chance for further and deeper analysis that could enhance the disaster management by more accurate physical models with forecasts adjusted by fuzzy IF-THEN rules. On the other hand, there is a serious complication in the lack of the past data that could be analyzed. The high number of previous floods is unfortunately not accompanied by a sufficiently high number of precise data. As we have mentioned, there was, for example, a problem of measured zero water flow rates even during massive floods due to uncalibrated measuring stations or due to other unspecified reasons. This lack of reliable data may significantly complicate the situation.
As the first step for future research, we plan to extend our investigation by using measured past precipitations and possibly also the forecasted future precipitations that are already at disposal to the Math-1D model but that were not at disposal to our data analysis presented in this paper.
This work was supported by the European Regional Development Fund in the IT4Innovations Centre of Excellence Project (CZ.1.05/1.1.00/02.0070).