Nowadays, drought phenomena increasingly affect large areas of the globe; therefore, the need for a careful and rational management of water resources is becoming more pressing. Considering that most of the world’s unfrozen freshwater reserves are stored in aquifers, the capability of prediction of spring discharges is a crucial issue. An approach based on water balance is often extremely complicated or ineffective. A promising alternative is represented by data-driven approaches. Recently, many hydraulic engineering problems have been addressed by means of advanced models derived from artificial intelligence studies. Three different machine learning algorithms were used for spring discharge forecasting in this comparative study: M5P regression tree, random forest, and support vector regression. The spring of Rasiglia Alzabove, Umbria, Central Italy, was selected as a case study. The machine learning models have proven to be able to provide very encouraging results. M5P provides good short-term predictions of monthly average flow rates (e.g., in predicting average discharge of the spring after 1 month,
In recent years, long and frequent droughts have affected many countries in the world. These events require an ever more careful and rational management of water resources. Most of the globe’s unfrozen freshwater reserves are stored in aquifers. Groundwater is generally a renewable resource that shows good quality and resilience to fluctuations. Thus, if properly managed, groundwater could ensure long-term supply in order to meet increasing water demand.
For this purpose, it is of crucial importance to be able to predict the flow rates provided by springs. These represent the transitions from groundwater to surface water and reflect the dynamics of the aquifer, with the whole flow system behind. Moreover, spring influences water bodies into which they discharge. The importance of springs in groundwater research is highlighted in some significant contributions [
A spring hydrograph is the consequence of several processes governing the transformation of precipitation in the spring recharge area into the single output discharge at the spring. A water balance states that the change rate in water stored in the feeding aquifer is balanced by the rate at which water flows into and out of the aquifer. A quantitative water balance generally has to take the following terms into account: precipitation, infiltration, surface runoff, evapotranspiration, groundwater recharge, soil moisture deficit, spring discharge, lateral inflow to the aquifer, leakage between the aquifer and the underlying aquitard, well pumpage from the aquifer, and change of the storage in the aquifer.
In many cases, the evaluation of the terms of the water balance is very complicated. The complexity of the problem arises from many factors: hydrologic, hydrographic, and hydrogeological features, geologic and geomorphologic characteristics, land use, land cover, water withdrawals, and climatic conditions.
Even more complicated would be to estimate future spring discharges by using a model based on the balance equations. Therefore, simplified approaches are frequently pursued for practical purposes.
Many authors have addressed the problem of correlating the spring discharges to the rainfall through different approaches. Zhang et al. [
Recently, many researchers have investigated the feasibility of addressing hydraulic engineering issues by means of advanced models derived from artificial intelligence studies. Regression Tree models, Ensemble methods, and support vector machines have been increasingly used in solving water engineering problems.
Dibike et al. [
Tree models or ensemble methods were implemented to forecast flood events [
The aim of this study is to assess the ability of a machine-learning algorithm-based approach in predicting average monthly discharge of a spring, when the forecast horizon does not exceed a few months, if few years of monthly flow rate measurements and rainfall data are available. Therefore, regression tree, random forest, and support vector regression were used to build forecasting models and to perform a comparative study. The proposed approach was tested by means of experimental data obtained from the spring of Rasiglia Alzabove, Umbria, Central Italy. Time series data are available from the Regional Agency for Environmental Protection (
A regression tree (RT) model (Figure
Typical architecture of a simple regression tree. LMs: linear models.
During the growth of a regression tree model, the input data domain is recursively divided into subdomains. The predictions are made in each of them by means of multivariable linear regression models. At the first step of the iterative algorithm, all data are allocated into two branches, considering all the possible split on every field. Subsequently, the development process of the regression tree continues by splitting each branch into smaller partitions, as the system expands. At each stage, the procedure identifies the subdivision in two distinct partitions that minimize the sum of the squared deviations from the mean. This sum can be considered a measure of the “impurity” at a node that is a quantification of the predictive capability of the node. The algorithm continues until the lowest impurity level is obtained or until a stopping rule is met. Usually, a stopping rule is related to the threshold for the minimum impurity variation provided by new splits, the minimum number of units in each node, or the maximum tree depth.
The algorithm here used is commonly known as M5P and is based on Quinlan’s M5 algorithm [
The split process at each node is carried out based on the following function of the LSD:
A regression tree might suffer from overfitting when the model structure is fully developed. Overfitting occurs when a machine learning model has become too attuned to the data on which it was trained and therefore loses its applicability to any other dataset. Therefore, overfitting reduces the tree ability to make predictions, when the model is applied to novel data. To minimize this risk, a
If the results of different regression trees are combined into a single prediction, an
A random forest (RF) [
Regression tree ensemble: a typical random forest.
The forest error rate is affected by two factors: the correlation between any two trees and the strength of each single tree. The forest error rate rises if the correlation increases. Moreover, the forest error rate decreases if the strength of the individual trees increases. If
Different is the approach of support vector machine algorithms [
Example of support vector regression: smaller than
Therefore, given a linear function in the form
The constant
The optimization problem stated in (
The partial derivatives of
The evaluation of
In order to make the SVR algorithm nonlinear, the training patterns
Typical architecture of the nonlinear SVR algorithm.
It follows that condition (
In the nonlinear case, the optimization problem requires finding the flattest function in the feature space, not in the input space. The expansion of
In this research, a radial basis function (RBF) was selected as kernel. The RBF has the form
In particular, in the case study, the parameters assume the following values:
The described algorithms were implemented in a specific code written in MATLAB language. The search for the optimal structure of the models was conducted by means of a trial-and-error iteration procedure. The holdout method was used in the cross-validation process during training. It involves removing a part of the training data and using it to get predictions from the model trained on the rest of the data. The estimation error tells how the model is doing on unseen data or the validation set.
The Menotre River Valley, in correspondence of the Rasiglia town (Umbria Region), is characterized by the Capo Vena spring (elevation of 670 m a.s.l. and average discharge of 700 l/s), Alzabove spring (elevation of 650 m a.s.l. and average discharge of 250 l/s), and the minor spring of Verchiano Aqueduct (elevation of 650 m a.s.l. and average discharge of 45 l/s) [
Carbonate deposits of the Umbria-Marche sequence characterize the area [
The complexes of “Marne del Sentino-Rosso Ammonitico-Marne ad Aptici Fm.,” “Marne a Fucoidi,” and “Scaglia Cinerea-Marnoso Arenacea” have an aquitard function and divide the groundwater circulation of the calcareous complexes. Groundwater of the Alzabove spring takes origin from the contact between Maiolica Fm. (lower Cretaceous) and Marne a Fucoidi Fm. (middle Cretaceous), via a contact of 500 m length, along the right side of Menotre River (Figure
Geological and hydrogeological map (a), section of the Alzabove spring area (b), and rain gauge location (c). Key to the legend: (1) the talus and alluvial deposit complex (Olocene-Pleistocene) has high permeability which constitutes the local aquifer; (2) the lacustrine deposit complex (Olocene-Pliocene) has an aquitard function; (3) terrigenous complexes (marls, scaly clays, and sandstones) (Miocene) have an aquitard function; (4) the Scaglia calcarea complex (Scaglia Rossa and Scaglia Bianca Fm.) (Eocene-Cretaceous) has high permeability and high storing capacity which constitute the regional aquifer; (5) the Marne a fucoidi complex (Lower Cretaceous) has an aquitard function; (6) Maiolica complex (Lower Cretaceous-Jurassic) has high permeability and high storing capacity which constitute the regional aquifer; (7) the calcareous siliceous marly complex (Marne del Sentino-Rosso Ammonitico-Marne ad Aptici Fm.) (Upper Jurassic) has an aquitard function; (8) the Corniola-Calcare Massiccio basal complex (Lower Jurassic) has high permeability and high storing capacity which constitute the regional aquifer; (9) fault; (10) folds: (a) anticline and (b) syncline; (11) thrust; (12) springs: (a) Capo Vena, (b) Verchiano, and (c) Alzabove; (13) groundwater level: the numbers indicate the water above sea level; (14) river; (15) section trace; (16) village.
In this context, the Alzabove spring can be considered an overflow spring (Figure
The chemical analysis proves that the Alzabove spring is characterized by low contents of Mg, typical of Maiolica Fm. On the contrary, groundwater circulation of the Capo Vena spring is affected by dolomitic horizons with sulphates. Therefore, the groundwater path of the Alzabove spring is shallower and less influenced by karst phenomena and tectonics, if compared with the Capo Vena reservoir which comprehends the basal portion of the Umbria-Marche sequence (lower Lias).
The identification of the hydrogeological model of the Alzabove spring confirms that it constitutes a useful case for the proposed algorithm training.
The effectiveness of forecasting algorithms was evaluated by the following criteria: the coefficient of determination
The coefficient of determination,
The mean absolute error measures how much the predictions are close to the observed values. It is evaluated by
RMSE is the sample standard deviation of the differences between experimental and predicted values. It is given by
Finally, RAE normalizes the total absolute error dividing it by the total absolute error of the simple predictor. Its definition is
The input data of the different models are the past monthly average flow rates,
Time series of rainfall and spring discharge.
The available time series is not very long, but this is not a limitation as one of the primary objectives of this study is to evaluate the predictive capabilities of the considered models when a few years of experimental observations are available.
In Figure
Cross-correlogram of cumulative monthly rainfall and average monthly discharge.
Preliminary analysis showed that better results are obtained if the input vector has the same number of flow rate and cumulative rainfall data. Thus, each vector of the input matrix to the three different models is composed as follows:
Different models were built to predict the monthly average discharge of the spring after 1 month,
Table
Comparative analysis of M5P, RF, and SVR by means of
Model | MAE (m3/s) | RMSE (m3/s) | RAE | ||
---|---|---|---|---|---|
4-month input | M5P | 0.991 | 0.0124 | 0.0156 | 14.97% |
RF | 0.926 | 0.0309 | 0.0446 | 37.29% | |
SVR | 0.97 | 0.0196 | 0.0299 | 23.67% | |
6-month input | M5P | 0.987 | 0.013 | 0.018 | 15.67% |
RF | 0.963 | 0.0261 | 0.035 | 31.50% | |
SVR | 0.976 | 0.0191 | 0.0291 | 22.97% | |
8-month input | M5P | 0.889 | 0.0214 | 0.0312 | 41.24% |
RF | 0.823 | 0.0297 | 0.0377 | 57.26% | |
SVR | 0.86 | 0.0275 | 0.0348 | 52.96% | |
4-month input | M5P | 0.962 | 0.0272 | 0.0309 | 32.50% |
RF | 0.972 | 0.0322 | 0.0391 | 38.53% | |
SVR | 0.933 | 0.03 | 0.0402 | 35.91% | |
6-month input | M5P | 0.976 | 0.0207 | 0.026 | 24.73% |
RF | 0.972 | 0.0333 | 0.0389 | 39.81% | |
SVR | 0.95 | 0.028 | 0.0369 | 33.55% | |
8-month input | M5P | 0.675 | 0.0484 | 0.0623 | 75.87% |
RF | 0.834 | 0.0398 | 0.0491 | 62.24% | |
SVR | 0.84 | 0.0381 | 0.0489 | 59.61% | |
4-month input | M5P | 0.859 | 0.0487 | 0.0544 | 56.43% |
RF | 0.964 | 0.0373 | 0.0435 | 43.12% | |
SVR | 0.791 | 0.0507 | 0.0637 | 58.69% | |
6-month input | M5P | 0.921 | 0.0349 | 0.0405 | 40.40% |
RF | 0.96 | 0.0388 | 0.0475 | 44.88% | |
SVR | 0.838 | 0.0464 | 0.0589 | 53.65% | |
8-month input | M5P | 0.586 | 0.048 | 0.0591 | 84.85% |
RF | 0.855 | 0.0409 | 0.0496 | 72.37% | |
SVR | 0.389 | 0.0638 | 0.0682 | 112.87% | |
4-month input | M5P | 0.755 | 0.0544 | 0.0612 | 71.83% |
RF | 0.936 | 0.0359 | 0.0393 | 47.41% | |
SVR | 0.731 | 0.0528 | 0.0621 | 69.83% | |
6-month input | M5P | 0.831 | 0.0449 | 0.051 | 59.39% |
RF | 0.945 | 0.0373 | 0.0441 | 49.25% | |
SVR | 0.857 | 0.0415 | 0.05 | 54.88% | |
8-month input | M5P | 0.709 | 0.0379 | 0.0475 | 70.49% |
RF | 0.934 | 0.0277 | 0.0351 | 51.44% | |
SVR | 0.797 | 0.0314 | 0.0406 | 58.39% |
Comparison between predicted and observed discharges (m3/s), 4-month input.
Comparison between predicted and observed discharges (m3/s), 6-month input.
Comparison between predicted and observed discharges (m3/s), 8-month input.
Comparison between predicted and observed discharges (m3/s), 4-month input.
Comparison between predicted and observed discharges (m3/s), 6-month input.
Comparison between predicted and observed discharges (m3/s), 8-month input.
Comparison between predicted and observed discharges (m3/s), 4-month input.
Comparison between predicted and observed discharges [m3/s], 6-month input.
Comparison between predicted and observed discharges (m3/s), 8-month input.
Comparison between predicted and observed discharges (m3/s), 4-month input.
Comparison between predicted and observed discharges (m3/s), 6-month input.
Comparison between predicted and observed discharges (m3/s), 8-month input.
With regard to the average discharge of the following month,
Regarding the forecast of the average flow rate after two months,
Considering an 8-month input (Figure
All models show the most significant errors for
Analyzing the average flow rate after 3 months,
If, on the other hand, an 8-month input is considered (Figure
Again, all the models show the most significant errors for
With regard to the average discharge after 4 months,
The error of the models is fairly evenly distributed over the entire flow rate range considered for testing.
It can also be noted that as the forecasting horizon advances, the M5P model provides less accurate predictions. Similar is the behavior of SVR. The accuracy of the RF model, instead, is less reduced as the monthly timeframe advances. While M5P provides very good short-term forecasts, RF is able to provide fairly accurate predictions of average flow rates that will be available after some months.
All the considered models tend to be less effective if the input data relate to an appreciably longer period than the actual aquifer response to rainfall on the basin; therefore, it is appropriate to estimate this time by means of a cross-correlation analysis.
The capability of forecasting spring discharges is essential for a careful management and an accurate planning of water resources. In many cases, a prediction of the flow rate that will be available in the future on the basis of the basin water balance is very complicated or impossible. Machine learning models represent a very interesting alternative. These models can be built on the basis only of past discharges and cumulative rainfall.
Three different machine learning algorithms were used and compared in this study: M5P, random forest, and support vector regression. The spring of Rasiglia Alzabove, Umbria, Central Italy, was chosen as a case study.
The considered models have proven to be able to provide encouraging results even if the available time series for training is rather limited. M5P provides very good short-term predictions of monthly average flow rates, while RF is able to provide accurate medium-term forecasts.
As the time of forecasting advances, the models generally lead to less accurate predictions. Moreover, the effectiveness of the models significantly depends on the duration of the period considered for input data. This time should be approximately estimated by means of a cross-correlation analysis, in order to evaluate the actual aquifer response time.
Bias in SVR algorithm
Cost of error in SVR algorithm
Number of sample units in the generic node
Cumulative monthly rainfall of
Portion of units assigned to the left child node
Portion of units assigned to the right child node
Predicted monthly average flow rate after
Monthly average flow rate of
Least-squared deviation, within variance for the generic node
Generic split during tree model growing
Generic node of a tree model
Left node generated by the generic split
Right node generated by split
Variable in SVR algorithm
Value of the target variable for the
Mean value of the target variable in the generic node
Function used to make the SVR algorithm nonlinear
Variable in SVR algorithm
Variable in SVR algorithm
Deviation parameter in SVR algorithm
Function of the least-squared deviation
Variable in SVR algorithm
Variable in SVR algorithm
Slack variable in SVR algorithm
Slack variable in SVR algorithm.
The data used to support the findings of this study were provided by Umbria regional agency for environmental protection. They are freely available online (
The authors declare that they have no conflicts of interest.