Comparing the Selected Transfer Functions and Local Optimization Methods for Neural Network Flood Runoff Forecast

The presented paper aims to analyze the influence of the selection of transfer function and training algorithms on neural network flood runoff forecast. Nine of the most significant flood events, caused by the extreme rainfall, were selected from 10 years of measurement on small headwater catchment in the Czech Republic, and flood runoff forecast was investigated using the extensive set of multilayer perceptrons with one hidden layer of neurons. The analyzed artificial neural network models with 11 different activation functions in hidden layer were trained using 7 local optimization algorithms. The results show that the LevenbergMarquardt algorithmwas superior compared to the remaining tested local optimizationmethods.When comparing the 11 nonlinear transfer functions, used in hidden layer neurons, the RootSig function was superior compared to the rest of analyzed activation functions.


Introduction
In recent three decades, the implementations of various models based on artificial neural networks (ANN) were intensively explored in hydrological engineering. The general reviews of ANNs modeling strategies and applications with the emphases on modeling of hydrological processes are presented in [1][2][3]. They confirm that the class of multilayer perceptron (MLP) [4,5] belongs to the most frequently studied ANN's models in hydrological modeling [6][7][8][9].
The MLP forms the nonlinear data driven model. According to its architecture, it is a fully connected feed-forward network, which organizes the processing units (neurons) into the layers and allows the interconnections only between neurons in two following layers. As it was proved by [10], the MLP is the universal function approximator. This important property has been widely confirmed by many hydrological studies [11][12][13][14].
Despite the positive research results of a large number of studies on MLP runoff forecasting, there is a need for clear methodological recommendations of MLP transfer function selection [15,[22][23][24] combined together with the training method assessment and the implementation of new training method [8,18,19,25].
Main aims of presented paper are (1) to analyze the hourly flood runoff forecast on small headwater catchment with MLP-ANN models, which are based on 12 different MLP's transfer functions following the work of [15,24], (2) to compare the 7 local optimization algorithms [5,17,19], and finally (3) to evaluate the MLP performance with 4 selected model evaluation measures [26,27].

Material and Methods
The tested runoff prediction using the MLP-ANN models uses the set of rainfall runoff data. The MLP-ANN implementation for runoff forecast generally consists of data preprocessing, model architecture selection, MLP training, and model validation. In this section, we give a very brief description of the MLP-ANN model architecture and tested optimization schemes and datasets.   [15]; is the neuron's activation, is the neuron output.   [18,[28][29][30][31]. The studied MLP models had in total three layers of neurons, the input layer, the hidden layer, and the output layer. As proved by Hornik et al. [10], this type of artificial neural network with sufficiently a large number of neurons in the second layer can approximate with desired precision any measurable functional relationship. The implemented MLP-ANN models had a general form where the is a network output, that is, flood runoff forecast for given time interval, is network input for input layer neuron , in is the number of MLP inputs, the is the weight of input to hidden layer neuron, () is the activation function constant for all hidden layer neurons, hd is the number of hidden neurons, V is the weight for output from hidden neuron , and 0 , V 0 are neuron biases [2-4, 18, 25, 31].

MLP-ANN Transfer
Functions. The type of activation function together with network architecture influences the generalization of neural network. Imrie et al. [32] empirically confirmed that the transfer function bounding influences the ANN generalization and hydrological extreme simulations during runoff forecast. Following the work of [15], we implemented the 12 different types of transfer functions, and 11 of them were tested in hidden neuron layer of analyzed MLP-ANN models. Table 1 provides their list.
The activation functions type combined with specific type of training methods influences the average performance of leaning algorithm and computing time [15,24]. For example, the Bishop [4] pointed out that the implementation of hyperbolic function speeds up the training process compared to the use of logistic sigmoid.

MLP-ANN Local Optimization
Methods. We selected 7 gradient based local optimization methods. Table 2 shows their list together with their references. All MLP-ANN optimization was performed using the batch learning mode [4].
All tested gradient local search methods (except BP regul) minimized the error function represented as a the sum of square of residuals and the residuals = [ ]− [ ] were defined as differences between observed and computed [ ] flood runoff. The two first order local training methods are represented by the standard backpropagation and backpropagation with regularization term. Both backpropagation methods implement the following modification: constant learning rate and momentum parameter. The BP regul used the regularization Mathematical Problems in Engineering 3 term, which penalizes the size of estimated weights, and the error function is defined as where the is a total number of MLP-ANN weights . The hyperparameters and were constant within the standard backpropagation with the regularization term [4,16].
The scaled conjugate gradient methods are built together with safe line search based on golden section search combined with bracketing the minima [33,34]. The implementation enables the restarting during the iteration search based on the recommendations of [21,35]. The restarting controls the prescribed number of iterations or gradient norm. The implementation of scaled conjugate gradient uses four different updating schemes in detail described by [19,36].
All gradient based methods apply the standard backpropagation algorithm for the estimation derivatives of the objective function with respect to weights [37]. The Levenberg-Marquardt methods approximate the Hessian matrix using first order derivatives neglecting the terms with the second order derivatives [4,17].

The MLP-ANN Performance.
We based the evaluation of MLP-ANN model simulations of training, testing, and validation datasets on the following statistics [26,27,38]: Nash Sutcliffe efficiency (NS) fourth root mean quadrupled error (R4MS4E) persistency index (PI) where the represents the total number of time intervals to be predicted, the is the average of observed flood runoff , and LAG is the time shift describing last observed flood runoff [ − LAG].

The PONS2train
. The tested MLP-ANN models were implemented using the PONS2train software application. The PONS2train is software written in C++ programing language, whose main goal is to test MLP models with different architectures. The software application uses the LAPACK, BLAS, and ARMADILLO C++ linear algebra libraries [39][40][41]. The application is freely distributed upon a request to authors.
The PONS2train has additional features: the weight initialization can be performed using two methods. The first one follows the work of Nguyen and Widrow [42], while the second one uses random initialization coming from the uniform distribution.
Giustolisi and Laucelli [25] extensively studied the eight methods for improving the MLP performance and generalization. One of them the early stopping is incorporated in designed application. Following the recommendations of Stäger and Agarwal [43], the PONS2train also controls the avoiding of the neuron's saturation.
The important PONS2train implementation feature is the multirun and ensemble simulation. Its software design also enables further multimodel or hybrid MLP extensions [29,44].
The software design also allows the comparative analysis of MLP's architectures with or without bias neurons in layers. The PONS2train also enables the comparison of MLP trained on shuffled and unshuffled dataset. The shuffling of data patterns follows the random permutation algorithm of Durstenfeld [45].
The MLP datasets are scaled using two methods. Both methods scale the analyses datasets into the interval (0, ) with arbitrary chosen upper bound ≤ 1. The nonlinear scaling provides the transformed data trans obtained from original data orig using exponential transformation where the is a control parameter. The second scaling methods is a linear one.

The Dataset Description.
We explored the MLP-ANN models using the rainfall and runoff time series data obtained from 10-year monitoring in the Modrava catchment 0.17 km 2 . The experimental watershed was established in 1998 in upper parts of Bohemian Forest National Park. The basin belongs to the set of testbeds designed to monitor the hydrological behavior of headwater forested catchments. The watershed description shows that of Pavlasek et al. [46].
The forest cover is a clearing with young artificially planted forest combined with an undergrowth of herbs (mainly Calamagrostis villosa, Avenella flexuosa, Scirpus sylvaticus, and Vaccinium myrtillus) and bryophyte (Polytrichastrum formosum, Dicranum scoparium, and Sphagnum girgensohnii). A small part of the catchment (less than 10%) is covered by 40-year-old forest. The bark beetle calamity removed the original forest cover. Catchment bedrock is formed by granite, migmatite, and paragneiss covered by Haplic Podzols with depths of up to 0.9 m. The mean runoff coefficient is 0.2, mean daily runoff 1.2 mm.   The most significant nine rainfall runoff events observed in hourly time step were selected from 10-year measurement. The flood runoff prediction was analyzed via proposed MLP-ANN models. The characteristics of flood events are described in Table 3. All floods events were complemented with the periods of 5 preceding days. The rainfall runoff events were divided into the nonoverlapping training, testing, and validation dataset.
The division of flood events into the datasets was made with respect to the similarity of empirical distribution functions of training, testing, and validation datasets and to their independence. The empirical distribution functions were estimated using the quantile estimation method, which was specifically developed for the description of hydrological time series (for detailed information see [47]). The selected quantiles of all datasets are shown in Table 4. The quantiles show that the distinctions of the information in training, testing, and validation datasets are not significant.

Results and Discussion
We tested MLP-ANN models with 4 MLP architectures; they are different according to the number of hidden layer neurons hd = 3, 4, 5, 6. For each MLP architecture, we prepared 11 types of MLP-ANN models according to the type of hidden layer activation function (AF) (see Table 1). Each of them was trained with 7 training algorithms (TA) (see Table 2).
All MLP-ANN datasets consisted of all available pairs of four inputs and one output. The inputs were one runoff interval The total number of training pairs was 1270, the testing input-output datasets were 1221, and validation datasets were 1423.
Although there are suitable methodologies for selection of the proper input vector for MLP model, that is, [48][49][50], we based our flood forecast on small number of previous rainfall intervals and one previous runoff mainly due to fast hydrological response of analyzed watershed. The datasets were transformed using the nonlinear exponential transformation. Each training algorithm was repeated 150 times. The random initialization of network weights was performed by the method of [42]. Each optimization multirun used the same values of 150 mutually different initial random vectors of weights, in order to ensure that the comparison of performances of optimization algorithms was based on similar random weights initializations.

The Benchmark
Model. The flood forecast was simulated using the benchmark model based on simple linear model-SLMB. The SLMB parameters were calculated using the ordinary least squares. Table 5 shows results obtained from the simulation of SLMB benchmark model. Since the benchmark model provides the single simulation and one value for all tested model comparison measures, we compared the results of SLMB with results of the best selected single MLP-ANN models. In model ensemble, we found MLP-ANN models, which were superior compared SLMB.
For example, the model performance based on the PI index shows all MLP-ANN provided models, which were superior compared to SLMB (see the results of Table 6). The highest differences between the best PI values of ANN and PI of SLMB were obtained on MLP-ANN trained using LM algorithm on training dataset (PI ANN − PI SLMB = 0.53). The LM and PER training algorithms provided models with the highest values of PI on testing and validation datasets (PI ANN − PI SLMB = 0.41, resp., PI ANN − PI SLMB = 0.32).
These conclusions are in agreement with the values of remaining model performance measures-MAE, NS, and R4MS4E (see Table 7). The LM and BP regul were superior in terms of differences with SLBM according to the MAE and R4MS4E. The LM and PER were superior compared to SLMB for NS values on training, testing, and validation datasets. The similar results can be found, when comparing the results of SLMB with the best MLP-ANN models organized in terms of different transfer functions. The highest differences of PI values were on training dataset for MLP-ANN with LL transfer function (PI ANN − PI SLMB = 0.48), for testing dataset on RS transfer function (PI ANN − PI SLMB = 0.41) and for validation dataset on LL transfer function (PI ANN − PI SLMB = 0.31). These were calculated for MLP-ANN with transfer functions, which were successful in more than 10% of simulations on validation dataset.   Tables 6 and 7. All training computations controlled the neuron's saturation using the method of Stäger and Agarwal [43]. The parameters of TA (i.e., number of epochs, learning rate, etc.) were selected in such a way that the number of MLP-ANN evaluations was similar in all tested TA. Table 6 shows the results of persistency index, which was used as a main reference index, since the PI compares the model with last observed information [38]. The best TA according to the number of successful models with PI > 0 was the PER (the scaled conjugate gradient method with Perry updating formula). The highest number of successfully trained models was found on MLP with hd = 6 (see the ntrained = 1181, ntest = 838, and nval = 468 in Table 6).
When comparing the performance of TA according to the best single value of PI (see columns PI train, PI test, and PI val in Table 6) and the average performance of best MLP-ANN models on PI (see columns mPI train, mPI test, and mPI val in Table 6), the Levenberg-Marquardt algorithm was mostly superior compared to all remaining TA, except for three cases, when the PER and BP regul were better on validation datasets for MLP with hd = 3, 6 on best single value of PI and for average of mPI val for hd = 6. Table 7 displays the results of best models for remaining statistical measures of MLP-ANN models trained on tested TA. Only three algorithms were superior at least for one architecture of MLP and on one dataset. They are LM, PER, and BP regul. Again, the LM was mostly superior compared to the other tested TA. The differences between results of LM and PER and BP regul were very small.
The best values of NS were in agreement with values of PI (see, e.g., the PER on MLP with hd = 3). The BP regul was better in terms of the length of residuals for MAE test on MLP ANN models with hd = 3, 5. Also when comparing the simulation of peak flow in terms of R4MS4E, the BP regul was better on MLP with hd = 3, 6 for validation dataset.
Our finding are in agreement with results on runoff forecast of Piotrowski and Napiorkowski [18], who compared the Levenberg-Marquardt approach even with more robust global optimization schemes, and found that the LM provides comparable results with MLP trained using the selected evolutionary computation methods.

The Transfer Functions.
The results of PI, MAE, NS, and R4MS4E are shown in Tables 8 and 9. The PI has again served as a reference. We trained the MLP with all AF listed in Table 1. Tables 8 and 9 show the results of AF for MLP-ANN models, which were successful in more than 10% of simulations on validation dataset.
When comparing the absolute values of number of MLP-ANN models with PI > 0, the models with two AF (RS and CLm) were superior compared to MLP models with remaining 9 AFs. The MLP with RS provided the larger number of better models in terms of PI value on 8 datasets, while the MLP with CLm transfer function was successful on 4 datasets.
RS was also the most successful TA on training dataset at MLPs with hd = 4, 5, 6 (note that for hd = 3 the differences in PI between RS and CLm are almost insignificant). The LL also provided good results on training dataset (for all tested values of hd ) and on validation data for hd = 4, 5.
The mean performances based on arithmetical means of PI values of best models showed that three AFs were superior compared to remaining 8 AFs (see mPI train, mPI test, and  Table 8). They were CL, HT, and RS MLP ANN models. Their differences of PI were again very small. Table 9 shows the averages of MAE, NS, and R4MS4E on set of tested models. The results point out that the RS transfer function provided in summary superior values compared to rest of tested AF. The CLm, HT, and LS activation functions were on some datasets better in terms of mean values of tested statistical measures but the differences between the RS MLP ANN models were again negligible.
When reflecting the results of da S. Gomes et al. [15], who recommended the CL, CLm, and LL functions on MLP ANN models, we point out the ability of the MLP models with RS to improve the flood runoff forecast.
Our findings on the selection of suitable AF on MLP ANN models recommend that different AF should be tested during the implementation of MLP models for flood runoff forecast.

Conclusions
During the extensive computational test, we trained in total the 46200 models of multilayer perceptron with one hidden layer. The main aim of computational exercise was the evaluation of the impacts of the transfer function selection and the test of selected local optimization schemes on flood runoff forecast.
Using the rainfall runoff data of nine of the most significant flood events, we analyzed the short term runoff forecast on small watershed with fast hydrological response. The developed MLP ANN models were able to predict flood runoff using the records of past rainfall and runoff from the basin.
When comparing the tested MLP ANN models with benchmark simple linear model, the developed MLP models were superior in terms of values of model performance measures compared to the SLMB.
The PONS2Train software application was developed for the purposes of the evaluation of MLP-ANN models with different architectures and for providing the simulations of neural network flood forecast.
When analyzing the 7 different gradient oriented optimization schemes we found that the Levenberg-Marquardt algorithm was superior compared to the tested set of scaled conjugate gradient methods and two first order local optimization schemes.
When analyzing the 11 different transfer functions used in hidden neurons we found that the RootSig function was according to the values of four model performance measures most promising activation function in terms of flood runoff forecast.