Influence of Hydrological Model Selection on Simulation of Moderate and Extreme Flow Events : A Case Study of the Blue Nile Basin

Five hydrological models were applied based on data from the Blue Nile Basin. Optimal parameters of each model were obtained by automatic calibration. Model performance was tested under both moderate and extreme flow conditions. Extreme events for the model performance evaluation were extracted based on seven criteria. Apart from graphical techniques, there were nine statistical “goodness-of-fit” metrics used to judge the model performance. It was found that whereas the influence of model selection may be minimal in the simulation of normal flow events, it can lead to large underand/or overestimations of extreme events. Besides, the selection of the best model for extreme events may be influenced by the choice of the statistical “goodness-of-fit” measures as well as the criteria for extraction of high and low flows. It was noted that the use of overall water-balance-based objective function not only is suitable for moderate flow conditions but also influences the models to perform better for high flows than low flows. Thus, the choice of a particular model is recommended to be made on a case by case basis with respect to the objectives of the modeling as well as the results from evaluation of the intermodel differences.


Introduction
According to the Intergovernmental Panel on Climate Change, IPCC [1], the continued increase in the greenhouse gas emissions will lead to the warming and changes in the various components of the climate system.With respect to the hydrological system, these changes are reflected in the increased frequency and severity of risk-based water disasters such as drought and floods which inflict severe damage on public life and property in many parts of the world on a yearly basis.The Nile Basin in which the study area is located is not exceptional to such hydrological disasters.Although the Ethiopian highlands experience rainfall in excess of 1000 mm annually, extreme low rainfall conditions also tend to occur in the Blue Nile Basin (BNB).Thus, both floods and drought occur in the study area.Some typical examples of the occurrences of hydrometeorological extremes in the BNB include the daily rainfall of 88 mm (which was the highest since late 1890s) in Khartoum on the 31st July 1920 [2], the severe floods in the months of August and September in 1988 in Sudan [3], the devastating drought of the early 1980s in the Ethiopian highlands [3], and drought conditions over the 1970s and 1980s which led to the decline in annual rainfall by 30% [4].
In support of the investigations of impacts of climate variability and change on hydrological extremes, hydrological models may be applied.The ability of such models to capture the extreme high and low flow events is important for planning and management of risk-based water resources applications.Because the study area is dominated by cultivated area based on rain-fed agriculture, normal hydrological events (e.g., annual mean flows) which may be indicative of the mean net rainfall totals are vital for the management of agricultural practices.Distributed models (from which simulated runoff can be obtained at any point of interest within the catchment) require more detailed spatial inputs to reflect the spatiotemporal variability in the runoff.Conceptual models (which are simpler in their structures than the fully distributed process-based models) are designed mostly to simulate lumped runoff at the catchment outlet (as aimed at in this study) based on catchment-wide averaged inputs.

Advances in Meteorology
Some examples of rainfall-runoff models include Nedbør-Afstrømnings Model (NAM) (Danish Hydraulic Institute DHI [5][6][7]), Hydrologiska Byråns Vattenavdelning (HBV) [8], Probability Distribution Model (PDM) [9], Precipitation Runoff Modeling System (PRMS) [10], SIMHYD, that is, the simplified version of HYDROLOG [11], TANK model [12], Sacramento (SAC) model [13], Australian Water Balance Model (AWBM) [14], and Identification of Unit Hydrographs and Component Flows from Rainfall, Evaporation and Stream (IHACRES) flow data model [15][16][17].For the study area, several rainfall-runoff models have been applied to model the hydrological regimes of the Blue Nile.Examples of the models applied include Soil and Water Assessment Tool (SWAT) by [18][19][20][21], HBV by [22,23], Hydrologic Engineering Centre-Hydrological Modeling System (HEC-HMS) by [24], and PRMS by [25,26].Most of the above studies applied only one hydrological or rainfall-runoff model.Because of the tendency of the models to differ among themselves in terms of their structural complexity and set of parameters for calibration, the use of only one model in hydrological investigations leads to lack of insight about the influence of the model selection on the modeled results [27].According to [27], the influence of the selection of a hydrologic modeling approach which is seldom investigated is critical for impact assessment, for example, of climate variability and change on the water resources of the BNB.The use of modeled results for data scarce region like the BNB to aid water resources planning and management decisions requires clear assessment of the possible uncertainties and their communication.Water management decisions taken amidst uncertainty of the scientific supporting information may lead to unnecessarily lavish expenditure of the limited economic resources, for example, for risk-based applications.To judge the confidence in the results of modeling amidst data limitation and quality problem, evaluation of a number of models in particular study is of paramount importance to the scientific community especially those involved in impact investigations, for example, of climate variability and change on hydrology.
This study is therefore aimed at exploring the influence of model selection on the simulation of (1) daily moderate and extreme flow events, (2) temporal changes in the flows from (1), and (3) hydrological extreme quantiles as a simultaneous function of aggregation levels and return periods.

Study Area, Data, and Selected Models
2.1.Study Area and Data.The BNB (Figure 1) with the flow outlet at Khartoum has a drainage area of about 325,000 km 2 which extends in both Ethiopia and Sudan in Africa.The Blue Nile with a total length of about 1,460 km flows into and out of Lake Tana and emanates from the Ethiopian highlands based on the two main tributaries including the Dinder and Rahad rivers.From Lake Tana to El Diem which is at the Ethiopian-Sudanese border, the length of the Blue Nile is about 940 km.The basin can receive annual rainfall up to about 2000 mm, though in some years it can reduce to less than 1000 mm.The climate of the basin is characterized by seasonal migration of the intertropical convergence zone.Daily rainfall data at 14 locations (Figure 1) in and around the BNB were obtained online from the Global Historical Climatology Network (GHCN) [28,29] via the link http://www.ncdc.noaa.gov/oa/climate/ghcn-daily/(accessed: 11th June, 2014).Gridded (0.3 ∘ × 0.3 ∘ ) Climate Forecast System Reanalysis (CFSR) rainfall data from the National Centers for Environmental Prediction (NCEP) were also obtained via the web link http://cfs.ncep.noaa.gov/cfsr/(accessed: 3rd February, 2016).The rainfall data obtained from the GHCN were mainly from 1965 to 1990.The percentage of the missing rainfall records from the GHCN was minimal (Table 1).From 1991 to 2000, the rainfall data from the GHCN were augmented by the daily CFSR data.The missing rainfall records were in-filled using the inverse distance weighted interpolation technique as applied before for the rainfall of the Nile Basin by [30].Consider Ω  as the missing rainfall intensity at the meteorological station , let  be the distance between  and another station in the neighborhood being used for interpolation, and take  as the power parameter.Using  other neighboring rainfall stations, Ω  for a particular period was estimated using For a realistic value of the interpolated record,  was set to 2. This was to obtain optimal weights to the points both far and near the station with the missing record.
Daily flow series from 1965 to 2000 observed at Khartoum was adopted from a study by [23].Daily flow data at El Diem and Ribb observed from 1980 to 2000 were obtained from personal sources.Minimum and maximum temperatures were also downloaded from the data sources similar to those of the rainfall series.

Rainfall-Runoff Models.
Five internationally well-known models including IHACRES, AWBM, TANK, SAC, and SIMHYD were obtained from the "eWater Toolkit" of the Cooperative Research Centre for Catchment Hydrology in Australia via the link http://www.toolkit.net.au/(accessed: 25th August, 2015).These models were selected because (1) they are freely available online, (2) by modeling the runoff in rather lumped than distributed way, they conform with the aim of the study, and (3) their robustness for hydrological modeling has been demonstrated for several climatic zones by several studies, a few examples of which include [31][32][33][34][35][36][37].
For the details on the structures and rainfall-runoff generation processes which these models can simulate, the reader is referred to [15][16][17]38] for IHACRES and to [39] for AWBM, TANK, SAC, and SIMHYD.However, some brief description on each model as well as the general structural details can be obtained from Appendices A.1 to A.5.The model parameters are also presented in Appendices B.1 to B.2.

Methodology
3.1.Rainfall-Runoff Modeling.All the selected five models use catchment-averaged rainfall and potential evapotranspiration (PET) as inputs.Based on the PET and the soil water storages, the actual evapotranspiration is calculated by the models.The PET was computed using the FAO Penman-Monteith method [40] considering minimum and maximum temperature.
First and foremost, before applying the models to the entire BNB to simulate the long-term runoff at the Khartoum flow outlet, it was deemed important to investigate the influence of spatial extent of the catchment on the capacity of the models to simulate the runoff.Daily river observed from 1980 to 2000 at Khartoum, El Diem, and Ribb with the catchment areas of 325,000 km 2 , 176,000 km 2 , and 1,592 km 2 , respectively, were used.Based on the rainfall stations upstream of each flow station, Thiessen polygons were constructed.The average of the observed daily rainfall in each polygon was merged with the mean of the gridded CFSR rainfall data (within the same polygon) so as to cover the period similar to that of the flows.The blended rainfall data from all the constructed polygons were jointly used to compute the catchment-wide average rainfall for the model inputs.
To limit the influence of subjectivity in the manual changing of model parameters, automatic calibration was done for all the models.A number of optimizers exist for automatic calibration including the uniform random search, Rosenbrock multistart search, Rosenbrock single start search, multistart pattern search, single start pattern search, generic algorithm [41,42], and shuffled complex evolution (SCE) [43][44][45].The SCE was adopted in this study because of two reasons.Firstly, it has been demonstrated to be efficient and effective in the calibration of a number of rainfall-runoff models (see, e.g., [44,[46][47][48]).Secondly, the SCE has also been widely applied for automatic rainfall-runoff calibrations (see, e.g., [49][50][51][52]).According to [6], the SCE method combines four optimization strategies to simultaneously evolve a number of potential solutions towards the region of the global optimum of the objective function.These SCE strategies include the simplex method, competitive evolution, controlled random search, and complex shuffling.Note should be taken that the adoption of the SCE was possible for the models incorporated within the Rainfall Runoff Library (RRL) of the "eWater Toolkit," that is, AWBM, TANK, SAC, and SIMHYD.For IHACRES which was obtained as a stand-alone model, the calibration process is automated in two steps.Firstly, the linear module calibration which is controlled by fixed transfer function is done using cross-correlation to calculate the delay between rainfall and stream flow.Secondly, the nonlinear module calibration is used to perform a grid search through the parameter space to obtain the best parameter set.Further detail on this calibration procedure for the IHACRES can be found in [38].For the calibration of all the models, the optimization objective function was in terms of the wellknown Nash-Sutcliffe Efficiency (NSE) [53].
After assessing the influence of spatial variation on the adequacy of the modeling results, the models were rebuilt using the long-term model inputs considering the entire BNB (i.e., with the flow outlet at Khartoum).To obtain catchmentwide rainfall average, Thiessen polygon was constructed using all the 14 rainfall stations.For each model, the automatic calibration was done based on daily series from 01/01/1965 to 31/12/1990.The validation of the models was done using daily data from 01/01/1991 to 31/12/2000.The "goodness-offit" between the observed and modeled flows was assessed using the NSE for both calibration and validation periods.

Comparison of Observed and Modeled Flows
3.2.1."Goodness-of-Fit" of the Entire Time Series.Firstly, the "goodness-of-fit" between the observed and modeled flows considering the entire or full-time series (covering both the calibration and validation periods) was, again, assessed using the NSE.Secondly, to reduce/increase the sensitivity of the NSE to high/low flows, the logarithmic values of observed and modeled flows were to be used.However, in this study, because some modeled flows were zero in value for all the models (except TANK), the use of logarithmic values to compute NSE was not feasible.Eventually, two other forms of statistical "goodness-of-fit" metrics which consider the absolute relative deviations between the observed and modeled flows as presented by [54] were used.Consider   as the modeled flow,   the observed flow,   the mean of observed flow, and  the sample size.The relative efficiency (  ) and index of agreement (  ) were computed using If the assessment of the models is with respect to high flows, values of the power   greater than 1 can be used.However, in this study,   = 1 was used to obtain a balance between high flows and low flows given that the full-time series was being considered.
Graphically, comparison of the simulated and observed cumulative flows was also made to investigate the water balance closing capacity of each model.Using the annual mean flow, scatter plot points of observed versus modeled series was also made.This was after Box-Cox (BC) transformation procedure (as will be described in Section 3.2.2).

"Goodness-of-Fit" of Flow Extremes.
Using the entire or full-time series (covering both the calibration and validation periods), flow extremes were extracted based on a number of criteria for the model performance evaluation.Firstly, the forecast accuracy metrics  FC and  FC relevant for high flows and low flows, respectively, as proposed by [55] were computed using where   is number of low flow events lower than one-third of the mean low flow observed;   is number of high flow events greater than one-third of the mean peak flow observed.
In this study, the mean peak high flow was computed as the average of the events above the median flow.Similarly, the mean low flow was taken as the average of the hydrological events less than the median flow.According to [55] To investigate the effect of adjustment of extreme flow extraction threshold on the model performance,  FC ,  FC ,  AB , and  MSE were also computed by setting   (or   ) from (3) and  of (4) in terms of (1) the number of events higher (or lower) than the 95th (5th) percentile of observed flows and (2) the number of annual maxima and minima flows.
Graphically, plots of the Box-Cox (BC) transformed observed versus modeled series were made for the minimum as well as maximum flows in each year.To give similar weights to the maximum and minimum flows () so as to obtain homoscedastic model residuals, the parameter () of the BC [56] transformation (see (5)) was set to 0.25.Consider BC () =   − 1  . (5)

Analyses of Temporal
Changes.The differences in the changes between observed and modeled flows were investigated using the maximum, mean, and minimum flow in each year.The considered changes in flow were in terms of the variability and subtrends.Subtrends are short-durational changes in trend direction over unknown periods within the series [56].The identification and assessment of the temporal subtrends and investigation of variability are vital to ascertain the possibility of any intervention of, for example, climate fluctuations on the hydrology.The subtrend analysis was done using the cumulative rank difference (CRD) technique [56].To graphically reveal the hidden short-durational changes (e.g., jumps in the mean, subtrends, etc.) within the annual mean flows and the minimum as well as maximum flow in each year, CRD plots were used.To construct the CRD plots, the following steps were taken: (i) Considering V as the number of times a data point is exceeded,  as the number of times a data point appears within the given sample and  as the sample size, the difference () between the exceedance and nonexceedance counts of the data points is computed using To determine V or , each data point is counted as if it was not considered before [57].Considering the imaginary series (3,2,7,4,3,4),  = 6 and for  = 1, 2, . . ., , V = (3, 5, 0, 1, 3, 1),  = ( (ii) The cumulative sum () of  from ( 6) is given by For the series in Step (i),   = (2, 7, 2, 0, 2, 0).It can be checked that, at  = ,   is always zero.
(iii) The CRD plot was made in terms of the graph of   against the time unit of the series.
The short-durational changes from the CRD plots can be identified based on the graphical guidelines illustrated in Figure 2 which was generated based on synthetic series  of  = 200.In the CRD plot, the  = 0 line is taken as the reference (i.e., the case when the data points within the series are of the same value).The values above or below this reference are considered to characterize subtrends in the series.If the series has no trend, the scatter points cross the reference several times (see case (a) and (1)).When the series has a positive/negative trend, most of (if not all) the scatter points take the form of a curve above/below the reference (see case (b) and (2)).It is possible that different changes can occur within the same series.For instance, when the series has its first/second half characterized by positive/negative trend, two curves are formed such that the first one is above the reference and the other is below the  = 0 line (see case (c) and (3)).For a step upward/downward jump in the mean of the series (assuming there is no trend in both parts of the subseries before and after the step jump), the scatter points form lines which meet at a point (call it the vertex) above/below the reference (see case (d) and ( 4)).For an upward/downward jump, the slope of the first line is positive/negative, while that for the second one is negative/positive.For further details on the use of the CRD plots to identify changes in the series and how the significance of the visualized overall trend can be tested, the reader is referred to [57].Variability was investigated using the nonparametric anomaly indicator method (NAIM) [58].NAIM relies on nonparametric rescaling of the data followed by temporal convolution or aggregation of the series.According to World Meteorological Organization (WMO) [59], aggregation of series allows study of the general or summarized and representative behavior of the data.To implement the NAIM to compute anomalies in the series, the following was done: (i) A block length (  ) based on a sensitivity analysis was selected.This was done while ensuring that the periods of high or low anomalies were as nearly independent as possible.To capture the decadal oscillations (if any), sensitivity analysis was done using   in the range from 2 to 10 years.Finally   = 10 years was adopted because it gave a more representative smoothing of the series than for other values such as   = 2, 5, and 7 years.
(ii) Each value of  from (6) was transformed using   = −1 ×   for 1 ≤  ≤ ; this was to make the temporal variation of the rescaled series match that of the given data while circumventing the influence of possible outliers (if any) on the variability analysis.(iii) Using   and rescaled series  from Steps (i) and (ii), respectively, temporal aggregation was done in an overlapping way using (8) to obtain the mean () of the subseries in each time slice ; consider where  =  + (  − 1) and 1 ≤  ≤ ( + 1 −   ).(iv) The anomaly (%) in each slice  was calculated as a ratio of 100  to ( − 1).Actually, ( − 1) is equal to the possible maximum absolute value of  from Step (ii).Thus, the possible maximum anomaly is 100%.(v) Finally, the anomaly (%) was plotted against the midpoints of the time slices.From the plot, the long-term mean of the series is represented by the zero percent anomaly.The fluctuation of the anomalies above or below the reference was considered to characterize the variability in the series.
If the   of 10 years selected in Step (i) is moved in an overlapping way as described by (8), the resultant time slices based on the data considering Khartoum flow outlet, that is, from 1965 to 2000, would be 1965-1974, 1966-1975, . . . , 1991-2000.The mid-points of the time slices (as required in Step (v)) against which the anomalies from Step (iv) are plotted would be 1970, 1971, . . ., 1996.
For validity of the NAIM results, the significance of the anomalies can be tested to verify the null hypothesis ( 0 ) ( that the variability in the series is caused by only natural randomness; that is, there is no persistence in the temporal variation.Nonparametric bootstrapping [60]  To evaluate the performance of the models in the simulation of the observed temporal variability, the following steps of the nonparametric skill score (SC, %) were taken: (i) The zero percent anomaly was taken as the reference for natural randomness in the data.The symbols "+" or "−" were assigned to a particular time slice if its anomaly was greater or less than zero, respectively.If the anomaly was zero, " * " was assigned.
(ii) Again, the symbol "+" was assigned to a particular time slice if its anomaly was greater than or equal to the upper limit of the 95% CI; otherwise, the symbol "−" was assigned.
(iii) Furthermore, the symbol "+" was assigned to a particular time slice if its anomaly was less than or equal to the lower limit of the 95% CI; otherwise, "−" was assigned.
(iv) Rewards and penalties were awarded for the scores assigned to the anomalies from the models.For a given model, if its modeled flow correctly obtained the anomaly sign (i.e., + or −) similar to that of the observed flow in a particular time slice, a score of +1 was given to the model; otherwise a penalty of −1 was assigned.This procedure was repeated separately for each of the criteria from Steps (i) to (iii).
(v) The number of criteria under which the scores were assigned was determined and denoted as ; in this case  = 3 because there were three aspects of accuracy tested, that is, in Steps (i) and (iii).Consider that ℎ 1 , ℎ 2 , . .., and ℎ  denote the sum of scores for the first, second, . .., and the th criterion, and if  1 ,  2 , . .., and   are the sample sizes of events for which scores were assigned using the first, second, . .., and the th criterion, generally the balance () between the total of the rewards and penalties would be given by  = (ℎ 1 / 1 ) + (ℎ 2 / 2 ) + ⋅ ⋅ ⋅ + (ℎ  /  ).
(vi) The SC (%) for a particular model was calculated using The best model performance is given by SC of 100%.The worst model is that with SC of zero.
3.2.4.Amplitude-Duration-Frequency Relationships.Amplitude-Duration-Frequency (ADF) relationships combine Extreme Value Distribution (EVD) over a range of aggregation levels.ADF relationships can be considered as one of the most important tools for risk-based water engineering and management.ADF relationships are used for design, operation, and/or management of water supply projects (e.g., dikes, dams, and irrigation systems) [61] or urban drainage facilities such as sewer conduits [62].ADF relationships constructed for flow and rainfall are called QDF and IDF, respectively.In this study, QDF relationships were constructed for both observed and simulated flows.The first step in the construction of QDF relationships is the selection of aggregation levels, that is, durational intervals over which flows are averaged.To cover the relevant water resources management or water engineering applications as agriculture, irrigation, hydropower, domestic water supply, pollution control, and so forth, aggregation levels may be selected in the range of 1-90 days for high flows and the range of 1-365 days for low flows [63].
Equation (8) with the term   taken as the aggregation level and  as the original (i.e., unrescaled) flow is used in averaging of the series.For analyses of low flows from nonephemeral rivers (i.e., when  > 0), the aggregation of  is done after transformation of  by (1/).For low flow analyses, mostly Weibull or Fréchet distributions are used.However, the transformation of the series using (1/) makes the low flows to follow the Generalized Pareto Distribution GPD [64] or exponential instead of Weibull or Fréchet distribution.This transformation allows both the low flows and high flows to be analyzed in a similar way.Consider () as the cumulative distribution function of the GPD, and assume that (1/) low flows exhibit normal tail in the exponential quantile plot (i.e., − ln{1 − ()} in abscissa and  in ordinate).As shown by [65], to calibrate exponential distribution (with scale () and location or threshold (  ) parameters) to the series ( = 1/) above a specified threshold   , This equation can be transferred towards a distribution for  (using   = 1/  ) as follows: Advances in Meteorology For values lower than   , it can be noted that (13) matches the Fréchet distribution () = exp(− − /) with  = 1.The next step after the temporal averaging of the series is the selection of independent hydrological extremes for each aggregation level.In line with frequency analysis, the requirement that the data should be independent and identically distributed can be achieved through extraction, from the full series, of either Annual Maxima Series (AMS) or the Partial Duration Series (PDS)/Peak Over Threshold (POT).The main advantage of the AMS in which the time slice is chosen to be one hydrological year is that it produces extremes with stronger independence compared to those of the POT method.However, it has the disadvantages that the sample is rather small, and because the second highest event in each year is not considered, some useful information for the definition of the extreme value region is lost.In the POT method, all events above a certain truncation (threshold) level are extracted, thereby leading to a more reasonable sample size for extreme value analysis than that of the AMS.Thus, the POT method intuitively provides a more consistent definition of the extreme value domain than the AMS approach [66].Eventually, the POT method of extracting independent hydrological extremes was adopted in this study.The POT events were selected using the independence criteria based on the flow threshold and the time between successive flow extremes as presented by [67].
The step that follows the POT extraction is fitting of the EVD to the independent extreme events.It is known that the POT events as used in this study follow the GPD (see ( 14) and ( 15)) which can be valid for values of  above the threshold   .Consider Three classes of the GPD can be identified based on the shape parameter , that is, normal ( = 0), heavy ( > 0), and light ( < 0) tails.When  ≥ 0, the upper GPD tail goes up to infinite values and for  < 0, the GPD has a final right-end point.The GPD for  = 0 equals the exponential distribution.The parameters of the EVD were estimated using the weighted linear regression (WLR) technique based on the quantile incremental properties of the GPDs [68].The WLR technique was adopted because it was recently demonstrated by [66] to be ostensibly more robust to capture EVD tails than other well-known parameter estimation methods such as the method of moments, maximum likelihood, and -moment approach.To compute the parameters of the GPD (see ( 14)), Pareto quantile plot (i.e., − ln{1 − ()} in abscissa and ln() in ordinate) is used.In this plot, the GPD (see ( 14)) appears as a line and the slope of this line especially in the tail approximates to .Eventually, for the GPD (see ( 14)),  is computed using (16), and  can be estimated by least square weighted linear regression based on the Hill weights [69] using (17).To compute the parameters of the exponential distribution (see (15)), exponential quantile plot (i.e., −ln{1 − ()} in abscissa, and  in ordinate) is used.In this plot, the exponential distribution takes the form of a line whose slope is equal to  which can be estimated again based on the Hill weights [69] using (18).Consider  as the number of POT events above the threshold   and The optimal   is obtained as the value of  for which the mean squared error (MSE) is minimal.For the GPD (see ( 14)) and exponential distribution (see (15)), the MSE of the WLR in the Pareto and exponential quantile plot can be given by ( 19) and (20), respectively.Consider The second last step of QDF construction comprises the estimation of the flow quantiles.Considering  as the data record length in years and  as the rank of the POT events ( = 1 for the highest), the relationship between the -year flow   and the return period  based on calibrated GPD and empirical data can be calculated using ( 21) and ( 22), respectively.Consider Empirically,   is obtained as the flow value corresponding to  computed using (22).Based on (22), it is noticeable that the empirical   can only be estimated for  values not exceeding .For  greater than , the theoretical quantiles can be estimated for the GPD (see ( 14)) and exponential distribution (see (15)) using ( 23) and (24), respectively.Extrapolation requires the choice of  values to be made.In this study, to minimize the possible uncertainty boost in the extreme value analyses due to finite sample size,  was ensured not to exceed 100 years (i.e., less than three times the data record length).
Besides, the range of  from 5 to 100 can be used in planning and designing of multipurpose risk-based water engineering applications such as hydraulic structures along river systems (bridges, culverts, etc.). around 100 years may be used for medium-sized flood protection systems, flood plain development, and so forth [62].In the construction of the QDF relationships, for each aggregation level the theoretical quantiles   are computed for all the selected return periods.Consider Figure 3 shows examples of calibrated exponential distribution or normal tailed GPD as regression lines in exponential quantile plots for both high flows and low flows.It is noticeable that the linear behavior of the quantiles is obtained towards the tail of the distribution of events.The regression lines can be seen to be extrapolated up to  of 100 years for both high flows (Figure 3 For low flows, back transformation is applied to the (1/) transformed -year events to obtain the actual quantiles.
Finally, the QDF relationships comprise the compiled values of   for all the aggregation levels as well as the different selected return periods.This makes possible the estimation of cumulative volumes of water during drought or flood periods at various aggregation levels or return periods.

Effect of Catchment Size on Simulation of Runoff.
Figure 4 shows the performance of the models in simulating runoff for catchments of different drainage areas using hydrometeorological inputs from 1980 to 2000.It is noticeable that the flows of Ribb (Figures 4(k)-4(o)) are less smooth than those of El Diem (Figures 4(f)-4(j)) and Khartoum (Figures 4(a)-4(e)).Normally, the time taken by the water to reach the catchment outlet when the area or size is large tends to be long.The increasing delay in the catchment response as the catchment size increases leads to low runoff volume from rainfall following the large losses of rain water by infiltration, evaporation, and so forth.Given that the flow is cumulative in volume from upstream to downstream, the changes in flow due to the fluctuations in rainfall-runoff tend to be more smoothened as the catchment size increases.However, regardless of the catchment size, it is noticeable that the observed hydrographs are well reproduced by those of the modeled flows.The statistical evidence of the agreement between the observed and modeled flows for all the catchments is provided in Table 2.The NSE for each model fell in the range of 0.70-0.81regardless of the catchment size.Similarly, the correlation between the observed and modeled flows was between 0.8 and 0.9.This  indicated the adequacy of the models to simulate long-term runoff from the BNB.Eventually, to evaluate the performance of the models in reproducing the variability and quantiles of the flows considering the entire BNB, all the models were applied to simulate runoff at Khartoum and the modeling results are presented in Section 4.1.2.Table 3 shows the statistical "goodness-of-fit" from the different models considering calibration and validation periods separately.The NSE values for calibration were higher than those for validation.With respect to the calibration results, the validation NSE for AWBM, IHACRES, SAC, SIMHYD, and TANK dropped by 20.9, 11.0, 5.4, 20.0, and 10.1%, respectively.The average drop of NSE by 13.5% (considering all the models) could be indicative of the difficulty of the models to capture the runoff generation dynamics due to the possible change in catchment behavior.As stated in Section 3.1, the model validation was from 1991 to 2000.The land management policy in Ethiopia (where the upper part of the BNB is located) from 1991 to 2000 was different from that of the 1975-1990 period.This difference in policy was due to the changes in the government or political regimes in 1975 and 1991 [70].For instance, from 1975 onwards, the landlord-tenant relationship was abolished by the "landto-the-tillers" policy; however, by 1991, the government decreed state ownership of land [70].The combination of such governmental policy changes and other factors such as widespread poverty associated with frenzied population growth promotes anthropogenic influences on the hydrology through overgrazing, deforestation, significant expansion of urbanized areas, and so forth.Such anthropogenic factors affect hydrology by altering the amount of infiltration into the soil, velocity of the overland runoff, the rate and amount of evaporation, and so forth [23], thereby modifying the catchment response to the rainfall input.The capacity of the models to capture such changes in catchment behavior tends to differ as seen in the differences in the NSE drop.However, because the NSE values for all the models were higher than 0.6 for the validation, the simulated flows were considered adequate for the comparison of the model performance in capturing the variability of the flows.Figure 6: Performance of the models in terms of (a) the cumulative flow and (b) observed versus simulated annual mean flow.For the labels of the axes in chart (b), "BC" appearing before "()" shows that the flow (m 3 /s) values plotted were obtained after applying Box-Cox (BC) transformation using (5).simulated flows combined over both calibration and validation periods.It is seen that whereas the AWBM overestimated the observed cumulative flows, underestimations were obtained for the rest of the other models (Figure 6(a)).Furthermore, the deviations between the observed and simulated flows from the various models can be confirmed by most of the scatter points of the AWBM (and the rest of the models) falling above (below) the bisector (Figure 6(b)).The performances of IHACRES, SAC, SIMHYD, and TANK are shown to be comparable (Figures 6(a) and 6(b)).The average biases in reproducing observed cumulative flows for AWBM, IHACRES, SAC, SIMHYD, and TANK were 17.68, −6.86, −10.18, −12.56, and −10.37%, respectively.One probable cause of these model biases indicating the deficiency in capturing the observed water balance closure could be the overall structural differences among the models in capturing the dynamics of the runoff generation in the study area.Another reason for the model biases would be the low quality of the meteorological model inputs.However, as already seen in Table 1, the quality of the data at each station was satisfactory for rainfall-runoff modeling.The performance of the rainfall from these stations, say, to explain the temporal variation in the runoff would be good if they are to be used in a case-specific way, for example, each station for a particular subbasin where it is located.However, in the implementation of Thiessen polygons for the computation of catchment-wide average rainfall, all the stations are considered so long as they fall within the polygons irrespective of whether they are located over the catchment or not.Although all the 14 rainfall stations from Table 1 were used to construct Thiessen polygons, it is noticeable from Figure 1 that stations 3, 5-6, and 10-14 are outside the BNB boundary.With respect to the physical reality, rainwater from these stations, 3, 5-6, and 10-14, cannot contribute directly to the overland runoff of the BNB.The consideration of data from stations located outside the catchment as done in this study (following the limitation of stations with data) lowers the accuracy of the temporal variation of the lumped catchment-wide rainfall to capture the variation in the observed overland flow; thus it becomes a possible reason for the mismatch between measured and modeled flows.

Comparison of the Observed and Modeled
Figure 7 shows the statistical measures of agreement between observed and simulated flows.It is noticeable that all the metrics including   ,   , and NSE for each model were above 0.5.Furthermore, considering each statistical metric, the performances of the models are evidently comparable.This shows that, without giving particular focus to either flooding or drought conditions, the influence of model selection on the simulation of the runoff of the study area may be minimal.Nonetheless, based on which model obtained the highest statistical "goodness-of-fit," the best performance with respect to   and   was realized from TANK, followed by IHACRES.The lowest values of   and   were obtained by AWBM and SIMHYD, respectively.However, with respect to NSE, the best performing model was AWBM, followed by TANK.The lowest NSE was obtained by SIMHYD.It can be seen that the value of   was lower than those of   and NSE for each model.For the AWBM, the NSE was greater than both   and   .On the contrary, the rest of the models exhibited higher value of   than for the NSE.This shows that the judgment of the model performance depends on the selected statistical "goodness-of-fit" measure.It is therefore recommended that a number of statistical metrics be used in evaluating model performance in rainfall-runoff modeling. .For the annual minima flow,  FC was zero for SAC because all the modeled minimum flow in each year was zero for the entire calibration and validation periods.To supplement the evaluation using  FC and  FC , results based on  AB and  MSE are presented in Table 4.The overall best model (i.e., with the lowest  AB ) to capture both high and low flow conditions was AWBM.It is noticeable that the magnitude of a particular statistical measure for evaluating the model performance tended to differ with the variation in the criteria for extracting the flow extremes from the full-time series.This shows that, for the intercomparison of the performance of the models with respect to extreme conditions, high and low flow events should be extracted based on a number of criteria.In this study, seven criteria were considered.The results of the use of six criteria are presented in Figure 8.The last (i.e., the seventh) criterion which entailed extracting events which are nearly independent and identically distributed was applicable to both high and low flows and is very vital for applications of extreme value analyses as seen in the construction of QDF relationships (see Sections 3.2.4 and 4.2.3).

"Goodness-of-Fit" of Flow Extremes
Figure 9 shows the graphical model performance in terms of the Box-Cox transformed observed and modeled flows.For an ideal model (which is said to be unbiased), the scatter points in Figure 9 would all fall along the bisector (i.e., 45 ∘ line in the Cartesian plane).Consistent with the graphical observation made on Figure 5(d overestimated the maximum flow in each year.As seen from Table 4, this overestimation by SIMHYD equaled  AB of 62.57%.It is shown that the TANK model underestimated most of the annual maximum flow events.Figure 9(b) shows that the only realistic estimations of the minimum flow in each year were obtained by the AWBM.All the other models including IHACRES, SAC, SIMHYD, and TANK performed poorly in capturing the observed annual minimum flow.The worst performance for the annual minima flow was by SAC.This was for the reason that SAC produced zero flows during every dry period in each year; thus, the scatter points of its annual minima flows appear as a horizontal line (Figure 9(b)) with the constant value of −4 after applying the Box-Cox transformation using (5) with the parameter  set to 0.25 as mentioned before.Figure 9: Model performance evaluation for (a) maximum flow in a year and (b) minimum flow in a year.For the labels of the axes, "BC" appearing before "()" shows that the flow (m 3 /s) values plotted were obtained after applying Box-Cox (BC) transformation using (5).annual minima are presented in Table 5.It is shown that the observed maximum flow in each year was generally above the reference from 1970 to 1978 and again from 1991 to mid 1990s (Figure 10(a)).The annual maximum flow is shown to be below the reference from the late 1970s to early 1990s.This decrease in the flow was significant at the level of 5%.Visually, it is evident that all the models captured quite well the temporal variability in the observed annual maxima.The temporal variability of the annual maxima is comparable with that of the mean annual flows (Figure 10(b)).

Analyses of Temporal Changes
For the annual minima series, it is shown in Figure 10(c) and Table 5 that the flow events of 1970-1983 (1984-1996) were above (below) the reference.The anomalies in annual minima flow before and after 1984 were positive and negative, respectively.Nonetheless, the observed annual minima flow exhibited a steadily significant decrease from 1970 to mid 1990s.Statistically, the skill scores of the models in capturing the anomalies of variability in the annual maxima were 66.7% (AWBM), 63.0% (IHACRES), 63.0% (SAC), 70.4% (SIMHYD), and 66.7% (TANK).For annual mean flow, skill scores of 55.6% (AWBM), 59.1% (IHACRES), 55.6% (SAC), 62.9% (SIMHYD), and 70.4% (TANK) were obtained.Generally, the models performed poorly in capturing the variability in the annual minima flow with the skill scores of 18.5% (AWBM), 14.8% (IHACRES), 33.3% (SIMHYD), and 0% (TANK and SAC).According to [71], low flows are often poorly reproduced by most rainfall-runoff models which are tailored to capture flooding conditions.Besides, the overall water-balance-based objective function (as applied for calibration in this study) seemingly influences the capacity of the models to perform better in capturing the temporal changes in observed high flows compared to low flows.One option (which can still remain food for thought) would be for the model developers to revise their model structures to capture both low flows and high flows in an acceptably The symbols "+" and "−" denote anomaly greater and less than zero, respectively.The symbol " * " indicates anomaly of zero percent."Obs."denotes observed change.The cells with superscript symbol "#" are for anomalies significant at the level of 5%.Each anomaly was placed at the center of the 10-year time slice, for example, 1965-1974, 1966-1975, . . ., 1991-2000, obtained based on the NAIM procedure as described in Section 3.2.3.
simultaneous way.Secondly, for optimization of the numerical performance schemes during calibration, the use of waterbalance-based objective function should be combined with other criteria which seek to strike a balance in the model performance for both high flows and low flows jointly.Some of the criteria which can be combined together include the adequacy of the models to capture the overall shape of the hydrograph, peak high flows, low flows, and temporal changes (variability and subtrends) in high flows and low flows.
For short-durational changes in trend direction, the annual maxima and mean flow (Figures 10(d) and 10(e)) exhibited both negative and positive subtrends.This result is consistent with that of [23] that found decreasing and increasing short-durational trends over the periods of 1965-1983 and 1984-2000, respectively, in the Blue Nile flow at Khartoum.Like for the NAIM results, all the models again exhibited close agreement in capturing the temporal cumulative effect of the variations or subtrends in the annual maxima and mean flows.The correlation coefficients between the CRD curves for observed and simulated annual maxima flows were 0.86 (AWBM), 0.85 (IHACRES), 0.86 (SAC), 0.93 (SIMHYD), and 0.82 (TANK).For the annual mean flows, correlation values of 0.87 (AWBM), 0.86 (IHACRES), 0.86 (SAC), 0.89 (SIMHYD), and 0.88 (TANK) were obtained.For the annual mean flow, there was more discrepancy between the observed and modeled flows with respect to decreasing than increasing subtrends.Nevertheless, some slight discrepancy between the temporal subtrends of observed and modeled flows from all the models was exhibited from 1991 to 1999.This discrepancy could be an indication of the slight change in catchment behavior with respect to the rainfallrunoff generation dynamics.The arguable evidence of this change in the catchment behavior was already presented in Advances in Meteorology Section 4.2.1.Consistent with the NAIM result in Figure 10(c), it is noticeable from Figure 10(f) that the entire CRD curve fell below the reference, thereby indicating a dominantly decreasing trend in the observed annual minimum flows [23].The models again performed poorly in capturing the subtrends in the annual minima flows.The correlation coefficients between the CRD curves for observed and simulated annual minima flows were 0.28 (AWBM), −0.06 (IHACRES), 0.00 (SAC), 0.49 (SIMHYD), and −0.41 (TANK).Based on the obtained results, the best and consistent model performance to reproduce the observed cumulative temporal variation for annual maxima and mean flows was SIMHYD.

Amplitude-Duration-Frequency
Relationships.Figures 11 and 12 show the QDF relationships for extreme high flows and low flows, respectively, compiled from the quantiles estimated based on exponential distribution for return period  of 5, 25, and 100 years and aggregation levels of 1, 3, 5, 7, 10, 30, 60, and 90 days (for high flows) and 1, 10, 30, 90, 150, 180, 240, and 365 days (for low flows).The slope of each -year curve on the QDF relationships is positive (for high flows) and negative (for low flows).It is shown that SIMHYD (Figure 11(d)) overestimated the high flow quantiles for aggregation level of one day.However, for higher aggregation levels, the biases in the estimation of quantiles reduced considerably.For aggregation levels higher than 10 days, IHACRES (Figure 11(b)) underestimated the quantiles for  = 100 years.This might be due to the inadequacy of the model to reproduce higher quantiles in the tail of the distribution of extreme event, thereby leading to uncertainty in the EVD calibration for extrapolation.For AWBM (Figure 11 40 years were −0.55% (AWBM), 5.30% (IHACRES), 6.35% (SAC), 79.56% (SIMHYD), and −6.28% (TANK).For low flows, the corresponding biases of −53.51% (AWBM), −99.98% (IHACRES), −100% (SAC), −99.99% (SIMHYD), and −79.46% (TANK) were obtained.One reason for the biases in the models could be due to the observation errors in the flows.For instance, in the flooding conditions when the flow exceeds the river banks, observation errors become large as a result of the bias from the rating curve extrapolation.On the other hand, the difference in the biases indicates the influence of model selection on the simulation of extreme events.If the results from models are to be used to support decision for risk-based management, the model with the least bias would be selected.

Conclusions
In this study, the influence of hydrological model selection on the rainfall-runoff simulation was demonstrated using five well-known models including AWBM, IHACRES, Sacramento (SAC), SIMHYD, and TANK applied based on hydrometeorological data from the BNB.Optimal parameters of all the selected models were obtained through automatic calibration using Nash-Sutcliffe Efficiency as the objective function.Each model was tested by evaluating its capacity in capturing moderate and extreme hydrological events and the temporal changes in the flows.The temporal changes in the flows were assessed in terms of variability and subtrends (short-durational changes in trends within the series).Variability was computed using the nonparametric anomaly indicator method and sub-trends were analyzed based on the cumulative rank difference technique.The performance of the models were evaluated using nine statistical "goodnessof-fit" measures including Nash-Sutcliffe Efficiency, relative deviation efficiency, low flow forecast accuracy, index of agreement, high flow forecast accuracy, model bias, root mean squared error, correlation, and skill scores.The model performance was evaluated based on high flows, low flows, and the overall shape of the hydrograph.To evaluate the performance of the models in reproducing quantiles as a simultaneous function of different return periods and aggregation levels, Amplitude-Duration-Frequency (ADF) relationships constructed using observed and modeled hydrological extremes were compared.
Without giving particular focus to either flooding or drought conditions, the performance of all the models in simulating the runoff of the study area was found acceptable.With respect to considering the overall shape of the hydrograph, the best performance in terms of relative efficiency and index of agreement was realized from TANK, followed by IHACRES.However, using the Nash-Sutcliffe Efficiency, AWBM was the best performing model.The deficiency of the models in capturing the observed water balance closure evaluated using bias ranged from −6.86% (IHACRES) to 17.68% (AWBM).Thus, the best model to reproduce observed cumulative flow was IHACRES.On an important note, the choice of the "goodness-of-fit" measures in performance evaluation was found to influence the selection of a particular model.Therefore, caution should be taken to use a number of measures in intercomparison of hydrological models.
The skill scores in reproducing temporal variability in observed annual maximum flows varied from 63.0% (IHACRES) to 70.4% (SIMHYD).Correspondingly, for the minimum flows in each year, the best and worst performances were with skill score of 14.8% (IHACRES) and 0% (SAC), respectively.For the annual mean flow, the lowest and the highest skill scores were 55.6% (AWBM and SAC) and 70.4% (TANK), respectively.Whereas the changes in observed annual minimum flows were dominated by a negative trend, the simulated flows from all the models exhibited no trend.Eventually, the correlation between the temporal subtrends from observed and simulated annual minima flows ranged from 0.0 (SAC) to 0.49 (SIMHYD).The observed annual maxima as well as annual mean flow exhibited both negative and positive subtrends over the periods of 1965-1984 and 1985-2000, respectively.These observed positive and negative subtrends were adequately reproduced by the simulated runoffs from all the models.Coefficients of the correlation between observed and modeled annual mean flows were above 0.85 for all the models.For maximum flows in each year, correlation varied from 0.82 (TANK) to 0.93 (SIMHYD).Therefore, the best performing model to reproduce the observed temporal variability at decadal time scale was SIMHYD (for annual maxima and minima flow) and TANK (for annual mean flow).For short-durational changes in trend directions, the best and consistent model performance to reproduce the observed cumulative temporal variation for annual maxima and mean and minima flows was SIMHYD.
For the ADF relationships, the model bias in reproducing the daily aggregation observed high flow quantiles for return periods between 1 and 40 years ranged from −0.55% (AWBM) to 79.56% (SIMHYD).Correspondingly, the biases for low flows were from −53.51% (AWBM) to −100% (SAC).Generally, for the simulation of flooding and drought conditions, it was found that the use of the overall water-balance-based objective function influences the capacity of the models to perform better in capturing the changes as well as quantiles of observed high flows compared to low flows.This might not be surprising because low flows are often poorly reproduced by most rainfall-runoff models which are tailored to capture flooding conditions [71].Whereas the need to revise concepts on model structures to simultaneously capture both low flows and high flows acceptably may still remain food for thought, it is recommended that the use of water-balancebased objective function be combined with other criteria for optimization of the numerical performance schemes.Such other criteria may include the overall shape of the hydrograph, peak high flows, low flows, variability, and subtrends in low and high flows.Furthermore, it was generally found that the choice of the criteria for extracting extreme events from the full series for the purpose of intermodel comparison influences the model performance.Thus, caution must be taken to test model performance with respect to extreme events extracted based on a number of criteria.In this study, a total of seven criteria were considered including the extraction of the flow extremes based on (1) events greater than one-third of the observed peak high flow mean, (2) events less than one-third of the observed low flow mean, (3) high flow events greater than the 95th percentile of the observed flow, (4) low flows less than the 5th percentile of the observed flow, (5) maximum flow in each year, (6) minimum flow in each year, and (7) events which are nearly independent and identically distributed (e.g., those used for construction of the ADF relationships).Considering all the above criteria in a combined way, the overall best model (i.e., with the best "goodness-of-fits") to satisfactorily capture both high and low flow conditions jointly was the AWBM.Because the performance of a particular model may differ from one criterion to another, an attempt to interrelate the performance indicators across all various criteria should be augmented by the expert judgment of the modeler.
It can be remarked that prudence must be exercised in the choice of a particular model which can be made on a case by case basis in line with the objectives of the hydrological modeling study.Evaluation of the intermodel differences is vital in the choice of which hydrological models to apply for impact investigations, for example, of climate variability and change on water resources.Importantly, whereas the influence of model selection may be minimal in the simulation of normal flow events, it must be considered carefully for high flows or low flows to minimize under-and/or overestimation of hydrological extremes.The selection of a particular model for simulating extreme flow events is, in turn, influenced by the choice of (1) the "goodness-of-fit" measures for evaluating the model performance and (2) the criteria for extracting extreme events for the model performance evaluation or intended application.Furthermore, note should be taken that in the SAC model are controlled by mainly 16 parameters of which 5 define the size of soil moisture stores, 3 calculate the rate of lateral outflows, 3 estimate the percolation water from the upper to the lower soil moisture stores, 2 compute direct runoff, and 3 determine losses in the system (see Figure 13).
A.2.The AWBM Structure (Adopted from the RRL [39]).The AWBM for the catchment water balance takes rainfall to three surface water stores based on their moisture contents.Each surface water store is considered independently of the others.Evaporation is subtracted from each of the stores.
The moisture in excess of the storage capacity becomes either the surface runoff or recharge into the groundwater.The baseflow and the surface runoff are routed separately and later combined into the total flow at the outlet of the catchment.In total, there are generally 8 parameters which control the rainfall-runoff generation by the AWBM (see Figure 14).generated from the effective rainfall.Using unit hydrograph concept, the catchment is taken as linear reservoirs in series and/or parallel.This concept is premised on the assumed linear relationship between effective rainfall and stream flow.The total flow at the catchment outlet comprises the baseflow and quick flow in a combined way.The rainfall-runoff generation processes in the TANK model are controlled by 11 parameters (see Figure 15).A.4.The SIMHYD Structure (Adopted from the RRL [39]).In the SIMHYD, rainfall is lost by evaporation and infiltration.The fraction that infiltrates is transformed into interflow, groundwater store, and soil moisture store.Interflow and baseflow are each assumed to follow the soil wetness in a linear way.The infiltration excess becomes the surface runoff.The surface runoff and interflow form the quick flow which can be combined with the baseflow to yield the stream flow.The SIMHYD model has 9 model parameters for calibration (see Figure 16).
A.5.The TANK Model Structure (Adopted from the RRL [14]).The TANK model has four tanks arranged vertically in series.The rainfall is fed into the top tank.The rainfall is lost by evaporation from the top tank downwards in a sequential way.The outlets at the sides of the tanks generate different components of the total runoff, that is, surface runoff, intermediate runoff, subbase runoff, and baseflow from the first through to the fourth tank, respectively.The TANK model has 18 parameters for calibration (see Figure 17).

B. Parameters of Selected
Rainfall-Runoff Models

Figure 1 :
Figure 1: Location of the BNB.The labels of the flow and rainfall stations are consistent with those in Table 1.The Digital Elevation Model (DEM) used as the background map was obtained online from the International Centre for Tropical Agriculture, CIAT-CSI SRTM website, http://srtm.csi.cgiar.org/(accessed: 20th October, 2010).

Figure 2 :
Figure 2: The plots for (a)-(d) synthetic series,  of  = 200, and (1)-(4) the cumulative effects of the temporal variations for the corresponding series.

Figure 3 :
Figure 3: POT events in exponential quantile plots for (a) high flows and (b) low flows of 1-day aggregation level.The regression line is the fitted EVD.
Figure3shows examples of calibrated exponential distribution or normal tailed GPD as regression lines in exponential quantile plots for both high flows and low flows.It is noticeable that the linear behavior of the quantiles is obtained towards the tail of the distribution of events.The regression lines can be seen to be extrapolated up to  of 100 years for both high flows (Figure3(a)) and low flows (Figure3(b)).For low flows, back transformation is applied to the (1/) transformed -year events to obtain the actual quantiles.
Figure 5 shows the observed and simulated daily flows from 1965 to 2000.The calibration and validation periods are shown by the number of days in the ranges of 1-9496 and 9497-13149, respectively.It is visually noticeable that the performance of each model (Figures 5(a)-5(e)) is acceptable.However, the tendency of the SIMHYD (Figure 5(d)) to overestimate the high peak flows is also evident.The model parameters following the SCE technique of calibration are listed in Appendices B.1 and B.2.

Figure 7 :
Figure 7: Performance of the models in terms of the forecast accuracy and NSE.
Figure 8 shows the capacity of the models to reproduce extreme events.The performances of the models are comparable for capturing high flow extremes (Figures 8(a), 8(c), and 8(e)), though SIMHYD tended to overestimate the high flows.Generally, the values of  FC (Figures 8(b), 8(d), and 8(f)) were greater than those of  FC .Except for the annual minima flow (Figure 8(f)), the values of  FC from the various models are also fairly comparable (Figures 8(b) and 8(d))

Figure 8 :
Figure 8: Model performance considering (a) flow events greater than one-third of the observed peak high flow mean, (b) flows less than one-third of the observed low flow mean, (c) flows greater than the 95th percentile of the observed flow, (d) flows less than the 5th percentile of the observed flow, (e) maximum flow in each year, and (f) minimum flow in each year.

Figure 10
shows the temporal anomalies in the observed and simulated flows.The details of the temporal evolution of the anomalies characterizing the variability in the annual maxima and

Figure 10 :
Figure 10: Temporal variation in terms of (a)-(c) NAIM results and (d)-(f) CRD results for (a, d) maximum flow in each year, (b, e) mean annual flow, and (c, f) minimum flow in each year.The charts (a)-(c) and (d)-(f) share the same legend at the bottom of (c) and (f), respectively.

Figure 11 :
Figure 11: QDF relationships for observed peak high flows and modeled flows from (a) AWBM, (b) IHACRES, (c) SAC, (d) SIMHYD, and (e) TANK.Markers and lines are for observed and modeled flows, respectively.In the legend, for example, 5 denotes -year curve for  = 5 years.

Figure 12 :
Figure 12: QDF relationships for observed extreme low flows and modeled flows from (a) AWBM, (b) IHACRES, (c) SAC, (d) SIMHYD, and (e) TANK.Markers and lines are for observed and modeled flows, respectively.In the legend, for example, 25 denotes -year curve for  = 25 years.

A. 3 .
The IHACRES Model Structure.The IHACRES model is characterized by the catchment storage from which rainfall is lost through evaporation, and the remaining fraction contributes to the effective rainfall.Quick flow and slow flow are

Table 1 :
Meteorological and hydrological stations.
, the best or perfect model performance is given by  FC or  FC equal to zero.However, it is noticeable that, even for the worst models with the all values of   equal to zero, the values of  FC or  FC also become zero.Eventually, caution must be taken to use the metrics  FC and  FC alongside other statistical "goodness-of-fit" measures.In this study, the model average bias ( AB ) and root mean squared error ( MSE ) were selected to supplement the evaluations using  FC and  FC . AB and  MSE were computed using (4) by setting  equal to   or   .For an unbiased model,  AB is ideally equal to zero.
by Monte Carlo simulations is used to test  0 .Consider  MC as the number of Monte Carlo simulations; the variability bounds in the form of (100 − )% Confidence Interval (CI) were constructed by (1) applying the NAIM using   to the original series, (2) randomly shuffling the original full-time series, (3) applying the NAIM using   to the shuffled series, (4) repeating Steps (2)-(3)  MC times to obtain  MC sets of anomaly values, (5) ranking, for each set, the anomaly values from the highest to the lowest, and (6) taking the upper and lower limits of the (100 − )% CI as the [0.005 × % ×  MC ]th and [{1 − (0.005 × %)} ×  MC ]th anomaly values, respectively. 0 is accepted if the anomaly values from Step (1) fall within the CI.If the upper/lower CI limit is upcrossed/downcrossed,  0 is rejected.In this study, % and  MC were set to 5% and 1000, respectively.

Table 2 :
Statistical evaluation of the model performance.
Corr.: correlation between observed flow and simulated flow.

Table 3 :
Statistical measures of the agreement between the observed and simulated flows.

Table 4 :
Statistical measures of model performance for flow extremes.

Table 5 :
Temporal anomalies in the annual maxima and annual minima flows.