On the Investigation of Monthly River Flow Generation Complexity Using the Applicability of Machine Learning Models

Streamflow is associated with several sources on nonstationaries and hence developing machine learning (ML) models is always the motive to provide a reliable methodology to understand the actual mechanism of streamflow. *e current research was devoted to generating monthly streamflows from annual streamflow. In this study, three different ML models were applied for this purpose, including Multiple Additive Regression Trees (MART), Group Methods of Data Handling (GMDH), and Gene Expression Programming (GEP). *e models were developed based on annual streamflow and monthly time index of three rivers (i.e., Upper Zab, Lower Zab, andDiyala) located in the north region of Iraq.*emodeling results indicated an optimistic simulation for generating the monthly streamflow time series from annual streamflow time series. *e potential of the MARTmodel was superior to the GMDH and GEP models for Upper Zab River (R 0.84, 0.64, and 0.47), Lower Zab River (R 0.75, 0.46, and 0.40), and Diyala River (R 0.78, 0.42, and 0.5).*e results of RMSEwere 113, 169, and 208 for Upper Zab River, 95, 149, and 0.5 for Lower Zab River, and 73, 118, and 109 for Diyala River. *e results have proved the possibility of changing the timescale in generating streamflow data.


Introduction
e hydrological processes are associated with several elements such as evaporation, evapotranspiration, precipitation, runoff, river flow, infiltration, and groundwater. In nature, the hydrological cycle is featured by high stochasticity, nonstationarity, and nonlinearity [1], and thus studying the hydrological process is one of the significant topics in the field of water resources engineering. Over the past literature, several models have been introduced for modeling hydrology cycle processes and evidently proofed their capacity [2][3][4]. Between several components of the hydrology cycle, streamflow is a very important process and has received major interest by the hydrologists and computer scientists [1]. e establishment of accurate and reliable models "forecasting, prediction, or optimization" for the long scale, such as yearly, seasonally, or monthly, is very magnificent for reliable water resources management and planning [5]. In addition, for short scale like day, hour, and minutes, streamflow recording is very essential for flooding warning and monitoring in order to lessen and mitigate their effects on various structure and human well-being [6]. e data-driven streamflow models are regression-based where the relationships between model inputs and output are directly defined [7,8]. With the advances of computer aided models, ML models such as fuzzy logic, neural network, nature-based algorithm, support vector machine, decision tree, and optimizers have been successfully implemented for modeling streamflow patterns. ese models can help in detecting the nonlinear, dispensable, and dynamic pattern of the time series [9][10][11][12]. However, a number of problems are associated with most of the ML-based techniques due to their inherent limitations [13]. e ML-based models need previous information of the stochastic behavior of the addressed research issue (i.e., hydrology or climatology processes or water quality data) [14,15]. Hence, it is essential to configure reliably in terms of learning process to obtain the important information from the chronological data of streamflow. In addition, it is required to optimize a number of model internal parameters [16,17]. Over the time, many hybrid models have been also implemented such as fractionally autoregressive integrated moving average (FARIMA) and self-exciting threshold autoregressive (SETAR) with GEP, MARS, and MLR [18]. Similarly, another authors used autoregressive conditional heteroscedasticity (ARCH) to hybridized GEP and MARS models [19]. In particular, the conventional MLbased models need numerous trial-and-error processes to determine the optimum architecture design. For example, hydrological models using neural network require optimization of the number of hidden layers, the type of the transfer function, and the number of neurons in a hidden layer's choices [20]. Hybrid models are one of the updated models that have started to be used extensively in hydrology science [21]. Correspondingly, fuzzy models are one of the traditional models that lack handling complex problems and too many rules [22] and same goes for MLR model when dealing with multiple output and complexity [23]. e highlighted limitations of the existing ML-based streamflow forecasting models have necessitated the search for more sophisticated ML-based modeling techniques.
Streamflow forecasting plays an essential role for the researchers and engineers to better understand the river pattern which in turn helps to design more sustainable and efficient infrastructure and management project. Streamflow data is important yet presents itself with various issues such as missing data, noncontinuous data, nonlinearity, and extreme events [24,25]. Researchers have devised various techniques and tools to overcome them, yet to grasp the full scope of such data in terms of seasonality, point source pollution, and sudden changes due to event of heavy rain or other calamities, and more work is needed to be done. Disaggregating streamflow can sever an essential procedure for reservoir operation and river basin management in general [26,27].
is topic has received an extensive capacity by several hydrology scholars. Stedinger and Vogel [28] developed a simple class of a disaggregation model that can reproduce a covariance matrix of streamflow and reasonable approximation to the lead times that should be imposed for the disaggregation approach. Of recent advanced computer models, the disaggregation procedure was investigated by several scholars. A stochastic model was proposed to disaggregate streamflow at multiple sites preserving their temporal and spatial dependencies [29]. An integrated nonparametric model with genetic algorithm was to simulate seasonal streamflow disaggregating [30]. Monthly streamflow scale was disaggregated into daily scale using simple stochastic, as conducted in [31]. Various other research studies were conducted on the streamflow disaggregation [32][33][34][35]. All the reported research over the literature evidenced the capacity of studying the streamflow disaggregation. However, the implementation of the ML models for the streamflow disaggregation is limited and needs to be investigated. ML models such as Multiple Additive Regression Trees (MART), Group Methods of Data Handling (GMDH), and Gene Expression Programming (GEP) are yet to be explored for the generating monthly streamflow time series from annual streamflow time series.
ere was no established research over the literature using those models yet to be tested. e main objective of the current research is to investigate the feasibility of MART, GMDH, and GEP models for generating monthly streamflow time series from annual streamflow time series. e proposed models represent three different types of ML models. e MART model is one of the most popular decision tree models that strengthen the weak learning, which results in strong learning process and better generalization [36], while the GMDH model is chosen to represent self-learning models. e GEP model was applied as revolutionary model. e proposed models were evaluated statistically among each other and analyzed based on their predictability capacity. e study aims to demonstrate the possibility of changing the timescale in generating streamflow. is is the first application of using the GMDH, GEP, and MART models to generate monthly streamflow data from annual monthly streamflow data without using method of fragments which is usually used to disaggregate the annual streamflow to monthly streamflow.

Study Area and Data. Upper Zab, Lower Zab, and Diyala
Rivers are the major tributaries of Tigris River in Iraq, which were selected for the case study in this research. e Tigris River is one of the largest rivers in the Middle East. e river is about 1718 km long that goes through Turkey then Syria then Iraq. However, the major percentage (253,000 km) of about 85% of the river travels through Iraq region. e Tigris River along with the Euphrates River contributes to the Iraqi region as the main natural resources of fresh water that is required for diverse necessity of water usages. e Upper Zab River headwaters are located in Turkey's territory, while the headwaters of the Lower Zab and Diyala rivers are located in Iran's territory [37]. Figure 1 shows the location of Upper Zab, Lower Zab, and Diyala Rivers in Iran. Table 1 summarizes the morphological and flow data characteristics for the Upper Zab, Lower Zab, and Diyala upstream Bekhme, Dokan, and Derbindi-Khan flow gauging stations, respectively. e climate of the basin is predominantly semiarid. e temperature in the basin varies from maximum 45°C during summer to minimum 10°C in winter. e mean monthly discharge and the standard deviation of Tigris River flow at Baghdad station are 411.35 m 3 /s and 234.52 m 3 /s, respectively. Monthly flow data for the period 1932-2004 were selected. is period was selected because there was no missing data during this period. e first 70% of data was selected for training the models, while the second 30% of data was selected to validate the models.

Introduction to the Gene Expression Programming (GEP)
Model. GEP was invented by Ferreira as an extension of traditional genetic programming. e program is developed as linear strings of fixed chromosome's length and then encoded as a nonlinear form with different dimensions [38]. In GEP, expressions are generated automatically by encoding the expression in the form of a tree consisting of nodes representing functions and leaves (terminal) representing constants and variables. e generated candidates were evaluated by a fitness function. e genes included two parts: tail that includes variables and head that includes variables and constants [39]. Five steps are used to develop the GEP model: (i) selecting a set of predictor variables, which can be used in discrete programs; (ii) selecting the specific functions and arithmetic operations; (iii) choosing the fitness measure; (iv) selecting the appropriate head length, quantity of genes, and the linking function; and (v) selecting the genetic operators which include inversion rate and mutation rate [40]. More details for GEP are found in [41]. Figure 2 shows the flowchart of gene expression programming algorithm.

Introduction to the Multiple Additive Regression Trees (MART) Model. MART was developed by Derrig and
Francis [42] to increase the accuracy of the traditional decision tree model result.
e researchers found that the models developed using MART are more accurate models in comparison with any known modeling methodologies. e model can handle categorical and continuous inputs and target variables. e model is more stable due to the use of    the Humber M-regression loss function in its algorithm. MART algorithm is started by fitting the inputs to first tree and then the biases from the first tree are inserted to the next tree to minimize the error [43]. is procedure is repeated through a series of following trees. e final results are adjusted by adding contribution weight of each tree. e MART algorithm can be expressed as [36] Target where S is the mean value of the target variable; N is a pseudoresidual as set value's vector, is tree fixed to the pseudoresiduals, and C 1 , C 2 , . . ., C n are the tree node predicted coefficients. Figure 3 shows a simple MART structure and Figure 4 shows the flowchart of random trees algorithm. (GMDH). GMDH was developed to solve the problems of predication, complex system, and optimization by using a nonlinear regression algorithm. GMDH structure is classified as a self-organizing polynomial neural network's method [44]. GMDH is a specific type of supervised artificial neural network. e algorithm of GMDH uses the concept of natural selection to control the network size, complexity, and accuracy [45]. e GMDH model starts by selecting a set of functions that showed highest prediction accuracy at previously unseen data. In GMDH model, layers of neurons are created using one or more inputs. e connections between neurons in the network are self-selected during training phase. e determination of number of layers and neurons in the network is automatic. e GMDH solutions are subsets of functions called partial models [46]. e best model is reached by gradually increasing the number of partial models. e GMDH algorithm uses the two variables' quadratic equation to develop the model.

Introduction to the Group Method of Data Handling
where Y � [y 1 , y 2 , . . ., y n ] T and A � [a 1 , a 2 , a 3 , a 4 , a 5 ].
where m presents the number of variables, (x 1 , x 2 , x 6 ) are vectors of input variables, and (a 1 , a 2 , . . ., a 6 ) are vectors of parameters. Figure 5 shows the structure of GMHD and Figure 6 shows the flowchart of GMDH algorithm. More details of GMHD are found in [44].

Performance
Evaluation. In this research, two different performance metrics were selected to evaluate the proposed models: coefficient of determination (R 2 ) and root mean square error (RMSE) [47].
where Q io and Q ip are the observed and generated streamflow values, respectively, Q o is the observed streamflow mean value, and n is the data record number. e best models are those which showed low RMSE and are close to 1 value for R 2 .

Modeling Results and Discussion
In this study, the three ML models were applied to develop the best models to generate monthly streamflow from annual streamflow. e models were developed using monthly streamflow as a target variable while the annual streamflow and monthly time index as predictor variables. e time index is an index that represents the monthly sequence within a year, and its values range from 1 (January) to 12 (December). Selecting the best model for predicting the monthly streamflow of the three proposed rivers requires choosing the best model settings for the MART, GMDH, and GEP models. e best MART model requires selecting the best settings for the three parameters in the model that includes the amount of trees in series, depth of discrete trees, and number of splits (least size). ese values for the three rivers are 600, 5, and 10 for the Upper Zab River, 800, 5, and 10 for the Lower Zab River, and 300, 5, and 10 for Diyala River. e best GMDH model requires selecting the best settings for the four parameters in the model that include maximum network layers, maximum polynomial order, number of neurons per layer, and network layer connections type.
e optimum parameter's settings of the GMDH model for the three rivers in this study are 20 for maximum network layers and 16 for maximum polynomial order, same number of neurons as inputs' option for the number of neurons per layer, and previous layer and original input variables for the network layer connections' type. ere are five major steps for GEP modeling in this study: (1) selecting the set of functions to be used: 5 basic mathematical functions were used: +; −; ×; ÷; and power; (2) selecting the where Q m is a monthly flow, Q a is an annual flow, and T is a time index (1, 2, 3, . . ., 12). e values of monthly streamflow change with the change in the value of the time index in the previous functions. e performance of the proposed models was evaluated utilizing the couple of statistical   Table 3, the Upper Zab River simulation has shown that the performance of the MART model is superior over the performance of GMDH and GEP models. e RMSE values in the validating period are 66, 120, and 121 m 3 /s in the training phase and 73, 118, and 109 m 3 /s for the MART, GMDH, and GEP models, respectively, for the Diyala River. Both the R 2 and RMSE metrics results have proved the accuracy of the MART model to disaggregate annual flows to monthly streamflow in comparison with the GMDH and GEP models. e quality of the proposed models was measured by equating between the three statistical time series parameters which are maximum flow, standard deviation, and mean. Table 4 exhibits the results of these parameters. In accordance with the reported results in Table 4, it is apparent that the performance capacity of the MART model was superior to the performance of GMDH and GEP models. e observed data of the maximum monthly streamflow with the results of the applied models shows that the maximum monthly streamflow values in the validating phase were 1631, 1486, 1435, and 885 m 3 /s for the Upper Zab River, 1569, 1215, 769, and 588 m 3 /s for the Lower Zab River, and 864, 769, 626, and 570 m 3 /s for the Diyala River of the observed, MART, GMDH, and GEP models, respectively. Comparing the results of the statistical parameters, standard deviation, and the mean of the observed monthly streamflow with the results of the applied models as in Table 4 shows improved competence of the MART model compared to the GMDH and GEP models. e predicted monthly streamflow over the validating period was assessed using the scatter plots variation as illustrated in Figures 7(a)-7(c). e plots in Figure 7 demonstrated a good relationship between the observed value and the generated monthly streamflow using the potential of the MART model in comparing to the other models. Also, the efficiency of the applied models was evaluated by comparing monthly statistical parameters for each month. e models' capability to handle streamflow data decreases with increased stochasticity of the data; however, the results depict that MART is more capable of predicting such data. e results of the maximum monthly flow, mean flow, and standard deviation for each month were compared with the results of observed monthly streamflow in the validating phase. e comparisons were made by plotting the maximum monthly flow, mean flow, and standard deviation against the months; see Figure 8. e results of Figure 8 also demonstrated that MART model is accurate and thus superior with respect to the GMDH and GEP models' performance to generate monthly streamflow from annual streamflow. As per the results of the statistical parameters, it is apparent that the MART model performance is accurate when compared to the observed data for the three studied rivers (Figure 8). It is apparent that rivers have diverging hydrological characteristics and model behavior and performance can change greatly according to that. As per the figure, it can be observed that each river presents different seasonality and deviation over the time period because of which models generate more error during modeling.        Figure 9 shows a comparison between the observed and generated monthly streamflow generated using MART model during the validation phase for the three rivers in this study and it also shows how the monthly flows were generated from the annual flows data. e results in Figure 9 show the proximity of the observed value and generated monthly streamflow which also evidenced that the MART model performance is able to produce monthly streamflow time series from annual monthly streamflow time series data.
e results indicated the efficiency of MART model in generating monthly flow data from annual flow data and this is due to MART model's structure, which enables the building of robust models with a limited number of inputs (the inputs included only the annual flow and time index).
e results also showed the weakness of revolutionary and self-learning models in creating robust models with a limited number of inputs.
e results indicated the efficiency of using MARTmodel to generate monthly streamflow from annual streamflow.
is is the first application of using MARTmodel to generate the monthly streamflow from annual streamflow. e results showed the importance of using time index to improve the accuracy of generating monthly streamflow from annual streamflow. e results of this paper are encouraging to develop new models for generating monthly streamflow data instead of the data of fragment method which is usually used for this purpose.

Conclusions
In this study, three different ML models were used to generate monthly streamflow time series from annual streamflow time series. e models included MART, GMDH, and GEP. e models input only included the annual streamflows and monthly time index. e results showed that the MART model is superior to the GMDH and GEP models in producing monthly streamflow time series by applying annual monthly streamflow time series data. e results indicated that the structure of MART model is better than the structure of polynomial neural networks or revolutionary models in generating modeling. e efficiency of MART model was better than the results of GMDH and GEP models for Upper Zab (R 2 0.84, 0.64, and 0.47), Lower Zab (R 2 0.75, 0.46, and 0.40), and Diyala (R 2 0.78, 0.42, and 0.5).
e MART model accuracy is relating to its specific architecture, which may include number of trees growing in equivalence in addition to the use of boosting technique which helped to improve the prediction function. e results demonstrated the possibility of changing the timescale in generating streamflow. e application of MART model is easier than the method of data of fragment that is usually used to disaggregate the annual streamflow to monthly streamflow.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors have no conflicts of interest to any party.