A Multivariate and Multistage Medium- and Long-Term Streamflow Prediction Based on an Ensemble of Signal Decomposition Techniques with a Deep Learning Network

'e accuracy and consistency of streamflow prediction play a significant role in several applications involving the management of hydrological resources, such as power generation, water supply, and flood mitigation. However, the nonlinear dynamics of the climatic factors jeopardize the development of efficient prediction models. 'erefore, to enhance the reliability and accuracy of streamflow prediction, this paper developed a three-stage hybrid model, namely, IVL (ICEEMDAN-VMD-LSTM), which integrated improved complete ensemble empirical mode decomposition with additive noise (ICEEMDAN), variational mode decomposition (VMD), and long short-term memory (LSTM) neural network. Monthly data series of streamflow, temperature, and precipitation in the Swat RiverWatershed, Pakistan, from January 1971 to December 2015 was used as a case study. Firstly, the correlation analysis and the two-stage decomposition approach were employed to select suitable inputs for the proposed model. ICEEMDANwas employed as a first decomposition stage, to decompose the three data series into intrinsic mode functions (IMFs) and a residual component. In the second decomposition stage, the component of high frequency (IMF1) was decomposed by VMD, as the second decomposition. Afterward, all the components obtained through the correction analysis and the two-stage decomposition approach were predicted by using the LSTM network. Finally, the predicted results of all components were aggregated, to formulate an ensemble prediction for the original monthly streamflow series. 'e predicted results showed that the performance of the proposed model was superior to the other developed models, in respect of several evaluation benchmarks, demonstrating the applicability of the proposed IVL model for monthly streamflow prediction.


Introduction
e accuracy of the streamflow prediction technique is crucial for efficient management and planning of hydroresources. However, the involvement of nonlinear processes, such as evaporation, topography, anthropic activities, and rainfall, poses a challenge for efficient streamflow prediction [1]. Streamflow prediction can be categorized into shortterm prediction (e.g., daily or hourly), medium-term prediction (e.g., seasonal, monthly, and weekly), and long-term prediction (e.g., annual) [2].
Process-driven models (PDMs) and data-driven models (DDMs) represent the two general categories of streamflow prediction models. PDMs consider the physical processes of the water cycle [3], whereas the DDMs are based on artificial intelligence (AI) methods and avoid considering the physical mechanisms of the watershed. In other words, these AIbased models are more user-friendly compared to the PDMs [4]. e development of PDMs is very complex, and these models are prone to several factors. ese factors include the effects of watershed's underlying conditions on the accuracy and integrity of data, the intricacy of rainfall-streamflow process, the spatial-temporal variation of climatological data, and the limited knowledge of streamflow patterns in the watersheds. Majority of these models necessitates a large quantity of data for training and testing, which makes these models computationally complex. Resultantly, the researchers have attempted to develop substitute approaches to predict streamflow with reasonable accuracy and comparative ease. e DDMs can be regarded as a black-box and try to establish relations between the input and output variables with limited information on the underlying hydrological process [5]. DDMs have a simpler architecture than the PDMs since they require fewer data. ese models can circumvent the influence of uncertainties on model performance, which is experienced due to complex hydrological processes and also offer good prediction results [6]. e DDMs are becoming popular with the advent and advancement of AI. ese models are more suitable for streamflow forecasting than the PDMs, particularly when limited knowledge of the hydrological process is available [7]. e DDMs can be viewed as a promising solution to resolve the challenges of uncertainty and sensitivity inherent with the PDMs [8,9].
Machine learning models (MLMs) are extensively employed to study the nonlinear dynamics of the hydrological variables [10][11][12]. Neural networks [13], support vector machines (SVM) [14], and random forests are the most popular MLMs for prediction [15]. MLMs are feasible for predication of streamflow, temperature, and precipitation variables on a large scale [16,17]. Recent studies have demonstrated the superior performance of deep learning (DL) approaches for streamflow prediction [18][19][20][21]. LSTM network can be employed to model streamflow-precipitation variables due to its ability of learning long-term inputs and outputs dependencies [22]. erefore, LSTM has been successfully applied in numerous streamflow-precipitation studies [23,24].
MLMs coupled with decomposition techniques are employed, to enhance the performance of standalone models, and for more accurate prediction [25,26]. e decomposition techniques have been effectively applied to decompose the streamflow time series and to improve the performance of MLMs [27]. ICEEMDAN is the latest version of complete ensemble empirical mode decomposition with additive noise (CEEMDAN) and decomposes the signal into the subcomponents having less noise [28]. VMD is another advanced decomposition technique having outstanding frequency search performance and sampling properties [29]. e selection of input variables in the machine learning (ML) based DDMs (ML-DDMs) for streamflow prediction is of vital importance. Different combinations of inputs are applied to predict the target values of the streamflow. e streamflow prediction can be performed, by considering the observed streamflow time series as an input, to predict the target streamflow [30,31]. e streamflow, precipitation, and temperature variables can also be applied as an input to predict the target streamflow [32,33].
is paper developed five standalone MLMs, including a radial basis function neural network (RBF), support vector regression (SVR), random forest regression (RFR), gated recurrent unit neural network (GRU), and LSTM to determine the model with the best prediction performance. e monthly streamflow, temperature, and precipitation series were selected as the input variables for models development. e different statistical metrics were employed to assess the performance of the models in the training and testing periods. e performance of the LSTM network was superior to the standalone counterparts. e standalone LSTM network was selected, and its prediction performance was enhanced further by the development of two-stage hybrid models (ICEEMDAN-LSTM and VMD-LSTM). e two-stage hybrid models revealed better results than the standalone LSTM network. e two-stage hybrid models for the streamflow prediction can be extended to the three-stage hybrid models to improve the performance of the two-stage hybrid models [34]. erefore, considering the superior decomposition properties of ICEEMDAN and VMD techniques and the better prediction capability of LSTM than the other MLMs, this paper proposed a three-stage hybrid model IVL for streamflow prediction. Experimental results proved that the proposed model was superior to the twostage hybrid and standalone models in terms of several performance measures. Specifically, the main objectives of this study were the following: (1) e development of a three-stage hybrid model coupling a two-stage decomposition approach with a DL model (2) e applicability of the proposed model for the streamflow prediction by considering streamflow, temperature, and precipitation as input variables (3) Verification of the performance of the proposed model with two-stage and standalone models by comparing results e remainder of this paper is arranged as follows. Section 2 introduces the decomposition and DL approaches, the statistical metrics for performance evaluation, methodology, and the study area. Section 3 presents all the results along with the discussion of the results, and Section 4 summarizes the conclusions of this study.

Improved Complete Ensemble Empirical
Mode Decomposition with Additive Noise. ICEEMDAN was proposed to resolve the issues of the spurious modes and the frequency aliasing as faced by the other EMD based techniques [28]. By adding white noise, ICEEMDAN realizes the frequency continuity among adjacent scales, which results in the weakening of frequency aliasing effect [35]. e calculation methodology of ICEEMDAN is given as follows: (i) Add white noise of specific amount to the original signal x, as where i is the added noise number, x i denotes the signal to be decomposed, ω i represents the white noise, and E 1 (w (i) ) depicts the first EMD component of the white noise. (ii) Afterward, the first residue (R 1 ) can be obtained as 2 Advances in Meteorology where M(.) represents the local mean of envelope that fulfills the sifting threshold of IMF. (iii) e first IMF can be obtained by utilizing EMD after the decomposition of N signals as (iv) e following steps can be applied to calculate the second residue and mode: (v) Calculate k th residue and mode: (vi) Repeat (4) for the next k stages.

Variational Mode Decomposition.
is study utilized VMD to construct a two-stage hybrid model and to verify its applicability for streamflow perdition. e benefit of the VMD technique is the absence of residual noise during the decomposition process. Equations (8)-(11) describe the main steps of the VMD technique [29]. As a constrained optimization issue, optimization functions to lessen the spectral bandwidth sum of all modes are given as where u k : u 1 , u 2 , . . . , u k and ω k : ω 1 , ω 2 , . . . , ω k denote modes set and centre frequencies, respectively. e Lagrangian multipliers and the term of quadratic penalty are introduced to convert the above optimization issue into the following unconstrained problem: e alternative direction method of multipliers is feasible to solve (2). e two stages of (2) can be demonstrated as follows: (i) u k minimization: (ii) ω k minimization: where n denotes the number of iterations and respectively. e detailed decomposition process of VMD technique can be found in [29].
Compared to the ICEEMDAN technique, VMD is an adaptive signal decomposition technique and avoids the presence of residual modes. ese advanced features of VMD make the decomposition process of VMD superior to the other decomposition techniques. e present study carried out an additional decomposition of the IMF1 component by the hybrid combination of VMD with ICEEMDAN for further resolution of the low patterns of frequency.
is enables the DL model to perform the streamflow prediction more accurately with fine-scale decomposition components.

Long Short-Term Memory Neural
Network. LSTM is an advanced version of the recurrent neural network (RNN) specially designed to address the issues of vanishing and exploding gradients as being inherent by RNNs [36]. LSTM can preserve long-term dependencies through its unique architecture, gates, and the cell state [23]. e LSTM network takes input X t at time step t and hidden states h t−1 and updates its hidden states as follows [37]: Cell where P s and Q s denote the network weights, b s are bias vectors, σ is the sigmoidal function, and tan h shows the hyperbolic tangent function [37].

Statistical Metrics.
Statistical metrics were employed to evaluate the performance of the proposed and other predictive models. e commonly used statistical metrics in the field of hydrology include mean absolute error (MAE), root mean square error (RMSE), Nash-Sutcliffe coefficient of efficiency (NSCE), and mean absolute percentage error (MAPE). e following equations were used to define these metrics: In (18)- (21), Ob i and Pr i depict the observed and predicted values of streamflow, respectively, while n represents the number of data points.

ICEEMDAN-VMD-LSTM-Based Hybrid Modelling.
is paper proposed a hybrid model IVL based on ICE-EMDAN, VMD, and LSTM network to predict monthly streamflow. e systematic sequence of the proposed model is explained as follows: Step 1: To select suitable input variables for the IVL model, the correlation analysis and the ICEEMDAN approach were applied to the streamflow, temperature, and precipitation time series.
e highest frequency component obtained because of ICEEMDAN was further decomposed by VMD into subcomponents.
Step 3. e components obtained as a result of the ICEEMDAN-VMD technique and the correlation analysis were applied to the LSTM network to construct the prediction model.
Step 4. e predicted results of Step 3 were reconstructed to finalize the prediction.
Step 5. e performance of the proposed model was evaluated by applying several evaluation benchmarks, including the two-stage hybrid models, standalone models, and statistical metrics. e hybrid models included VMD-LSTM and ICEEMDAN-LSTM models, whereas the RBF, SVR, RFR, GRU, and LSTM models were established as standalone models. Figure 1 explains the flowchart of the proposed methodology.

Dataset and Study Area.
e monthly streamflow, temperature, and precipitation data were selected in this study to predict one-month ahead streamflow at Chakdara station in the Swat River Watershed. e monthly data from January 1971 to December 2015 were taken, which corresponds to a sample size of 540 values, for each of streamflow, temperature, and precipitation datasets. e datasets were divided into the training dataset (70% of the total data) and the testing dataset (30% of the total data). e detailed description of the selection of the input variables for different models is provided in Table 1. Figure 2 provides pairwise relation between streamflow, temperature, and precipitation through a pairplot. e data were collected from the Water and Power Development Authority (WAPDA), Pakistan, and Pakistan Meteorological Department (PMD).
e Swat River Watershed is situated in the Khyber Pakhtunkhwa Province, Pakistan. Figure 3 illustrates the location of the Swat River Watershed in Pakistan. e perianal Swat River commences from the mountains of Swat Kohistan with the convergence of Utar and Ushu tributes. After streaming through the Kalam valley and the Swat area, the Swat River flows through the Malakand district and ends up into the Kabul River. e Swat River Watershed is mostly hilly, with heights stretching from 360 m to 4,500 m. e glaciers lie above 4,000 m, and vegetation is visible between 1,800 m and 3,400 m [38]. Precipitation occurs mostly in winter and summer. e high precipitation in the summer monsoon season sometimes results in flooding events. Swat River is vital for the economy of the Swat valley. It irrigates the districts of Swat, Malakand, and Peshawar and fulfills the needs of springs and water wells. e Swat River provides a natural habitat of flora and fauna in the region and attracts thousands of tourists. e hydropower stations on the Swat River provide electricity to the national grid of Pakistan.

Results and Discussion
3.1. Decomposition Analysis. Firstly, the ICEEMDAN was applied to decompose the three (streamflow, temperature, and precipitation) data series into several components, as demonstrated in Figure 4. ICEEMDAN decomposed the streamflow and temperature signals into seven IMFs (IMF1-IMF7) and a residual (Residual) component, whereas nine IMFs (IMF1-IMF9) and a residual (Residual) component were obtained due to the decomposition of the precipitation series through ICEEMDAN. e decomposed components (IMFs and Residual) provide the information of the high to low frequency components present within the three input data series. e first decomposed component (IMF1) of the three data series obtained through the ICEEMDAN preprocessing technique was further decomposed by VMD due to high oscillatory fluctuations. e number of intrinsic modes' determination is an important step, in the VMD process, and represents an acceptable data series, for an accurate approximation model [39]. Different methods were employed for the mode determination of VMD, including the correlation analysis [40], the centre frequency method [41], and the EMD process [42]. is study applied the correlation analysis to the decomposed components, obtained through the decomposition of observed Table 1: Selection of the input variables for different models.

Models
Input variables Target variable RBF  Advances in Meteorology streamflow, temperature, and precipitation series by the ICEEMDAN technique for mode determination, as presented in Figure 5. Figure 5 shows that the numbers of modes for the decomposition of the IMF1 component by VMD were found as eight, eight, and ten, respectively, for streamflow, temperature, and precipitation series. e decomposition of the IMF1 component of streamflow, temperature, and precipitation, is depicted in Figure 6.

Selection of Models Input Variables.
is study employed both the decomposition techniques and the correlation analysis to select suitable input variables for the development of all DL models. e ACF and CCF values of the three time series were calculated with a 95% confidence level to extract relevant input variables for model development. e ACF and PACF analysis for the streamflow time series are described in Figures 7(a) and 7(b), respectively. It is evident from Figure 7(a) that a significant correlation exists at 1 st , 11 th , and 12 th lag; therefore, these three lag values were selected as one of the inputs. Figure 8(a) illustrates that a significant CCF between streamflow and temperature series is present at 1 st , 10 th , 11 th , and 12 th lag. erefore, these four values were also chosen for model inputs. e 3 rd and 4 th lag values of the streamflow and precipitation series were chosen as the input due to significant correlation, as shown in Figure 8(b). Table 1 demonstrates the selection of input variables for the development of different models to predict the target variable of one-month ahead streamflow. For the standalone RBF, SVR, RFR, GRU, and LSTM models, the input variables were the observed time series of streamflow (Q t ), temperature (T t ), and precipitation (P t ) and the components obtained through the correlation analysis of these three data series (TF1-TF8), and PIMF1 (PF1-PF10)) to the observed time series of streamflow, temperature, and precipitation.

Models Structure and Parameter Selection.
All the analyses were performed using MATLAB R2015a software under the environment of Intel (R) Core i7-10510 U CPU @ 3.70 GHz, 16G RAM, by utilizing a Windows 10, 64-bit operating system. Moreover, Python 3.6 programming language was used in PyCharm integrated development environment, based on NumPy and Pandas packages, to implement all MLMs. e modules, including the Scikitlearn and the Keras employing Google TensorFlow backend, were also employed to develop MLMs.
For the ICEEMDAN technique, the value of standard deviation was set as 0.2, the realizations were 500, and the maximum sifting iterations were set as 5000. For the VMD technique, the moderate bandwidth constraint was taken as 2000, and effectively shutoff Lagrangian multiplier was considered.
e uniform distributed initialization of the centre frequencies of all modes was used. Moreover, no DC part was imposed during the decomposition process, while the tolerance parameter was taken as 1E-7. More details for parameter selection of ICEEMDAN and VMD can be found in [28,29]. e network consists of two hidden layers with 128, 64, or 32 nodes in each layer, and a dropout value of 0.2 was used to avoid overfitting. Adam was selected as an optimizer for all the models, and 1000 epochs were used for training the models.
Due to the difference in the dimension of streamflow, temperature, and precipitation datasets, normalization of the whole data is necessary to achieve the best performance of the models. e normalization was performed through the sklearn preprocessing module by employing the Min-MaxScaler function to transform the data between zero and one. e formula for normalization is

Prediction Outcomes.
To verify the performance of the IVL model, the predicted results of the IVL model were compared with VMD-LSTM, ICEEMDAN-LSTM, LSTM, GRU, RFR, SVR, and RBF models, during the training and testing periods. Tables 2 and 3 illustrate the results of statistical metrics for the performance evaluation of models in the training and testing periods. e performance of the hybrid models was far better in comparison to the standalone MLMs, where no decomposition of input variables was involved. Moreover, better results of LSTM with the lower error values of the statistical metrics than the other MLMs also established the viability of the LSTM network to predict streamflow, during the training and testing periods.
It is evident from Table 2 that the integrated IVL model yielded better accuracy and lowest error compared to the two-stage hybrid and standalone models. Conversely, the RBF model revealed the worst effectiveness and efficiency as compared with the standalone, two-stage, and three-stage hybrid models. During the training period, the IVL model 8 Advances in Meteorology   Correlation map of streamflow components by ICEEMDAN   Observed   IMF1   IMF2   IMF3   IMF4   IMF5   IMF6   IMF7   Residual   Observed  IMF1  IMF2  IMF3  IMF4  IMF5  IMF6 Observed   IMF1   IMF2   IMF3   IMF4   IMF5   IMF6   IMF7   IMF8   IMF9   Residual   Observed  IMF1  IMF2  IMF3  IMF4  IMF5  IMF6  IMF7  IMF8 Table 3 also illustrates the superior results of the IVL model compared to VMD-LSTM, ICEEMDAN-LSTM, LSTM, GRU, RFR, SVR, and RBF models in terms of MAE, RMSE, and MAPE during the testing period. It is also observable that two-stage hybrid models also acted to reduce the errors with higher efficiency than the standalone models during the testing periods. Furthermore, the VMD-LSTM model showed better results than the ICEEMDAN-LSTM model during the testing periods. e streamflow prediction results for all models in the training and testing periods are shown in Figures 9 and 10. It is evident from the figures that the standalone models were inferior to the hybrid models in effectively capturing the extreme values of streamflow. e three-stage hybrid IVL model was the most efficient in predicting the peak values during the training and testing periods. e standalone models were comparatively easy to develop; however, they showed a lesser accuracy in predicting the streamflow compared to the three hybrid models. e hybrid models were complex to construct; however, the hybrid models showed a better capability of predicting the intricate nonlinear relation between the input and the output parameters with more accuracy. erefore, the hybrid models possess the ability of meeting the necessities of medium-and longterm streamflow prediction. Figures 11 and 12 illustrate the scatter plots, whereas Figures 13 and 14 represent the boxplots of all models, to highlight the graphical comparison of models performance during the training and testing periods. e scatter plots provide the degree of dispersion and correlation between the observed and predicted values.

Advances in Meteorology
From Figures 11 and 12, it is evident that the scatter points of the hybrid models were nearer to the 1 : 1 gradient line compared with the standalone MLMs. is provided evidence of better accuracy delivered by the hybrid models than the individual MLMs. e IVL model showed the most concentrated scatter points around the regression line, with the lowest error and highest value of R 2 , while the RBF model had the most dispersed scatter points around the regression line. Figures 13 and 14 illustrate that the location of the median was more towards the bottom of the box for all models during the training and testing periods and represented all the plots that skewed to the right. e LSTM model revealed a better distribution of predicted data than the RBF, SVR, and GRU models during the training and testing periods. However, the boxplots of the hybrid models were According to the results discussed so far, in Tables 2-3 and Figures 9-14, the IVL model undoubtedly demonstrated the implementation of a superior model for streamflow prediction, by considering the streamflow, temperature, and precipitation variables. Moreover, the results also revealed the feasibility of ICEEMDAN and VMD approaches to improve the performance of the ML-DDMs. e three-stage hybrid prediction model enhanced the performance of the two-stage hybrid prediction models. e VMD-LSTM hybrid model presented better results than the ICEEMDAN-LSTM hybrid model, which indicates the superiority of the VMD technique over the ICEEMDAN technique. e standalone DL models (LSTM and GRU) showed better results than the standalone RFR, SVR, and RBF models, which highlight the advantages of the DL models, over the other MLMs, whereas the RFR ensemble model revealed better results than the SVR and RBF models. e performance of the SVR model was also better than the standalone RFB model. Regardless of the different performances shown by all the developed models, the results showed that all the models are feasible for the streamflow prediction.
For brevity, the authors considered only the three-stage hybrid model by integrating ICEEMDAN, VMD, and LSTM network for streamflow prediction. However, practically all the developed standalone models can be extended further to the two-stage and three-stage hybrid models. It shows that the ML-DDMs allow ease of extension and integration, to form the hybrid prediction models. is fact highlights the superiority of the ML-DDMs, over the PDMs. e IVL model is also feasible to predict different factors in the field of hydrology and meteorology, which signifies another advantage of the ML-DDMs (black-box models), compared to the PDMs (white-box models). e black-box models require the input variables to predict the output variables, whereas the in-depth consideration of the physical process is necessary, for the white-box models. e accurate prediction is indispensable for the effective management of hydroresources and for timely mitigation of extreme events and natural disasters. e proposed model can be applied to develop an early warning system, for protection against the flood damages, like the flood event of 2010, which occurred  in the Swat River Watershed [38]. e IVL model is also viable to predict any form of time series. e prediction of wind speed, solar radiation, pollution emissions, and climate change trends is also a feasible option by employing the proposed model.
Despite the superb performance of the IVL model to predict the monthly streamflow, this study offers some limitations. is study considered streamflow prediction on monthly basis; however, there is a need to investigate streamflow prediction also on a daily, weekly, and annual basis for efficient management of the watershed, reservoir operation and planning, and water allocation and supply.
Furthermore, this study employed streamflow, temperature, and precipitation variables for the streamflow prediction and does not consider important streamflow components (groundwater flow, surface, and subsurface components), infiltration, evapotranspiration, and human-made aspects. Nevertheless, the consideration of the above-mentioned components is necessary for more accurate streamflow prediction tasks. erefore, our future study will investigate streamflow prediction for other watersheds in Pakistan by considering different time scales, streamflow and associated components, and efficient input variable selection techniques.

Conclusions
In this study, a two-stage hybrid decomposition model was developed by integrating ICEEMDAN and VMD techniques. Subsequently, the LSTM model was coupled in the hybrid scheme, ultimately forming a three-stage hybrid model IVL (ICEEMDAN-VMD-LSTM) to predict monthly streamflow in the Swat River Watershed, Pakistan. e input variables for model development were selected from monthly time series data of streamflow, temperature, and precipitation, by employing correlation functions and the decomposition techniques. e datasets were split into the training (70% of the total dataset) and testing (30% of the total dataset) periods. Statistical metrics, including MAE, RMSE, NSCE, MAPE, and R 2 , were employed to evaluate the performance of the established models. e decompositions of the streamflow, temperature, and precipitation time series were performed using the ICE-EMDAN technique, which resulted in the improved performance of the standalone LSTM model. Consequently e proposed model can be employed to support water and environmental monitoring tasks; hence, this provides stakeholders with efficient means to respond to warnings, upcoming outbreaks, and happenings. It will eventually be helpful to provide support towards the strategic planning, operation, and the sustainable management of water resources.
Data Availability e streamflow, temperature, and precipitation data of the Swat River, Pakistan, used to support the findings of this study are included in this article. e data are also available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper. 18 Advances in Meteorology