Improved Rainfall Prediction through Nonlinear Autoregressive Network with Exogenous Variables: A Case Study in Andes High Mountain Region

Dirección de Investigación (DIUC), Universidad de Cuenca, Campus Central, Av. 12 de Abril s/n y Loja 010203, Cuenca, Ecuador Departamento de Quı́mica Aplicada y Sistemas de Producción, Facultad de Ciencias Quı́micas, Universidad de Cuenca, Campus Central, Av. 12 de Abril S/n y Loja 010203, Cuenca, Ecuador Facultad de Ingenieŕıa, Universidad de Cuenca, Av. 12 de Abril S/n y Loja 010203, Cuenca, Ecuador Departamento de Ingenieŕıa Civil, Universidad de Cuenca, Av. 12 de Abril S/n y Loja 010203, Cuenca, Ecuador Carrera de Ingenieŕıa Ambiental, Facultad de Ciencias Quı́micas, Universidad de Cuenca, Campus Central, Av. 12 de Abril S/n y Loja 010203, Cuenca, Ecuador


Introduction
Precipitation is one of the main components in the hydrological cycle [1,2]. It is one of the most important variables associated with atmospheric circulation in meteorological studies [3]. Moreover, it is the main source of recharge in water balance studies from local to regional scales [4]. However, its forecast has become a significant challenge owing to the complexity of the atmospheric processes that produce rainfall [5]. is reality is evident in high mountain regions [6] with high time-space variability like the Andes mountain range. In particular, the water originating from the Andean high mountains is used for diverse purposes (e.g., human consumption, agriculture, livestock grazing, industry, and recreation) in downstream cities [7][8][9] (e.g., Mérida in Venezuela, Bogotá in Colombia, and Quito in Ecuador). An accurate prediction of precipitation (temporal and spatial) can help decision-makers to assess in advance both flood and drought situations [10,11], and it could support extreme hydrological management and diminish the impacts on the population.
In different studies, a variety of methods have been used to forecast rainfall. Some of them have shown accurate results like the multiple linear regression and k-nearestneighbors methods as in [12,13]. Other approaches have used the outputs of global climate models to improve the seasonal and subseasonal forecasting [14][15][16] and the probabilistic forecasts for uncertainty quantification [17][18][19]. Other studies have applied artificial intelligence approach for rainfall prediction such as artificial neural networks (ANNs) [5,[10][11][12][20][21][22][23][24], support vector machine (SVM) [12,13,24,25], logistic regression [26], and random forest [27,28]. Also, other studies have investigated the optimum selection of predictors to improve forecast accuracy [29]. NARX model is a novel approach for many applications of prediction of environmental variables, for example, water level of wetland systems [30], groundwater levels [31][32][33], urban drainage framework [34], stream flow and inflow to reservoirs [35,36], water level in urban flood [37][38][39], and rainfall forecasting [40,41]. e unpredictable and chaotic nature of the rainfall behavior makes its prediction a critical challenge. e NARX model is suitable for dealing with time series with such characteristics [42] since the output value of a variable is a nonlinear function of the past values of the same variable and other exogenous variables. Nonlinear mapping is convenient in chaotic environments, and it is superior to conventional autoregressive models. It is also superior to ANN since NARX uses its memory to make predictions. NARX presents an improvement opportunity due to its performance in the prediction of time series that has a seasonal component.
In Andean countries like Ecuador, some of the methods mentioned above have also been used to predict precipitation. For example, in [43], continuous transfer function models were used to predict rainfall on the coasts of Ecuador. Also, Mendoza et al. [44] used sea surface temperature (SST) as a predictor and canonical correlation as a method to predict rainfall in the coast and Andes of Ecuador. ese studies highlight the complex relationship between predictors and rainfall in the Andes. However, new methods like NARX could be applied in the Ecuadorian Andes to improve the performance of precipitation forecasting.
is study first explores the performance of the most commonly used linear models for the rainfall forecast. It then compares them with nonlinear models to verify their performance in a mountain basin located in the south of Ecuador. Finally, NARX models are used to forecast rainfall several steps ahead. e results obtained can have a significant impact on mountain systems because they could be used to obtain improved performance in high mountain areas.

Study Area.
e study area is composed of the high mountain subbasins Tomebamba and Machángara, located in the Andean southern Ecuador (Figure 1). e two subbasins are essential to supply water to the city of Cuenca, considered the third most important in Ecuador [45]. Moreover, the two subbasins belong to the Paute hydrographic system, which, together with others, provide water to generate energy. e subbasins altitude varies between 2440 and 4400 m.a.s.l., which in turn allows finding a great variety of vegetation. For example, in high areas, there are patches of Polylepis reticulata, tussock grasses, ground rosettes, cushion plants, and ground rosettes, while in the low areas, there are agricultural land, pastures, as well as urban zones [9,46,47].
Two meteorological stations are taken into account for the analysis: Labrado and Chirimachay. ese are located in the upper parts of the two subbasins at 3300 m.a.s.l ( Figure 1). According to Celleri et al. [48], there are two types of precipitation within the study area: bimodal type I in the lower parts of the subbasins and bimodal type II in the middle and upper zones. Both regimes have two peaks of precipitation during April and October; however, the bimodal type II regimen presents a less dry season from June to August (Figure 2).

Precipitation Data.
Rainfall data from Labrado (subbasin of Machángara) and Chirimachay (subbasin of Tomebamba) stations were provided by the Ecuadorian National Institute of Meteorology and Hydrology (INAMHI, http://www.serviciometeorologico.gob.ec). e study period is 1964-2015. Table 1 shows a statistical summary of the rainfall data.
where β 0 is the bias (offset), β 1 is the coefficient (slope) of variable x, and ε is the error or random noise.

Multivariate Linear Model.
In general, the multivariate linear model supposes that p distinct predictors are available, and the output is a weighted linear combination of the set X of predictor variables x. e formula of the multivariate linear model is as follows: where x j represents the jth predictor and β j represents the magnitude of the relationship between the jth predictor and the output y. e absolute value of the coefficients β j defines the degree of influence of the predictor over the output [73].

Linear Model Regularization and Selection.
As the availability of variables increases, the probability of falling into overfitting also increases [74]. Overfitting is an error that occurs when a model fits too closely to a limited set of data points, decreasing the predictive power of the model. To prevent this drawback, commonly, ridge and lasso regressions are used as regularization techniques [73]. e objective of applying these techniques is to obtain parsimonious models. In the multivariate linear model, the adjustment of the parameters is made by minimizing the cost function (residual sum square, RSS) through the leastsquares. e RRS formula is shown as follows: where n is the number of samples in the dataset and x ij is the value of the ith sample in the jth predictor. Ridge regression aggregates a penalty term to the cost function, which is equal to the sum of squares of the magnitude of the coefficients. As a result, this method keeps all the predictors with the lowest coefficient magnitudes. In the lasso regression approach, the penalty term is equal to the sum of the magnitude of the absolute coefficient, and some values even can become zero. is is the reason why Lasso is considered as a feature selection method. Ridge regression and lasso minimize the quantity depicted on the following equations, respectively: where λ ≥ 0 is a tuning parameter. e impact of regularization over the estimated coefficient is controlled through λ that accompanies the penalty term. In an extreme case, when λ � 0, the penalty term does not cause any effect. Conversely, as λ ⟶ ∞, the incidence of the regularization penalty increases [73].

Training and Testing
Datasets. e data of the 27 synoptic climatic indices were lagged 12 months with respect to rainfall data. is decision is due to a practical matter. To predict 12 months of rain, information on exogenous variables from the previous 24 to 12 months is necessary. From that, the information was lagged from 1 to 24 months, so there were 675 time series used as predictors corresponding to the climatic indices. To incorporate the sense of autoregression in linear models, the same rainfall signal was used as a predictor in the models that considered lagged exogenous predictors. Although some predictors were correlated with each other (e.g., Spearman index of 0.94 between TNA and NTA), none were omitted. is is because only perfectly correlated signals (i.e., genuinely redundant) do not provide additional information [75]. e original dataset was divided into two subsets: the first one from January 1964 to December 2014 and the second one from January 2015 to December 2015.
e Minmax normalization process was applied to the first subset, which produced parameters to normalize the second subset as well. e first subset was randomly divided into a subset to train models (80% to find the best coefficients) and a subset to test (20% to assess the model). Applying this approach, fifty multivariate linear models to estimate the rainfall were fitted, namely, twentyfive for each method (ridge and lasso), one with no lags, and 24 with lags ranging from 1 to 24 lags. e algorithm used to define the best value of λ and fit the models was crossvalidation [73].

Random Forest.
Random forest (RF) [76] is based on the idea of boosting the prediction of a model using an assembly of decision trees [77] results. Each random tree is based on an independent random vector of sample values with the same distribution. RF has become popular in hydrological and climatic applications (e.g., [27,28]) due to its high performance, efficient training in big datasets and high  4 Advances in Meteorology dimensions, and the estimation of the importance of the different features that represent the instances. Several hyperparameters can be optimized in the construction of the model based on random forest. From them, different values for the number of decision trees, its maximum depth, and the number of features used in the construction of the trees were explored in a grid search 6-fold cross-validation [78] fashion. For this, 85% of data were used as a training subset. For testing the performance of the resulting model, the remaining independent 15% was used. Finally, the model was used for the prediction of rainfall in 2015 in the two study stations.
2.6. Support Vector Machines. Support vector machine (SVM) [79] raises the dimensionality of the data to a vector space where it is possible to construct a linear regression model. e linear regression is performed based on representative data points that make up a so-called support vector. e representation of the data in a high dimension is done using kernel functions. Here, the Gaussian radial basis function (RBF) was used. Two hyperparameters must be optimized for the regularization (parameter C) and the spread of influence of the support vector (parameter gamma). e optimization of the hyperparameters was done in the same way as with RF, i.e., 85% of data for training and 6-fold cross-validation and 15% for testing. e resulting model was used for predicting the rainfall in 2015 in the two study stations. Many studies in climate and meteorology have used SVM [12,13,24,25] and RF in prediction, so we use them here as a baseline to compare the performance of NARX.

Recurrent Neural Nets with Exogenous Inputs NARX Model.
e nonlinear autoregressive network with exogenous inputs (NARX) is a dynamic recurrent neural network (RNN), with feedback connections that enclose several layers of the network; i.e., the output is considered as another input of the network. Figure 3 depicts the NARX model architecture.
Its memory ability is useful for the prediction of nonlinear time series. Besides, unlike classic artificial neural networks, NARX gains degrees of freedom by incorporating valuable information from exogenous inputs [32]. ere are two different architectures of NARX: the series-parallel architecture (called open-loop) and the parallel architecture (called close-loop) presented by the following equations, respectively: y(t + 1) � F y(t), y(t − 1), . . . , y t − n y , where F(·) is the mapping function and y(t + 1) is the predicted output of NARX for the time t + 1, determined at the instant t. e terms y(t), y(t − 1), . . . , y(t − n y ) are the true past values of the time series, also known as the ground truth, and y(t), y(t − 1), . . . , y(t − n y ) is the past predicted outputs generated by the NARX. e true values of exogenous inputs are X(t + 1), X(t), . . . , X(t − n x ). e numbers of input delays and output delays are defined by n x and n y , respectively. e series-parallel architecture is used in the training process because the real output is available, leading to a conventional feed-forward representation, while the parallel architecture can make predictions based on feeding back the estimated output instead of accurate output. e mapping function F(·) (which is initially unknown) is fitted as the training process progresses. e multilayer perceptron (MLP) architecture is used to represent this approximation since it is a robust structure capable of learning any kind of continuous nonlinear mapping. A classic MLP contains three basic layers: input, hidden, and output layers. Moreover, it has elements such as neurons, activation functions, and connections weights. As the number of hidden neurons increases, the model can approach more complex functions. However, the selection of the number of these hidden neurons depends on the addressed case study. In general, just one hidden layer with neurons ranging from [0.5p, 2p] is commonly used [80]. e sigmoid-linear transfer function combination can provide an efficient mathematical representation of the output as a function of the input signal [32].

NARX Model Architecture.
In NARX models, 27 synoptic climate indices and the rainfall signal were used as inputs to the nets, and rainfall was the output. Again, none of the predictor was excluded. To identify the optimum NARX architecture and due to the massive amount of net setting allowed, the standard trial-and-error method to select the number of hidden nodes and lags number was used. e input layer neurons depended on the number of lags used, so architectures with 2, 3, 4, 5, 6, 9, and 12 lags were tested. e output layer had just one neuron. e net had one hidden layer with 10, 20, 30, 40, and 50 neurons. e neurons in the hidden layer used a sigmoid transfer function, whereas the output neuron used a linear transfer function [32]. During the training process, the first subset is randomly dividing into a training sample, a cross-validation sample, and a test sample (70%, 15%, and 15%, respectively). e connections weights were initialized randomly, and they were tuned using the Levenberg-Marquardt algorithm [80], which is one of the most widely used functions for time series network prediction and training [81]. It is essential to mention that the training was performed using the series-parallel architecture, and the test sample was evaluated with the architecture mentioned above. e test sample is randomly selected, so the sense of temporality is lost. Once the NARX was fitted in a series-parallel configuration, it was converted to parallel architecture, and the test close-loop (second subset) was evaluated accordingly. is model can perform forecasting for several time steps ahead; hence, the predicted outputs Advances in Meteorology (at previous steps) constitute a real-time series as well as the test close-loop subset.

Nash-Sutcliffe Efficiency Coefficient (NSE).
e NSE is widely used to evaluate the performance of hydrological models. NSE is even better than other metrics, such as the coefficient of determination. However, it is susceptible to extreme values since it makes a sum over the square values of the differences between the observed and the predicted values [82]. is index is defined by equation (5): where O i and P i are the observed and predicted values in each period, respectively, and O i is the average of the observed values.

Kling-Gupta Efficiency (KGE).
e KGE is a performance measure based on three equally weighted components: variability, linear correlation, and bias ratio, between predicted and observed data. is index is defined by equation (6): where a is the variability (the ratio between the standard deviation of predicted over the observed values), cc is the linear correlation between predicted and observed values, and β is the division between the average of predicted over the average of observed values.

Determination of Bias Percentage (PBIAS).
e PBIAS determines whether there is a tendency in the values predicted by the model (i.e., if these are higher or lower than the observed values). A positive PBIAS indicates that the model underestimates the predicted variable, while a negative indicates that the variable is overestimating. e optimal value is a PBIAS equal to zero. is index is defined as

Root Mean Square Error (RMSE).
e root mean square error is the difference between forecasted values and the observations. e RMSE is always positive, and values closed to zero indicate a perfect fit; also, RMSE is sensitive to outliers. e RMSE is defined as where N is the length of the time series. NSE, KGE, and PBIAS can classify the goodness of fit in four categories [83,84], as shown in Table 2.
ese metrics evaluate the goodness of fit of the models in the training subset and the forecasting performance in the testing subset.

Multivariate Linear Model.
To obtain a multivariate linear model for predicting rainfall for Labrado and X (t + 1)

Input layer
Hidden layer Output layer e fitted models obtained for each λ that produced the best performance were selected. NSE metric applied in training and testing sets with the selected models for both ridge regression and lasso is depicted in Figure 4.
For both ridge regression and lasso, a tendency is evident. As lags increase, the NSE grows as well in the training set. However, for Labrado station (Figure 4(a)), the performance grows as lags increase to 18 where the performance in the test set fell. e models fitting the best, for both ridge and lasso, were obtained with around 16 lags. In the models with 16 lags, the performance of both was the same, while the fit was better for the ridge method. Despite having used regularization methods, the overfitting raised from models that used 18 lags or more, where it is clear that the fit increases, but the performance decreases. In the Chirimachay station (Figure 4(b)), the behavior was similar to Labrado. e best models for both ridge and lasso were with around 18 lags. In these models, ridge was better than lasso for both fit and performance. Again, the overfitting problem raised from the model with 19 lags. e top five of the predictors is shown in Appendix 1. ese predictors were ranked by the absolute value of the β j in each fitted model.
In Appendix 1, it can be seen that, for Labrado station, the predictor Niño 1 + 2 lagged 1 period (Niño 1 + 2_1) appears as the most influential index in linear models defined by the lasso method. On the contrary, for the ridge method, Niño 1 + 2 appears in the first six lagged models; hurricane activity lagged 7 periods (HURR_7) and become the most influential from seven to eleven lagged models, and the remaining models are influencing the same rainfall signal with a lag of 12 months. For the Chirimachay station, Niño 1 + 2 was the most crucial variable because it always appears as the first predictor of the model obtained by lasso. Meanwhile, for ridge regression, the same rainfall (with a lag of one month) is the most influential predictor in models that considered until eleven lags. However, from here (models with lags from twelve to twenty-four months), the same rainfall with lags of 11 and 12 months became the most important predictor. Another essential predictor indexes are as follows: North Pacific Oscillation lagged 1 period (NP_1), and Sahel rainfall lagged 3 or 5 periods (SAHELRAIN_3 and SAHELRAIN_5), which appears in the majority of the models for both lasso and ridge regression model regardless of the station. It is worth noting that when we refer to the predictors, for example, Niño 1 + 2 lagged 1 period, the true lag is 13 periods since we initially induced a lag of 12 periods in the predictors with respect to the rainfall variable in the initial database.
is naturally does not apply to delayed rainfall variables. ENSO indexes have a strong influence on Ecuadorian rainfall [43]. ey found that significant predictors for rainfall come mainly from the tropical Pacific sea surface temperature, especially from ENSO events. is statement matches with our finding on the Niño 1 + 2 and North Pacific Oscillation indexes. Overall, the models obtained for both Labrado and Chirimachay were "unsatisfactory" according to Table 2, thus implying the need to move in nonlinear models to improve the performance.

Random Forest Model.
e RF model of 12 lags had the best performance in Labrado station. However, the testing displays ( Table 3) low values of NSE and KGE; thus, these are classified as "Unsatisfactory". On the contrary, PBIAS is "very good" value, and the RMSE is relatively high. In the test, the close-loop values of NSE, KGE, and RMSE are bad; only PBIAS is better relative to the test subset. e Chirimachay station presents the better RF model with 12 lags similar to Labrado. us, in the testing, NSE and KGE show "unsatisfactory" values and PBIAS is "very good". In the test close-loop subset, the metrics NSE, KGE, and RMSE show low performance and only PBIAS is "very good". e RF models show poor results of forecasting (Figure 5); thus, NSE and KGE have low values, and those are unsatisfactory. e model has difficulty in forecasting high and low extreme values; the difference between observed and forecasting is more evident in January and July for both stations; in general, RF shows a low forecasting performance although RF is better than LM.
Also, Chen et al. [85] found RF performing better than LM for drought forecasting. However, extreme events are not accurate. In concordance with [85], the RF models in the peaks in rainfall are not accurate. Rainfall in Labrado and Chirimachay stations shows two high peaks in the year, in April and October (Figure 2). For this reason, low values of monthly forecasting are observed in the wet and dry seasons.

Support Vector Machine Model.
e SVM model with 3 lags presents a better performance for Labrado station; however, for the test, the statisticians are weak, and the NSE and KGE show "unsatisfactory" values; however, PBIAS is "very good," and the RMSE is relatively high. In the test close-loop, the metrics for evaluating the model worsen; only RMSE shows a little improvement. erefore, the number of lags does not have a strong influence in the model, SVM models with 3, 6, 9, and 12 lags have very similar values of NSE, KGE, and PBIAS, and the number of lags is not significantly different. In the Chirimachay station, the SVM model with 12 lags exhibits better performance. However, in the test subset, the NSE and KGE       Table 4) low values of NSE and KGE are "unsatisfactory," PBIAS is "very good," and RMSE is relatively high. e SVM model for Labrado and Chirimachay are similar; in other words, in the test and test close-loop shows low values than the test. ese values are poor and classified as "unsatisfactory;" also, both models have "very good" PBIAS in test and test close-loop. Figure 6 shows SVM model forecasted vs. observed data. e model shows difficulty in forecasting rainfall peaks mainly in January and September, where significant discrepancies are observed in both stations. However, the Chirimachay station displays a lower performance. Although SVM is one of most accurate rainfall prediction models [12], it failed to predict extreme rainfall [12]; for this reason, SVM does not show good forecasting in Labrado and Chirimachay, and both shows several low and high peaks in rainfall ( Figure 6).

Recurrent Neural Nets with Exogenous Inputs (NARX).
A different number of hidden neurons were evaluated for its ability to generate accurate model outputs. A hidden layer formed by 50 neurons with sigmoid transfer function and a single output neuron with a linear function provided the most effective network architecture. For the Labrado station, 3 lags in the input were necessary to produce the best model and 6 lags for Chirimachay. Table 5 shows the performance of the best models.
According to Table 5, for Labrado basin, the goodness of fit (evaluated in train, cross-validation, and test samples) is "satisfactory" in accordance with coefficient NSE, "good" in agreement with KGE, and "very good" according to PBIAS. Likewise, the performance (evaluated in the test close-loop) is "satisfactory" for both NSE and KGE and "good" for PBIAS. For the Chirimachay station, the goodness of fit was also "satisfactory" for both NSE and KGE and "very good" following PBIAS. Finally, the performance was "satisfactory" in agreement with NSE, KGE, and PBIAS. As an overall evaluation, the model fitted for Labrado was "good," and the model for Chirimachay was "satisfactory." Figure 7 shows the predicted versus real rainfall in the training subset for Labrado (a) and Chirimachay (b). Besides, the predicted and real rainfall in the test close-loop is presented as time series (c) for Labrado and (d) for Chirimachay. Figure 7 shows that, for the training subset, the model reached a perfect fit for some data points (points over the line), but for other data points, the predictions were weak (Figures 7(a) and 7(b)). Labrado station shows problems mainly in peaks of rainfall where overestimation and underestimation are present (Figure 7(c)). On the contrary, Chirimachay station shows better forecasting than Labrado, and the tendency is better captured from months 1 to 8, and from 9 to 12, it shows lousy forecasting (Figure 7(d)).
In the test close-loop data points (second dataset), the predictions were "very good" for the first six months, and after that, the performances decrease. It makes sense because the forward predictions are based on predictions from previous steps. erefore, previous forecast errors affect subsequent predictions. All prediction models depend on long-term data from past time [81]. is is consistent with our study because it works with a long-time series of information and also because the NARX models present good forecasts. However, the prediction quality decrease when the times ahead increase. e overall results show a satisfactory performance of the NARX model. However, at a lower time scale and with local exogenous variables, the NARX model has shown excellent performance in other studies as well [40,41]. Also, some studies, such as [81,86], show the suitable way of NARX model for getting good predictions on problems involving long-term dependencies. Nevertheless, these studies use few predictors; hence, their performance could improve by including more exogenous variables like our study.
erefore, NARX models are presented as a good alternative for the prediction of some hydrometeorological variables with various temporal scales because it takes advantages of the information of relevant exogenous variables.

Conclusions
In this study, linear and nonlinear approaches are evaluated for rainfall forecasting. For the linear models, the climatic indices: Niño 1 + 2, Sahel rainfall, hurricane activity, North Pacific Oscillation, and the same delayed rainfall signal, were the most influential in the prediction of rainfall. Even for models with predictors with delays higher than or equal to 12 months, the same rainfall signal that delayed 12 months was the most significant. us, it shows the seasonality present in the rainfall signal, which expected similar measures for the same month every year. However, the linear models did not capture well the rainfall dynamics, and hence, the evaluations were "unsatisfactory." Linear models reached the best performances with predictors lagging around 16 or 18 periods plus an initial delay of 12 months between the predictors and the rainfall variable. e RF and SVM showed better forecasting than LM in both stations, but this model presents inaccuracy in low and high peaks of rainfall. e SVM exhibits better performance than RF; nevertheless, in general, both display poor forecasting performance in Labrado and Chirimachay.
On the contrary, the nonlinear autoregressive network with exogenous inputs was also evaluated. Considering that the behavior of the rainfall is chaotic and highly nonlinear, the NARX networks had a better performance, obtaining models considered "good" for Labrado station and "satisfactory" for Chirimachay station. e best NARX networks were reported, which had 50 neurons in the hidden layer, and the delays were multiples of 3 periods (3 for Labrado and 6 for Chirimachay). In the prediction of the 12 months ahead, the models almost always follow the real changes in the rainfall signal. For the first 6 months, the models are very accurate in the prediction. Nevertheless, as we continue the prediction ahead, the performance decreases. e forecasting shows several problems for predicting peaks; this was more evident in the Labrado station. ese results could be attributed to the complex behavior of weather in Andean regions, which has several external and local influences affecting the dynamic of rainfall processes.
us, the future challenges on rainfall prediction must focus on capturing these peaks and identify the triggers of the precipitation events. ese improvements in rainfall predictions are essential to generate plans for suitable decisionmaking in water resources management.