Forecasting Water Demand in Residential , Commercial , and Industrial Zones in Bogotá , Colombia , Using Least-Squares Support Vector Machines

The Colombian capital, Bogotá, has undergone massive growth in a short period of time. Naturally, this growth has increased the city’s water demand. The prediction of this demand will help understand and analyze consumption behavior, thereby allowing for effective management of the urban water cycle. This paper uses the Least-Squares Support Vector Machines (LS-SVM) model for forecasting residential, industrial, and commercial water demand in the city of Bogotá. The parameters involved in this study include the following: monthly water demand, number of users, and total water consumption bills (price) for the three studied uses. Results provide evidence of the model’s accuracy, producing R2 between 0.8 and 0.98, with an error percentage under 12%.


Introduction
Bogotá, the capital of the South American nation of Colombia, has undergone significant population growth rate in recent years, with this figure at 1.31% annually (National Administrative Department of Statistics).Such a growth rate has increased the consumption of natural resources to ensure normal urban development.Of these resources, the most widely used one is water, due to the high demand for water in domestic, commercial, service, and industrial activities.An understanding of water consumption behavior, in addition to the ability to predict its use, would help public entities effectively plan and manage the development of cities, especially in light of the importance of water in a city's development.By understanding and predicting water use, positive environmental effects can be generated [1].Furthermore, doing so would contribute to urban water cycle management in terms of strategies for the optimal development of drinking water infrastructure, water consumption control, the environment, and sanitation [2].
Two important components must be accounted for when looking at water demand forecasting and water consumption behavior: on one hand, timing and, on the other, factors that influence consumption.The first component is measured using three different time scales: short, medium, and long.Short-scale measurements refer to hourly, daily, and weekly amounts of water and its use, which are generally performed with the real-time control, operation, and management of the system [3].Medium-scale measurements refer to months and seasons, focusing on the variability of water consumption [4].Long-scale measurements refer to time periods greater than a year.These two medium-and long-scale measurements are usually combined for financial, capacity, and expansion planning [5][6][7].
With regard to the most relevant elements that influence demand, a number of factors merit mention.These include climate (temperature, precipitation, evaporation, and humidity), demographic factors (population and population density), study-site infrastructure and system (productivity, technology, etc.), and political and administrative factors (application programs, education, and cost) [8][9][10].These factors determine water consumption, which means they should be considered input data for forecasting.For a short time scale, factors generally studied include climate (maximum and minimum temperature), sunshine hours, price, water use, working day, greenery coverage, building size, population density, and type of use.For medium scale, the main factors are season, greenery coverage, price, building size, water use, population density, and type of use.As for long scale, pertinent factors include climate or season, price, water use, per capita demand, population density, and type of use [5].
According to Kofinas et al. [11], models for water demand forecasting can be divided into two types: deterministic and stochastic.The former is used for short and medium scales; primary approaches include simple, multiple, semilog, and log-log regressions [11][12][13][14].As for the latter, these are most often used for large-scale models, such as autoregressive, moving average, autoregressive moving average, autoregressive integrated moving average, Artificial Neural Networks (ANN), fuzzy logic, and support vector machines [15][16][17][18][19] models.
This paper forecasts water demand for residential, industrial, and commercial uses in the city of Bogotá using Least-Squares Support Vector Machines (LS-SVM).The data analyzed herein was recorded on a monthly basis from 2004 to 2014; these data include the number of users, price (total billed), and consumption.
In the next section, readers are presented with a literature review on water demand forecasting using support vector machine (SVM), support vector regression (SVR), and LS-SVM.In the following section, water use is studied as a function of land use; also, the model, along with its variables, is discussed.In the final section, results and conclusions are presented.

Literature Review
This chapter presents a concise review of research that has relied on SVM, SVR, and LS-SVM models, especially with regard to water demand forecasting.
Chen and Zhang [20] used a LS-SVM model to calculate hourly demands; the authors found that the LS-SVM model performed better than an artificial Feedforward Neural Network Backpropagation (FNN-BP), mainly due to the fact that LS-SVM is based on structural risk mitigation; this type of mitigation considers the experience, risks, and confidence intervals by which the minimization of risk leads to a more accurate prediction.Ji et al. [21] compared four different algorithms-teaching-learning-based optimization (TLBO), ameliorated TLBO (ATLBO), grid search, and particle swarm optimization (PSO)-to adjust the hyperparameters of LS-SVM for forecasting hourly water demand in Yanqia in China; the researchers' results showed that the model with ATLBO was more accurate with better precision than the other algorithms.Shabri and Samsudin [22] applied Empirical Mode Decomposition (EMD) to the intrinsic mode function in the ANN and LS-SVM; they found that the EMD combined with LS-SVM performed best (root-mean-square error (RMSE) = 2.0 and Mean Absolute Error (MAE) = 1.64) for monthly forecasting water demand, followed by LS-SVM (RMSE = 2.1 and MAE = 1.81) and EMD combined with ANN (RMSE = 2.71 and MAE = 1.86).
Msiza et al. [23,24] compared Artificial Neural Network (ANN) to SVR, finding that the RNA model performed better (with an error rate of 2.95%), though, despite a greater error rate (5.46%),SVM model satisfactorily described demand behavior.Herrera et al. [17] predicted hourly water demand in southeast Spain using climate factors such as temperature, wind speed, precipitation, and atmospheric pressure.The authors used the following models: RNA, Projection Pursuit Regression, Multivariate Adaptive Regression Splines, Random Forests, Weighted Pattern-Based Model for water demand forecasting, and SVR.The authors found that SVR presented less error, proposing it as the most appropriate model for the forecast of hourly demands.Bai et al. [25] proposed a variable-structure support vector regression (VS-SVR) and compared it with LS-SVM.They found that VS-SVR was more accurate and improved the forecasting performance on account of being a dynamic model.Brentan et al. [26] compared an SVR with a hybrid model (SVR with a Fourier Time Series) for hourly water demand forecasting.Their results showed that SVR described a pattern yet had problems with extreme values, whereas the hybrid model performed better in terms of pattern description and reduced errors considerably.
During the same year Wu and Wang [27] assessed the performance of a support vector machine for annual data, finding that, for the dataset, evaluated relative errors of 0.91%, 1.86%, and 0.93% were detected.This, according to the authors, demonstrated that the SVM was highly accurate for demand forecasting.Liu and Chang [28] used SVM for forecasting monthly water demand in Shanxi, China; the authors' results showed that SVM prediction proved better when the sample size was relatively small.Chen [29] and Yang et al. [30] optimized SVM with a Genetic Algorithm (GA-SVM) to determine training parameters for SVM.For purposes of evaluation, the authors compared their optimized model to ANN and Grey Model (GM), finding that GA-SVM performed well relative to ANN and GM.Another study, done by Sampathirao et al. [31], included a comparison of three different approaches: ARIMA, Box-Cox transformation, ARMA Errors, Trends, and Seasonality (BATS), and SVM.Based on their study, the authors concluded that the BATS model was best at predicting water demand, with values of 0.0043 from mean square error (MSE) and 0.058 from RMSE.Zhang et al. [32] optimized the SVM with AGG (asexual genetic algorithm).This optimization was compared to Backpropagation Neural Network (BPNN) for forecasting water demand, with results indicating SVP was more accurate for this problem.only represents 23.49% of the city's total physical space, but it contains 99.92% of the city's population.

Pipe network distribution
Currently, the city uses a socioeconomic stratification system in accordance with domestic laws regarding public services.This system classifies housing into six brackets relative to its surroundings and construction materials.Under this system, the poorest sector of society has its services subsidized by those living in richer sectors; those living in "higher" strata pay more for the same services received by those living in lower strata [33].With regard to land use, a local resolution organizes territory into areas and provides guidelines on the needs and projects to be carried out.

Data.
For forecasting water demand in Bogotá, we rely on the number of users per type of use and strata per month (  ) as well as the value billed for monthly + 00 6e + 06 0e 0e + 00 3e + 05 0.0e + 00 consumption per use (  ).These variables were selected after we identified direct correlations (0.86 and 0.92,  value < 0.05) and a statistically significant effect on monthly water consumption (  ).Climate and greenery coverage were depreciable by the weather conditions (average annual temperature 13.1 ∘ C, maximum annual temperature 19.3 ∘ C, minimum annual temperature 7.84 ∘ C, and average monthly precipitation 66.32 mm).Price was not a factor for users, especially R1, R2, and R3, due to the nature of the subsidies.We obtained values through the Single Information System website (www.sui.gov.co/SUIAuth/logon.jsp).The relationship between these factors and demand can be expressed by where  represents the number of land uses and the number of strata for residential uses.The water demand estimate was based on records from January 2004 to December 2014.Ghiassi et al. [1] proposed an 80/20 ratio (80% training data to 20% test data) for optimal calibration models.Therefore, training data was taken from January 2004 to December 2012 (108 pieces of data) and test data from January 2013 to December 2014 (24 pieces of data).In Figure 2, we see the behavior of water consumption, total number of users, and total billed from 2004 to 2014.
For this paper, only R (R1, R2, R3, R4, R5, and R6), Ind., and Com.land uses were evaluated due to the lack of information for special and official land uses.To note, multiusers are included within the R strata.
R use demands more water than other uses, which is likely attributable to the number of residential users.The strata with the most users are 2, 3, and 4, mainly owing to two causes: these two strata (2 and 3) include the highest number of users associated with Bogotá's growth and the nature of the subsidies.That is, stratum 2 is subsidized and located in urban areas with good infrastructure, and strata 3 and 4, while not subsidized, do not pay higher rates to subsidize strata 1 and 2 and have good infrastructure.Together, these causes underscore the desirability of these areas for residents.The Wilcoxon test showed that strata 5 and 6, stratum 4 and Com.use, and stratum 1 and Ind. use presented similar consumption despite their differences in terms of number of users.For price (amount paid/total billed) by consumers, only strata 5 and 6 displayed significant relationships (see Figure 3).

LS-SVM Estimator.
SVM is a classification and regression method with origins in statistical learning theory [34].When it is applied to a regression, SVM can be summarized as follows: with a set of data (represented in a -dimensional plane) that are not linearly divisible, data are mapped onto a space with a larger dimension in order to obtain a linear regression [35].
The LS-SVM model was first proposed by Suykens and Vandewalle, and it differs from the SVM in how to find the hyperplane: SVM employs the principle of structural risk minimization, whereas LS-SVM employs a linear system in a double space under the minimum quadratic cost function.The LS-SVM training process involves the selection of kernel parameters and the regulation of the cost function using cross-validation or Bayesian techniques [36,37].
The LS-SVM structure can be understood as follows: if there is a set of data {x  ,   }  =1 , then the following nonlinear function is estimated: where x ∈ R  represents system inputs,  ∈ R outputs, and (x  ) : R  → R  ℎ mapping to a higher dimension (possibly infinite) of a featured space.As previously mentioned, the traditional SVM solves the problem of finding the best hyperplane using the principle of structural risk minimization, and the LS-SVM optimizes the problem with the following equation: By applying a Lagrange multiplier in (2), the following equation is obtained: where  is the regularization parameter by which the complexity of the model is balanced and   is the Lagrange multiplier, with eligibility conditions obtained from a set of the following set of equations: By eliminating  and   , the following linear system is obtained: With I = [1, 1, .
The model for regressions is where  and  are the solution to ( 6) and ( 8) and (x, x  ) is defined as the kernel function, which is a value of the inner product of two vectors x  and x  in the space of (x  ) and (x  ) which is (x  , x  ) = (x  )  (x  ).For this study, the radial basis function (RBF) was selected as the kernel function because it offers good performance under general assumptions of smoothness; for that reason, this function is especially useful when no additional data are available.The RBF is expressed as follows [38]: where  2 is the width of the band from the kernel's RBF. Figure 4 shows the structure of the LS-SVM, where  1 ,  2 , . . .  are the input data and (  ) the output.

Evaluation Model.
To verify the accuracy of this model, comparison was done using an artificial Feedforward Neural Network Backpropagation, for which a learning rate of 0.01 assorted the number of hidden layers (1 to 10) until the least mean square error (MSE) was found.For this process, the same number of data was used for calibration and evaluation.
After obtaining the demand results, three different statistical metrics were used: (i) RMSE; (ii) the absolute average of relative errors (AARE); and (iii) the coefficient of determination ( 2 ).Mohamed and Al-Mualla used these three metrics for monthly forecasting [39,40].
Finally, to evaluate the significance of the factors (number of users and total billed) for each stratum with regard to validation, a Kruskal-Wallis test was performed.

Results
Table 1 shows the performance of both LS-SVM and FNN-BP models in terms of predicting water demand for all land uses according to the aforementioned statistical metrics employed in Mohamed and Al-Mualla [39,40].
In the LS-SVM model, most results for RMSE values were under 2% (except for stratum 6); more information will be provided below.This indicates that the distribution of errors between the model and the reported data reported was very low.As for the model's efficiency (measured as accurate prediction of the reported data), AARE values were under 8% for LS-SVM model (except for strata 4 and 6), which reflects its forecasting ability.Finally, the coefficient of determination showed values over 0.9 (except for stratum 6), which means that there is a strong relationship between the reported and calculated data, another sign of the model's utility.
The FNN-BP model, despite its relatively inferior performance, which is reflected in higher RMSE and AARE and lower  2 values, predicted the behavior of the demand for each type of use.Nonetheless, the forecast for strata 1, 3, 4, 5, and 6 had AARE greater than 8% and  2 low in strata 6 and 3.
The SL-SVM performance is evidence that this model is suitable for predicting water consumption at long and short scales, as presented by Chen and Zhang [20] at an hourly scale and Hwang et al., [36] at a daily scale.
See Figure 5 for the differences between the reported data and the calculated data in terms of use.
For stratum 1, the model presented greater underestimates because the economic value of the service is not an important factor in consumption.Instead, the number of users is significant ( value < 0.05).Therefore, this consumption depends on increases in the number of users and the city's economic growth.
As for strata 2, 3, and 5, both factors affected the result; however, forecasting for stratum 5 presented the most variability of these three strata.In line with these findings, the number of users represented a greater challenge for the FNN-BP, presenting the highest RMSE (12.58%) and AARE (15.85%) values.For LS-SVM this stratum (R5) produced the higher values of these two strata (R2 and R3) with RMSE and AARE values of 7.65% and 7.8%, respectively.
For strata 4 and 6, the Kruskal-Wallis test showed that the most important factor turned out to be the number of users; for this factor, the LS-SVM had the lowest efficiencies mostly in stratum 6, with values of RMSE (40%), AARE (23%), and  2 (0.8).This is mainly because of underestimation of values obtained in the months of January of 2013 and January of 2014.During these months, sudden changes in consumption were observed, and December showed the highest consumption (1.6 × 10 6 m 3 ), although January showed the lowest (0.2 × 10 6 m 3 ), demonstrating that the model may be sensitive to substantial changes when consumption in other months does not represent this trend (see Figure 6).Regardless, all other months displayed good predictions, as seen in Figure 6.
For Ind. and Com.use, factors that may influence water consumption were not taken into account, such as economics, production, or seasons.Water consumption for these uses is controlled by governmental entities, which means only two factors (number of users and total billed) were considered.The Kruskal-Wallis test showed that both factors were relevant.For Ind. use, the model performed well, with differences under 10%.As for Com.use, lower RMSE and AARE values were observed, clearly identifying the pattern of demand for this use.It is important to note that Com.use tended to exhibit less consumption between the months of December and January, though this dip was not as significant in magnitude for Com.use as it was for stratum 6 (see Figure 7).

Conclusions
This article presents LS-SVM as a method for forecasting water demand and explores the effectiveness of this model for long time scales.
The LS-SVM model proves superior to the FNN-BP model in terms of accurately calculating water demand, as evidenced by RMSE mostly under 1% and coefficient of determination over 0.9, except for stratum 6.This exception is primarily attributable to overestimated values relative to the real demands for the months of January 2013 and January 2014.
As for water consumption, residential use demands more water than other uses due to the number of users.Strata 2, 3, and 4 have the most users.The Wilcoxon test shows that strata 5 and 6, stratum 4 and commercial use, and stratum 1 and industrial use present similar consumption despite their differences in terms of number of users.For price (amount paid) by consumers, only strata 5 and 6 display significant relationships.
It is important to note that Bogotá's demand is best understood in groups.The first group is affected by the price and the number of users, which applies to strata 2, 3, and 5 as well as Ind. and Com.use.The second group is the number of users, as in the case of strata 1, 4, and 6.Therefore, the number of users is the most important factor in water consumption, essentially dictating water consumption behavior in Bogotá.
The LS-SVM proves least for strata 4 and 6, indicating that the model relies heavily on the number of the factors influencing water demand.Thus, more factors included results in improved forecasting.
Finally, forecasting the behavior of water demand is clearly shown to be an effective tool for city planning and management, for it helps identify the need for administrative decisions in order to regulate the consumption of different strata and uses.

Figure 3 :
Figure 3: Boxplot of water consumption, number of users, and billed consumption for Bogotá's land.

Figure 5 :
Figure 5: Differences (%) between values predicted by LS-SVM and reported data for user and strata.

Figure 6 :
Figure 6: Reported and forecasted water consumption for residential use (LS-SVM and FNN-BP models).

Figure 7 :
Figure 7: Reported and forecasted data for commercial and industrial (LS-SVM and FNN-BP models) use.

Table 1 :
Statistical performance of LS-SVM and FNN-BPM models.