Correlation Analysis of Water Demand and Predictive Variables for Short-Term Forecasting Models

Operational and economic aspects of water distribution make water demand forecasting paramount for water distribution systems (WDSs) management. However, water demand introduces high levels of uncertainty in WDS hydraulic models. As a result, there is growing interest in developing accurate methodologies for water demand forecasting. Several mathematical models can serve this purpose. One crucial aspect is the use of suitable predictive variables. The most used predictive variables involve weather and social aspects. To improve the interrelation knowledge between water demand and various predictive variables, this study applies three algorithms, namely, classical Principal Component Analysis (PCA) and machine learning powerful algorithms such as Self-Organizing Maps (SOMs) and Random Forest (RF). We show that these last algorithms help corroborate the results found by PCA, while they are able to unveil hidden features for PCA, due to their ability to cope with nonlinearities. This paper presents a correlation study of three district metered areas (DMAs) from Franca, a Brazilian city, exploring weather and social variables to improve the knowledge of residential demand for water. For the three DMAs, temperature, relative humidity, and hour of the day appear to be the most important predictive variables to build an accurate regression model.


Introduction
The main objective of water distribution systems (WDSs) is to supply water to consumers with adequate quantity and quality. For water utilities, using accurate water demand estimation has the advantage of allowing better operation and management of their systems. Among the benefits associated with suitable water demand forecasting, leakage identification, optimal operation of pumps and valves, and the possibility of improving planning and design of network expansions must be highlighted. These engineering aspects represent a key step forward in the improvement of WDS operation efficiency, which ultimately will lead to the provision of quality water supply [1].
Water demand forecasting models can be roughly divided into long-and short-term models, whose approaches depend on the time horizon used for scheduling further predictions.
Long-term water demand forecast is useful to define rehabilitation and expansion strategies and water source capacity evaluations [2]. In their turn, short-term water demand forecasting models can help to suitable define the operation and management of the water systems, with the aim of supplying water to costumers with maximum efficiency [1,3]. As a result, working with water predictive models is central to suitably establish the set of variables involved in the model. Donkor et al. [4] present a review of several studies of water demand forecasting including various time horizons. This study is sensitive to the different nature of the used predictive variables. For long-term forecasting models it is proposed to use population density, size of the buildings, water price, and weather variables such as air temperature and relative humidity [5][6][7]. However, for short-term water demand approaches, several studies in the literature [8][9][10] include previous demand data, weather variables (such as rain 2 Mathematical Problems in Engineering or wind speed), and calendar variables, such as weekday, hour of the day, and presence of holidays.
Autoregressive integrated moving average-(ARIMA-) based models [11] have been traditionally considered for understanding and modeling urban water demand [12]. With the improvements brought by data mining tools and machine learning techniques, a number of data analysis models have been considered more recently. For instance, several authors [13][14][15] have applied artificial neural network (ANN) architectures to both long-and short-term demand forecasting. The use of other machine learning tools has also increased during the last years. As an example, [9] has performed a comprehensive comparison of various predictive methods for hourly water demand forecasting, suggesting the use of support vector regression (SVR) as one of the models through which it is possible to reach better results. However, off-line predictive models are likely to develop growing bias, if models are not updated with the arrival of new data. Models can also become rapidly obsolete in the case of abrupt changes occurring in the forecasting framework. These are the models known as intervened [16] and are a consequence of unexpected changes in the scenario in which the demand is computed. For example, opening and closing valves, extreme variation of weather conditions, appearance of new leaks, and celebration of a social events, among others, may change the end-user response regarding water demand.
Despite short-term water demand forecasting models being crucial to improve water system operation and management, there is a lack of recent studies focused on correlation analyses between water demand and the usual (weather, calendar, and hydraulic) predictive variables. This is the main objective of the present paper. In-depth knowledge of those correlations will greatly improve those crucial aspects for the water supply industry.
In mathematical grounds, a general objective for regression methods is to map the input data into a convenient output space to optimally approach predictions through a set of independent variables. As this set might be formed by a large group of inputs, it is often suitable to count on ways of synthesizing the input space with minimum loss of information [17,18]. For water demand forecasting problems, three main groups of variables integrate the input space: weather, social, and economic variables.
Despite several studies proposing various methodologies to forecast water demand, few investigations present indepth correlation analyses able to give deeper insight into the causality principle of water demand. Among these studies, weather variables are explored as one of the main components that have influence on water demand. In this case, the impact of air temperature, relative humidity, and amount of rain may be highlighted [19][20][21]. Depending on the forecasting horizon, the water price and the consumer size (residential houses, businesses, industries, etc.) are used to estimate water demand as well [22]. Furthermore, large-size WDSs are usually divided into DMAs, and the correlation among these DMAs is not usually exploited in the models found in the literature. In this paper, we claim that this aspect can be used to refine water demand forecasting.
In this regard, Coomes et al. [23] reinforce pioneering studies of weather influences on water demand [20] and show the effects of weather variables. Using time-series analysis, specifically autoregressive models, can be highlighted as a classical approach for short-term water demand forecasting [19,24,25]. However, posterior machine learning theory developments tackle the main flaw of autoregressive models (constrained to only model linear relationships) by considering nonlinear modeling of water demand. These approaches result in substantial advances for predictive models since, regarding water demand, seasonality, dynamic-featuring, and state-dependent models cannot be built just considering linear relationships [16]. In this line, neural networks and various statistical learning methods have been widely applied to estimate the future demand with the advantage of using nonlinear regression [10,[26][27][28][29].
This work presents three methodologies to analyze water demand in close connection with various predictive variables. Firstly, the classical algorithm of correlation evaluation, Principal Component Analysis (PCA), is considered. Then, Self-Organizing Maps (SOMs) and Random Forest (RF) algorithms are also proposed to evaluate data correlations because of their ability to treat nonlinear relationships among the variables, in contrast to PCA. The three approaches are then used to deal with real world data corresponding to three district metered areas (DMAs) of Franca, a Brazilian city, in an attempt to verify potential existing correlations among water demand and some social and weather variables.
The rest of the paper is organized as follows. The following section provides the methodological aspects. The methodology is then applied to the case study, which is first suitably described; this section also includes the main results of the investigation. Finally Conclusions and References close the paper.

Correlation Analysis Algorithms
Correlation analysis is important to better understand various interdependences among the variables in a problem. Correlation also allows the construction of mathematical models of many phenomena. By increasing the knowledge of the set of variables that describe a given problem, the capacity to forecast also increases. This opens the possibility of performing better action policies of any related operation linked to the phenomenon under study.
The techniques and algorithms to evaluate the correlation degree among a set of predictive variables and the variable(s) they try to explain are varied. These techniques and algorithms, according to their mathematical nature, can be applied to a wide variety of situations. Among these algorithms, PCA is a classical approach using linear transformations to find data correlation. However, with the increase of data mining techniques, correlation assessment of the analyzed data is frequently not necessary, since neural networks and other machine learning methods are able to treat the data without previous assumptions as those involved in PCA. Among some other significant and powerful machine learning techniques, the SOMs and the RF algorithms can be highlighted because of their respective ability to process large databases. Next, we concisely present these algorithms and provide the necessary elements for their application in the problem we investigate in this paper.

Principal Component Analysis (PCA).
PCA has the objective to reduce the dimensionality of a dataset and to identify the superposition degree of the variables. In general terms, PCA (orthogonally) linearly transforms the input space by evidencing some correlations between variables.
As the covariance matrix is real and symmetric, using the spectral theory, it is possible to find real eigenvalues and (orthonormal) eigenvectors for this matrix. From the set of sorted eigenvalues, 1 , 2 , . . . , , and their associated orthonormal eigenvectors e = [ 1 , . . . , ] , new variables, named principal components, PC , can be written: The aim behind PCA is to create a component order explaining the variance of the dataset. This is based on the values of the corresponding eigenvalues. Once the eigenvalues are ordered, the principal component PC explains the variance of the dataset proportionally to the loading / ∑ . When represented on a two-dimensional plane the similarity between the vector directions graphically points to the similarity between variables.

Self-Organizing Maps (SOMs).
SOMs are nonsupervised learning methods based on the brain behavior when excited by external signals [30]. Motivated by this idea, SOMs are processing tools able to find out patterns in a dataset. A widely used application lies on the possibility of visualizing high dimensional datasets in reduced dimensions (usually 2D) while maintaining the topological correlations. This makes SOMs highly useful for correlation analyses among several variables.
SOM represents the input space by a mesh of points, socalled neurons. A neuron is represented by a vector with components, known as synaptic weights. The synaptic weights change at each iteration to get the best rendering of the input space. The process to adjust the mesh is the training stage. The competitive learning process is responsible to adjust the map.
The process starts by creating a mesh of neurons, where each neuron is described by its synaptic weight vector, w = [ 1 , . . . , ] . Each step of the learning process is made of three different stages: competition, cooperation, and synaptic update. The competition stage is responsible for identifying the most activated region by an input x. This region is defined as a neighborhood of the highest reactive neuron, so-called winning neuron. The winning neuron is identified by the similarity between the input and the neuron itself. This measure is usually computed by the minimal Euclidean distance. So, the winning neuron, (x), can be written as Once the winning neuron is identified, the competition stage stops and the cooperation stage begins. In this stage, the activated region is defined and, in particular, how much these neurons are activated is also decided. The cooperation stage determines the influence of the winning neuron in the neighborhood. Finally, in the update stage the activated region has the synaptic weights modified. Following the biological inspiration, the activation decays according to the distance to the winning neuron. The activation power can be written as a monotonic decay, for example, as a Gaussian function, where ℎ (x ) is the neighborhood topology function, centered in the winning neuron (x ), containing a set of neurons excited by the winner, and is the size of the neighborhood in iteration and is defined by an exponential decay function at each time step. The distance can be written as the Euclidean norm of the difference between two vectors: where is the position of an excited neuron and is the position of the winning neuron. The winning neuron determines a region or neighborhood of influence. The closer a neuron is to the winner, the larger the change of position of this neuron is. The neighborhood defined before is shrunk through a number of iterations attempting to achieve several objectives: improving the process stability, leading the map to the final arrangement of neurons, and making the model better mimic the brain behavior. This reduction process has the disadvantage of reducing the winning neuron power. According to [31], a usual representation of the learning rate is written as where is the size of the initial neighborhood, is the current iteration, and is a time constant, usually defined by a correlation between the maximum number of iterations, max , and the initial topology size.
The synaptic weights are updated after the neighborhood activation is defined. Each weight (neural position in the topological space of the data) is then updated according to the corresponding increment, Δw , defined as where is the initial learning rate. Finally, the new neural position is written as The learning process finishes when the mesh updates are less than a predefined threshold value or when the maximum number of iterations is reached. At this stage, it is expected that this mesh is a good two-dimensional representation of the input space, while preserving the topological relationships of data. Each variable can be represented by its neuron position allowing, by comparison of maps, qualitative inference over the correlation of variables.

Random Forest (RF).
Before introducing Random Forest models, it is necessary to introduce a decision tree (DT) for regression. A DT [32] is based on a recursive partition over the range of the input space (also called instance space). It can be used for either classification or regression depending on whether the response variable is discrete or continuous, respectively. The decision tree consists of a set of nodes containing the status of the dataset partition and edges connecting them in a way in which they form a hierarchical sequence of logical rules.
A DT starts on a special node called "root" with no incoming edges. Following a previously defined sequence of logical rules, the root node iteratively breaks down the instance space into smaller instance subspaces. This is by drawing outgoing edges from the root node to other nodes and from these nodes to further ones. In tree building, all nodes, but the root, have exactly one incoming edge. In the case of a DT for regression analysis, the Sum of Squared Error (SSE) is used to define each split of the tree. The SSE computes the error between the predicted value, considering as predictor the mean per node (instance subspace), and the observed values. Comparing all these errors allows choosing the candidate node to be split at each iteration as the one having the lowest SSE. A stop criterion (tree depth or a certain SSE threshold) provides the final partition. Eventually, a subspace partition is produced and its predictors are in special nodes called leaves or terminal nodes. These are characterized by having incoming but not outgoing edges.
A Random Forest (RF) is an ensemble of tree-based models. RF algorithms can be used for classification when the base models are classification trees, or for regression when the base models are regression trees. The algorithm is based on a bootstrap aggregation (or bagging) of tree models [33].
Given the response, = ( 1 , . . . , ), from the corresponding training set, = ( 1 , . . . , ), a bagging tree is constructed by selecting samples (sampling with replacement) from ( , ) and training a DT for each sample. Finally, in the case of regression, the bagging tree is computed by averaging all the resulting single trees. RFs use a variation of the bagging tree method by forcing each split to consider only a subset of the predictors (see Algorithm 1). This makes RFs computationally efficient compared to bagging trees for large datasets. As a general rule, for a -dimensional problem a subset of √ variables is selected to build single regression trees. These trees are combined in a further ensemble to improve sample tree variability. Other benefits of tree ensembles in RF are to avoid sources of bias in model outcomes and to help reducing overfitting. RFs have proven to be outstanding predictive models in regression (and classification) tasks.

Case Study
This work analyzes water demand records along with weather data of a Brazilian city. This case study corresponds to the WDS of Franca, a city with 318,000 inhabitants, one of the most important cities in São Paulo State (Brazil). Franca's WDS is divided into DMAs (see Figure 1). This work uses water demand data of three of its DMAs.
The three studied DMAs are typical residential areas in Brazil, encompassing family customers and small businesses. In Figure 1, Tks are storage tanks and the hatched DMAs are used in this study. Table 1 presents the number of connexions and the mean demand for each DMA.
The available demand data was measured every 20 minutes for the various DMAs. The weather data were obtained from the meteorological station of the University of Franca (Unifran) and the weather database is integrated by air temperature ( ∘ C), relative humidity (%), wind speed (m/s), wind direction ( ∘ ), dew point temperature ( ∘ C), and atmospheric pressure values (hPa). All these measurements were taken in an hourly basis. To correspond to the same measurement frequency of water demand, weather data is linearly interpolated. Table 2 presents a brief statistical description of the weather database.
Social behavior is introduced in the model using calendar variables such as the hour of the day, the day of the week, the day of the month, and the month of the year.   Applying PCA to the data corresponding to each DMA studied in this work, the loading plots in Figures 2-4 are obtained. It is possible to observe the high positive correlation between water demand and hour of the day. Disregarding the DMA , temperature also has a positive correlation with water demand. Relative humidity presents a negative correlation with water demand for all DMAs, as observed by the opposite direction of the loading value.
A strong correlation among hour of day, water demand, and temperature can be observed for the three DMAs. A negative correlation between water demand and relative humidity can also be observed. This mainly happens in Azevedo's DMA. The secondary correlations, such as wind direction with month of the year or dew point and rain with atmosphere pressure, may be highlighted. These secondary correlations help validate the results of other methodologies, since they can be observed also in SOM and RF results.
The application of SOMs to identify correlations among variables can be useful, since the SOMs synthesize the topological space of inputs, by their projection onto a twodimensional space. This projection turns easier data distribution and clustering. For the DMAs previously analyzed by the PCA, Figures 5-7 present the respective maps. The maps are based on the final distribution of neurons and the color of the maps represents the distance between neurons. Light colors represent short distances between neurons, while dark colors represent large distances. The fact that two inputs have similar distribution of neurons, that is to say, similar color distribution in the maps, helps identify qualitative correlations among data.
The distribution trends observed for temperature, hour of the day, and water demand are indicated by the reduction of the number of neurons in the positive diagonal of the maps for Airport DMA. The SOM analysis corroborates the correlations among water demand, hour of the day, and temperature. Also, the correlation among humidity and the dew point can be observed. These variables are also correlated with water demand, even if this correlation is not so clear as the one with the hour of the day.
Some secondary correlations appear in the SOM analysis clearer than in the PCA. That is the case of humidity and dew point temperature. From the physical point of view, this correlation is meaningful and points towards a good correlation analysis by the SOM interpretations. However, despite the correlation results obtained by SOMs having full physical sense, the lack of a quantitative analysis can impair the application of this kind of analysis.
To obtain deeper knowledge of the variables without previous considerations of their relationship, the RF algorithm is applied. When used to evaluate the importance of each variable, the RF algorithm runs as many times as the number of variables, removing a variable by turn and evaluating the improvement of the regression. Using this evaluation, the RF analysis ranks the priority variables. Figure 8 shows the scores of the variables for each DMA standardized in the interval [0, 1].
The hour of the day appears as the most important variable for the short-term water demand forecasting. The second most important variable for the Airport and Leporace DMAs is the temperature, while for the Azevedo DMA it is the month of year. For this DMA, the humidity is the most important weather variable. For the three DMAs, the rain is the lowest important variable, corroborating the previous results obtained by SOM and PCA.
Despite the fact that quantitative analysis of the variables can be easily performed with RFs, the secondary correlations, however, cannot be easily defined by the mathematical approach of RFs. That is to say, once the correlations have been determined by a regression process, the secondary correlations are disregarded.
SOMs and RFs appear as efficient alternatives to capture complex relationships when compared with PCA. Both provide a clear set of correlations between water demand, social variables, and weather inputs. In this study, RF corroborates the influence of the hour of the day, temperature, and relative humidity. Other lower correlations are observed between rain and water demand for all the DMAs in the case study.
All in all, the quantitative analysis performed by SOMs can help identify important correlations, without the assumption of a linear correlation among the variables. However, quantitative analyses require personal interpretation of the maps, and this can lead to faulty correlations. In this sense, a quantitative method like the one provided by a RF is very useful. Quantitative methods can give the magnitude of the correlations and, when combined with quantitative analyses, the correlation identification process can be more powerful.
The correlation analysis of weather and social variables to improve water demand forecasting models are exploited in this work with different computational tools. Applying the methodologies to various DMAs, it is possible to evaluate the correlation properties by different spatial levels in DMAs with different size. Furthermore, the hour of the day, a predictive variable highlighted as very important for the three methodologies, corroborates the typical approach used to consider temporal trends of water demand in forecasting processes.

Conclusions
Water demand forecasting models help decision-making processes dealing with various issues in water resources planning and management. Several studies propose using soft computing techniques to model water demand for different time horizons. However, these approaches are not exploited enough regarding the understanding of correlations between predictive variables and water demand. In this sense, assessing social and weather variables is a useful approach to improve the accuracy of regression models on water demand. This work presents three techniques to evaluate the correlation among water demand, social variables, and weather information.
A classical tool, such as PCA, can find the main correlations: temperature, hour of the day, and water demand. However, PCA is unable to find deeper correlations as they    are of nonlinear nature. SOM analyses process the input space and turn further visual analytics easier. However, the qualitative nature of these analyses can affect the final results. RF algorithms are able to evaluate the influence of a predictive variable through comparisons of regression models. Using this approach, RF algorithms quantify the influence of each variable into the quality of the regression model. This paper also presents a water demand analysis for various related DMAs with different consumers' features. A bullet point of this work is the space-time analysis of the correlation among water demand and the studied input variables. The obtained results allow concluding the good generalization capacity of the presented tools based on SOMs and RF algorithms.
Last but not least, it is worth mentioning that accurate water demand models help improve urban water system operation, as the degree of uncertainty in water demand is reduced. In this regard, the operation of pumps and valves then might be approached under better hydraulic conditions. Consequently, better knowledge of short-term future water demands may directly translate into several improvements on water, energy, and economic resources.