Genetic Programming and Standardization in Water Temperature Modelling

1 Instituto de Ingenieŕıa, UNAM, Ciudad Universitaria, Edificio 5, Cub. 403, 04510 México, DF, Mexico 2 Facultad de Ingenieŕıa, UNAM, Ciudad Universitaria, 04510 México, DF, Mexico 3 Department of Hydraulic Engineering, Maritime and Environmental, Universidad Politécnica de Catalunya, C/ Jordi Girona1-3, 08034 Barcelona, Spain 4 Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Ciudad Universitaria, 04510 México, DF, Mexico


Introduction
Evolutionary computing has been widely used in hydraulics and hydrology, for example, the studies ofSavic et al. [1], Madsen et al. [2], and Dorado et al. [3], related to rainfallrunoff processes, modeling of an urban aquifer as was discussed by Hong and Rosen [4], or the modifications of genetic programming algorithms attempting to get an agreement with the problem dimension in natural and compounded channels as applied by Keijzer and Babovic [5], Harris et al. [6], and Keijzer et al. [7].On the other hand, water temperature is an important parameter to consider because of the changes it can experience due to human activities.In the last three decades diverse studies about weather changes have been made, related to the increase of extreme events such as floods and droughts (e.g., Lehner et al. [8]) the increasing air and water temperatures (e.g., Seguí [9]; Webb and Nobilis [10]), ice melting, and greenhouse effect (e.g., Greve [11]), with all their consequences in the surrounding ecosystems (e.g., Schindler [12], Álvarez Cobelas et al. [13]).
The motivation to work with models that allow the representation of water temperature behavior year after year is because each time a possible abnormal increase in this parameter occurs, the consequences and implications for the physical and chemical properties of water with their corresponding effects in aquatic life are numerous.Some models have been applied to maximum water temperatures by means of nonlinear relationships between air temperature and water temperature (Caissie et al. [14]), but there are other important variables involved in water temperature variation during a given period of time.In order to preserve the ecological balance it is very important to have a continuous inspection of water quality in that portion of the river.Freshwater organisms are mostly ectotherms and are therefore largely influenced by water temperature.Some of the expected consequences of a water temperature increase are life-cycle changes (Hellawell [15]; Winfield and Nelson [16]), shifts in the distribution of species with the arrival of allochthonous species (Walther et al. [17]), and the expansion of epidemic diseases (Harvell et al. [18]) as a possible result.Also, aquatic flora and fauna depend on dissolved oxygen to survive, and this water quality parameter is a function of water temperature as well.

Study Site
The field data used in this study were taken from the lower Ebro River, Spain.This river has a basin of 85 000 km 2 and an average year inflow of 17 000 hm 3 in natural regime.Three dams are located along the river (Figure 1) which change the water temperature regime (Val [19]): Mequinenza (1534 hm 3 ), Ribarroja (210 hm 3 ), and Flix (11 hm 3 ).Five kilometers downstream the Flix dam, the water is taken from the river with cooling purposes in the Nuclear Central Asc ó.
Water is returned to the river with a higher temperature and flows downstream to Miravet.In this zone several meteorological gauge stations were installed, including some measuring water temperature (Figure 1).These data were applied in studies made by Val [19] and Prats et al. [20].
Besides, an important effort has been recently made to obtain equations to predict water temperature associated to meteorological variables that are easily measured, centered at the Ribarroja station (Arganis et al. [21,22]).

Methodology
3.1.Evolutionary Algorithms.Evolutionary algorithms, also known as Evolutionary Computation (EC), the optimization tool used in this work, use computational models of evolutionary processes in the design and implementation of computer-based problem solving.A general definition and classification of these evolutionary techniques is given in Bäck [23].He defines an EA as a search and optimization algorithm, inspired by the process of natural evolution, which maintains a population of structures that evolve according to the rules of selection and other operators such as recombination and mutation.Here, the structure of all evolution-based algorithms is shown in Algorithm 1.
In a similar way to that of natural evolution and heredity, these algorithms work on a population of N individuals

P(t) = {x t
1 , . . ., x t N }, representing search points in the space of potential solutions of a given problem.How well each individual x t i adapts each generation t to the problem under investigation is provided by a quality measure called the "fitness".The population evolves, generation by generation, towards better regions of the search space by means of genetic processes, such as selection, recombination, and mutation.The selection process uses the fitness measure to choose individuals of the previous generation (P(t − 1)) to be reproduced, favoring those of higher quality.The recombination operation promotes the exchange of genetic information between parent individuals, thereby producing descendants.The mutation operation alters the genetic information by introducing some changes into the population.The evaluation process is repeated until a predefined termination criterion is met, or alternatively, until a maximum number of generation (iterations) is reached.This artificial evolution process is the foundation of the evolution-based algorithms used in this work, genetic programming.

Genetic Programming Algorithm.
A typical genetic programming algorithm consists of a set of functions, which can involve arithmetic operators (+, −, * , /, . ..), transcendental functions (sin, cos, tan,. . ., ln, exp,. . .), even relational operators (>, <, =) or conditional operators (IF), and a terminal  set with variables and constants (x 1 , x 2 , x 3 , . . .x n ).An initial population is randomly created with a number of parsetree individuals composed of nodes (operators plus variables, and constants) previously defined according to the problem domain (an example of GP individual is given in Figure 2).An objective function must be defined to evaluate the fitness of each individual (in this case each individual will be a resultant model or program of the random combination of nodes).Selection, crossover, and mutation operators are then applied to the best individuals, and a new population is created.The whole process is repeated until the given generation number is reached (Cramer [24], Koza [25]).

Brief Description of the Physical Phenomena and Their
Related Variables.The water from a river is in a constant heat exchange with its surroundings: the atmosphere and the river bed.This process may reach equilibrium so that the heat lost by the water equals that which is absorbed.Normally, the water temperature increases throughout the river in a natural state as the altitude decreases.To this spatial variation a double temporal variation is superimposed.In a river reach temperature varies following both a daily and an annual cycle.
In the study performed by Val [19], an analysis of five kilometers of the Ebro River was performed, in a section downstream of the Flix hydroelectric center; in this reach, hourly temperature measurement data are available for different sections.It was observed that during the summer a 9 • C difference may exist between the Flix Central site and the temperature before the dams.Additionally, downstream of Flix Central, the water temperature recovers, trying to reach thermal equilibrium with its surroundings.To estimate the heat that is absorbed by the river water as it progresses naturally through a certain reach, and its corresponding temperature variation, an energy balance is established between the caloric energy received and the caloric energy emitted by the water along that reach.This can be done based on the thermic balance presented by Edinger et al. [26].This balance can be expressed as where A is the total caloric power absorbed by water as it moves along a river reach, by square meter of free surface, measured in W/m 2 .This is the result of the balance of the different heat inputs and outputs for water as it moves along the reach.H sn is the net (incident minus reflected) total shortwave solar radiation (direct plus diffused) that is absorbed by the water by square meter of free surface, measured in W/m 2 .This is a function of the incident solar radiation r s and the reflected r r , which is proportional to r s , and this proportionality is given by the constant α which is also known as the albedo.H an is the net longwave atmospheric radiation (incident minus reflected) absorbed by water by free surface square meter, measured in W/m 2 .H b is the longwave radiation emitted by water by free surface square meter, measured in W/m 2 , determined as a function of average surface water temperature, T w .H e is the heat lost through evaporation by free surface square meter, measured in W/m 2 , determined as a function of the wind velocity, v v , the vapor pressure of saturation, and the air's vapor pressure.Relative humidity h r is also a variable that affects the water-atmosphere heat exchange.H c is the sensitive heat interchanged by conduction between the atmosphere and water by free surface square meter, measured in W/m 2 , dependent upon the air temperature T a and that of the water T w .S is the heat exchanged with the substrate (river bed) by square meter of river reach.
The heat stored by a water mass as it moves along a river stretch of longitude L is estimated by A = 4187(ΔTQC e ρ/LB), where A is the caloric power absorbed by water, in (W/m 2 ), ΔT is the water temperature increment, in ( • C), Q is the circulating flow, in (m 3 /s), C e is the specific heat of water in (Kcal/ • CKg), ρ is the density of water, in (Kg/m 3 ), L is the longitude of the studied reach, in (m), and B w is the effective width of the river, in (m).
On the other hand, through an analysis of the historical behavior of the time variation of water temperature during consecutive years, similar results were observed, both in the cyclical variation and in the tendency to increase or decrease.This leads to an expectation of a correlation between the temperature variation in year i and the temperature of previous years.
This background described led to the choice of the measured variables which were used in the prediction model.
Additionally, when physical variables are used to be fitted by means of genetic programming, several questions about the dimensionality of the problem could be made.But this problem can be solved considering the possible existence of dimension in the obtained constants of the calculated model.New physical interpretations of the related variables can be done by analyzing the model terms.
Twelve independent variables, one dependent variable, and a vector of real constants were selected.Thus, in the nonstandardized case the terminal set is where h r98 , h r99 , and h r2000 are the hourly average relative humidity values recorded in the years 1998 to 2000, in decimals, T a98 , T a99 , and T a2000 are average air temperature values from years 1998, 1999, and 2000, in • C, v v98 , v v99 , and v v2000 are the average wind speeds from years 1998, 1999, and 2000, in m/s, r s98 , r s99 , and r s2000 are average solar radiations from years 1998, 1999, and 2000, in W/m 2 , T w2000 is the hourly average water temperature measured from year 2000, in • C, and b is a real constant vector.
Tests were made with one hour, daily and weekly averaged water temperatures.
In the standardized case all the last variables are dimensionless.

Objective Function.
The objective function considered in this problem was defined as the minimization of the mean square error between calculated and measured data:   where T w measured data, T w1i calculated data, and i counter from 1 to data number n.The genetic programming algorithm was implemented in MATLAB (The MathWorks [27]).

Standardization.
The variables were standardized by subtracting the mean and dividing by the standard deviation: where Z is the standardized variable, dimensionless; T w is the variable before standardization; T w is the mean value of T w , with the same units as T w (the arithmetic average was used); σ Tw is the standard deviation of T w , with the same units as T w .
Variables with large variances tend to have a larger effect on the resulting model than those with small variances that can be also relevant.Standardized variables can then be advantageous in that their means are zero and their second moments (variances) are one.River.Data consist of 10-minute averages of measurements taken every minute.Water temperatures were measured just downstream of the hydroelectric power plant of Flix.The meteorological variables were measured at the measuring station located on the Ribarroja Dam.The hourly average was calculated for all the variables and taken as input data: relative humidity (h r ), air temperature (T a ), wind speed (v v ), and solar radiation (r s ) as independent variables and water temperature (T w ) as the dependent variable.The first experiment was carried out with the original data, and the second one with the standardized ones.GP parameter settings for both experiments are shown in Table 1 3.7.Model Linearity.In order to validate the applicability of the method, the correlation coefficient between measured and calculated data was obtained: where Cov(T w , T w1 ) is the covariance between the variables T w and T w1 ; σ Tw , σ Tw1 are the standard deviation of T w and T w1 , respectively.

Results and Discussion
4.1.One-Hour Average Data.The genetic programming algorithm tendency is to produce relatively simple models.The equations produced in both experiments were T w12000 z = 0.4732T a98 + 0.6409T a99 + 0.0321h r98 + 0.2316h r99 − 0.2366r s98 , respectively, where T w12000 is the hourly average water temperature value estimated in 2000, in • C, and the prefix z indicates a standardized variable.
In order to get T w2000 values, an inverse standardization process should be performed: For forecasting purposes, mean and standard deviations were estimated as follows: where T w2000 is the estimator of mean water temperature in 2000, in The mean (μ r ) and the standard deviation (σ r ) of the residuals, calculated as the difference between measured and calculated water temperatures, appear in Table 3.The measured and calculated data, including their differences for both experiments, are shown in Figures 3 and 4.
In Figures 5 and 6, the measured and calculated data with ( 6) and ( 8) are plotted against the identity function to obtain the correlation coefficient (r Tw,Tw1 ), checking the linearity in the fitting.Results given by Figures 3-6 show an improvement in calculated data when standardization is applied; residuals are slightly reduced, fluctuations become softer, and this is verified by the correlation coefficient.The mean square error is reduced in about 30%, and there is less data dispersion (standard deviation of residuals decreases 17%).

Daily Average Data.
In this case, the equations obtained without and with standardization were as follows: where T w12000 is the daily average water temperature value estimated in 2000, in In (13), data from 2000 are estimated according to ( 9) and ( 10), but considering daily measurements.The mean square errors (MSEs) obtained by using ( 11) and ( 13) are detailed on Table 4.The mean (μ r ) and the standard deviation of residuals (σ r ) of this experiment appear in Table 5.
Water temperature variations against time and the obtained differences are plotted on Figures 7 and 8. Figures 9 and 10 show a comparison between measured and calculated daily average water temperatures with respect to the identity function.Results for daily analyses report a reduction of nearly 40% in mean square error with the equation obtained using standardized data.In this case the standard deviation of residuals is also smaller (12% lower than using nonstandardization). Figure 11 shows an example of the performance of the best individual in each generation when the genetic programming algorithm was applied.

Weekly Average Data.
In this last experiment, the equations obtained without and with standardization were T w12000z = 0.2962T a98 + 0.6819T a99 + 0.6397T a2000 − 0.2668r s98 − 0.3215r s99 − 0.3852r s98 T a2000 + 0.3852r s98 r s99 − 0.0928, respectively, where T w12000 is the weekly average water temperature value estimated in 2000, in    weekly average solar radiation values of 1998 and 1999, in W/m 2 ; the prefix z indicates standardized variable.
Equation ( 15) must be nonstandardized to get the average weekly temperature approach: Mean square errors and statistics of residuals appear in Tables 6 and 7. Figures 12,13,14 and 15 show the behavior of water temperature in this weekly analysis.
The results obtained for the weekly analysis show a reduction of 52% in the mean square error when data are previously standardized, and of about 31% reduction in the standard deviation of residuals.The correlation coefficient is also close to one.

Getting the Daily Water Temperature for the Year 2004
The climatic daily data measured from 2002 to 2003 in Flix and Miravet stations were taken to estimate water temperature in the year 2004, in order to check the accuracy    of models given by ( 11), (12), and (13).The climatic data for 2004 needed by the models were assumed as the average of the years 2002 and 2003.The average water temperature for 2004 was assumed 15 • C with a standard deviation of 5 • C. A mean square error of 49.549 and a correlation coefficient of 0.6744 were obtained by applying (11) as it is shown in Figures 16 and 17, with an important variation in the residuals; by contrast, with (13) that demands standardized data, a mean square error of 13.027 and a correlation Advances in Civil Engineering    coefficient of 0.7445 were obtained; the residuals took values between −7 • C and almost 8 • C (Figures 18 and 19).Therefore, this last model had a better performance in daily data, in this year.
With both equations very big residuals for water temperature were obtained for some days of the estimated year.

Conclusions
Different models which allow the estimation of water temperature in the Ebro River in a given year were obtained, taking into account climatic variables measured in the same year, but also considering their variability in two previous years, in an attempt to explain the possible evolution of the water temperature behavior.The GP algorithm considered as input hourly, daily, and weekly average measured data without and with standardization, in order to analyze the resulting equations when the shape of the input data varies from one form to another.
Intrinsically, measured data of water temperature and climatic variables have more oscillations in hourly average data than in daily or weekly average data.Particularly, in the experiment using hourly data, the GP algorithm amplifies the water temperature oscillations, probably because in the actual physical process, the oscillations of the climatic variables are filtered.Nevertheless, by using standardized data, mean square errors were lower than those without standardization, and a lower dispersion in data could be obtained.Similar situations occurred in the case of daily data.
According to the mean square errors, the standard deviation of residuals, and the correlation coefficient, when weekly data were considered, GP algorithms produced models more capable to follow the behavior of water temperature.This was particularly true for those models obtained with standardized data.
Therefore equations such as those obtained herein can be used as a first approximation to predict changes in water temperature when changes occur in climatic variables such as air temperature, wind speed, relative humidity, and solar radiation, all of which affect the water temperature as well as the physical and chemical water conditions, including the flora and fauna of a river.
When the models for daily data were applied in another year, lower correlations between measured and predicted data were obtained, particularly with the model that does not take into account standardized variables.
According to these results, it is feasible to obtain some improvements in generating water temperature models by means of genetic programming, when the standardization process is incorporated.
Results also show limits on the models developed herein; the models produced oscillations in the water temperatures that do not correspond to the measured data; the results of forecasting from 2004 are only fair.That is probably due to the fact that some variables included in the physical phenomena are eliminated, and the filtering that occurs in nature is not reproduced; nevertheless, these results are considered useful as a first-order explanation of a complex process.However future work is suggested to compare the proposed method with physically based ones.

Figure 1 :
Figure 1: Station locations at the Ebro River, Spain.

Figure 2 :
Figure 2: A mathematical expression represented hierarchically by its parse tree.

Figure 3 :
Figure 3: Water temperature values and residuals, experiment without standardization (hourly average values).

Figure 4 :
Figure 4: Water temperature values and residuals, experiment with standardization (hourly average values).

3. 6 .
Input Data.Meteorological and water temperature data were taken in gauging stations installed in the Ebro

Figure 7 :
Figure 7: Water temperature values and residuals, experiment without standardization (daily average values).

Figure 8 :
Figure 8: Water temperature values and residuals, experiment with standardization (daily average values).

Figure 11 :
Figure 11: Convergence of a genetic programming run.

Figure 12 :
Figure 12: Water temperature values and residuals, experiment without standardization (weekly average values).

Figure 13 :
Figure 13: Water temperature values and residuals, experiment with standardization (weekly average values).

Figure 16 :
Figure 16: Measured and predicted water temperature for 2004, model without standardization (daily average values).

Table 2 :
Mean square error values.

Table 3 :
Statistics of residuals.

Table 4 :
Mean square error values.Daily average data.

Table 5 :
Statistics of residuals.Daily average data.
Advances in Civil Engineering deviation of water temperature in 1999, in • C. Table2shows the mean square error (MSE) obtained with both models.
• C; T w98 is the water temperature in 1998, in • C; T w99 is the water temperature in 1999, in • C; σ w2000 is the estimator of standard deviation of water temperature in 2000, in • C; σ w98 is the standard deviation of water temperature in 1998, in • C; σ w99 represents the standard

Table 6 :
Mean square error values.Weekly average data.

Table 7 :
Statistics of the residuals.Weekly average data.