Application of Multiple Linear RegressionModels and Artificial Neural Networks on the Surface Ozone Forecast in the Greater Athens Area , Greece

1 Department of Mechanical Engineering, Technological Education Institute of Piraeus, 250 Thivon and P. Ralli, 12244 Athens, Greece 2 Laboratory of Climatology and Atmospheric Environment, Faculty of Geology and Geoenvironment, University of Athens, University Campus, Panepistimiopolis GR 157 84, Athens, Greece 3 Laboratory of Environmental Technology, Department of Electronic Computer Systems Engineering, Technological Education Institute of Piraeus, 250 Thivon and P. Ralli, 12244 Athens, Greece 4 General Department of Mathematics, Technological Education Institute of Piraeus, 250 Thivon and P. Ralli, 12244 Athens, Greece


Introduction
Air quality has emerged as a major factor affecting the quality of living in urban areas, especially in densely populated and industrialized areas.Air pollution control is necessary in order to prevent worsening of air quality in the long run.At the same time, short-term forecasting of air quality is required in order to take preventive and clear action during episodes of atmospheric air pollution [1].The increasing surface ozone concentrations, in recent decades, have become a major concern worldwide due to the adverse effects on human health [2][3][4][5][6].
Long-term human exposure to ozone causes chest pains, persistent coughing, respiratory irritation, impaired lung function, sense of dry throat, worsening of previous respiratory diseases like asthma, severe inflammation of the lungs, and even irreversible damage to the lungs.All these will result in premature aging and chronic human respiratory diseases.These concern more certain sensitive groups in the population, such as children, asthmatics, and elderly people [7][8][9][10][11], and therefore, it is clear that ozone's forecast is very important for humans.
In the past, there have been many attempts by scientists to forecast ozone concentrations as well as the concentrations of other pollutants in both urban and nonurban areas [12][13][14][15][16][17][18][19][20][21][22][23][24].Specifically, Ziomas et al. [16] presented analytical models relating maximum ozone concentrations in Athens area with various meteorological variables.For this purpose, 54 pollution periods, during the years 1987-1990, were selected and analysed.The evaluation of the developed relationships showed that their degree of successfulness is rather promising.Spellman [18] developed ANN models so as to be able to forecast surface ozone concentration in five different locations in London, during the warm period of the year.In any case, the evaluation of the results by the developed ANN models, indicate that these models surpassed in comparison with the regression models.G ómez-Sanchis et al. [19] developed a prognostic model in order to predict tropospheric ozone by using ANN models.For this purpose, ambient ozone concentrations were estimated using surface meteorological variables and vehicle emission variables as predictors.Corani [20] applied feed forward neural networks, pruned neural networks, and lazy learning in order to predict at 9 A.M. the concentration estimated for the current day for ozone and PM 10 in Milan, Italy.Heo and Kim [21] in their paper described the method of forecasting daily maximum ozone concentrations at four monitoring sites in Seoul, Korea.The forecasting tools developed were fuzzy expert and neural network systems.The hourly data for air pollutants and meteorological variables, obtained both at the surface and at the high-elevation (500 hPa) stations of Seoul City for the period of 1989-1999, were analyzed.Two types of forecast models were developed.The first model, Part I, uses a fuzzy expert system and forecasts the possibility of high ozone levels (equal to or above 80 ppb) occurring on the next day.The second model, Part II, uses a neural network system to forecast the daily maximum concentration of ozone on the following day.The forecasting system includes a correction function so that the existing model can be updated whenever a new ozone episode appears.The accuracy of the forecasting system has been improved continuously through verification and augmentation.
Furthermore, Elkamel et al. [22], in order to forecast the concentration of surface ozone in Kuwait, developed and trained ANN models by using meteorological parameters along with other pollutants concentrations.Finally the developed ANN models were compared with linear and not linear regression models resulting in the supremacy of ANN models in every case.In addition, Slini et al. [25] by using meteorological parameters, such as air temperature and wind speed along with surface ozone concentration of the previous day, proceeded in forecasting the surface ozone concentration the next 24 hours, applying the method of multiple regression analysis.They reached to the conclusion that cooperation of statisticians and environmentalists, leads to rather significant outcomes in matter of air quality forecasting.Finally, Moustris et al. [23] developed ANN models for predicting air quality in many different areas within GAA, one, two, and three days ahead.They used meteorological data, air pollution data, and air quality indices.The results were encouraging and showed that ANN models could give more reliable predictions concerning the quality of ambient environment in the future.
The objective of this study is to perform and compare the MLR models against a forecasting model based on Artificial Neural Network (ANN) approach, in order to forecast the ozone's concentration levels towards the next 24 hours, within the GAA.

Data and Methodology
2.1.Area and Data.Multiple linear regression models and ANN models were used in the present study to forecast the maximum daily surface ozone concentration of the forthcoming 24 hours, within the GAA.The study was based on maximum daily values of surface ozone concentrations (μg/m 3 ) that were recorded for a five-year period (2001)(2002)(2003)(2004)(2005).Hourly surface ozone concentration data and meteorological variables were acquired by the network of the Hellenic Ministry of Environment, Energy and Climate Change (HMEECC).The meteorological variables of air temperature ( • C) and wind speed (m/s) were obtained by eight different sites stations of the HMEECC monitoring network.Since surface ozone levels can vary considerably within GAA [16], eight suburban and five urban monitoring stations were selected for the analysis.A detailed description of the HMEECC network is found in relevant publications (e,g., [26]).
In the present study, the data set collection consists of 5-year long daily observations of maximum mean hourly ozone concentrations (μg/m 3 ), from the thirteen stations of HMEECC monitoring network, maximum hourly air temperature ( • C), and mean wind speed (m/s), from the eight meteorological observation stations of the HMEECC for the same time period.According to scientific literature [16,25,27] the choice of the appropriate meteorological data is based on their availability and their impact on shaping the surface ozone concentrations.Air temperature plays an important rile, because an increase in air temperature is linked, with rise of the surface ozone concentration.Additionally, another basic meteorological parameter determining the horizontal transport and dispersion of air pollutants is the mean wind speed.
The final constructed data set, from 2001 to 2005, was divided into two subsets.The first concerns the four-year period, from 2001 to 2004, and the second the one-year period for 2005.The data from the first subset were used for the applied models training.The second data subset was not used for the training of the applied models but only for the evaluation of their predictive ability.

2.2.
Methodology.The first step followed in the present analysis was the adjustment of daily observations of maximum mean hourly ozone values to the requirements involved by the application of multiple linear regression analysis.Because of the fact that this method requires the data to be normally distributed, the natural logarithm transformation may be applied in ozone values [28], in order to fulfill this prerequisite assumption.The distribution of original (Figure 1(a)) and transformed (Figure 1 A variety of statistical methods have been utilized in order to develop techniques, which will enable qualitative or quantitative short-term forecasts.The most common alternative is to employ a multivariate statistical approach that is widely used in operational ozone forecasting and research oriented statistical modelling (e.g., [14,25]).Multiple (multivariate) linear regression analysis is the most popular of these techniques, and it has the general form: where c is the intercept term.a i , i = 1, 2, . . ., n, are the regression coefficients, y is the response variable (surface ozone concentration), x i , i = 1, 2, . . ., n, are the independent predictor variables, and ε is a residual error.When the regression equation is used in predictive mode, e (the difference between actual and predicted values not accounted for by the model) is omitted because its expected value is zero [29].
In multiple regression analysis, an important assumption is that the explanatory variables are independent from each other [30].However, in some applications of regression, the explanatory variables are related to each other.This problem is called the multicollinearity problem [31].The non multicollinearity term means that in any case of multiple regression analysis, any correlation between the independent variables should not exist.For this purpose, a multicollinearity index known as variance inflation factor (VIF) is used [30].The VIF index is calculated by the following equation: where n is the number of predictor variables and R 2 i is the square of the multiple correlation coefficient of the ith variable with the remaining (n − 1) variables where, according to Hossain et al. [30], if 0 < VIF < 5, there is no evidence of multicollinearity problem.If 5 ≤ VIF ≤ 10, there is a moderate multicollinearity problem, and finally if VIF > 10, there is seriously multicollinearity problem of variables.
ANN models are an attempt to imitate and simulate the function of neurons in the human's brain through mathematical functions.In late 1940, Hebb [32,33] made one of the world's first theories about the mechanism and the ability of human's neurons in the brain, such as their ability to learn.The branch of ANN begins to grow when McCulloch and Pitts [34] create the first ANN model.In 1958, Rosenblatt [35] invented the model of artificial perception-cognition (Perceptron model) which has created more interest in the scientific community for ANN models and their ability to solve at least some simple problems.It is obvious that the simplest ANN model is composed by a single neuron.In this case, of course, the term "network" is unfair because there are no other artificial neurons connected to each other in order to eventually form a network.The only connections are those between inputs x 1 , x 2 , x 3 . . ., x n and the artificial neuron.A Perceptron model that is essentially a single artificial neuron is very limited.Such a network, for example, can only represent flat surfaces.This limitation no longer exists when more than one artificial neuron are used.This has led to multilayer perception (MLP).MLP consists of an input layer on which there are the artificial neurons that correspond to the input data (inputs).After the input layer there are one or more hidden layers with one or more artificial neurons.Each artificial neuron on each one of the hidden layers is connected and interchanges information with all the other neurons of both the previous and the next layer.Finally, the output layer, on which there are the "targets" artificial neurons, follows.Since data flow within the artificial neural network from a layer to the next one without any return path, such kind of ANN models are defined as feedforward ANN models.The structure of a feedforward MLP ANN can be found in relevant publications (e.g., [23]).
A weighting factor known as "synaptic weight" corresponds in each connection between neurons.The synaptic weight of an input is a number, which when multiplied with the input gives the weighted input.These weighted inputs are then added together and if they exceed a preset threshold value, the neuron is activated and gives a response-result.When the ANN transfers information from layer to layer the values of these synaptic weights change continuously.After the result reaches the output layer, it is compared with the target value and the error is calculated.In the process, the ANN returns back suitably modifying the values of synaptic weights.When the error is between acceptable limits, the values of synaptic weights are freezing.The ANN model has now been trained properly and is able to replicate and apply its "knowledge" or "experience" on new data and information, which was unknown to it previously.This training ANN process-algorithm is known as the backpropagation error algorithm.For the evaluation of the results and the predicting performance of the developed model, appropriate statistical indices such as the coefficient of determination (R 2 ), the mean bias error (MBE), the root mean square error (RMSE) and the index of agreement (IA) were used [29,[36][37][38][39][40].The RMSE is a commonly used measure of the differences between the predicted values by a predictable model and the real-observed values.The RMSE was used as a single measure that indicates the ability of the model to predict and has the same units as the predicted value.The RMSE is always positive and a zero value is ideal.The MBE provides information on the long-term performance.A low MBE is desirable.Ideally a zero value of MBE should be obtained.A positive value gives the average amount of overestimation in the calculated and negative underestimate.The coefficient of determination is used in cases of statistical models, whose main purpose is the forecast of future outcomes on the basis of other related information.It is the proportion of the variability in a dataset that is accounted for by the statistical model.It provides a measure of how well future outcomes are likely to be predicted by the model.The coefficient values range from zero to one (0 ≤ R 2 ≤ 1).The closer the value is to one, the better and more accurate is the prediction.The index of agreement is a dimensionless measure with values between zero and one (0 ≤ IA ≤ 1).When IA = 0, there is no agreement between prediction and observation, while IA=1 denotes a perfect agreement between prediction and observation [37].

Results and Discussion
Figure 2 depicts the annual variation of the mean maximum monthly temperature for the urban and suburban examined areas, respectively.
According to Figure 2, there is a permanent difference of about 2 • C, between urban and suburban areas.The result is expected and is consistent with the urban heat island (UHI) effect, resulting in higher temperatures within the During the warm period of the year (May to September), the mean maximum air temperature is about 29.6 • C concerning the examined urban areas and 27.5 • C for the suburban areas, respectively.Finally, during the cold period of the year (October to April), the mean maximum air temperature is about 16.8 • C concerning the examined urban areas and 14.7 • C for the suburban areas, respectively.In a similar way to the above analysis, a brief analysis for the wind speed was done.Figure 3 presents the annual variation of the mean monthly wind speed for the urban and suburban examined areas, respectively.
It is shown that the wind speed for both urban and suburban areas within the GAA is low enough.There is a relatively difference for the wind speed of about 0.3 m/s, between urban and suburban areas.The mean annual wind speed is about 2.0 m/s concerning the examined urban areas and 2.4 m/s for the suburban areas, respectively.During the warm period of the year (May to September) the mean wind speed is about 2.1 m/s concerning the examined urban areas and 2.4 m/s for the suburban areas, respectively.Finally, during the cold period of the year (October to April) the mean wind speed is about 2.0m/s concerning the examined urban areas and 2.3 m/s for the suburban areas, respectively.
The characteristics and the differences of daily maximum surface ozone concentration in suburban and urban areas were classified and analyzed separately during the examined period 2001 to 2005. Figure 4 shows the annual variation of the mean maximum monthly ozone concentrations.There is a permanent difference of about 15 μg/m 3 between urban and suburban areas.The mean maximum annual ozone concentration is about 89 μg/m 3 for the urban areas and 105 μg/m 3 for the suburban areas, respectively.During the warm period of the year the ozone concentration is about 115 μg/m 3 for the urban areas and 132 μg/m 3 for the suburban areas, respectively.Finally, during the cold period of the year the ozone concentration is about 70 μg/m 3 for the urban areas and 84 μg/m 3 for the suburban areas, respectively.
According to the HMEECC, the ozone concentration threshold (maximum hourly values during the day) for the public awareness is 180 μg/m 3 and the alert threshold is 240 μg/m 3 , respectively.For both the urban and suburban examined areas within the GAA, the percentage of days exceeding the above thresholds were analyzed.In an annual base, the results showed that at least in one of the examined urban areas 12.3% of the days exceed the threshold for the public awareness and 95.5% of these days appear during the warm period of the year and only 4.5% during the cold period.During the warm period exceedances days, the mean maximum daily temperature is about 31.9 • C and the mean daily wind speed is about 1.7 m/s.Taking into account the suburban areas, 15.8% of the days within the whole year exceed the threshold for the public awareness.From these days, a percentage of 85.5% appears during the warm period of the year and the rest 14.5% during the cold period.Within the warm period exceedances days in suburban areas, the mean maximum daily temperature is 30.0 • C and the mean daily wind speed is about 1.9 m/s against 21.5 • C and 1.9 m/s, respectively, for the cold period exceedances days.
Concerning the alert threshold in an annual base, 1.6% of the days exceed the threshold at least in one of the examined urban areas.All of these exceedances days (100%) are presented during the warm period of the year.Furthermore, during these days the mean maximum daily temperature is about 33.6 • C and the mean daily wind speed is about 1.6 m/s.For the suburban areas, it is found that 4.1% of the days during the whole year exceed the alert threshold.From these days, 86.7% appear during the warm period of the year and the rest 13.3% within the cold period.During the warm period alert exceedances days in suburban areas, the mean maximum daily temperature is 31.3• C and the mean daily wind speed is about 1.7 m/s against 23.7 • C 1.5 m/s, respectively, for the cold period.In general, the observed high air temperatures and low wind speeds promote air pollution and especially high ozone concentrations.
For the maximum hourly surface ozone concentration forecast of the next 24 hours within the GAA, prognostic models were developed using multiple linear regression and ANN models.Thus in this work, daily values of the maximum hourly air temperature ( • C) and the mean daily wind speed (m/s) within the GAA as well as the daily values of maximum hourly ozone concentrations (μg/m 3 ) of the previous day were used.
Essentially, three independent variables have been chosen (Table 1) while the dependent variable is the concentration of surface ozone 24 hours ahead (Table 1).Concretely, the independent variable x 1 is the natural logarithm of the maximum observed daily ozone concentration of the previous day, from the thirteen examined areas.The independent variable x 2 is the maximum observed daily air temperature of the previous day, from the eight examined meteorological stations.The independent variable x 3 is the mean daily observed wind speed of the previous day, from the eight examined meteorological stations.Finally, the dependent variable y is the natural logarithm of the maximum daily surface ozone concentration for the next 24 hours, and concerns the maximum ozone concentration from the thirteen examined regions within the GAA one day ahead.
The transformation of ozone concentrations using their natural logarithm was applied, in order to satisfy the linear multiple regression analysis requirement, namely, the values of the independent variable follow the normal distribution.Thus, the dependent variable is now on the natural logarithm of surface ozone concentration 24 hours ahead.The "normality" of both the values of ozone concentrations and the values of the natural logarithm of ozone concentration was checked by the application of the Kolmogorov-Smirnov test (statistic D) and the results are shown in Figure 1.Based on the values of Kolmogorov-Smirnov statistical index D, it is obvious that in the case of the original ozone concentrations (D = 0.08184) the residuals do not follow the normal distribution (Figure 1(a)).In the case of the natural logarithm of ozone concentrations (D = 0.03722) the residuals seem to follow the normal distribution (Figure 1(b)).
In general terms, we can say that in any case there are three independent variables and one dependent variable.Table 1 presents the variables of all applied forecasting models, which are the same for each developed model as well as their abbreviations.
The nonmulticollinearity between the three independent variables was investigated by the application of the multicollinearity index VIF.Results are presented in Table 2.According to Table 2, the term of nonmulticollinearity between the three independent variables x 1 , x 2 and x 3 is fully fulfilled.
The estimated multiple linear regression (MLR) model was y = 1.4271 + 0.6562x 1 + 0.0101x 2 + 0.0076x 3 . ( As far as the ANN forecasting model is concerned, a multilayer perception neural network with the architecture structure of a time lagged recurrent network (TLRN) was properly developed and trained.Most of the real-world data The natural logarithm of the maximum daily ozone concentration of the previous day x 2 The maximum daily air temperature of the previous day x 3 The mean daily wind speed of the previous day Dependent variable Description y The natural logarithm of the maximum daily ozone concentration 24 hours ahead  Table 2: Nonmulticollinearity test results between the three independent variables.

Combination of the independent variables
Multiple correlation coefficient VIF 0.028 1.0 contain information according the structure of time, that is, how the data changes with time.However, most neural The developed ANN model consists of one hidden layer with four artificial neurons.The choice of both the type and the architecture of the structure of the developed ANN prognostic model was done by applying the trial-and-error method.In order to train the ANN model, the independent variables x 1 , x 2 , and x 3 (Table 1) were used as input data.The independent variables x 1 , x 2 , and x 3 were chosen as input training data in order that the results of the developed ANN model are comparable with those of the MLR developed model (3).The ANN output is the dependent variable y, the natural logarithm of the maximum daily surface ozone concentration for the next 24 hours, within the GAA.In other words, the input training data are the independent variables x 1 , x 2 , and x 3 of the developed MLR forecasting model, as well as the output of the ANN model is the dependent variable y of the developed MLR forecasting model.
As mentioned before, the used data set was divided into two subsets.The first was included data from the period 2001-2004 and used for the ANN training, and the second subset was the year 2005, which was absolutely unknown to the model and used for the evaluation of the forecasting ability of the developed ANN.The same procedure was followed in the case of all developed MLR models.
Table 3 presents the values of the evaluation statistical indices for both the MLR and the ANN model.At this point, it should be noted that the values of these evaluation statistical indices concern the correlation between the realobserved and the predicted ozone concentrations for the validation year 2005 and they have been emerged after taking back the predicted values of the natural logarithms of surface ozone concentration values of each model.
Figure 5 shows the observed values of maximum daily ozone concentration against the corresponding predicted models values, within GAA for the year 2005.According to Table 3 and Figure 5, it seems that the prediction of all the developed models is in a very satisfactory level (P < 0.01).In conclusion, the daily maximum surface ozone concentration forecasting 24 hours ahead is most reliable by using the ANN approach compared to the forecasting by using MLR models.Very close to the ANN prediction is the prediction of the polynomial regression model.

Conclusions
The objective of this study was to examine the one-day forecast of the daily maximum surface ozone concentration within GAA, by developing predictive models based on the method of multiple regression analysis against artificial neural network approach.The performed analysis has indicated that the coefficient of determination (MLR: 0.653, ANN: 0.666) and the index of agreement (MLR: 0.887, ANN: 0.892), between the observed and the predicted ozone concentrations for the year 2005, with respect to MLR and ANN models, are statistically significant (P < 0.01).Finally, it should be mentioned that ANN models forecasting ability does present a limited precedence against MLR models.Furthermore, the improvement of the predictive ability of the constructed ANN model could be reached by the use of several others parameters such as nitrogen oxides concentration, the intensity of solar radiation, and the sunlight duration, based on the disposability of those data over space and time.
(b)) ozone values, indicated by bars, compared to the normal distribution.It is obvious that the transformed ozone values, fit to the normal distribution.

Figure 1 :
Figure 1: Normality test results for ozone (O 3 ) concentration values (a) and the natural logarithm transformation (lnO 3 ) of ozone concentrations (b).

Figure 2 :
Figure 2: Annual variation of mean maximum monthly temperature for the urban and suburban areas within the GAA.Period 2001-2005.

Figure 3 :
Figure 3: Annual variation of mean monthly wind speed for the urban and suburban areas within the GAA.Period 2001-2005.

3 )Figure 4 :
Figure 4: Annual variation of mean maximum monthly ozone concentrations for the urban and suburban areas within the GAA.Period 2001-2005.

Figure 5 :
Figure 5: Observed (blue line) and predicted (red line) ozone concentration values 24 hours ahead for MLR (a) and ANN (b) model, year 2005.

Table 1 :
Variables of the developed models.

Table 3 :
Performance statistics for the validation of the developed models.
[41,42]s are purely static classifiers.The TLRN neural architecture structure is the evolution of technology in linear time series prediction, system identification, and temporal pattern classification[41,42].