Neural Models for Imputation of Missing Ozone Data in Air-Quality Datasets

Ozone is one of the pollutants with most negative effects on human health and in general on the biosphere. Many data-acquisition networks collect data about ozone values in both urban and background areas. Usually, these data are incomplete or corrupt and the imputation of the missing values is a priority in order to obtain complete datasets, solving the uncertainty and vagueness of existing problems to manage complexity. In the present paper, multiple-regression techniques and Artificial Neural Network models are applied to approximate the absent ozone values from five explanatory variables containing air-quality information. To compare the different imputationmethods, real-life data from six data-acquisition stations from the region of Castilla y León (Spain) are gathered in different ways and then analyzed. The results obtained in the estimation of the missing values by applying these techniques and models are compared, analyzing the possible causes of the given response.


Introduction and Related Work
The ozone (O 3 ) is an odorless, colorless, and highly reactive gas composed of three oxygen atoms.It is formed both in the Earth's upper atmosphere (stratospheric ozone) and at ground level (tropospheric ozone).It can be "good" or "bad" for people's health and for the environment, depending on its concentration levels and location in the atmosphere [1].
Stratospheric O 3 is formed naturally through the interaction of solar UltraViolet (UV) radiation with molecular oxygen (O 2 ).Ground-level or "bad" ozone is not emitted directly into the air.In the 1950s, hydrocarbons and nitrogen oxides (NO  ) were identified as the two key chemical precursors of photochemical smog and its concomitant high concentrations of O 3 and other photochemical oxidant [2].The majority of ground-level O 3 is formed from the photochemical oxidation of Volatile Organic Compounds (VOCs) in the presence of NO and other NO  .Significant sources of VOCs are chemical plants, gasoline pumps, oilbased paints, autobody shops, and print shops.NO  result primarily from high temperature combustion, and its most significant sources are power plants, industrial furnaces and boilers, and motor vehicles [3].
1.1.Importance of Ozone.The O 3 exposition can cause damage in different ways.In the stratosphere, reduced O 3 levels as a result of O 3 layer depletion mean less protection from the sun's rays and more exposure to UltraViolet B (shortwave) rays (UVB) radiation at the Earth's surface [4].The effects on human health of the O 3 layer depletion have been much analyzed, increasing the amount of UVB that reaches the Earth's surface.UVB causes nonmelanoma skin cancer and plays a major role in malignant melanoma development.In addition, UVB has been linked to the development of certain cataracts, negative effects in patients with asthma, and 2 Complexity other chronic respiratory disease.With respect to groundlevel O 3 , and its effects on human health, breathing O 3 can trigger a variety of health problems.People with asthma and other chronic respiratory disease are a large and growing segment of the population and are also known to be especially susceptible to the effects of O 3 exposure.On days with high levels of O 3 , people with asthma tend to experience increased respiratory symptoms [3].The layer O 3 depletion has also negative effects on the process of the development of plants, effects on the marine ecosystems like a direct reduction in phytoplankton production, negative effects on materials like biopolymers, and so forth.Tropospheric O 3 does not provide the protective function that it fulfills in the stratosphere, being high reactivity.Its strong oxidizing capacity, when its levels rise above the natural background, can cause adverse effects in materials (derived from its corrosive effects), on vegetation and ecosystems.
The present work focuses on tropospheric O 3 , which is a risk for the air quality [3].Given the increase in O 3 levels in the troposphere, it is currently considered one of the most important atmospheric pollutants.

Ozone Level Monitoring.
Around the world there are numerous data-acquisition networks for the measurement of O 3 levels and other pollutants, which consist of many stations in different locations where different sensors measure corresponding magnitudes.These network stations acquire data at periodic intervals of time (periods between ten and fifteen minutes are the most frequent ones) but frequently appear missing or corrupted data.In Europe, data are considered as corrupted when not meeting the Council Decision 97/101/EC of January 27, 1997 [5], which establish a reciprocal exchange of information and data from networks and individual stations measuring ambient air pollution within the Member States.Some of these networks provide information about the validity of the data, indicating through codes if the data is correct, it has not been possible to acquire, or it is corrupt, but in other occasions this type of information is not provided while the data are still missing.Some reasons for such failures have been pinpointed [6], namely, a damaged cable, the loss of proper electrical grounding, half-melted frost or snow on the dome, communications failure, and so forth.Some of these causes are temporary and may disappear spontaneously, but other ones require the intervention of a maintenance task force, and therefore errors persist for different periods of time.The absence of valid data may also be due to reasons such as the following: mishandling of samples, low signalto-noise ratio, measurement error, nonresponse, or deleted aberrant value [7].This is a problem for the analysis of the information coming from the measurement networks, and the imputation of these missing data [8] is necessary.Any of the variables acquired in network stations may suffer from the problem of the absence of data.If many data variables are omitted or corrupted in the same record, the whole sample must be withdrawn, when some models are applied [9], for subsequent tasks such as control, classification, forecast.Alternatively, if data for the same pollutant are missing in several adjacent rows, removing that variable may also be an alternative solution.In conclusion, having a complete set of data is necessary to perform a reliable study and to apply some models that cannot deal with missing data.

Missing Values and Related Work.
The standard classification of missing data phenomenon [10] includes different situations: (i) Missing Completely At Random (MCAR), when the probability of an instance (case) having a missing value for a variable does not depend on either the known values or the missing data.(ii) Missing At Random (MAR), when the probability of an instance having a missing value for a variable may depend on the known values but not on the value of the missing data itself.(iii) Not Missing At Random (NMAR), when the probability of an instance having a missing value for a variable could depend on the value of that variable.
As previous authors have pointed out, the complexity varies between these patterns of missing data [11].Usually, in the case of air-quality data, missing values are associated with MAR or MCAR.The circumstances that may interfere with the acquisition of the data are many and not easily predictable [12].
To solve the missing data problem, a wide variety of different methods have been applied up to now [8,10,13].These imputation methods (IMs) are usually classified as follows: (i) Single imputation (SI): the method fills in one value for each missing one [12].(ii) Multiple imputation (MI): multiple simulated values are generated at the same time [14].
The univariate and multivariate imputation methods differ in which the approximation of the missing values of the variable under study are calculated from the rest of the values of the very same variable (univariate) or using values of the rest of the variables (multivariate) [12].
With the aim of reducing the complexity of other MI applied methods [11], the present paper focuses on single and multivariate imputation for the O 3 magnitude in air pollution datasets.To do so, multiple-regression (linear and nonlinear) techniques together with Artificial Neural Networks (ANN) are applied to real-life datasets obtained from public airquality networks.
Up to now, different Artificial-Intelligence (AI) techniques have been applied for imputation of missing data.In [7] imputation methods based on six different techniques are compared: K-Nearest Neighbors (KNN), Fuzzy K-Means (FKM), Singular Value Decomposition, Bayesian Principal Component Analysis (bPCA) and Multiple Imputations by Chained Equations.These methods are applied to four datasets split into two groups of various sizes: small datasets (Iris and E. coli) and large datasets (breast cancers 1 and 2).bPCA and FKM appeared to be the most robust imputation methods in the tested conditions.
In [15] the accuracy of different imputation methods is evaluated: MissForest (MF) and Multiple Imputation based on Expectation-Maximization (MIEM), along with two other imputation methods: Sequential Hot-Deck and Multiple Imputation based on Logistic Regression (MILR).The models are applied over fourteen binary datasets, with a range of missing data rates between 5% and 50%.The results from 10fold Cross-Validation (CV) show that the performance of the imputation methods varies substantially between different classifiers and at different rates of missing values.
Although many imputation methods have been proposed up to now, scant attention has been paid to validate ANN for such a task, taking advantage of their regression capability [16].Among these previous studies, ANN have been applied for the estimation of lost values in [17], where the main goal is identifying Learning Disabilities (LD) in children at early stages.In [18], authors proposed a SI approach relying on a Multilayer Perceptron (MLP) whose training is conducted with different learning rules, and a MI approach based on the combination of MLP and KNN.24 real and simulated datasets from the UCI repository, the Promise repository, and mldata.orgwere exposed to a perturbation experiment with random generation of monotone missing data pattern.
In [19] six different types of ANN are proposed as IM: MLP and its variations (the Time-Lagged Feedforward Network (TLFN)), the Generalized Radial-Basis-Function (GRBF) network, the Recurrent Neural Network (RNN), and its variations (the Time Delay Recurrent Neural Network (TDRNN)).Additionally, the Counterpropagation Fuzzy-Neural Network (CFNN) along with different optimization methods is applied for infilling missing daily total precipitation and extreme temperature series from 15 weather stations.The standard MLP and TLFN appear to provide the most accurate reconstruction of missing precipitation and daily extreme temperatures records with results for the R correlation coefficient between the observed and the reconstructed daily series close to 1.
In [20] a novel nonparametric algorithm named Generalized regression neural network Ensemble for Multiple Imputation (GEMI) is proposed.Additionally, a SI version of this approach (GESI) is proposed.The algorithms were tested on 98 synthetic and real-world datasets.All simulation results show the advantages of GEMI as compared with conventional algorithms.GEMI has heavy memory storage requirements but outperformed other SI algorithms.
In [21] fifteen real and simulated datasets are exposed to a perturbation experiment, based on the random generation of missing values.Several architectures and learning algorithms for the MLP are tested and compared with three classic imputation procedures: mean/mode imputation, regression, and hot-deck [22].
In [23] a methodology based on Gaussian Mixture Model (GMM) and Extreme Learning Machine (ELM) is developed and tested on some datasets from the UCI Machine Learning Repository and the LIACC regression repository.GMM is used to model the data distribution which is adapted to handle missing values, while ELM enables devising a Multiple Imputation strategy for final estimation.The combination of GMM and ELM is shown to be superior in almost all tested cases over the method based on conditional mean imputation.
In [24] a SI approach relying on a MLP and a MI approach based on the combination of MLP and K-NN is proposed.The models are applied to 18 real and simulated datasets like domains such as biology, medicine, chemistry, electronics, social surveys, census, and business.For datasets with only quantitative variables MIMLP model provided the best results, with IMLP being the best method for datasets with categorical variables.
In [25] a two-stage hybrid model for filling the missing values using fuzzy c-means clustering and MLP is proposed.It is applied to a Wine dataset with a 1% to 5% of generated missing values and the accuracy of the model is checked using the Mean Absolute Percentage Error (MAPE).The MAPE obtained for stage 2 (MLP regression to the obtained dataset as a result of applying fuzzy c-means in stage 1) is 4.95% for 1% missing-value records and 8.36% for 5% missing-value records.
In the case of air-quality data, few imputation methods have been proposed up to now.In [13], an important set of SI: Listwise, Unconditional mean, Modified Median, Principal Component-based, Expectation-Maximization (EM) (Regularized-EM), and MI methods are applied to three datasets with the most important pollutant variables (NO, NO 2 , NO  , CO, O 3 , PM10, and PM2.5) and a percentage of missing data among the 3.85% and the 23.52% depending on the year.Missing data of the eight variables are imputed in order to assess the effectiveness of the methods applied.In general, MI tends to yield more scattered values than its counterparts, mainly when the variables have many voids and they correlate poorly to the other variables like CO with 43.5% of missing data in 2006 and they correlate poorly to the other variables.
In [11] some methods for the imputation of missing airquality data are compared: in the context of SI (linear, spline, and nearest neighbor interpolations), MI (regression-based imputation, multivariate nearest neighbor, Self-Organizing Maps (SOM), and Multilayer Backpropagation (MLBP) nets) and hybrid methods of the aforementioned.The dataset uses the most common pollutants: NO  , NO 2 , O 3 , PM10, SO 2 , and CO concentrations, all on a time-scale of one per hour (hourly averaged), together with four meteorological parameters.The performance of the proposed univariate missing data interpolation was limited, and in general they were able to fill only very short gaps of contiguous missing data.The general performance of the applied imputation methods was fair good when considering the pollutants (NO  , NO 2 , O 3 , PM10, SO 2 , and CO) which are the most important ones in terms of air-quality modelling, but not so good regarding meteorological variables.The results suggested that SOM and MLBP are the methods of choice for air-quality data imputation and even better results can be achieved by using the MI.

Main Contributions. The main contributions of this work are as follows:
(i) Deep study of the real-life human health protection task in Spanish region of Castilla y León.
(ii) Multisensor of O 3 data analysis.

Complexity
(iii) Experimental evaluation of the proposed approach based on multiple-regression techniques together with ANN models.
To the best of authors knowledge, this is the first approach of imputation methods of O 3 based on both MLP and Radial-Basis-Function Networks.
The rest of this paper is organized as follows.Section 2 presents the techniques and models applied.Section 3 details the real-life case study that is addressed in present work, while Section 4 describes the experiments and results.Finally, Section 5 sets out the main conclusions and future work.

Regression Techniques and ANN Models
In order to fill missing or corrupted values of O 3 in high dimensional datasets with air-quality information, two regression techniques and two ANN models have been applied in present study.This set of techniques applied as imputation methods is described in this section.

Regression Techniques.
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable [26].
The general purpose of multiple regressions [27] is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable.

Multiple Linear Regression. Multiple linear regression
(MLR) attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data [28].Every value of the independent variable () is associated with a value of the dependent variable ().The population regression line for  explanatory variables is defined to be This line describes how the mean response   changes with the explanatory variables.The observed values for y vary about their means   and are assumed to have the same standard deviation .The fitted values  0 ,  1 , . . .,   estimate the parameters  0 ,  1 , . . .,   of the population regression line.
Since the observed values for y vary about their means u  , the multiple-regression models include a term for this variation.The model is expressed as DATA = FIT + RESID-UAL, where the "FIT" term represents the expression  0 +  1  1 +  2  2 + ⋅ ⋅ ⋅ +     .The "RESIDUAL" term represents the deviations of the observed values  from their means   , which are normally distributed with mean 0 and variance .
The notation for the model deviations is .
Formally, the model for multiple linear regression, given n observations, is [28] (3)

Multiple Nonlinear Regression. A Multiple Nonlinear
Regression (MN-LR) is a form of regression analysis in which observational data are modelled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables [29].The data are fitted by a method of successive approximations.
The parameters can take the form of an exponential, trigonometric, power, or any other nonlinear function.To determine the nonlinear parameter estimates, an iterative algorithm is typically used.
where  represents nonlinear parameter estimates to be computed,  is the dependent or criterion variables, and represents the error terms.

Artificial Neural Networks. Artificial Neural Networks
(ANN), also known as Artificial Neural Systems (ANS), connectionist systems, adaptive networks, and distributed and parallel processing are simplified models of natural neural systems.The following definition, given by Hecht-Nielsen in 1989 [30], formalizes the concept of ANN: An ANN is a parallel processing computer system distributed, consisting of a set of elementary processing units equipped with a small local memory and interconnected in a network through connections with associated weights.Each processing unit has one or more input connections and a single output connection that links to many collateral connections as desired.All processing associated with an elementary unit is a local, i.e. depends only on the values that take input signals from the unit and the internal state of the same.

Multilayer Perceptron (MLP).
The MLP consists of a system of simple interconnected neurons or nodes.The nodes are connected by weights and output signals which are a function of the sum of the inputs to the node modified by a simple nonlinear transfer, or activation, function.The architecture consists of several layers of neurons; the input layer serves to pass the input vector to the network.The terms "input vectors" and "output vectors" refer to the inputs and outputs of the MLP and can be represented as single vectors [31].A MLP may have one or more hidden layers and finally an output layer.MLP are fully connected, with each node connected to every node in the next and previous layer.

Radial-Basis-Function Networks (RBFN).
In a RBFN [33] each unit in the hidden layer of this network has its own centroid, and, for each input vector  = (  ,  2 , . . .,   ), it computes the distance between  and its centroid.Its output of the unit is calculated as a nonlinear function of this distance.
Assuming that there are r input nodes and m output nodes, the overall response function without considering nonlinearity in an output node has the following form [34]: where  ∈ N is the number of units in the hidden layer,   ∈ R  is the vector of weights linking the th hidden-layer unit to the output nodes, x is an input vector, K is a radially symmetric kernel function of a unit in the hidden layer, z  and   are the centroid and smoothing factor of the th kernel node, respectively, and : [0, ∞) → R is a function called the activation function, which characterizes the kernel shape.

Case Study
In present study, data from air-quality stations in Castilla y León (CyL) are analyzed.CyL is a Spanish region located at the north-center of the Iberian Peninsula.It is composed of nine provinces and it is the most extensive region of Spain with a total surface of 94,226 square kilometers and the sixth with more population: 2,435,797 habitants.Gross Domestic Product (GDP) in CyL represents the 5.3% of country's GDP [35].Climate in CyL approaches what is known as the continental ocean, characterized by cold winters and hot summers with short spring and autumn periods.
CyL region provides a wide network of stations [36] for the acquisition of air-quality data.These data are public available according to the Open Data Initiative from the Spanish Government [37].
Stations from this network have some interesting characteristics: (1) Stations are classified in types: urban, background, and oriented to the vegetation protection [36].
(2) These stations collect the fundamental air-quality pollutants, and among them is the O 3 , which is the objective pollutant of this study.Daily averages data [38] of each pollutant are provided in each location.
(3) This data presents empty or corrupted data in all of its variables in some rows and in a reasonable percentage to be estimated.In the present study, pollutant data recorded in six different stations from the CyL network are analyzed.Daily data averages from years 2000 to 2008 have been selected.For some periods of time within the selected time window, data are not available for all the variables and, thus, the whole example is rejected for the study.Three of the stations are located in the center of the cities and labeled as urban stations; these stations are oriented to the protection of the human health.The other three stations are background stations and are also oriented to the protection of the human health.These stations measure a greater number of pollutants than the other type of stations and are the most important ones in terms of air quality, and many of them are not collected at the stations for the vegetation protection.This fact is important for the determination of the O 3 missing values, as this gas is especially harmful for human health.
The three background stations are as follows: (1) Burgos."Fuentes Blancas" station.Geographical coordinates: 42.33611, −3.63611; 929 masl.Figure 1 shows the location of the six selected stations that have been studied in the present paper.
The pollutants gathered in the above-mentioned stations and analyzed in the present study are as follows: Complexity (1) Ozone (O 3 ), g/m 3 , secondary pollutant.See Section 1.
It is an odorless, colorless gas formed by the incomplete combustion of fuels.When people are exposed to CO gas, the CO molecules will displace the oxygen in their bodies and lead to poisoning [39].
From the standpoint of health protection, nitrogen dioxide has set exposure limits for long and short duration [39].(5) Particulate matter (PM10), g/m 3 , primary pollutant.
These particles remain stable in the air for long periods of time without falling to the ground and can be moved significant distances by the wind.It is defined by the ISO as follows: "particles which pass through a size-selective inlet with a 50% efficiency cut-off at 10 m aerodynamic diameter.PM10 corresponds to the 'thoracic convention' as defined in ISO 7708:1995, Clause 6" [40].(6) Sulphur dioxide (SO 2 ), g/m 3 , primary pollutant.It is a gas.It smells like burnt matches.Its smell is also suffocating.SO 2 is produced by volcanoes and in various industrial processes.In the food industry, it is also used to protect wine from oxygen and bacteria [39].
Primary pollutants are injected into the atmosphere directly.Secondary pollutants are formed in the atmosphere through chemical and photochemical reactions from the primary pollutants [36].
All data from these six variables were normalized for the study.On the other hand, all of them are highly decorrelated.Table 1 shows the correlation matrix of the six pollutants of the case study.
It is worth mentioning that O 3 is the most independent pollutant, as its correlation coefficients with the rest of the variables are close to zero.
There are a total of 13,526 samples, as one sample per day (daily average) was collected for the twelve months of every year, between years 2000 and 2008, in the six stations analyzed in this study.Missing or corrupted data appear in all the variables in some rows, which are omitted for the study.
Table 2 shows the percentage of missing or corrupted data presented in each variable in the whole dataset.
All the samples with at least one missing or corrupted value were removed from the dataset.

Experiments, Results, and Discussion
The main target of this paper is to fill missing O 3 values in air pollution datasets.To do so, several imputation methods are comprehensively compared as described below.

Experimental Settings.
The imputation methods described in Section 2 are applied to different datasets, all of them with the six variables described in Section 3: For the three datasets, both statistical and neural imputation methods were applied and the performance is calculated through n-fold Cross-Validation (CV).The main idea behind CV is to split data, normally many times, for estimating the  3-11.The Mean and the STD of the execution time (in seconds) are also presented in Tables 3-11 for the 10 folds.
For MLP and RBFN different network topologies have been applied: combinations of 10, 20, and 30 neurons in the hidden layer.Additionally, in the case of MLP, the model is trained 10 times with the same combination of parameters to reduce the effect of randomness and get more statistically significant results.

Results from the Whole Dataset.
In this section, results in terms of MSE and execution time when applying MLR, MN-LR, RBFN, and MLP to the WD are presented.
In Tables 3 and 4, it can be observed that the MSE Mean values for the determination of the O 3 are very similar for the three applied methods (MLR, MN-LR, and RBFN).In the case of RBFN, slightly lower values of MSE are obtained, with the lowest one being obtained with 10 neurons in the hidden layer.Regarding execution times, the MN-LR method turns out to be the slowest and RBFN the quicker.The high values of STD for the runtime in the case of RBFN are due to the fact that it greatly varies from one fold to the others.
As it can be seen in Table 5, the LM, SCG, and BR training algorithms present the lowest values of MSE Mean in all cases (10, 30, and 50 neurons) and very close to those shown in Tables 3 and 4. The lowest value of MSE was obtained with the LM learning algorithm and 50 neurons.The learning algorithm that attained the worst results (in terms of MSE) is GDX.With respect to execution time, the SCG algorithm attained the best results, while LM and BR are the second best ones, while TB was the slowest of the five algorithms.Obviously, the training algorithms take more time when 50 neurons are defined in the hidden layer, the TB algorithm being the one with greatest effect.

Results from the Season Dataset.
In Tables 6-8 results of applying MLR, MN-LR, RBFN, and MLP to subsets with data from the four seasons of the year (spring, summer, autumn, and winter) are presented.
In Tables 6 and 7 the 3 methods present similar values in MSE Mean, and the lowest MSE Mean is achieved by the RBFN with 50 neurons in the hidden layer for the summer season.The MSE Mean values are higher than that observed for the WD.The season of the year with the lowest values of MSE Mean is the summer.One reason may be that there are few variations in pollution conditions during summer time.This is due to the small variation in weather conditions during summer as well as low industrial activity and traffic in urban areas due to vacation time.Furthermore, correlation coefficients in more than 20 pollutants analyzed in [42] are higher for measurements in the summer compared with correlations for measurements over all days combined.The season of the year with the worst results in the calculation of the MSE has been the autumn in the case of the two regression techniques and RBFN, although the differences between the three seasons (spring, summer, and autumn) is not significant.In terms of execution time, it is probed once again that MN-LR is the slowest method, while RBFN is the quickest one, returning very similar results for the four seasons of the year.
In Table 8, similarly to Table 5, the training algorithms that achieve the best results in terms of MSE Mean are LM, SCG, and BR.LM achieved the best value of MSE Mean in 10 of the 12 cases shown in Table 8, being exceeded by BR by a minimum value for the winter and spring seasons with a configuration of 10 neurons.GDX records the worst MSE values in the 12 cases shown in Table 8.Again, the best MSE Mean is obtained for the summer season, reducing the MSE Mean in comparison with those registered by RBFN.The season of the year with the worst results in the calculation of  the MSE has been the spring, although the difference in this term between spring, autumn, and winter is minimal.
In terms of execution time, it can be said that the SCG algorithm is the fastest one (in terms of mean execution time), with slight variations (low STD).LM and BR perform very well according to runtime with very similar result.Finally, the TB algorithm is the slowest one in the 12 cases shown in Table 8.This algorithm is very sensitive in its execution time to the increase in the number of neurons in the hidden layer.It is worth mentioning that the best value in terms of execution time has been obtained by SCG and for the summer season, the same for which the best value of MSE is achieved.
Figure 2 shows the boxplot for the results shown in Table 8.Each box represents the MSE Mean values for the whole dataset (four seasons), for a certain number of neurons and a training algorithm.
In Figure 2 it can be observed that the LM and SCG training algorithms outperform the other algorithms and that the TB algorithm achieved the worst results.It is also worth mentioning that, in general terms, increasing the number of neurons in the hidden layer causes an increase in the MSE due to the loss of generalization capability of the models (especially in the TB and GDX algorithms).The difference between the 25th and 75th percentiles is also higher in the case of algorithms achieving poor results in the Season Dataset, especially for the TB training algorithm.MLR, according to Sections 4.2 and 4.3, again MN-LR, and RBF achieve similar results in estimating the MSE, but in this occasion it is higher than in Sections 4.2 and 4.3.In turn, the urban stations get a better MSE than the background stations; this indicates that the pollution levels are more constant in the urban stations than in the ones furthest from the center.In terms of the execution time, the MN-LR is again the slowest method.The RBFN shows a more efficient response than the regression methods.
For MLP, and in similar way compared to the other datasets (Sections 4.2 and 4.3), the training algorithms which achieved the lowest MSE Mean values are LM, SCG, and BR.These values are similar to those obtained by RBFN (Table 10) and lower than the values associated with the regression techniques in Table 9.According to the station type, generally speaking, lower MSE values were obtained for the "urban" stations, in comparison with "background" stations.The lowest MSE value was obtained for "urban" stations with 50 neurons and LM algorithm.The only training algorithm which returns higher values of MSE Mean for the "urban" stations than for the "background" stations is GDX.This happened for the three different numbers of neurons, while the other four algorithms get lower values of MSE for the "urban" stations for the different numbers of neurons in the hidden layer.The lower value in the MSE makes the "urban" stations easier to estimate the missing O 3 values; this is due to fewer variations in the pollution values in In terms of execution time, the SCG algorithm is the quickest one in the six cases shown in Table 11 followed by LM, with no big difference depending on the number of neurons in the hidden layer, only a little faster with 10 neurons.The slowest train algorithm is again TB in the six cases, as it was identified from these results with the exposed results in Tables 5 and 8. 4.5.Discussion.The two applied regression techniques (MLR and MN-LR) obtained similar values of MSE in most cases, in terms of both Mean and STD.However, MN-LR obtained poor results according to execution time (Tables 3, 6, and 9), even worse than the slowest training algorithm for MLP (GDX and TB in Tables 5, 8, and 11).
For the ANN models (RBFN and MLP), different combinations of neurons in the hidden layer were compared.For the sake of brevity, only the results for 10, 30, and 50 neurons (Tables 4, 5, 7, 8, 10, and 11) have been included in the present paper.In the case of RBFN, the best execution times are achieved, outperforming the fastest algorithm for MLP (SCG in Tables 5, 8, and 11).In the case of MLP, varying results have been obtained, depending on the training algorithm applied, obtaining the best results (in terms of MSE) when learning through the LM and SCG algorithms.SCG algorithm additionally is the fastest one.GDX has been identified as the algorithm with worst error, as can be seen in Tables 5, 8, and 11.No significant improvement is observed in the estimation of missing values according to MSE when increasing the number of neurons of the hidden layer.On the contrary, the selection of the training algorithm has been identified as a key factor when applying MLP.An increase in the number of neurons in RBFN does not affect considerably the accuracy of the results in terms of MSE and execution time (see Tables 4, 7, and 10).MLP achieved a better value of MSE if the training algorithm is properly selected.
Taking into account the different datasets, the lowest MSE for the Season Dataset is obtained for the summer season when applying the LM training algorithm with 50 neurons, without big differences between the other three seasons of the year.For the spring, autumn, and winter seasons the best MSE corresponds to the LM algorithm combined with 50 neurons.In terms of execution time, the fastest experiment was that applying RBF with 10 and 30 neurons for the summer season, SCG for 50 neurons in the case of spring season, RBF with 50 neurons for autumn season, and RBF with 30 neurons for the winter season.In the case of the Station Type Dataset, "urban" stations, the best results in terms of MSE for the "urban" stations and for the summer season are accompanied by the lowest execution times.It must be mentioned that good results have been obtained, in terms of MSE, when applying the four imputation methods to the WD.This fact indicates no great variations neither between the weather seasons of the year nor between the analyzed types of station ("urban" and "background").

Conclusions
In the present work, several different imputation methods are proposed for dealing with missing O 3 values in multidimensional real-life datasets with air-quality information.To do this, two multiple-regression techniques (linear and nonlinear) and two ANN models (RBFN and MLP) with different training algorithms and different number of neurons in the hidden layer have been compared.As a validation scheme, 10-fold cross-validation has been applied to the different datasets.The imputation task has been carried out firstly on the complete dataset, and on different datasets, where the original data are split according to two criteria: according to the season and according to the station type.The following conclusions are worth mentioning: (1) MLR and MN-LR attained very similar results in terms of MSE and execution time.These results are slightly worse than those obtained by the two ANN models (RBFN and MLP).The lowest value of MSE has been obtained for the WD (applying MN-LR technique) and the highest one for the SD (also applying MN-LR technique).(2) In the case of RBFN, slight differences have been obtained when varying the number of neurons in the hidden layer, in terms of both the MSE and the execution time.The best results have been obtained for the WD (with 10 neurons in the hidden layer) and the worst for the SD (with 50 neurons in the hidden layer), as it happened for MLR and MN-LR.(3) In the case of MLP, the best results are achieved when using the LM training algorithm and a number of 50 neurons in the hidden layer.As in the previous case (RBFN), the best results are obtained for the WD and the worst results for the SD, with small differences between the results in the three datasets.These are the best result from the whole experimentation in the present paper.The results obtained by MLP improve those obtained by RBFN, only when applying the LM training algorithm.(4) The CV technique guarantees reliability of the results when dealing with large datasets.
As future work, the application of additional artificialintelligence models for the imputation of O 3 and other pollutants is proposed, comparing the results with those obtained in the present study.

Figure 1 :
Figure 1: Location of the six selected stations in CyL, by Google Maps.

( 1 )
The Whole Dataset (WD), comprising the 13,526 samples: results for this datasets are shown in Section 4.2.(2) The Season Dataset (SD): samples in WD are split in four subsets according to the four seasons of the year: spring (3,453 samples), summer (3,349 samples), autumn (3,295 samples), and winter (3,429 samples).Results for this dataset are shown in Section 4.3.(3) The Type station Dataset (TD): samples in WD are split into two subsets according to the type of the station where the data come from; "urban" (6,763 samples) or "background" (6,763 samples).Results for this datasets are shown in Section 4.4.

Figure 2 :
Figure 2: Boxplot for the MLP applied to the Season Dataset (SD).

4. 4 .
Results from the Station Type Dataset.show the results of applying the four techniques to two different subsets, according to the station type: urban or background (see Section 4.1 for further details).

Table 1 :
Correlation matrix of the six variables in the dataset.

Table 2 :
Percentage of missing and corrupted data for each one of the analyzed variables.

Table 3 :
Linear regression and nonlinear regression results for the WD.

Table 4 :
Radial-basis function network results for the WD.
sample.The number of the  parameters (data partitions) was 10 for all the experiments in the present study.It means that 90% of the data are used for training and 10% for validation.In the case of neural models, the training process is repeated ten times (one for each fold).In the case of MLP, training is also repeated for each training algorithm (see Section 2.2).For all the experiments the Mean and the Standard Deviation (STD) of the Mean Square Error (MSE) for the ten folds are presented in Tables

Table 5 :
Multilayer perceptron results for the WD.

Table 6 :
Linear regression and nonlinear regression results for the Season Dataset.

Table 7 :
Radial-basis function network results for the Season Dataset.

Table 8 :
Multilayer perceptron results for the Season Dataset.

Table 9 :
Linear regression and nonlinear regression results for the Type Dataset.

Table 10 :
Radial-basis function network results for the Station Type Dataset.

Table 11 :
Multilayer perceptron results for the Station Type Dataset.