Short-Term Prediction of Air Pollution in Macau Using Support Vector Machines

Forecasting of air pollution is a popular and important topic in recent years due to the health impact caused by air pollution. It is necessary to build an early warning system, which provides forecast and also alerts health alarm to local inhabitants by medical practitioners and the local government. Meteorological and pollutions data collected daily at monitoring stations of Macau can be used in this study to build a forecasting system. Support vector machines (SVMs), a novel type of machine learning technique based on statistical learning theory, can be used for regression and time series prediction. SVM is capable of good generalization while the performance of the SVM model is often hinged on the appropriate choice of the kernel.


Introduction
Air pollution is often the result of economy development and population increase.It is particular in developing cities, particularly cities in China and India.Many epidemiologic studies [1][2][3] reported that air pollution problems are often associated with adverse human respiratory health effects, particularly to susceptible individuals.For example, ozone has been attributed to cause inflammation in airway and elevate airway response to inhaled allergens.It may increase the risk of the development of asthma among children taking part in outdoor sports [4].WHO [5] reported that the health problems in turn may increase the burden of the health care systems in the long run and be detrimental to economy.To reduce the burden on health care due to diseases caused by atmospheric pollutants, the establishment of an early warning system is necessary.The success of an early warning system, which provides forecast and alerts local inhabitants, depends on the reliability and the availability of up-to-date meteorological information and pollutions data.For instance, medical practitioners can advise patients to minimize outdoor activities during days of high levels of pollutions and smog, depending on the prediction of the early warning system.
The meteorological and pollutant data in Macau are used as a case study for the testing of the forecasting model for a representative developing city.Macau, located on the southern coast of China with merely 26.8 square miles land area, comprises three land zones: Macau peninsula, Taipa, and Coloane (Figure 1).Macau peninsula has the characteristics of a hybridized, urbanized area; Taipa has mainly residential areas; Coloane has a power station and is largely undeveloped with the largest green areas.The population density in 2008 reached 20,493 people per square mile [6], which is one of the highest in the world.Resident population in Macau is projected to increase at an average annual rate of 1.9%, from 513,000 at 2006 to 829,000 at 2031 [7].At present, the number of vehicles amounts to 188, 668 by the end of 2009, that is, more than triple from the quantity in 1999 [8].Macau Government has implemented regulations on importing lead-free petroleum products inline with other developed countries since 1995, while the sulfur contents in gasoline must be lower than 0.05% by weight.Furthermore, in 2004, the sulfur contents in petroleum products used in power station were also regulated.Implementation of these policies reduces the emission of pollutants locally.In addition, construction and infrastructure projects have been transforming the landscape Monitoring and forecasting of air pollutant level in ambient condition involve using a variety of approaches, for example, on-site measurement, computational fluid dynamics (CFD) simulation, and computational intelligence, and so forth.Artificial neural network (ANN) method is regarded as a cost-effective method and has been employed for the construction of prediction models at a variety of cities by environmental researchers [9][10][11][12].The practical applications of these models, however, suffer from different drawbacks, for example, local minima, overfitting, poor generalization, and the need to determine the appropriate network architecture.Support vector machines (SVM), developed by Vapnik [13], can provide an effective novel approach to improve generalization performance of neural networks and achieve global solutions simultaneously.SVM can overcome most drawbacks of ANN and has been reported to show promising results [14][15][16].However, the performance of the resulting SVM is often hinged on the appropriate choice of the kernel.There are several kernels commonly used in SVM for regression.Therefore, another aim of this study is to study which kernel is more suitable used in air pollution prediction.

Meteorological Data
The meteorological information and pollutant data measured at Taipa Grande automatic meteorological station (see Figure 1) in year 2003 to year 2006 were selected as the experiment data set, which were extracted from Macau Government's centre.Since the land area of Macau is relatively small, the data obtained at Taipa Grande (at an elevation of approximately 150 m above sea level) may be considered as representative for the entire region of Macau.The meteorological stations record pollutant data, such as nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), suspended particulate matters (SPM), and ozone (O 3 ); climatic data, such as temperature, humidity, rainfall, wind direction, wind speed, and precipitation.The day average value for air pollutants and meteorological data is considered a more representative measure and is adopted in this study.
In addition, the recorded levels of SPM, SO 2 , NO 2 , and O 3 in January and July 2006 were selected as special cases in this study.The reason to choose the data in these two months is because January and July represent the winter and summer seasons in Macau, respectively.January is typified with dry, dominating northeastern wind, whereas July is typified with humid, hot weather, and southeastern prevailing wind.The temperature difference in these two seasons in Macau may range from 10 to 30 • C. In winter season, due to the dominating northeastern wind, air-borne industrial pollutants from mainland China may be blown through Macau.In contrast, the southeastern prevailing wind from the sea in summer usually carries pollutants away.
Moreover, these meteorological data are closely associated with the presence and dispersion of pollutants.In order to discern the relationship between meteorological data and pollutants, an unadjusted crude method of bivariate analysis  2. The results illustrate some critical problems using limiting amount of data for the determination of the relationship between bivariate independent variables.For instance, atmospheric pressure appeared to have a positive correlation with SPM, NO 2 , and SO 2 for a period of one year, respectively, whereas the same parameters for a threeyear period showed no relationship.However, to minimize the operation of the regression model, parameters with a Pearson correlation coefficient of a value greater than 0.5 were selected as input in the model.However, an exception was applied to those related to O 3 where the value of Pearson correlation coefficient greater than or equal to 0.4 was used.
Apart from the physical significance of the meteorological variable, such as sunshine rate to the production of O 3 , this exception was necessary to prevent having too few inputs in the model that may fail to account for the fluctuation of O 3 levels.The available input variables included in the regression model is summarized in Table 1.
Table 2 shows the Pearson correlation coefficient between pollutants in different time series.The three-time series of SPM, SO 2 , and O 3 , at or even fall below, the level of significance of 0.5 after lag 2. Hence, in order to improve the accuracy of models and minimize the operation of models, only air pollutants and meteorological data at current day and previous day were used in this study to predict air pollutants level at the following day.
On the other hand, wind direction W is only available in the form of general directions, such as, N, NE, and E. Therefore, Pearson correlation coefficient cannot be applied to find the relationship between air pollutants and wind direction.However, as mentioned above, wind direction is closely associated with the presence and dispersion of pollutants.Hence, wind direction is also selected as the available input variable in this study.The wind direction is separated into 16 discrete directions {N, NNE, NE, ENE, E, . ..}.After applying corresponding analysis, it was found that only 7 out of the 16 wind directions are related to pollutant levels, namely, {N, NNE, E, ESE, SE, NW, NNW}.To represent these directions, a Boolean variable W i ∈ {0, 1} was used for each of them, where i = 1 to 7, rather than a number W i ∈ {1, 2, . . ., 7}, so that no bias is incurred.In addition, since most of air pollutants are dissolvable, rainfall may be a critical impact feature to the output.However, after applying rainfall into modeling, it is found that the influence of rainfall is very low for the accuracy of models.Hence, rainfall was not selected as input variable in this study.

Methodology
3.1.Support Vector Machines.Support vector machines (SVMs) are known as an excellent tool for classification and regression problems [17][18][19], producing good generalization.The basic principle of SVM applies linear model to convert nonlinear class boundaries through some nonlinear mapping of the input vector into the high-dimensional feature space.Details of the working concept of SVM can be found in [13].

Kernel Selection.
Kernel selection is a crucial issue for support vector machines.A kernel introduces nonlinearity into the SVM problem by mapping new input data, X, implicitly into Hilbert space via a function Φ where it may then be linearly separable.Since SVM only requires inner products of the nonlinearly mapped features Φ(X), a kernel becomes an efficient way to compute such an inner product and provides the same scalar output k(X, X t ) = Φ(X) T Φ(X t ), where k is a predefined kernel and X t is the support vector.Different kernels will accommodate different nonlinear mapping and the performance of the resulting SVM is often hinged on the appropriate choice of the kernel [20].Several kernels are commonly used in SVM for regression.These kernels including Linear, Polynomial, Radial Basis Function (RBF), Sigmoid, and Wavelet were used in this study to build SVM models as comparison.In general, these kernel functions are listed as follows, where X, X t ∈ R m : In (5), ϕ can be any mother wavelet.In this study, Morlet function was selected.).These data were divided into three groups for three experiments as shown in Table 3.

Data Normalization.
Prior to modelling, it is necessary to normalize all selected features into same range to avoid the domination by any feature with large values.This normalization process leads to more stable and accurate predicted results.The features in training data and test data were normalized by subtracting and dividing by the feature means, that is, where x i is the mean of the ith parameter of x.

Modeling and Data Representation.
As mentioned in Section 2, air pollutants and meteorological data at previous day and current day were used in this study to predict air pollutant level at the following day.In order to apply SVM for pollutant level forecasting, the representation of a pollutant      level is defined as a pair (x, y).Generally the following features for a specific pollutant P ∈ {SPM, SO 2 , NO 2 , O 3 } are chosen for the representation of x: (i) pollutant level at previous day: For example, if pollutant P = NO 2 , then according to Table 1, its correlated pollutants CorrP = {SPM,SO 2 } and the correlated meteorological parameters CorrM = {W , T, Hum}, denoting the levels of SPM, SO 2 , wind direction, temperature, and humidity at previous day and current day, respectively.The representation of x is then defined as Finally, the output y = P(d + 1) is the corresponding pollutant level of P (i.e., predicted pollutant level) at the following day.This set of training data (x, y) is then passed to SVM models.The concept is illustrated in Figure 3.For simplicity, the SVM models were named according to the kernel used in the model.Subsequently, five kinds of models for each pollutant in this study were as follows: Linear model, Polynomial model, RBF model, Sigmoid model, and Wavelet model.For P ∈ {SPM,SO 2 ,NO 2 ,O 3 }, five modelling methods, three different experiments, and 60 different trained models were developed in total.

Experiment Environment.
Modelling was performed on MATLAB 2007a platform where LIBSVM toolbox [21] and SVM Matlab toolbox [22] were employed to construct models.The hyperparameters (c and g) of SVM and the options of different kernels have been optimized.

Error Measures.
In order to effectively compare the accuracy among the models, four error measures were used in this study including mean absolute error (MAE), root mean squared error (RMSE), complementary Willmott's index of agreement (CWIA), and relative error (RE).It is necessary to set up the RE because in a warning system, attentions are usually focused on the level exceeding a particular dangerous level.The success of a forecasting system may be defined as whether the predicted value falls within an accepted error range relative to the true value [23].In the following formulas, P i and O i represent the predicted level and observed level of ith day, respectively.O max and O min represent the maximum and minimum of observed level within each test set.n is number of data in the test sets: where where     models.In addition, the poor results of Polynomial model and Sigmoid model may be caused by the use of more hyperparameters, which are difficult to optimize.

Matching of Predicted and Observed Pollution Levels.
The exemplary plots of predicted and observed level of NO 2 in 1-year experiment, SPM in winter experiment, and O 3 in summer experiment are depicted in Figures 4 to 6, respectively.Although there were some lagging and underestimations, the predicted levels (Figure 4) produced by Linear model and RBF model followed the trend of observed level of NO 2 pretty well in 1-year experiment.However, the other three models failed to follow observed level at all.In winter experiment (Figure 5), Linear model preformed the best, while the other four models failed to follow the trend of the observed level, especially Sigmoid model.In summer experiment (Figure 6), Linear model, RBF model, and Polynomial model showed good performances.However, Polynomial model cannot match the peaks of the observed levels.Both Sigmoid model and Wavelet model produced poor prediction comparing to other three models.It is clear that Linear model and RBF model performed the best and their predicted results were the closest to the observed levels, regardless in the 1-year experiment or the seasonal experiment.

Conclusion
Using observed meteorological and pollutant data, SVM models for forecasting daily ambient air pollutant were constructed.The prediction results of Linear model and RBF model showed a relative good fit to the observed test set of over one year of data, particularly for SO 2 and NO 2 .
In seasonal experiment, Linear model and RBF model also outperformed other three tested models, although some lagging and underestimations of these two models occurred in winter experiment.Comparing to these five studied models, it was evident that using Linear kernel and RBF kernel in SVM model for air pollutant forecasting in Macau produced superior results with relatively lower errors.It is believed that SVM model with Linear kernel or RBF kernel can also produce good performance for air pollutant forecasting in other similar developing cities, or even other time series prediction in similar situation.
Although Linear model and RBF model outperformed other three tested models, both of these two models still suffer underestimation of high levels of pollutants.How to solve this problem to improve the accuracy of prediction model is the future works.Some literature [24] attempted to integrate discrete wavelet transform (DWT) with SVM for a higher accuracy.Hence, we will attempt to integrate other machine learning methods, for example, genetic algorithm (GA), with SVM to improve the accuracy and efficiency of model in the future.

Figure 1 :
Figure 1: The three land zones of Macau and the location of the stations.

Figure 3 :
Figure 3: Data representation for pollutant predictive model.

Figure 4 :
Figure 4: The predicted and observed levels of NO 2 using different kernel models in 1-year experiment.

Figure 5 :
Figure 5: The predicted and observed levels of SPM using different kernel models in winter experiment (2006).

Figure 6 :
Figure 6: The predicted and observed levels of O 3 using different kernel models in summer experiment (2006).

Table 2 :
Pearson correlation coefficients between pollutants in different time series.

Table 3 :
The training data and test data in each experiment.

Table 4 :
The tested results of SPM, SO 2 , NO 2 , and O 3 in different kernel models in 1-year experiment.

Table 5 :
The tested results of SPM, SO 2 , NO 2 , and O 3 in different kernel models in winter experiment.
Table 4 presents the results of SVM models under different kernels in 1-year experiment for SPM, SO 2 , NO 2 , and O 3 .The bolded values indicate the best performance among the five testing models.Linear model and RBF model produced satisfactorily low errors for all pollutants.Moreover, the results of these two models were comparable.The results of Polynomial model and Wavelet model were poor, evident in 3-18% higher than the results of Linear model and RBF model.Sigmoid model produced the highest errors.The predicted results of seasonal experiment showed the same pattern as in 1-year experiment (see Table 5 (winter experiment) and Table 6 (summer experiment)).From these results, Linear model and RBF model were capable of producing much higher generalization than other three

Table 6 :
The tested results of SPM, SO 2 , NO 2 , and O 3 in different kernel models in summer experiment.