Air Pollution Concentration Forecast Method Based on the Deep Ensemble Neural Network

The global environment has become more polluted due to the rapid development of industrial technology. However, the existing machine learning prediction methods of air quality fail to analyze the reasons for the change of air pollution concentration because most of the prediction methods take more focus on the model selection. Since the framework of recent deep learning is very flexible, the model may be deep and complex in order to fit the dataset. Therefore, overfitting problems may exist in a single deep neural network model when the number of weights in the deep neural network model is large. Besides, the learning rate of stochastic gradient descent (SGD) treats all parameters equally, resulting in local optimal solution. In this paper, the Pearson correlation coefficient is used to analyze the inherent correlation of PM2.5 and other auxiliary data such as meteorological data, season data, and time stamp data which are applied to cluster for enhancing the performance. Extracted features are helpful to build a deep ensemble network (EN) model which combines the recurrent neural network (RNN), long short-term memory (LSTM) network, and gated recurrent unit (GRU) network to predict the PM2.5 concentration of the next hour. The weights of the submodel change with the accuracy of them in the validation set, so the ensemble has generalization ability. The adaptive moment estimation (Adam) an algorithm for stochastic optimization is used to optimize the weights instead of SGD. In order to compare the overall performance of different algorithms, the mean absolute error (MAE) and mean absolute percentage error (MAPE) are used as accuracy metrics in the experiments of this study. The experiment results show that the proposed method achieves an accuracy rate (i.e., MAE = 6:19 and MAPE = 16:20%) and outperforms the comparative models.


Introduction
In recent years, the rapid development of the industry is accompanied by air pollution which causes the death of 7 million people every year and attracts great attention worldwide [1,2]. Among these air pollutants, PM2.5 can traverse the nasal passages during inhalation and reach the throat and even the lungs [3] and brings about a great threat to the human body. In 2018, Heft-Neal et al. [4] suggested that PM2.5 concentration above minimum exposure levels was responsible for 22% of infant deaths in the 30 studied countries and led to 449,000 additional deaths of infants in 2015, an estimate that is more than three times higher than existing estimates that attribute the death of infants to poor air quality for these countries. Therefore, air control and prevention of air pollution have become significant issues. In order to achieve this goal, obtaining real-time air pollution concentration is necessary [5]. Moreover, sensors have been used in a wide range of applications [6,7], which collect extensive air quality data. For the increased attention of air pollution, many researchers take a significant focus on air pollution and there are many relevant research studies about air pollution. The main machine learning methods applied to air pollution are as follows: artificial neural network (ANN), ensemble learning, support vector machine (SVM), and other hybrid models [8]. However, these existing prediction machine learning methods of air quality lack analyzing the reasons for the change of air pollution concentration because most of the prediction methods take more focus on the model selection and ignore the reasons for changing. Furthermore, since the framework of recent deep learning is very flexible, the model may be deep and complex in order to fit the dataset. Therefore, overfitting problems may exist in a single deep neural network model when the number of weights in the deep neural network model is large. This paper analyzes the inherent relation of PM2.5 with other meteorological data (i.e., dew point, humidity, atmospheric pressure, temperature, wind direction, accumulated wind speed, precipitation, and accumulated precipitation), season data, and time stamp data. By analyzing the correlation between PM2.5 and other auxiliary data (an hour before), extracted air pollution characteristics are used to cluster the dataset and build a deep ensemble network (EN) model to predict PM2.5 concentration. The input of the model is the PM2.5 concentration and auxiliary data of the previous eight hours while the output is the PM2.5 concentration of the next hour. The adaptive moment estimation (Adam) algorithm is used to replace stochastic gradient descent (SGD) to update weights to get higher accuracy. For the validation of the proposed method, hourly PM2.5 concentration and meteorological data at 3 stations in Shanghai from 01 January 2010 to 31 December 2015 are collected. The mean absolute error (MAE) and mean absolute percentage error (MAPE) are used as accuracy metrics to compare the overall performance of each algorithm.
The contributions of this study are summarized as follows: (i) This study proposes an ensemble model based on RNN, LSTM, and GRU to predict the PM2.5 concentration of the next hour (ii) This study proposes a cluster method based on wind direction to improve prediction performance (iii) Wind direction has been proved to be related to PM2.5 concentration because the wind can carry or take away PM2.5 The remainder of this paper is organized as follows. The literature reviews on air quality prediction in Section 2. Section 3 analyzes the inherent correlation between PM2.5 concentration and other auxiliary data and shows the data preprocess. Section 4 introduces the process of the Adam algorithm and the proposed EN. The experimental results are shown in Section 5. Section 6 is the conclusion including contributions and future work.

Literature Reviews
In this section, the disadvantages and advantages of existing machine learning for air quality prediction models are discussed.

Traditional Machine Learning Methods and Neural
Networks with Simple Structure. Traditional machine learning methods and neural networks with simple structure were applied in PM2.5 prediction. In 2015, Lary et al. [9] proposed a model based on machine learning to estimate PM2.5 concentration. They collected the hourly PM2.5 data from 55 countries to verify the performance of the proposed model. Though the method got certain results, it could not predict future PM2.5 concentration. Hooyberghs et al. [10] proposed a method that combines ANN and big data to predict air quality. For the validation of the proposed model, the pollutant data, traffic data, and weather data were selected by smartphone sensors. The results showed that the ANN model is skilled in air pollution prediction. Moreover, some variations of ANN were proposed to predict air quality concentration. For example, the recurrent neural network (RNN) was used by Prakash et al. in 2011. This model was used to forecast 1 h ahead concentration and daily mean and daily maximum concentration of various pollutants, and the experimental results demonstrate the practicability of the method [11]. Besides, a hybrid model is also one of the variations of ANN. In 2011, Feng et al. [12] proposed a hybrid model that combines SVM with a back-propagation neural network (BPNN) to forecast ozone concentration. SVM was used to classify the data into its corresponding categories, and a genetic algorithm-(GA-) optimized BPNN was employed to build the prediction model. They collected the data including temperature, humidity, wind speed, and ultraviolet radiation from March 2009 to July 2009 to validate the accuracy of the proposed method, and the results showed that the model had a great prediction capability which could be used to predict the ozone concentration of Beijing. Considering the long-term dependencies and spatial correlations of air pollution, Li et al. [13] proposed an extended model of the long short-term memory (LSTM) network to extract the inherent features of air pollution and predict air quality. For the validation of the method, some models including the spatiotemporal deep learning (STDL) model, the time delay neural network (TDNN) model, the autoregressive moving average (ARMA) model, the support vector regression (SVR) model, and the traditional LSTM network [14] were used as the comparison algorithm and the results demonstrated the superiority of the proposed method.

Complex Deep Neural Networks.
In recent years, deep learning has promoted the development of PM2.5 prediction. More and more complex deep networks are applied in this field to obtain better fitting results. In 2018, Huang and Kuo [15] analyzed the source pie of PM2.5 and proposed a deep neural network model that combines the convolution neural network (CNN) and LSTM network in 2018. CNN is a weight sharing network, which is good at capturing local features [16,17]. This innovation of this method was introducing the convolution layer to extract spatial dependencies of PM2.5 and long short-term memory to extract temporal dependencies. The experimental results were compared with SVM, random forest (RD), decision tree (DT), NN, and LSTM algorithms and showed that PM2.5 concentration prediction models based on deep neural networks (e.g., NN, RNN, CNN, and LSTM) are better than the models based on traditional machine learning methods (e.g., SVM, RD, and DT). Considering the spatiotemporal dependence of PM2.5, Xie et al. [18] proposed a CGRU model based on CNN and gated recurrent unit (GRU) to predict PM2.5 concentration in the next six hours in 2019. CNN is used to extract spatial correlation features, and GRU further extracts long-term correlation features. Experimental results showed 2 Wireless Communications and Mobile Computing that the proposed model was better than the traditional time series models (including LSTM, GRU, and ARIMA). GRU is a variant of LSTM, and the network structure is simpler than LSTM [19,20]. In 2019, Tao et al. [21] proposed the CBGRU model based on 1D convnets and bidirectional GRU. On the basis of the bidirectional gated recurrent unit (BGRU), this method added the convolution layer and pool layer which can extract the PM2.5 features more easily. In 2020, Xayasouk et al. [22] developed two models including the LSTM model and the deep autoencoder (DAE) model to predict the particle concentration in the next hour. The experimental results showed that LSTM was better than DAE in predicting the particle concentration. In 2020, Kaya and Oguducu [23] proposed a new air quality prediction model based on deep learning, namely, deep flexible sequence. They used hourly data from Istanbul, Turkey, from 2014 to 2018 to predict air pollution before 4, 12, and 24 hours. This model is a hybrid and flexible deep model, which includes long shortterm memory and convolutional neural network (CLSTM). On this basis, Li et al. [24] developed a deep CNN-LSTM method based on attention (ACLSTM), which includes the one-dimensional CNN, LSTM network, and attention network for urban PM2.5 concentration prediction. However, the attention layer is applied between the hidden layer and the output layer, which cannot explain the correlation between predictors and pollutants. Considering there are rare monitoring stations in a vast area, Ma et al. [25] proposed a deep spatiotemporal prediction method based on bidirectional LSTM and inverse distance weighting which can predict the PM2.5 concentration in the area without monitoring stations. Qi et al. [26] proposed a deep air learning method which provided novel ideas of interpolation, prediction, and feature analysis.

Ensemble Neural
Networks. However, complex and deep network structure causes the decline of generalization ability, which leads to bad performance in other datasets. Hornik et al. [27] had proved that a single-layer ANN could approach the function with any complexity. However, how to make an appropriate network configuration was an NPhard problem which influenced the generalization ability of the network. For solving the problem, Hansen and Salamon [28] proposed an ensemble neural network to provide a simple and feasible method. By this method, each of NN in the system was trained separately and the predictions of NN were synthesized as the final results.

Data Analysis
This paper collected hourly PM2.5 concentration and meteorological data at 3 stations in Shanghai from 01 January 2010 and 31 December 2015 from the UCI database [29]. The Pearson correlation coefficient is used to analyze the inherent correlation between PM2.5 and other auxiliary data of an hour before. The extracted features are used to choose the appropriate activation functions and train the EN which combined the RNN, LSTM, and GRU network to predict PM2.5 concentration. Before training the models, data preprocessing is necessary. This section analyzes the inherent correlation of PM2.5 and other auxiliary data in Section 3.1, and the data preprocessing is illustrated in Section 3.2.

Analyzing the Inherent Correlation of PM2.5 and Other
Auxiliary Data. First of all, this study analyzes the spatialtemporal characteristics of three monitoring stations in Shanghai. They are located in Jingan, American consulate, and Xuhui separately as is shown in Figure 1. The autocorrelation function which is shown as Equation (1) is applied to measure the temporal correlation of each station, and more details are given in Reference [30]. In Table 1, the correlation values of three stations between PM2.5 concentration of an hour before and PM2.5 concentration of an hour later are above 0.95, which demonstrates that PM2.5 has a strong correlation in time.
where X t denotes the PM2.5 concentration of each hour, μ is the expectation of X t , τ is the time delay, σ is the standard deviation, and E is the expectation function. Existing studies have proved that meteorological factors play a significant role in air pollutant concentration [31,32]. Therefore, it is necessary to find out the relationship between meteorological factors and PM2.5 concentration. The Pearson correlation coefficients of PM2.5 and other auxiliary data are shown in Table 1. The PM2.5 concentration data is the next hour data, and the auxiliary data is the before hour data. DEWP stands for dew point, HUMI stands for humidity, PRES stands for atmospheric pressure, TEMP stands for temperature, CV stands for no wind, NE stands for northeast wind, SE stands for southeast wind, SW stands for southwest wind, NW stands for northwest wind, lws stands for accumulated wind speed, and Iprec stands for accumulated precipitation. As is shown in Table 1, except for HUMI, precipitation, Iprec, and spring, whose values are below 0.10, other auxiliary data values are above 0.10. In the respect of season, summer and autumn show a negative correlation with PM2.5 data concentration while winter has a positive correlation with PM2.5 data concentration. In the respect of wind direction, there is a negative correlation between PM2.5 data concentration and east wind while west wind has a positive correlation with PM2.5 data concentration. The Pearson correlation coefficient functions are shown as Equation (2), and more details are given in Reference [30].
where σ X is the standard deviation of X and σ Y is the standard deviation of Y. μ X is the expectation of X, and μ Y is the expectation of Y.
For intuitively observing the correlation between PM2.5 concentration and meteorological data, the violin plots are shown in Figure 2. The abscissa is the interval of meteorological data, and the ordinate stands for the PM2.5 concentration in the interval (e.g., the temperature value ranges from 3 Wireless Communications and Mobile Computing -3 to 41, which is divided into 5 intervals as abscissa). The wider the image is, the more the number of data is. The division of interval varies with meteorological data. For the space limitation, this part only lists the violin plots of Jingan. Based on the correlation coefficient between wind directions and PM2.5 concentration, the dataset is divided into two parts: (1) west wind and no wind and (2) east wind. Two datasets are used to train and test the NN model. Six groups of controlled experiments are set to demonstrate the performance of the proposed method. Each group generates two models: (1) the NN model with the full dataset and (2) the NN model with the cluster method. In these groups, the number of dense layers of NN is 2 and the number of neurons is set from a candidate set of {5, 10, 15, 20, 25, 30}.  Group 6) which shows that the accuracy can be improved by the cluster method based on wind directions. When it is west direction, the PM2.5 concentration is higher, for Shanghai is located in the eastern coastal area of China, whose west is the inland. The west wind carries inland pollution, and no wind is not conducive to air circulation. On the contrary, the PM2.5 concentration is smaller, for the east wind carries the air from the ocean. The results illustrate that the cluster method based on wind directions can extract the data features effectively and get better predicted results.

Data
Preprocessing. Hourly PM2.5 concentration, season data, and meteorological data at 3 stations in Shanghai from 01 January 2010 to 31 December 2015 are collected to test the performance of the proposed hourly PM2.5 concentration and meteorological data at 3 stations in Shanghai from 01 January 2010 to 31 December 2015 which are collected to test the performance of the proposed method. In the data preprocessing stage, the records with abnormal values or missing values are deleted firstly. Secondly, the wind direction data are changed into binary codes to enhance the prediction performance. Wind direction index data have 5 unique categorical values, and each wind direction index is transferred to a 5-dimensional vector (e.g., northwest is assigned as [0, 0, 0, 0, 1]). Similarly, season index data have 4 unique categorical values, and each season index is transferred to a 4dimensional vector (e.g., spring is assigned as [1, 0, 0, 0]). Thirdly, data is processed by the min-max normalization method and compressed from 0 to 1 for a better training effect. The formula is shown as Equation (3), and more details are given in Reference [33].
where n denotes n-th records and N is the number of records. z is the normalized data ranging from 0 to 1.

The Proposed Ensemble Network Model for Prediction with Adam
The proposed EN model can be divided into three submodels: RNN, LSTM, and GRU. Three network models are used

Neural Network.
In this stage, historical and auxiliary data are used to train each network model in the EN model separately. The NN model is employed to analyze the interaction effects of input parameters and get the prediction. It is divided into three layers: input layer, hidden layer, and output layer. Besides, the rectified linear unit (RELU) function is applied as an activation function which is added behind the output layer to produce a nonlinear prediction. The   Wireless Communications and Mobile Computing Adam an algorithm for stochastic optimization is used to optimize the weights instead of SGD. For the training of NNs, the PM2.5 concentration data and auxiliary data of the previous hour in three stations are used as the inputs of NNs while the PM2.5 concentration data of the next hour in three stations are used as the outputs of NNs. The full connected layers are used as the hidden layer to analyze the inherent correlation of parameters. The historical data are applied to train the models, and the weights of models are optimized by the Adam algorithm.

Recurrent Neural Network.
Besides NN models, other deep learning methods are applied to prediction problems (e.g., RNN models, LSTM models). The inputs of the RNN, LSTM, and GRU models are different from NN models for each neuron of the NN's input layers is a single sequence element while each neuron of the input layer in RNN, LSTM, and GRU is a vector which is encoded by the past sequence elements. As has been said before, PM2.5 concentration data have a strong correlation in time. RNN, LSTM, and GRU are applied to predict PM2.5 concentration for they are experts in dealing with time series problems compared with NN models. LSTM an extended model of RNN differs from RNN in learning long-time dependence for there is a phenomenon of gradient disappearance in RNN. GRU an extended model of LSTM differs from LSTM in internal structure for LSTM has three gates and GRU has only two. In the training stage of the RNN, LSTM, and GRU models, eight hours of PM2.5 concentration data along with meteorological data is regarded as input which is different from the NN model. Figure 3, each submodel in EN can predict the PM2.5 concentration independently and all the results will be integrated to produce the final prediction results. In this research, the number of hidden layer neurons ranged from 5 to 30 increasing by 5 at a time. The weighted average method is applied to integrate all the prediction results of submodels. In order to obtain the weight of each submodel, 10% of the dataset is selected as the validation set. The accuracy of each submodel is applied in Softmax to get weight. The Softmax formula is shown in Equation (4), and more details are given in [34].

Combining State of the Ensemble Network. In
where z 1 , z 2 , ⋯, z i denote the accuracy of submodels on the validation set, n is the number of submodels, and w 1 , w 2 , ⋯ , w i are the weight of submodels. e denotes the natural exponential. The final result of the proposed model can be computed as   Group 6). The results illustrate that the combination of different models is effective to predict PM2.5 concentration and decreases the error of overfitting.

Adam
Optimization. This paper adopted Adam [35] an optimization algorithm to replace the traditional SGD. It can update the weights of the neural network iteratively based on the dataset. The Adam optimization algorithm is an extension of the SGD algorithm, which is widely used in deep learning applications recently. The Adam algorithm is different from traditional SGD for the latter keeps a single learning rate to update all weights and the learning rate does not change in the training process while the former designs an independent adaptive learning rate for different parameters. The Adam algorithm will be introduced in detail next.
Assume that f is the objective function and θ are the parameters which require to be optimized. g t stands for the gradient which can be expressed as Equation (6), and more details of formulas of this section are given in [35].
where the f 1 ðθÞ, f 2 ðθÞ, ⋯, f t ðθÞ stand for the function values of time step 1 to t. m t and v t stand for the exponential moving averages of the gradient (i.e., biased first moment estimate) and the squared gradient (i.e., biased second raw moment estimate) which are employed to update weights separately and their formulas are shown as v where β 1 and β 2 ranging from 0 to 1 control the exponential decay rates for the moment estimates.
In the beginning, m t and v t are near 0, for they are set as 0 and the decay rate is close to 1. For the sake of counteracting the bias at the beginning, bias-corrected estimates m t and v t The final updated formula of parameters is shown as where α denotes the stepsize (i.e., learning rate) and is initialized to 0.001 (i.e., default value). This section sets four groups of controlled experiments, and NN, RNN, LSTM, and GRU are applied to predict the PM2.5 concentration by using SGD, Adam, and Nadam separately. All the number of neurons of dense layers in these models is set to 15, and the performance of each algorithm in different networks is shown in Table 4. Compared with Adam and Nadam, SGD has the worst performance because it has a fixed learning rate leading to find global optimum difficultly. Except for the MAE of NN, Adam has better performance than Nadam [36] which proves that the Adam algorithm is effective in optimizing the PM2.5 predicting model.

Experimental Results and Analysis
This section uses hourly PM2.5 concentration and meteorological data at 3 stations in Shanghai from 01 January 2010 to 31 December 2015 to evaluate the proposed model. All the models including NN, RNN, LSTM, GRU, BGRU, CGRU, CBGRU, CLSTM, ACLSTM, and EN are trained on the Keras framework with TensorFlow backend. The learning rate is set to 0.001, and the epochs are set to 200. RELU is applied as an activation function for each layer of the network, and Adam is used as the optimized algorithm to optimize the weights. For the validation of the proposed method, MAE and MAPE are used as accuracy metrics to compare the overall performance of each model. MAE is an absolute value, and MAPE is a percentage, smaller values of which indicate better performance. Two metrics are given in Equations (12) and (13), respectively, and more details are given in Reference [30].
where o n is the value of the n-th observed data and p n denotes the predicted value of the n-th predicted data. The values of two metrics (MAE, MAPE) are calculated for proposed EN as well as for NN, RNN, LSTM, GRU, BGRU, CGRU, CBGRU, CLSTM, and ACLSTM.   Figure 4 shows the comparison between observed values and predicted values of the EN model. From the overall trend, the prediction data can better fit the observation data. This also proves that the proposed model can effectively predict the PM2.5 concentration in the next hour.

Conclusions and Future Work
Because of the flexibility of the network framework, many complex deep learning networks have been developed for air quality prediction. As far as we know, there is no uniform dataset in current air quality prediction research. Researchers collected datasets from different regions to train the network. Although these complex deep networks can well fit the data they use, they lack generalization ability. Therefore, this paper proposes an EN model to predict air pollution concentration by historical PM2.5 concentration, meteorological, and time stamp data. Considering that the submodel including RNN, LSTM, and GRU has quite good performance, each submodel of the EN model is trained, respectively, to get the accuracy and obtain the final model by a weighted average method. The weights of submodels are flexible because they are obtained by the accuracy of the validation set so that they can perform stably in different datasets. In addition, the ensemble of different networks is less involved in this field As far as we know, Adam is adopted to optimize weights instead of SGD for it can adjust the learning rate adaptively to get an efficient training effect. A case study of the prediction of PM2.5 concentration in Shanghai of the People's Republic of China is given in this research, and the dataset is divided into three parts: training data, validation data, and testing data. Training data are used to train the submodels of EN separately, validation data are applied to obtain the weight of each submodel, and testing data are adopted to compute MAE and MAPE for performance evaluation. The experimental evaluation is performed for EN, as well as other algorithms including NN, RNN, LSTM, GRU, BGRU, CGRU, CBGRU, CLSTM, and ACLSTM. The experimental results demonstrate that the proposed method has the best performance which outperforms other algorithms. Several findings of this paper are as follows: (i) Compared with the single model, the EN model has better generalization ability and predictive ability as validated by MAE and MAPE (ii) Wind direction has a significant impact on PM2.5 concentration for wind can carry or take away PM2.5 (iii) Compared with SGD, the Adam algorithm avoid the local optimum effectively For the extension of this study, the prediction performance can be enhanced by adding human activities because it is one of the main reasons for environmental deterioration   11 Wireless Communications and Mobile Computing especially in holidays. Furthermore, this paper found that wind direction has a significant influence on PM2.5 concentration because the wind will bring or take away PM2.5. However, it is uncertain that it can do in the areas with high mountains. Therefore, embedding the influence of topographical factors can become a research direction in the future. However, limited by lacking human activity and topographical data, this paper only analyzes the impact of meteorological data on PM2.5.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.