A Long-Term Prediction Model of Beijing Haze Episodes Using Time Series Analysis

The rapid industrial development has led to the intermittent outbreak of pm2.5 or haze in developing countries, which has brought about great environmental issues, especially in big cities such as Beijing and New Delhi. We investigated the factors and mechanisms of haze change and present a long-term prediction model of Beijing haze episodes using time series analysis. We construct a dynamic structural measurement model of daily haze increment and reduce the model to a vector autoregressive model. Typical case studies on 886 continuous days indicate that our model performs very well on next day's Air Quality Index (AQI) prediction, and in severely polluted cases (AQI ≥ 300) the accuracy rate of AQI prediction even reaches up to 87.8%. The experiment of one-week prediction shows that our model has excellent sensitivity when a sudden haze burst or dissipation happens, which results in good long-term stability on the accuracy of the next 3–7 days' AQI prediction.


Introduction
Industry of developing countries is mainly centralized around big cities, accompanied by a large population, consumption, and pollution. Together with Tianjin city and Hebei province, Northern China has become one of the most prosperous and polluted areas on Earth. By 2013, the transient population of Beijing was 37.5 million, and the intermittent outbreak of air pollution has greatly impacted every citizen's life: physiological diseases [1,2], depression, and poor visibility in traffic [3,4]. The main component of haze is pm2.5 (particulate matters less than 2.5 m in aerodynamic diameter), and the concentration of pollution is described with Air Quality Index (AQI, the concentration of pm2.5). The Chinese Government began to monitor and record pm2.5 concentrations for major cities since 2013 [5]. According to the report of Quan et al. [6], the AQI reached 600 in Beijing during the haze event in January 2013. In recent years, more and more papers have referred to the haze episodes and the consequences in Northern China [7][8][9][10][11]. Researchers pointed out that, over the coming years, haze episodes would continue to burst frequently in Northern China [12]. This paper presents an AQI prediction model of Beijing based on time series analysis. We collected Beijing's AQI data of 29 continuous months since 2013 and constructed a dynamic structural prediction model. Statistical methods are used to obtain the maximum likelihood estimation of the prediction model. And both short-term and longterm experiments are carried out to test the accuracy and robustness of our model. The remainder of this paper is organized as follows. In Section 2, we introduce recent related work. Section 3 presents our prediction model and proves our model to be a vector autoregressive model. Experiments and evaluations are reported in Section 4. We conclude the paper in Section 5 with future works.

Related Work
Generally, pm2.5, or haze, is born mainly through anthropogenic factors [13][14][15][16] and eliminated by natural diffusion. Several days after emission, secondary pm2.5 is produced through photochemical reactions among indiffusible pollutants. Secondary pm2.5 is the principal component in most 2 Computational Intelligence and Neuroscience severe haze episodes in China [17]. A typical way of haze prediction is to use pollutant emission data (CO, SO 2 , and NO ) in the simulation [5,18]. Huang et al. [14] analyzed the chemical compositions of pm2.5 and used chemical mass balance to identify the emission sources. Other more complex models are proposed to introduce the atmospheric features, chemistry components, and transport factors [15]. But the more common case is that pollutant emission data usually increase or decrease synchronously with AQI. Sun [19] took population, car ownership, and GDP into consideration and proposed a statistical index system of average annual haze episode days. They found that although most factors contribute to predicting pm2.5, the annual average of NO is negatively correlated with average severely polluted days. The paper [12] established a cubic exponential smoothing model by introducing dust emission into haze prediction. Liang et al. pointed out that there are various distribution and transmission patterns of pm2.5 [20]. In fact, Wang et al. mentioned that the government control policy should be considered in model simulations [9].
Many researches use backpropagation neural network as the simulation model [19,21]. Statistical time series analysis is rarely used in haze prediction, so long-term haze prediction is difficult for current methods to accomplish [22]. Multiple linear regression models also perform well on daily scale prediction [23,24]. However, the test data of existing researches is not ample; for example, [21] tested the prediction accuracy on only 3 days. Besides, Zhang et al. pointed out that pm2.5 accumulation in previous days significantly affects the present daily pm2.5 concentration, which should also be a concern in the modeling process [22].
Considering the above points, this paper presents a new AQI prediction model integrated with natural factor, humanity factor, and self-evolution factor.

The Parameters and Architecture of the Prediction Model.
The change of daily pm2.5 concentration depends on two factors: daily overall production of pm2.5 by human activities and daily overall natural diffusion or overall natural accumulation of pm2.5 . The production of haze depends a lot on the control policies of the government toward the emission of industry fuels . The diffusion of haze mainly depends on the airflow . Besides, complex chemical changes could occur between pm2.5 and other pollutants; thus, previous day's pm2.5 concentration also affects the AQI, which could be seen as the evolution result of previous day's pm2.5 and is represented by . Apparently, − could be directly observed.
is generated by a semimanual method. is mainly related to daily human activities, and we calculate from AQI sequences of no less than five consecutive sunny and windless days. Special circumstances are also considered. In winter, will be larger because the heating system is on. The car usage restrictions and temporary stoppage of factories during Beijing APEC 2014 are also taken into consideration.
is then calculated as − ( − ). Sometimes, is greater than zero, which means pm2.5 accumulates because of nonhuman factors.
Thus, the daily net growth of pm2.5 ( − ) is a function of the evolution result , the industry control index , and the forecast of wind power . Consider this problem as a dynamic structural model, and our model can be described as (1) Parameters 1 , 2 , and 3 , respectively, represent the effect caused by the pm2.5 of the previous day, the wind power, and the industry control index. The net growth of previous day's pm2.5 partly affects present day's pm2.5 and partly affects the next day's pm2.5. The parameter 4 represents this "partial adjustment." The disturbance represents other factors which affect present day's pm2.5.

Complexity Reduction of the Prediction Model.
In order to facilitate the research and modeling process, we have proved that this model could be reduced to a vector autoregressive model.

Proposition 1. Formula (1) is a vector autoregressive model.
Proof. Assume that there exists sequence autocorrelation in formula (1). The autocorrelation is in which is white noise. Here, we apply the Cochrane-Orcutt iteration to rewrite formula (2): where is the lag operator ( ≡ −1 ), which can convert the last phase to current value in a time series.
The next work is to find the most satisfying value of through successive iteration method. Specifically, this method uses residual error to estimate the unknown .
Assume that we use previous days' AQI to predict present day's AQI. Multiply (1 − ) on both sides of formula (1); the expansion formula will be as follows: In the substitution process, many assumptions are neglected. But the ordinary least square method (OLS estimation) should not be used in the estimation of formula (4), because OLS can only illustrate the relationship between daily pm2.5 production and the policy control index, the accumulation of history pm2.5, and the wind power. The net Computational Intelligence and Neuroscience 3 growth of previous day's pm2.5 is only one reason of the correlation of these variables.
The government could make policies to control pm2.5 production of industry to obtain "satisfying" daily production of pm2.5; that is, is an endogenous variable. And the policy control index depends on present day's and previous days' accumulation of history pm2.5, the wind power, the daily production of pm2.5, and daily diffusion of pm2.5: where represents the influence brought about by other policies.
The net growths of previous days' pm2.5 and policy control index also have an effect on daily accumulation of pm2.5: where represents other factors that influence daily accumulation of pm2.5.
In 0 , the parameters in the 1st, 2nd, 3rd, 4th, and 5th row, respectively, relate , , , , and to the other variables. Every is a 5 * 5 matrix. Premultiply formula (7) by −1 0 (the inverse matrix of 0 ): in which This is the standard form of vector autoregressive model. So it is proved that our prediction model (formula (1)) is in fact a vector autoregressive model.
The regression parameters of our haze prediction model can be obtained as follows. Let The dynamic structural system (formula (7)) isss 0 = −Γ + .
Assume that the disturbance terms are not sequence correlated or correlated to each other, which means is a main diagonal matrix. Formula (12) could be written as in which Let Ω be the variance-covariance matrix of : Suppose 0 is a lower triangular matrix, in which all main diagonal elements are assigned 1, and is a main diagonal matrix. The parameters ( 0 , Γ, ) can be obtained through the maximum likelihood estimation of complete information. The maximum likelihood estimation of Ω can be obtained by the variance-covariance matrix of regression residual. Finally, −1 0 and are calculated through triangular decomposition of Ω; thus, Γ can be evaluated.
Above all, the prediction model of Beijing AQI has considered factors including industry emission and policy control, together with the chemical changes of previous days' pollution accumulation and the diffusion conditions. This model also takes the correlations between these factors into consideration and introduces time series haze features into the dynamic structural model. The policy control index is simulated by the record of 4 severe haze episodes during this period. The diffusion is evaluated by weather record of daily wind power.

Model Evaluations
We collected the daily AQI and daily weather information from 28 Oct. 2013 to 31 Mar. 2016. This complete sequence is used to test the accuracy of the prediction model. The next day's AQI prediction experiment (Section 4.1) and long-term AQI prediction experiment (Section 4.2) are both implemented. The next day's AQI prediction is evaluated from two perspectives: the accuracy of daily prediction and the accuracy of statistical analysis.

Next Day's AQI Prediction.
The next day's weather forecast information is applied in next day's AQI prediction. The observed and predicted daily mean AQI in Beijing are illustrated in Figure 1. The simulation result shows that the predicted value matched the observed value very well on the whole sequence of 886 days. Sometimes, there is severe deviation from the observed value; for example, on 19 Feb. 2014, the observed AQI was 89, while our model gives a prediction of 209, with an offset of 135%. But the fact is, in the afternoon of 19 Feb., the wind of Beijing suddenly changed from northeasterly to southwesterly, and by 19:00 the AQI has reached already up to 170, which could be interpreted as our model successfully forecasted a severe haze outbreak several hours in advance; in the coming 7 days, the average daily AQI of Beijing is 305. The occasional occurrence of this "foreseeing" phenomenon is caused by coarse time granularity (daily), and this phenomenon is marked with red ellipse in Figure 1. These marks indicate that our model could "foresee" the sharp change of both outbreaks and diffusions. Most haze outbreaks and diffusions could be accurately simulated; some could be foreseen but could never be delayed. Figure 2(a) is a scatter diagram of daily AQI, including both observed value and predicted value. Most points lie close to = (the red line). But some points lie in a queue at the bottom part, which means the observed AQI exceeds 200, while the predicted value is less than 50. There are altogether 15 such outliers, 7 of which "foresee" haze diffusion, while the other 8 bug points could not be well interpreted. All the 15 points are checked and listed in Table 1. "✓" means a "foreseeing" phenomenon, and "?" represents bug points. Figure 2(b) is a scatter diagram of annual AQI (sum of daily AQI in a certain year). Our data covers only 2 months of 2013 and 3 months of 2016, so, in this diagram, these 2 points lie in the lower left corner.
The pie chart in Figure 3 shows the distribution of prediction accuracy. The deviation of predicted and observed AQI is obtained through the following formula:   Figure 3 shows that 55% predictions match the observed values very well (<20% deviation). The purple part is mainly caused by the "foreseeing" phenomenon. Most samples of the red part come from less-polluted days. For example, on 12 Jan. 2016, the AQI prediction is 40 while the observed AQI is 29, which makes a deviation of 37.9%. In fact, statistics also indicate that our model performs better in worse air conditions (Figure 4). A sample is correctly predicted if the deviation of a sample is less than 20% or the predicted air quality level matches the observed level.  correctly predicted if the deviation of a sample is less than 20% or the predicted air quality level matches the observed level. From 26 Dec. 2015 to 31 Mar. 2016, we predict the AQI in the next 7 days and check the accuracy of -day predictions. Figure 5 shows the accuracy of long-term prediction in the 91 days' experiment. Figure 5 shows that the accuracy stays stable on the next 3, 4, 5, 6, and 7 days' AQI prediction, which indicates that our model has excellent robustness on the task of long-term prediction. The next day's prediction accuracy surprisingly reaches 79.1%, which is far better than the experiment in Section 4.1. The reason is that, during the 91 days, 6 haze episodes attacked Beijing. These frequent attacks did contribute a lot to the overall performance because our model is very sensitive to sudden changes of AQI, including Accuracy Figure 5: The accuracy of long-term AQI prediction.
outbreaks and diffusions (Section 4.1; Figure 4). Figures 6 and  7 show several haze episodes during the 91 days. Both figures show a pm2.5 change process of more than 2 weeks. Figure 6 also shows a "foreseeing" phenomenon caused by coarse time granularity, marked by a red ellipse.

Conclusion and Future Work
This paper presented a dynamic structural model to predict Beijing's daily AQI. This model integrated natural factor, humanity factor, and self-evolution factor into the time series model. This dynamic structural measurement model of daily haze increment is proven to be a vector autoregressive model. Experiments reflected two highlights of this model. First,