A Decomposition-Ensemble Approach with Denoising Strategy for PM 2.5 Concentration Forecasting

,


Introduction
refers to particulate matter with a diameter of less than or equal to 2.5 microns in the atmosphere, also known as particulate matter that can enter the lungs [1][2][3]. Although PM 2.5 is only a small fraction of the Earth's atmospheric composition, it has an important impact on air quality [4]. It will exert a negative influence on society life, such as increasing the risk of disease and impeding economic development [5][6][7][8][9][10]. However, more air pollutants come with the development of industry and the increase in the number of fuel-powered cars (Maji et al. 2018). Accurate forecasting of PM 2.5 concentration has a very important guiding significance for people. It can enable people to make correct decisions, reduce economic losses, and benefit people's life and health. erefore, it is necessary to predict the PM 2.5 concentration efficiently and accurately.
At present, many scholars have made a lot of research studies in this field. Accordingly, these studies are roughly divided into four categories: time series model [11][12][13][14], econometric model [15][16][17], artificial intelligence (AI) model [18][19][20], and the hybrid model [21,22]. In particular, the time series approach, as the traditional forecasting approach, is often used to predict PM 2.5 concentration. For example, moving average (MA), autoregressive (AR), autoregressive moving average (ARMA), and autoregressive integrated moving average (ARIMA) are often used for predicting PM 2.5 and PM 10 concentration [14,23]. e econometric models are also often used as a forecasting of PM 2.5 due to its better interpretation [16].
However, due to the related multiple factors' impact, the PM 2.5 concentration data exist the nonstationary and nonlinear features. us, the artificial intelligence methods, which can simultaneously capture and approach the complicated data feature, have been extensively utilized to forecast the PM 2.5 [18]. Meanwhile, a variety of machine learning techniques are used for training and modelling the data pattern in this field [24]. Furthermore, the deep learning can well find the mapping relation of PM 2.5 concentration through nonlinear learning, which may get a good predictive performance.
In addition, considering the limitations of a single model, some scholars proposed the mixed models to make a better forecasting accuracy, which combined the advantages of different models to obtain more stable and accurate results [20,25,26]. Similarly, the hybrid approach involves different combination means among the forecasting models. Hybrid integration approach combines the advantages of different models to get better results. In general, it can be divided into finding different components through decomposition and then can be combined with the forecasting model [21,[27][28][29]. Some scholars have added some parameter optimization algorithms [30,31].
As mentioned above, although many scholars have adopted different forecasting approaches or proposed different hybrid models in past research, most of the existing combined models cannot comprehensively consider the problems of data noise processing, data feature capture, and forecasting techniques. Particularly, as for data noise processing, many previous studies used different denoising methods for processing the original data, such as singular spectrum analysis [32], Fourier transform, and wavelet transform [33][34][35][36][37][38]. Among them, the wavelet analysis method can deal the original data more convenient and efficient. Meanwhile, the advantage of wavelet transform method is that the noise is almost completely suppressed and the characteristic peak points reflecting the original signals are well preserved [39,40]; Dimitriou and Kassomenos 2014. As for data feature capture, the common approaches include EMD family and the VMD approach, and VMD has been widely applied to decompose the time series of different fields, such as wind speed prediction and power load prediction [41][42][43][44] and achieved relatively good results. Moreover, Compared with EMD, VMD can solve the divergence problem of EMD endpoint and better extract the components of PM 2.5 concentration data. As for forecasting techniques, the machine learning approaches are popularly incorporated into the predicted approach for forecasting in the complex system, especially in the field of PM 2.5 [30]. And, the KELM has the advantages of less training parameters, fast learning speed, and strong generalization ability and especially can get better forecasting results in nonlinear data sets [45].
Considering the noise caused by the way of PM 2.5 concentration data collection and the denoising characteristics of wavelet transform, firstly we choose wavelet denoising and secondly we choose VMD decomposition due to the complexity of time series data. Finally, considering the validity of KELM's data fitting, KELM was selected as the final prediction model. erefore, this study proposes a novel hybrid forecasting method, WAV-VMD-KELM, for improving the forecasting accuracy by systematically considering noise processing, data feature capture, and forecasting techniques. Accordingly, first, wavelet transform was used to denoise the data, then variational mode decomposition (VMD) was adopted to decompose the data, kernel extreme learning machine (KELM) forecasting approach was employed to predict the decomposed components, and finally the results were restored. erefore, a new hybrid approach denoising-decomposition-ensemble algorithm is proposed based on the comprehensive consideration of the cause of noise and the nonlinear, nonstationarity of PM 2.5 concentration. And, the two main contributions of this study are as follows: (1) In this study, focusing on the problems of noise processing, data feature capture, and forecasting technique selection of the PM 2.5 concentration, a novel hybrid forecasting approach, i.e., Wav-VMD-KELM, is proposed for improving the forecasting accuracy by denoising, decomposing, individual forecasting, and ensemble results (2) A novel decomposition-ensemble approach with denoising strategy is proposed to solve the forecasting approach of hourly data of PM 2.5 concentration in Xi'an. e experimental results show that the new hybrid approach forecasting method proposed in this study makes a better forecasting performance compared with the benchmarks and can significantly improve the forecasting level of PM 2.5 . e rest of this paper is summarized as follows. Section 2 introduces the approach used in this research. And, several experiments and analyses and discussion of the forecasting results were performed in Section 3. Finally, Section 4 discusses the conclusions of this study.

Methodology
Section 2.1 gives an overview of the proposed decomposition-ensemble approach with denoising strategy, and Sections 2.2-2.4, respectively, describe the related techniques, wavelet denoising, EMD, VMD, and KELM. Figure 1, the entire process includes wavelet denoising, multiscale analysis, PM 2.5 concentrations forecasting, and evolution of approaches which are summarized as follows:

Framework. As shown in
(1) Wavelet Denoising. As for the original PM 2.5 concentration time series, an effective approach of denoising-wavelet denoising is adopted to process the nonstationary time series. Moreover, Section 2.2 gave the detailed process of wavelet denoising. (2) Multiscale Analysis. VMD is applied to decompose the above denoising sequence into several modes, and these modes reveal the different characteristics which are hidden in the PM 2.5 concentration time series by their low and high frequency.

Wavelet
Denoising. Wavelet denoising method was first proposed by [34]. It is a kind of nonlinear denoising method, which can be approximately optimal in the sense of minimum mean square error. Figure 2 shows the wavelet denoising method, and each step of wavelet denoising is described in detail as follows. Wavelet denoising is based on wavelet decomposition, so some concepts of wavelet decomposition are introduced first.
e meaning of wavelet transform is one of the functions called basic wavelet transform displacement τ, then at different scales α, with signal to be analyzed X(t) for inner product, that is, In the formula, α > 0 is called the scale factor, and its effect is on the basic wavelet φ(t). Function scaling, t reflects the displacement, it can be positive or negative, and also, ɑ and t are continuous variables.
As mentioned above, the noisy time series data can be expressed as is real signal and ε(t) is white noise. And, the wavelet transform of both sides of the equation can be obtained as follows: According to the properties of the wavelet transform, the wavelet transform of the actual measured signal is equal to the sum of the wavelet transforms of multiple signals. After orthogonal wavelet transform, the correlation of signal y(t) can be removed to the greatest extent, and most of the energy can be concentrated on a small number of wavelet coefficients with relatively large amplitude. After the wavelet transform, noise ε(t) will be distributed on all time axes in various scales, and the amplitude is not very large. Based on this principle, the wavelet coefficients of noise are reduced to the greatest extent in each scale of wavelet transform, and then the signal is reconstructed by using the processed wavelet coefficients, so as to achieve the purpose of suppressing noise [37].
So, the threshold denoising process of one-dimensional signals can be divided into three steps: selection of the appropriate wavelet transform, threshold processing of wavelet coefficients, and wavelet reconstruction.
(1) Selection of the appropriate wavelet transform Wavelet Decomposition of the Signal. Select a wavelet and determine the level N of a wavelet decomposition and then perform n-layer wavelet decomposition calculation on the signal. Generally, the selection of wavelet basis functions should be considered comprehensively from the aspects of support length, vanishing moment, symmetry, regularity, and similarity. Because the wavelet basis function has its own characteristics in signal processing and no one wavelet basis function can get the optimal denoising effect for all kinds of signals. In general, Daubechies (dbN) wavelet and Symlets (symN) wavelet are two groups of wavelet bases that are often used in speech denoising. In wavelet decomposition, the selection of decomposition layers is also a very important step. e larger the value is, the more obvious the different characteristics of noise and signal are, and the more conducive it is to the separation of the two. On the other hand, the larger the number of decomposition layers, the larger the reconstructed signal distortion will be, which will affect the final denoising effect to some extent. erefore, a suitable scale of decomposition is selected after comprehensive consideration in application. (2) reshold processing of wavelet coefficients An important factor directly affecting the denoising effect is the selection of threshold, and different thresholds will have different denoising effects. e threshold function is a rule to correct the wavelet coefficient, and different inverse functions reflect different strategies to deal with the wavelet coefficient.
ere are two most common types of threshold functions: hard and soft. ere is also a Garrote between the soft and hard threshold functions. e hard threshold function is superior to the soft threshold method in the sense of mean square error. In this study, hard threshold is selected to denoise the data, and the details are as follows: when the absolute value of the wavelet coefficient is less than the given threshold value, let it be 0; If it is greater than the threshold, it stays the same, the mathematical expression is as follows: reshold value: where s is the standard deviation of the noise signal and N is the length of the data. (3) Wavelet reconstruction e processed wavelet coefficients are reconstructed. According to the low frequency coefficients of the nth layer of wavelet decomposition and the quantized high frequency coefficients of the first to the nth layer, the signal is reconstructed by wavelet. e details are the estimation value of the original signal is obtained by signal reconstruction based on the low frequency coefficients of the nth layer of wavelet decomposition and the high frequency coefficients of the first to the nth layer after processing.

Variational Mode
Decomposition. EMD is a commonly used decomposition method. In this paper, EMD is used as a Discrete Dynamics in Nature and Society comparison of VMD. EMD is a method proposed in [46]. e main purpose of this algorithm is to decompose signals into characteristic modes. e advantage of this method is that it does not use any defined function as the basis, but generates the inherent mode function adaptively based on the analyzed signal. It can be used to analyze nonlinear and nonstationary signal sequences with high signal-to-noise ratio and good time-frequency focusing.
VMD is a novel nonrecursive and adaptive signal decomposition method that can accommodate much more sampling and noise than some popular decomposition methods such as empirical mode decomposition. e main goal of VMD is to decompose a time series into a discrete set of band-limited modes u k , where each mode u k is considered to be compact around a center pulsation ω k, which is determined during the decomposition.
For example, the time series f(x) is decomposed into a set of modes u k around a center pulsation ω k according to the following constrained variational problem [39,40]: subject to where k is the number of modes, δ and * represent the Dirac distribution and the convolution operator. {u k } and {w k } represent the set of modes u 1 , u 2 , . . . , u k and the set of center pulsations, respectively. e above constraint variational problem can be headed with an unconstrained variational problem according to Lagrange multipliers λ, which is given as follows: where α represents a balance parameter, λ represents the Lagrange multipliers, and ‖f(t) − k μ k (t)‖ 2 2 denotes a quadratic penalty term for the accelerating rate of convergence. Step 1 Wavelet denoising Step 2 Multiscale analysis Step 3 PM2.5 concentration Prediction suboptimizations. erefore, the solutions for u k , ω k , and λ can be obtained as follows: where f(ω), μ i (ω), λ(ω), λ n (ω), and μ n+1 k (ω) represent the Fourier transforms of f(ω), μ i (ω), λ(ω), λ n (ω), and μ n+1 k (ω), and the number of iterations is n. e number of modes k needs to be determined, before the VMD method. ere is no theory on the optimal selection of the parameter k.

Kernel Extreme Learning
Machine. For a single hidden layer neural network, assume N arbitrary samples (X i , t i ), among the rest, and, for a single hidden layer neural network with L hidden layer nodes, it can be expressed as follows: where g(x), j � 1, 2, . . . , N, is the activation function, W i � [w i1 , w i2 , . . . , w in ] T is the input weight, β i is the output weights, and b i is the bias of the ith hidden layer element. W i · X j .is W i and X j do the inner product. e objective of single hidden layer neural network learning is to minimize the error of output, which can be expressed as follows: at is to say, there exist It can be represented by a matrix: where H is the output of the hidden layer node, β is the output weight, and T is the expected output.
In order to train single hidden layer neural network, we need to get W i , b i , and β i , making where i � 1, 2 . . . L, which the same thing as minimizing the loss function Traditional gradient-based algorithms can be used to solve such problems, but the basic gradient-based learning algorithm needs to adjust all the parameters in the process of iteration [45]. In the ELM algorithm, once the input weight W i and the bias b i of the hidden layer are randomly determined, the output matrix H of the hidden layer is uniquely determined. Training a single hidden layer neural network can be transformed to solve a linear system Hβ � T. And, the output weight β can be determined: where H + is the matrix's Moore-Penrose generalized inverse matrix. And, it can be proved that the norm of the solution is minimal and unique. Kernel function has strong nonlinear mapping ability, which can overcome the dimension disaster. For the problem of linear inseparability, kernel function can be Discrete Dynamics in Nature and Society mapped to high-dimensional space to make it linearly separable [45]. In KELM, the feature map of the hidden layer h(x) remains unknown, replaced by its corresponding kernel function K(u, v). e number of hidden layer nodes L also does not need to be set. e detailed structure of KELM can be referred to Figure 3.

Experimental Results and Analysis
PM 2.5 concentration data are hourly data collected from Xi'an, Shaanxi province, in this research. And, the data collected from (http://www.cnemc.cn/). ese data serve as our experimental data, and then two popular evaluation criteria of forecasting approaches are used to verify the forecasting result of the hybrid approach. In this section, we verified the effectiveness of our hybrid approach with the experiment of PM 2.5 forecasting in Xi'an. Also, MATLAB R2018a environment running on Windows 10 with a 64-bit 2.00 GHz AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx was performed in this experiment.

Data Description.
Xi'an, located in the west of China, is a famous tourist city with a long history of civilization, and Xi'an is also an important city for One Belt and One Road. However, the large amount of automobile exhaust and industrial exhaust in Xi'an has caused serious pollution to the air, the haze continues to exist in Xi'an, and the air quality becomes increasingly serious. e quality of air quality not only affects tourists' travel experience but also has a potential impact on people's health. Hourly PM 2.5 datasets are collected from January 1, 2019, to December 31, 2019, as illustrated in Figure 4. Before prediction, some basic descriptive statistical analysis were used to make a preliminary exploration of the data. For instance, Table 1 lists mean, standard deviation (std.), and min-max of the PM 2.5 concentration data. As depicted in Table 1 and Figure 4, the PM 2.5 concentration time series is complex data with nonlinear and nonstationary characteristics, which is embodied in the maximum and minimum values and the continuity of the data. More importantly, we found some data less than 3 and were 10 times different from the adjacent data. For such data, we use the mean value of the data before and after the data. And, the former data from January 1, 2019, 0 o'clock to 24 hours, 24 December 2019, were selected as the training data set in the sample and the remaining data from 0 o'clock December 25, 2019, to December 31, 2019, were taken as the forecast data set out of sample.
In this study, a common time series forecasting method is used. History observations {x t−1 , x t−2 , . . ., x t−p } as inputs to calculate predicted value x t+h−1 , where h and p indicate forecasting horizon and lagging order. Figure 5 details the criteria for dividing data by this approach. In this study, lagging order p � 12 and forecasting horizon h � 1.

Evaluation
Criteria. Next, we use two commonly used evaluation criteria, the mean absolute error (MAE) and the root mean square error (RMSE), to evaluate the forecasting effect of the approach. MAE can be used to reflect the overall level of forecasting errors; then, RMSE represents the degree among the actual values and the predicted results: where N is the numbers of testing sample, y act(i) is the ith observed values, and y per(i) represents the ith forecasting values.

Benchmarking Models.
In the field of deep learning as air pollution forecasting, SVR, BP, ELM, and KELM are popular artificial intelligence techniques forecasting approaches. In addition, the above single forecasting techniques are often combined with denoising and decomposition integration methods to improve the forecasting performance.   Table 2. Compared with the other three artificial intelligence (AI) models, KELM has the optimal effect under the evaluation standard of accuracy. 6 Discrete Dynamics in Nature and Society At the MAE and RMSE evaluation level, it can be found that there is one interesting conclusion. First, it is obvious found that KELM to be the best of all AI models, and the possible reason is that the KELM model might be inapplicable to the complex time series of PM 2.5 concentration.

Performance Comparison of Denoising Forecasting
Model. In the actual process of signal acquisition, the collected signal will inevitably be disturbed by noise or environment and other factors. e main reason lies in sources of the data. First, wavelet denoising is carried out on the data, and then the single approach mentioned above is used to predict the denoised data and then wavelet denoising using the aforementioned single algorithms as individual forecasting tools are then performed and compared with each other.
(1) Denoising results In the first step of the proposed work, wavelet denoising is employed to remove white noise generated by the collection process in PM 2.5 data. Figure 6 shows the contrast between the denoising results and the original data. From the results, it can be obviously seen that, after wavelet denoising, the data becomes flat where the change is sudden. is process of smoothing the data can help restore the true state of the data and also help the models' forecasting. e parameters corresponding to wavelet denoising are as follows: wavelet basis function: db6; layer: 3; decompose the filter into three layers using hard thresholds and threshold 4.88. RMSE of the denoised data and the original data is 5.88.
(2) Forecasting results In the second step, the single models, i.e., SVR, BP, ELM, and KELM, are employed as individual forecasting tools. Here, to keep the comparison consistent before and after the experiment. e parameters used are the same as before. Table 3 shows the comparison results of the four denoising and forecasting in terms of MAE and RMSE. One important conclusion can be obviously seen that the proposed wavelet denoising is verified for PM 2.5 concentration data forecasting, in terms of accuracy from these results. In terms of MAE and RMSE, two important findings can be obtained. First, wavelet denoising has obvious   Discrete Dynamics in Nature and Society improvement for all single models, which also proves that wavelet denoising is effective for this kind of sensor data. e main reason is that PM 2.5 data is mainly collected through sensors, which may produce some noise due to equipment. Wavelet denoising designed in this study can effectively remove such noise; in this way, the result can be greatly improved. Second, from the perspective of the accuracy of the two evaluation criteria, KELM's forecasting is the best, which may be mainly because the KELM approach can better adapt to such nonstationary time series data.

Performance Comparison of Denoising-Decomposition-Ensemble Models.
e seven different denoising-decomposition-ensemble approaches use the denoising-forecasting technology as individual denoising and forecasting tools, and then these approaches compared and performed with each other. In particular, EMD and VMD are selected as two decomposition methods to compare the effectiveness of decomposition.
(1) Decomposition results In the first step of the previous research, EMD and VMD are used to decompose the denoised PM 2.5 concentration data. Figures 7 and 8 partially show the decomposition results of EMD and VMD. e IMFs are listed from the highest to the lowest frequencies, while the last one is the residue variable in Figure 7. From the reveal of Figure 7, it is obvious that the complex PM 2.5 concentration values (also see Figure 5) can be divided via EMD into a number of simple components, and these components further help to make modelling easier. In addition, it is obvious that IMFs 1-3 follow a random walk between 0 and 100, while IMFs 4-6 reveal a regular feature of periodicity with different cycles. Moreover, MFs 7 and the residue appear smooth central tendencies. Based on these components, some simple models can characterize the characteristics of the data, which helpfully enhances forecasting accuracy. At the same time, in Figure 8, the components are described from the highest to the lowest frequencies in order. From the results of Figure 8, it can be obviously seen that the complex PM 2.5 concentration data can be divided via VMD into simple components, which simple models can get better results.
(2) Forecasting results In this step, the denoising-decomposition-ensemble models, WAV-EMD-BP, WAV-EMD-SVR, WAV-EMD-ELM, WAV-EMD-KELM, WAV-VMD-BP, WAV-VMD-SVR, and WAV-VMD-ELM, are used as forecasting tools to respectively model the previous extracted components. e parameters here are the same as before. Table 4 displays the comparison results of the seven hybrid models in terms of MAE and RMSE. An important conclusion can be obviously seen that the decomposition technique, especially, VMD is verified for PM 2.5 concentration data forecasting, in terms of accuracy from these results. In terms of MAE and RMSE, three important findings can be obtained.
First, the decomposition technique to improve the forecasting accuracy, which proves that the decomposition of integration technology is effective on such time series forecasting, and the main reason is that the sequence is divided into different components, and each component has a certain regularity. It is under the same approach forecasting model which can be compared with no decomposition of the data to achieve better results. en VMD effect is better than that of EMD, mainly because the advantage of VMD is that it   can solve the endpoint divergence problem of EMD, which can find the characteristics of PM 2.5 concentration data, irdly, the novel hybrid approach proposed achieves the best effect in this experiment.

Discussion.
e effectiveness of proposed model is discussed in this section.
e experimental results show that, firstly, wavelet denoising is effective in processing the data information collected by the sensor, and the decomposition effect after EMD is better than that without decomposition, and the decomposition effect of VMD is better than that of EMD.
Besides, it is apparent that, compared with wavelet, the WAV-VMD-KELM approach has lower forecasting errors in terms of MAE and RMSE, which demonstrates the superiority of VMD over EMD. In Figures 9 and 10, it is clear that the developed hybrid approach has the best forecasting result which can explains the effectiveness of the combination method of the proposed approach and its advantage forecasting performance. In order to conveniently show the forecasting availability of the VMD-KELM approach, Figure 11 shows the forecasting curve     of the PM 2.5 concentration of the WAV-VMD-KELM approach. As shown in Figure 11 and Table 4 that the developed hybrid forecasting approach can get the highest forecasting result among these approaches considered in this research.
It can be summarized that this WAV-VMD-KELM approach is obviously better than the comparison approaches with its smaller error when comparing with other approaches considered in this research.

Conclusions
For the past few years, there are many alternative approaches available for PM 2.5 concentrations forecasting. And, these works finally contributed to the improvement of forecasting performances to a certain degree. Yet, there are still some deficiencies in previous articles. For instance, the importance of noise produced by equipment is often neglected and all that. us, to solve the problems mentioned above, a new hybrid forecasting method which combines wavelet denoising is developed. Based on the experimental results and analysis, we can reach the conclusions as follows: (1) In those experiments, WAV-VMD-KELM gets the best performance among the comparison approaches (2) As a decomposition method, VMD has better predictive ability than EMD considering that the former is effective in capturing the various features hidden in the original datasets (3) Wavelet denoising can improve the accuracy of forecasting which indicates that there is necessity to remove the noise specifically when the data is obtained by the sensor Overall, with more accurate prediction, this proposed hybrid approach presents a superior performance beyond other alternative approaches, offering a novel and feasible method in the field of PM 2.5 concentrations forecasting. Furthermore, this new and viable option can also be applied to many other complex areas, such as tourism demand forecasting, wind speed forecasting, economic growth forecasting, product sales forecasting, and traffic flow forecasting, and others. In fact, it is worth noticing that there still exist some limitations. First, this research only puts PM 2.5 time series into consideration and does not include other influencing factors. Another shortcoming is that this paper adopts single objective optimization algorithms only for forecasting PM 2.5 time series without trying the multiobjective versions. Hence, future research on PM 2.5 time series forecasting should highlight the following aspects, that is, the usage of other possible factors (meteorological factors in particular) as well as the application of the multiobjective optimization algorithm. In addition, it can also be practicable to establish deep learning-enabled approaches, which are more effective and have the potential to be another significant research direction for future work [12,47].

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.