Air Pollutant Concentration Forecasting Using Long Short-Term Memory Based on Wavelet Transform and Information Gain: A Case Study of Beijing

Air pollutant concentration forecasting is an effective way which protects health of the public by the warning of the harmful air contaminants. In this study, a hybrid prediction model has been established by using information gain, wavelet decomposition transform technique, and LSTM neural network, and applied to the daily concentration prediction of atmospheric pollutants (PM2.5, PM10, SO2, NO2, O3, and CO) in Beijing. First, the collected raw data are selected by feature selection by information gain, and a set of factors having a strong correlation with the prediction is obtained. Then, the historical time series of the daily air pollutant concentration is decomposed into different frequencies by using a wavelet decomposition transform and recombined into a high-dimensional training data set. Finally, the LSTM prediction model is trained with high-dimensional data sets, and the parameters are adjusted by repeated tests to obtain the optimal prediction model. The data used in this study were derived from six air pollution concentration data in Beijing from 1/1/2014 to 31/12/2016, and the atmospheric pollutant concentration data of Beijing between 1/1/2017 and 31/12/2017 were used to test the predictive ability of the data set test model. The results show that the evaluation index MAPE of the model prediction is 7.45%. Therefore, the hybrid prediction model has a higher value of application for atmospheric pollutant concentration prediction, because this model has higher prediction accuracy and stability for future air pollutant concentration prediction.


Introduction
e rapid development of urbanization and industrialization has brought enormous economic results but also caused pressure on resources, energy, and the environment. e air pollution caused by the rapid development of urbanization and industrialization has become an important issue that restricts social and economic development and affects human health [1]. e six major atmospheric pollutants in the air (PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO) are harmful to human health. Moreover, when the concentration of these pollutants exceeds the standard, it will destroy the human respiratory system and may cause a headache, dyspnea, and heart attack, which will seriously affect the health of human beings and thus restrict social development [2]. erefore, it is vital to monitor and predict the concentration of pollutants, to avoid the health threats caused by excessive pollutant concentrations effectively. At the same time, the prediction of significant air pollution concentrations can be used as a policy tool for the environmental protection department to regulate social and economic activities such as transportation, industry, and urban construction under extreme air pollution conditions [3]. erefore, in order to support the decision of environmental management and avoid serious accidents caused by air pollution, it is urgent to establish a precise and stable pollutant concentration prediction model, which can predict the concentration of air pollutants in the future, helping the government to publish control measures for air pollutants and public health protection work.
At present, the research on the prediction of atmospheric pollutant concentration in the world mainly focuses on the application of two methods: the deterministic model and the computational model. e deterministic model research does not require a large amount of historical data, but it needs to have complete knowledge of the source of pollution, the number of timely emissions, and the main chemical reactions and space-time physical transformation processes of the exhaust gases [4]. Computational models usually require a large amount of historical measurement data under various meteorological conditions, and the relationship between historical pollutant data and predicted variables is established by regression and neural network methods [5]. Sánchez et al. proposed a combination of three different methods, including Elman neural network, autoregressive integrated moving average, and a combination of these two methods, applied to predict a certain control near the coal-fired power plant station's SO 2 concentration [6].
e results show that the hybrid method can still obtain excellent prediction results when the concentration of the particles is high. Wu et al. took Beijing as an example, using three-layer FFNN and regression Elman network, and the air quality prediction model was designed to predict the change of PM 10 concentration of air pollutants on the next day [7]. e optimal model is then selected based on the test performance metrics and learning time. It is proved that the three-layer FFNN of the first-order line cut (OSS) training algorithm is superior to the Elman network of the gradient adaptive learning rate (GDX) training algorithm. As researchers continue to explore neural network algorithms, Das and Padhy constructed a new ANN model to predict PM 10 concentrations in 23 EU countries [8]. is model reduces the average error of the results to less than 13% during the test. Subsequently, Elangasinghe et al. [9] extracted key information from daily available meteorological parameters and seasonal emission patterns and established a physics-based ANN air pollution prediction tool. e neural network model predicts better results than the linear regression model based on the same input parameters, and it can fully capture the temporal variation of air pollutant concentration in a specific scene. However, these models usually have a common defect; that is, the ability to predict the concentration of particulate matter such as PM 2.5 and PM 10 decreases at very high concentrations [5]. is flaw will mislead environmental control decisions and seriously affect human health. erefore, more adequate experiments and sophisticated modeling techniques are needed to capture sudden changes in particle concentration. Taking this into account, Zhou et al. considered a low data dimension and used a hybrid EMMD-GRNN model based on data preprocessing and analysis to predict PM 2.5 concentration one day ahead of time [10]. e model can be used to quickly and accurately predict the PM 2.5 concentration for the next day. Wang et al. proposed an air pollutant prediction model based on a hybrid artificial neural network and hybrid support vector machine [11]. By modifying the error term of traditional methods, artificial neural network and support vector machine effectively improved the prediction accuracy. It has been noted that previous studies have often been used for single contaminant concentration predictions, ignoring the possible nonlinear correlations between different atmospheric contaminants [12]. Lv et al. established an empirical regression model for the prediction of PM 2.5 and O 3 air pollutant concentrations in three large Chinese cities in Beijing, Nanjing, and Guangzhou in 2016 [13]. e predictive model is an empirical nonlinear regression model designed for automated data retrieval and prediction platforms.
e traditional neural network model cannot meet the requirements of high-precision, multioutput air quality prediction, and then the researchers improve the prediction accuracy by improving the input variable structure. Ni et al.'s results show that the selection of historical data such as PM 2.5 , PM 10 , temperature, wind direction, and wind speed on the previous day to train the model is crucial for the improvement of prediction accuracy [14]. On the other hand, Liu et al. proposed a new collaborative prediction model, using the SVR method of support vector regression to predict the Chinese urban air quality index AQI [15]. e experimental results show that when the air quality characteristic attribute and the air quality index when there are strong interaction and correlation between them, the MAPE value of the multicity multidimensional regression model is reduced. erefore, this study will use six major atmospheric pollutants as output variables, and the remaining five as input variables to explore the interactive prediction ability of various pollutants. Relative to the dynamic characteristics of the air pollution index, recurrent neural network (RNN) can effectively solve the adverse effects of the spatial and temporal evolution of the air pollution index. RNN is a deep learning method that can use any memory unit between networks to process any sequence in the input so that it can learn time series [16]. RNN technology has been proposed to solve the problem of time-series prediction, but studies have shown that the typical RNN model cannot solve the longterm dependence of the input sequence. In order to solve this problem, this paper uses a special long short-term memory artificial neural network (LSTM NN) of RNN structure. LSTM can learn time series of long spans and automatically determine the optimal time lag in the prediction. In recent years, LSTM has been successfully applied to image classification, natural language processing, human motion recognition, robot intelligence development, and oil price forecasting [17][18][19]. erefore, based on the ability of LSTM to analyze and predict spatiotemporal data, this study applies it to the prediction of air pollution and can obtain good performance.
is research focuses on two aspects: (1) developing an LSTM atmospheric pollutant concentration prediction model based on deep learning; (2) optimizing the input indicators by selecting methods and data dimension processing to improve the prediction accuracy of the LSTM model. Taking Beijing as an example, the prediction of six major atmospheric pollutants (PM 2.5 , PM 10 , SO 2 , NO 2 , O 3 , and CO) was taken as the research object, and the stability and accuracy of the model were analyzed.

2
Computational Intelligence and Neuroscience

Long Short-Term Memory.
As the current popular recurrent neural network algorithm, the LSTM neural network was first proposed by Hochreiter and Schmidhuber, which improves the memory ability of long (static) and short (cyclic) dynamic features of time series [20]. Similar to the traditional cyclic neural network model, this model models the temporal data by mining the cyclical connections between neurons and mining the intrinsic connections between time-series data. However, unlike the traditional cyclic neural network model, it has a unique neuron structure called a "memory unit." e hidden layer of the LSTM network constructed by the structure can store information of any length of time and obtain a more accurate time-series model [21]. e memory cell structure of the LSTM network is shown in Figure 1. e memory unit module is composed of three "door" structures of input gate, forgetting gate, and output gate and one loop unit. e core idea is to control the switching of each "gate" through a nonlinear function to protect and control the state of the memory unit, thereby controlling the increase and decrease of information [22]. erefore, the key to the LSTM network is the long-term storage of data information through the state of the memory unit. In general, the three "gates" output value of 0-1 through the sigmoid function to determine how much information can be input to the memory unit.
Assuming that at time t, the input of a memory unit module is x t , the output is h t , and the unit state is C t , and then the formulas of input gate, forgetting gate, output gate, input conversion, unit state update, and output of the hidden layer of the memory unit module are shown in the following equations: In this formula, σ is sigmoid function; tanh is hyperbolic tangent function; i t , f t , o t , and C t ′ are input of input gate, forgetting gate, output gate, and input conversion pair unit, respectively; W ix , W fx , W ox , and W cx , and W im , W fm , W om , and W cm are the weight matrix of the input gate, forgetting gate, output gate, and input conversion corresponding to x t and h t-1 , respectively; b i , b f , b o , and b c are the offset vectors of input gate, forgetting gate, output gate, and input conversion, respectively [22].

Wavelet Transform.
e wavelet transform is locally adjusted by the window adjustment, and the input signal is decomposed into a low-frequency signal capable of reflecting the true change trend of the signal data and a random disturbance high-frequency signal [23]. e contaminant concentration data are decomposed into a sequence group composed of different components by wavelet transformation. ese subsequences have a more stable variance and fewer singular value points than the original data. erefore, when using the LSTM model to predict the temporal data of atmospheric pollutants, in order to more accurately and accurately express the original signal information and improve the prediction accuracy, the input vector can be structurally transformed to increase the one-dimensional data to a higher level [24]. Air pollutant concentration time-series data y 1 , y 2 , . . . , y n can be considered as a set of signal sequences. Since wavelet analysis applies to such nonlinear and nonstationary timeseries data, this method can be applied to analyze and extract the information characteristics of time-series data of atmospheric pollutant concentration at different frequencies [25]. e wavelet transform adjusts the window through the window to achieve the purpose of localized analysis and decomposes the input signal into a low-frequency signal that can reflect the actual change trend of the signal data and a random-disturbed high-frequency signal. e contaminant concentration data are decomposed into a sequence group composed of different components by wavelet transformation. Compared with the original data, these subsequences have more stable variance and fewer singular value points, which can express the original signal information more effectively and accurately, so the prediction accuracy is better. If the scaling function of the wavelet transform is φ(t), the parent wavelet function is ψ(t).
So, Computational Intelligence and Neuroscience where j and k are scale parameters and translation parameters respectively, and signal y (t) can be expressed by formulas (7) and (8) as follows: where c j 0 (k) and d j (k) are estimated coefficients and detail coefficients, respectively, and the pollutant concentration data can be decomposed into m steps by wavelet transform: In formula (10), a is imperfect information representing the original information feature, and b is a piece of highfrequency information indicating a little signal fluctuation, that is, a noise portion of the original information. e lowfrequency approximation information and the high-frequency information obtained by the wavelet transform decomposition constitute a new set of input vectors.

Information Gain.
Usually, feature selection is always selected after quantifying the importance of the feature, and how to quantify the importance of the feature becomes the most significant difference between the various methods [26]. e correlation between features and categories is used in the square test to quantify. e stronger the association, the higher the feature score and the feature is more likely to be retained. In information gain, the measure of importance is to see how much information a feature can bring to a classification system [27]. e more information it brings, the more critical it is.
ere is a variable X, which has more than n kinds of values, which are x 1 , x 2 , . . . , x n , respectively. e probability of each is P 1 , P 2 , . . . , P n , and then the entropy of X is defined as For the classification system, category C is a variable, and its possible values are C 1 , C 2 , . . . , C n and the probability of occurrence of each category is P(C 1 ), P(C 2 ), . . . , P(C n ), So, n is the total number of categories. At this point, the entropy of the system can be expressed as In order to distinguish the symbol of t from the symbol of feature t itself, this paper uses T to represent the feature; then, e other side can be expanded to erefore, the information gain that the feature T brings to the system can be written as the difference between the original entropy of the system and the conditional entropy after the fixed feature T:

LSTM Forecasting Model.
e flow of the atmospheric pollutant concentration prediction model based on LSTM is shown in the following figure. e input variable consists of three parts, including the eigenvector group obtained by information gain, the high-frequency and low-frequency information vector group, and the historical data group, after wavelet decomposition transformation. In the design of the network structure, after repeated experimental debugging, the complete LSTM model was finally determined, as shown in Figure 2. e entire LSTM neural network contains N LSTM hidden layers, and each layer contains 256 nodes. e detailed steps for running the model are as follows: (i) Step 1: form the time-series data set AP 1 , AP 2 , . . . , AP n from the air pollutant concentration data, use information gain to select the input index characteristics of different prediction targets in the air pollutant concentration prediction, respectively, and obtain the significant factor data set I 1 ′ , I 2 ′ , . . . , I t ′ . (ii) Step 2: using the wavelet decomposition transform, the significant factor data set based on the high-dimensional input information to train the LSTM model, and through trial and error, adjust the parameters to get the prediction mode f(X i ).
(v) Step 5: using the prediction model obtained from the above training and the t + 1 stage input vector X t+1 obtained in Step 1, the predicted value f(X t+1 ′ ) of the t + 1 stage atmospheric pollutant concentration can be measured.

Research Object and Exploratory
Data Analysis e object of this study is Beijing, which is located in the northern part of the North China Plain. It consists of 16 functional areas in 6 districts and 10 suburbs, with a total area of 16 When using the deep learning method to build the model, the selection and normalization of features are essential for the performance of the model. rough exploratory analysis, it is found that there are apparent abnormal fluctuations in the raw data of pollutant concentration with seasonal changes, as shown in Figure 3. Due to the Beijing Huilongguan fire accident on June 5, 2015, the CO concentration suddenly increased to 6.8 μg/m 3 , and such abnormal fluctuations will seriously affect the predictive ability of the model. erefore, this study introduces the wavelet decomposition transform; the original data are transformed to obtain low-frequency data and high-frequency data subsequences. ese subsequences have more stable variance and fewer singular value points than the original data, which makes the input vector smooth. e concentration of O 3 has increased significantly in the same month in the same period of four years and has become the main factor of air pollution in Beijing. Under the influence of environmental management policies such as Beijing's industrial migration, the concentrations of PM 10 and PM 2.5 have decreased year by year. However, due to the increasing number of private cars in Beijing, the concentration of nitrogen oxides has increased significantly, which has become a new major air pollution factor.
In order to study the mutual prediction ability between the six major atmospheric pollutants in Beijing, this study will use a total of 1095 data from January 1, 2014, to December 31, 2016, as the training data set for the prediction model. Others include 2017. ree hundred sixty-five data from January 1 to December 31, 2017, were used as test data sets.

Result of Information Gain.
is study uses nonpredictive targets as input vectors. e purpose is to cope with the specific pollution indicators that can be obtained in the prediction environment with a strong influence of complex, uncertain factors, use the clear indicators to predict each other, and obtain the prediction model with higher applicability. In order to improve the accuracy of model prediction, this study will further select the input variables, select the input variables by information gain, determine the correlation degree with each pollutant concentration, and screen the significant indicators. Using the six atmospheric pollutants (PM 10 , PM 2.5 , NO 2 , SO 2 , O 3 , and CO) in the original data set as targets, respectively, through information gain to complete correlation exploration and feature selection, we can obtain the entropy of each input variable for the prediction target. e value is sorted as shown in Table 2.  Computational Intelligence and Neuroscience Table 3 lists the three items that are most important in predicting pollutants in order. It can be known from the table that PM 10 and PM 2.5 have a dangerous influence on the concentration prediction of SO 2 , NO 2 , O 3 , and CO when they are used as input variables, and there are the two pollutants with the most substantial influence on the results during the prediction of atmospheric pollutant concentration. It may be because its source has a specific correlation, which is derived from the burning of fossil fuels such as automobile exhaust. e research model does not consider the complex and uncertain prediction environment. erefore, in order to improve the accuracy of the prediction model, only the feature vector acquired by the IG is used as an input variable for model training.

Result of Wavelet Decomposition.
Although LSTM is applied to time-series prediction, it can show good prediction performance. However, the LSTM model efficiently represents the high-dimensional nonlinear relationship between the input vectors and the predicted targets through the kernel function. e appropriate high-dimensional input vectors can be used to describe the information features more effectively and accurately and express the meaning of the data. erefore, model prediction ability depends on no small extent on the choice of input vectors in the model design. In this study, when using the LSTM model to predict the concentration of pollutants, in order to make the prediction results more accurate and stable, the input variables can be structurally transformed to obtain a new set of input variables. e data are upgraded from one-dimensional to high-dimensional data by wavelet decomposition, which more fully and adequately represents the trend of data changes, thereby improving the prediction accuracy. In this study, wavelet decomposition is based on Daubechies (DB) wavelet basis function [4]. Daubechies has low-pass and high-pass filtering properties, which is suitable for feature selection. Because of its inherent orthogonality, Daubechies wavelet can be widely used and shows good performance in time-series data of analysis applications.
Matlab tool was used to form a new prediction data set of six atmospheric pollutants (PM 10 , PM 2.5 , NO 2 , SO 2 , O 3 , and CO) using the low-frequency approximate information and high-frequency information obtained by wavelet decomposition transformation, respectively, as another new input vector group of the LSTM model. e result of the transformation is shown in Figure 4, which is a high-frequency information group and is a low-frequency information group. e wavelet decomposition set generates high-dimensional input vectors by wavelet decomposition transformation from the density time-series data of three input feature variables, which can effectively increase the data representing information, and the prediction stability of the prediction model is significantly improved.

Determination of the Best Parameters of LSTM Model.
In order to ensure that the hybrid model obtains the best experimental results, the best parameters of the LSTM model   Computational Intelligence and Neuroscience should be determined before the experiment starts, so as to reduce the influence of parameter factors on the experimental results. ere are three major parameters in the LSTM model, namely, the number of time steps L of each layer in the LSTM, the size of the hidden unit (the same hidden unit is used for each layer in the LSTM), and the batch size during training. Also it includes learning rate (Lr) and max epochs. When selecting the experiment for predicting the best model in the experiment, the number of frames in each sample is set to L. When one parameter is different, the other parameters are fixed, and finally, the best prediction model is found. e model parameters are shown in Table 4.

Result of Hybrid Model.
In order to investigate the performance of the LSTM atmospheric pollutant concentration prediction model, this study provided four sets of input variables for the training prediction model, namely, Beijing atmospheric pollutant concentration original data set, characteristic variable set, high-dimensional data set, and high-dimensional characteristic variable set. Verifying Computational Intelligence and Neuroscience the high prediction accuracy of the hybrid LSTM model established in this study, six predictions of atmospheric pollution concentration using different methods for feature selection were compared in this study. is prediction model in the experiment was designed by Matlab 2017a and Ubuntu system using Python 2.7 programming. e minimum MAPE was selected as the target for the selection of relevant parameters in the model. Mean absolute percentage error (MAPE) is an important indicator to measure the accuracy of prediction in the statistical field [28].
In this paper, the MAPE index is also used to measure the error of the load prediction algorithm and compare it with other algorithms. MAPE not only considers the error between the predicted value and the real value but also considers the proportion between the error and the real value. e following formula gives the calculation expression of MAPE: where e MAPE represents the load prediction error measured by the MAPE index; N represents the total number of load prediction time points; L a t represents the actual load value; and L f t represents the predicted load. At the beginning of the experiment, the LSTM model with four hidden layers was selected. e original data of six atmospheric pollutants were used as the prediction targets, and the remaining five were used as independent variables. When PM 10 is used as the dependent variable, MAPE can be as low as 7.54%, and when PM 2.5 is used as the dependent variable, MAPE reaches 17.25%. e predicted results of the MAPE standard deviation are substantial. e above results show that the prediction stability of the model cannot be guaranteed unless when the independent variables are different using the LSTM model. erefore, in order to improve the stability and accuracy of the LSTM prediction, the learning efficiency of the auxiliary vector enhancement model is added to the prediction model.
rough the feature selection of information gain, only the feature variable set consisting of three input variables with higher correlation is used to train the LSTM model. e average MAPE of the prediction result is reduced from 12.62% to 10.75%, and the prediction accuracy of the model is improved. However, the CO prediction result is improved. e MAPE is 4.35%, and the PM 2.5 prediction result is 17.85%. e stability of the prediction model is still not guaranteed. In order to achieve the effect of improving the data dimension, the three sets of high-frequency information and a set of low-frequency information obtained by wavelet decomposition are used to form a highdimensional data set. e subsequence data transformed by the wavelet decomposition have a more stable variance and fewer singular value points than the original data and can express the original signal information more effectively and accurately. After the LSTM model was fitted using only G2 as training data, the average MAPE was 10.65%, which was 0.05% lower than the G2-assisted prediction alone. However, the variance of the model for different predicted target MAPE is significantly smaller, and the stability can be significantly improved. erefore, it can be concluded that the time-series data after wavelet decomposition has more stable prediction performance in prediction.
On this basis, this study attempts to set up a new training dataset after the original data of atmospheric pollutants are selected by information gain features and dimensionality enhanced by wavelet transform, aiming at smoothing the training data set of the deep learning model and enhancing the learning ability of the LSTM model. e average MAPE of the prediction result is reduced to 7.45% when the LSTM model is trained with the high-dimensional characteristic data set. Figure 5 shows that when the input vector is processed using wavelet transform and IG, the prediction accuracy is higher and the stability is excellent. e evaluation of the prediction results of each type of air pollution prediction under different models is shown in Table 5 and Figure 6. e actual and predicted concentrations of various atmospheric pollutants in 2017 are compared as shown in Figure 5.
LSTM has achieved excellent results as a high-dimensional nonlinear learning algorithm for the prediction of atmospheric pollutant concentration's time-series data. However, due to the incomplete representation of one-dimensional time-series data, the generalization ability of the prediction model is restricted to some extent. e hybrid prediction model proposed in this study uses wavelet to decompose the time-series data of various pollutant concentrations, constructs new high-dimensional feature vectors to express the relevant information of different pollutants at different frequencies, and better displays the data characteristics.
In order to evaluate the forecasting performance of the hybrid model more comprehensively, the experiment compares the model with some state-of-the-art time-series prediction models, including machine learning methods and deep learning methods. In the experimentation, the SVR, RNN, and GRU models were selected as the comparison models. Furthermore, all the control models were combined with the IG and wavelet model. e training set and the test set were the same as for the hybrid model. A total of three groups of comparative experiments were performed. e final results are shown in Table 6. From the results, It can be seen that the prediction model proposed in this paper improves the accuracy of prediction and is more effective than other prediction models.

e Stability of the Hybrid Models.
In order to measure and evaluate the stability of the hybrid models, the experiment collected data from different regions for verification,  Beijing. Shijiazhuang is located in the middle of the North China Plain, and the level of air pollution is more serious than Beijing. e data of three regions are reforecasted to evaluate the stability of the hybrid models. e forecasting procedure and model parameters of three regions is the same as that in Beijing which removes the need to make some duplicate figures. e final prediction results are shown in Table 7.
As shown in Table 7, the MAPE evaluation indicators of the three regions are relatively reasonable, and the average values are close to those of Beijing. Especially, in Tianjin, the average value is 7.21%, which is the closest to Beijing's 7.45%, and the level of air pollution is similar to Beijing. is shows that the hybrid model has high stability and is suitable for air pollution concentration prediction.

Conclusions
Severe air pollution has a significant impact on human health, flora and fauna, and the environment. For human health, dangerous air pollution is prone to respiratory diseases and physiological dysfunction, which severely hinders human health. erefore, scientific and accurate prediction of the concentration of atmospheric pollutants has an important practical significance, which can provide current prediction data and basis for environmental protection agencies, reduce the impact of air pollution on people's health, and guide people's work and life. In this paper, a hybrid LSTM model is established based on wavelet decomposition, information gain (IG), and long-and shortterm neural network (LSTM) methods to predict the concentration of six major atmospheric pollutants in Beijing in the future. e study is summarized as follows: (1) Using the information gain method to select the input variables is helpful to improve the accuracy of the LSTM neural network prediction results of six air pollution concentrations and use wavelet decomposition to convert the characteristic input variables into high-dimensional data. e collection effectively enhances the stability and accuracy of the predictive model. (2) e hybrid prediction model uses wavelet decomposition of atmospheric pollutant concentration's time-series data, and the low-frequency data and high-frequency data obtained after decomposition are simultaneously used as input variables, increasing the data dimension. e information performance of the pollutant concentration's time-series data at different frequencies is better described. (3) It can be seen from the experimental results that the hybrid prediction model has a significant improvement in the prediction accuracy of pollutant concentration and increased instability. Especially, in the prediction of burst data points, the hybrid prediction model can predict more accurately. erefore, the high-dimensional input variable pair's data information composed of low-frequency information and high-frequency information obtained by wavelet transform can be considered as the pollutant concentration's time-series data make this expression more accurate. (4) is study uses historical data to obtain the calculation results of the characteristic entropy of other pollutant concentrations when different pollutants are used as predictive target variables. According to    the experimental results, it can be observed that the characteristic entropy of the concentration of PM 10 and PM 2.5 for the concentration prediction of most major atmospheric pollutants is significant, indicating that when insufficient fossil fuels lead to the increase of SO 2 and NO 2 and other pollutants, the concentration of PM 10 and PM 2.5 in the air will be severely affected. (5) e LSTM neural network method can be used in the model to obtain the prediction results of atmospheric pollutant concentration more accurately. is prediction model is applied to the prediction of six air pollutant concentrations in Beijing. By comparing the actual data, the average MAPE predicted is as low as 7.45%. Compared with the mechanism model with complex and high computational cost, it is more suitable for the prediction environment with robust and complex uncertainty factors. erefore, the hybrid prediction model has strong applicability and high application value in predicting the concentration of atmospheric pollutants. (6) By predicting the concentration of air pollutants in three different regions, it can be seen that the hybrid model is more stable and the forecast of air pollution concentration is more reliable. In the control experiment, the MAPE of the other three regions is close to the value of Beijing, indicating that the hybrid model can still obtain better prediction results under the condition of different characteristic data values, and the model has good stable prediction performance.

Data Availability
Raw data used to support the results of this study are included in the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.