Longer Time Span Air Pollution Prediction: The Attention and Autoencoder Hybrid Learning Model

College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 200234, China Shanghai Engineering Research Center of Intelligent Education and Bigdata, Shanghai Normal University, Shanghai 200234, China Institute of Artificial Intelligence on Education, Shanghai Normal University, Shanghai 200234, China College of Transportation Engineering, Tongji University, Shanghai 200082, China Department of Electronic and Electrical Engineering, Brunel University London, Uxbridge, UB8 3PH, UK


Introduction
Air pollution has become increasingly severe with the gradual maturity of global industrialization. Studies have pointed out that air pollution might be one of the key factors leading to noncommunicable diseases and shorter life expectancy [1]. Countries around the world have tackled this problem with various measures and people have tended to change their activities in certain ways to cope with the declining air quality. In the fight against this havoc, forecasting air quality would be the essential and effective way to help human beings.
anks to the progress in sensor and recording devices, voluminous information about atmospheres and pollutant concentrations has been gathered during these years. ese data have in turn boosted studies on identifying influences of pollutants (e.g., PM2.5, PM10, and AOI) on public and societies [2]. Among them, foreseeing the changing trend of air quality is one the most challenging ones. Facing the colossal raw data, prediction approaches naturally deploy statistics-based methods. For example, ARMA (Autoregressive Moving Average) [3] uses one-dimensional data to predict air pollutions. Although multiple variables are added, this kind of method still follows the way of linear models and thus can hardly be applied to the long-time span sequential prediction, a kind of nonlinear problem.
Nonlinear models can exert their expertise at extracting deep features from gigantic data, especially when the raw data are more precise. For example, neural networks (NNs) like RNN, LSTM [4,5], and GRU [6,7] have been used for various nonlinear problems, including image processing, natural language processing, and automatic driving [8]. Bidirectional LSTM methods have been used to predict air contamination [9]. Current studies on air quality perdition devoted to improving the accuracy while retaining the prediction period within 12 hours; some of them even limited the time within 1 hour to seek the utmost accuracy. Despite the high accuracy, short-time prediction has barely any practical values, given that fighting air pollution needs longer time to prepare.
In this paper, we design a novel attention and autoencoder (A&A) learning model to predict the changing of air pollutants longer than 12 hours. is approach is composed of an Encoder and Decoder structure and LSTM models that contain the attention mechanism. e cooperation between these models is expected to exert a longer-span sequential prediction.
Targeting the temporal characteristic of the pollutant changing trend, this study specifically incorporates the attention mechanism to emphasize the "time" feature through the following techniques: (1) Add a time factor in the attention mechanism (2) Consider the decoder hidden states when outputting prediction results (3) Use the window algorithm to keep the concerned history data stable e remainder of the paper is organized as follows. Section 2 classifies and reviews related works. Section 3 describes the architecture of the A&A learning approach. Section 4 presents the experimental process. Section 5 draws certain conclusions and points to directions for further work.
e block diagram of our research is depicted in Figure 1, and detailed research steps can be found in Section 4. e brief steps are listed as follows: (1) First, after confirming the specific city and on which year we are going to investigate, we get two datasets of air pollution and atmosphere (2) en, we need to preprocess the data (a) Using mean value interpolation to fill in the missing data (b) Adjusting both datasets to the same record frequency (c) Merging two datasets, splitting the train/test datasets, and segmenting the dataset into multiple batches (3) After that, processed data will be sent to A&A learning for prediction (4) Finally, we evaluate the predicted value through evaluation metrics and get the result

General Classification.
Previous approaches to forecasting air pollution can be classified into theory-based methods and statistics-based ones: (1) eory-based methods focus on chemical or physical factors that could affect the variation tendencies of pollutants and construct numerical models to simulate these dynamic processes. (2) Statistics-based methods analyze a sea of data about past air qualities and atmospheres (e.g., PM2.5, PM10, and AOI) and extract potential relationships among them to forecast the variation tendencies of pollutants.

eory-Based Methods.
Two theory-based air quality models have been widely accepted around the industry: the community multiscale air quality (CMAQ) [10] modeling system and the comprehensive air quality model with extensions (CAMx) [11]. Both were developed based on the assumption of "one atmosphere." ey involved conversion and influences among different types of air contaminants and simulated scales of multipollutants [12]. CMAQ targets a majority of air contaminants and estimates the overall air quality covering multiple regions. Its universal applicability, however, comes with certain innate deficiencies, like errors caused by manually set parameters and perturbed mass conservation induced by inconsistent meteorological fields [13].
CAMx is a publicly available photochemical grid model that was developed and coded during the late 1990s by modern and modular coding practices [14]. It was proved to be effective in simulating various atmospheres and different scales of particulate pollutants covering cities or even  crossing continents. Yet problems still exist in accuracy and comprehensiveness of compound pollutants simulation. For instance, the particular matter for comprehensive air quality model with extensions (PMCAMx) predicts even higher values of O3 and PM than CMAQ does [15]. Although theory-based models will work efficiently when they are fed with accurate data, the mega-computation and inevitable gentle errors caused by the models do influence their performance. Plus, it is hard to collect abundant data for them in practice.

Statistics-Based Methods.
Meanwhile, the acceleration of computing power has brought statistics-based machinelearning models back to life. Machine-learning approaches have been widely used for classification and regression, thanks to their ability to extract potential relationships among a vast number of features, as well as thinking like a human and reacting reasonably without a predesigned program. ey are traditionally divided into three categories: supervised, unsupervised, and semisupervised study, based on different amounts of labels used during the training period [16].
Some researchers have adopted certain machine-learning techniques to predict air pollutions, like SVM algorithms [17,18] and the semiexperimental regression model [19], and obtained satisfying results on small-scale data. However, they can barely forecast long-period trends.
Artificial NNs, a branch of machine learning, have been proved to be promising for time-dependent forecast, for example, RNN [20], LSTM [21], and GRU [22]. An artificial NNs approach is to construct a model by simulating human's NNs. With multiple nerve cells depicting the profound connections among data, nonlinear activations, and the backpropagation algorithm [23], artificial NNs could classify or predict things more practically and be apt at solving nonlinear, random, and irregular problems.

NN-Based Methods.
In the field of air quality prediction, it is believed that accuracy is partially inversely proportional to time span. is conclusion has kept most researchers who deployed statistics-based methods focusing on the accuracy of prediction while shortening the limit of time to one hour [24] or one day (with only one value representing the pollution level each day) [25], which is far from meeting practical needs. Our A&A learning approach, however, puts real-life activities first by extending the forecast time to 12 or 24 hours, or potentially even 48 hours. is approach extracts the nonlinear spatial and temporal features of the changing of air pollutant concentration from previous data. Its main purpose is not to push the accuracy to the climax but to predict the tendency of the air pollution indexes within a longer period while retaining a higher accuracy.
Numerous barriers could impede obtaining accurate air quality forecasting. First and foremost, air pollutant concentration is influenced by certain human activities, like letting off fireworks or driving cars, and by meteorological factors such as wind speed and sea level pressure [25,26]. To tackle this complex situation, the proposed A&A approach deploys LSTM and GRU that can transform numeric air pollution information into characteristic vectors and clarify deeper regulations between these features by capturing the temporal information of fifteen dimensions of data. ese two methods can contribute to the prediction of PM2.5 distribution for a longer time span.
Prediction for a longer time span in this paper is a typical sequence-to-sequence problem [27]: the observation period versus the prediction period which are not equal.
is separates the proposed model into an encoder and a decoder; thus, the model can conduct feature extracting and forecasting separately. e core of both is LSTM. e encoder, fed with original data, is to finish the excavating task, compress the distribution of air pollutant features during the observed period into a cell state, and send it to the decoder. Once the decoder gets the cell state, prediction can be initialized based on it. us, the inequality between the two time spans disappears.
Another critical issue is that pure RNN, LSTM, and GRU models could cause the vanishing gradient problem. Due to the "tanh" activation frequently reused in these models, gradients drop dramatically along with time. at means previous information could vanish during a long sequence prediction. To solve this problem, we adopt an attentiontheory-based [28] algorithm to quantify to what extend the previous pollutant information can influence current prediction. us, the condensation of PM2.5 can be forecasted by the matrix output from the concatenated encoder and decoder, and the model's ability to seek long-term regularity among the historical data can be significantly improved.
To sum up, our research emphasizes the following critical tasks: (1) Extract correlated features by a sequence-to-sequence framework and transform them into vectors (2) Quantify the influences between the observed period and the forecasting period (3) Develop a time-related attention algorithm to extract the regularity of historical data and quantify its weights

Architecture of the A&A Learning Approach
3.1. Overview of A&A Learning. e proposed A&A learning approach uses the improved attention mechanism to build the encoder-decoder structure with the LSTM model.
is architecture can tackle particular problems within the NNs while strengthening the extracting ability.
First, the encoder-decoder structure can avoid the conflict between overfitting and linear prediction. We have conducted multiexperiments that used pure LSTM to forecast air quality. Although it could output the regularity of the variation of PM2.5, the fixed vector size drove its performance unstable. As the training epochs getting bigger, the error was becoming significantly high. If we enlarged the vector size, the overfitting problem would be more severe; and if we did it conversely, the result would tend to be more linear, hindering the deeper seeking of information about the air quality. To alleviate the conflict, we adopt the sequence-to-sequence structure to retain more historical data for later prediction.
Second, the improved attention mechanism can facilitate the effect of "time" in the prediction results. e combination of LSTM and sequence-to-sequence models is not enough, and the improved attention mechanism can help the approach to obtain results based on different periods of time.
Our approach intends to highlight the constraint of "time" through the following three ways:

Multiply a Time Factor according to the Time Gap between the Previous Data and the Current Ones.
Although the traditional attention mechanism is powerful, it evenly allocates the impact of each observed point on the predicted point. In the air quality prediction task, however, the impact of pollutants on the air quality at the predicted time changes with the time span between the observed time and the predicted time; that is, the longer the time gap between the observed time and the predicted time, the less the impact. erefore, certain time-related decay factors and numerical scoring mechanisms must be incorporated into the attention mechanism of this study.
Actually, the improved attention mechanism adds a time factor to the attention queries. e time factor is calculated by the difference between the observed time and the predicted time and helps to modify the hidden states.

Take the Decoder Sequence into the Attention Queries'
Calculation.
e traditional attention mechanism only considers the encoder's hidden state and current hidden states; thus, the current predicted values cannot impact later prediction, which is against common sense. erefore, the improved attention mechanism involved the hidden state of the decoder to push the prediction spreading forward. Use a fixed window size to simulate the time step.
When adding the hidden state of the decoder to the attention values, the stable size of the attention values will be boosted along with the prediction proceeding. Given the decaying impact, a window is "framed" on the sequence of the hidden states. e windowing algorithm abandons the hidden state at the longest time point when a new hidden state from the decoder joins; thus, the total amount of the attention values can keep stable.
To make the procedure clear, we have made a block diagram of the A&A learning approach which can be found in Figure 2 and there are 6 steps to predict the value.
Step 1. First, a batch of features with fifteen dimensions will be sent to train the encoder.
Step 2. en, the cell state of encoder and the last row of fifteen features will be transferred to decoder through route 3 and all hidden states will be recorded and sent to the improved attention mechanism through route 2.
Step 3. After using the last row of fifteen features as the input and the cell state transferred from encoder, the decoder will give out current hidden state, which will be sent to the attention mechanism through route 4.
Step 4. Next, we will multiply time factor to emphasize the importance of hidden states with a closer time span, add the hidden state of one step before produced by decoder, and drop the most ancient one. After that, hidden states with stable size will be sent to attention mechanism through route 5.
Step 5. Additionally, we calculate the result by using current hidden state and hidden states with stable size and sent to decoder through route 6.
Step 6. Finally, after a full connect layer, the predicted value will be sent through route 7 and the current hidden state will be sent to the improved attention mechanism, and it will also be used to update the hidden states through step 4 before the next round of forecast.
Also, we have prepared the overall structure of the A-A learning approach in Figure 3 to illustrate this progress in detail.

Recurrent NNs.
Artificial NNs can be divided into feedforward NNs and recurrent NNs [29]. Backpropagation NNs, the typical feedforward NNs, contain no links among the neurons of the same layers, which means that no interactions between historical and current data exist.
is could lead to the model's insensitivity to the variation of air pollutant concentration.
Recurrent NNs, however, can connect previous and current data. Each node in the network is a computing unit that incorporates both the hidden state from the previous units and the current input data and outputs a relevant result.
is mechanism can help the model to retrospect previous information during prediction.
Suppose the input series is denoted as  Mathematical Problems in Engineering where "f" represents the activation functions, among which "tanh" and "relu"' are the frequently used; "U, V, and W" are the shared weights of the RNN. ey represent the critical feature that can enable the RNN to capture relationships among diverse units and can also cause gradient vanishing after a long-sequenced calculation. Suppose the input series is denoted as

Long Short-Term Memory (LSTM
where [a, b] means concatenating matrixes a and b; "W ? " and "b ? " represent the outcome weight and the bias of various equations, respectively; the subscripts "i," "f," "o," and "c" indicate that the parameters are related to the input gate, the forget gate, the output gate, and the cell state,  Mathematical Problems in Engineering alternative information for cell states update; "σ" is the sigmoid function; and "f" is the tanh activation. First, the forget gate drops the reluctant part of the cell state to simulate the loss of the historical data with time stepping forward according to both the previous hidden state and the current input. Second, the input gate controls updating the information for the cell state.
ird, after updating the cell state according to the previous cell state C t−1 and the supplementary cell state C ' t , the output gate chooses the information for the hidden state that will affect the calculation of the following LSTM cells.
Although LSTM is apt at solving long sequence problems, compared to RNNs, its attention in this study needs to be paid to the saltation and high-value part of the data, because air pollution concentration does not change dramatically in a high frequency.

Gated Recurrent Unit.
A Gated Recurrent Unit (GRU) is a simplified edition of LSTM and has been well applied. It uses the update gate (z t ) that is transferred from the forget, the input and the update gate in LSTM, and the reset gate (r) t and merges the hidden state with the output. e merged state, the hidden state, is updated according to both the previous hidden state h t−1 and the supplementary state h ' t−1 and their weights depend on the result of the update gate.
where "W r ,""W z ," and W represent the weights for the diverse steps of calculation. e GRU holds the same capability of producing excellent results with that of LSTM [30].
3.5. Sequence-to-Sequence Structure. As mentioned before, prediction for air quality requires alignment between the historical data and the result matrix; the sequence-to-sequence structure, also called encoder-decoder structure, can well solve this problem by using two RNN models. is structure separates the prediction process into two parts based on the observed period and the forecasting period. During the observed period, one RNN model is used to extract information from historical data of each time step and condense it into a hidden state. is hidden state is then transmitted to the other RNN model used in the forecasting period, initializing the following prediction with the latest preserved data. e current simulated value is calculated after a full-connected network. us, the current predicted value is taken as the input of the next perdition and the procedure is iterated until the final sequence is obtained.
is state-sharing strategy balances the nonaligned time spans between the input data and the output data, allowing a longer sequence of data to be analyzed.
In this study, we replace the RNN models with LSTM networks for better extracting results and nonlinear simulating ability.
Although it is claimed that LSTM can help to relieve the vanishing gradient caused by the reuse of the "sigmoid" function [31], the conflict between overfitting and losing historical information generated by the fixed vector size of the cell state still exists. When the size of the vector grows, the overfitting problem becomes severer, while retraining the overfitting leads to a shortage of historical information for the following prediction. Extra measures should be deployed to avoid this.

Attention Mechanism.
e attention mechanism initially developed for image processing has been proved to be powerful in multiple fields of deep learning in recent years. Vaswani et al. [32] indicated that the attention mechanism performed outstandingly in sequential tasks, like translation of NLP (Natural Language Processing). Other studies [33,34] demonstrated that combing the sequence-to-sequence structure and the attention mechanism could obtain effective results in air quality prediction.
Targeting the LSTM's dilemma between overfitting and previous data losing, the attention mechanism designates an energy function and an alignment vector to quantify the similarity between the previous hidden state of decoder h d t−1 and all the previous encoder hidden states H E at time "t." Assume the time span of the observed period in the sequence-to-sequence structure is denoted as O and the predicted period as P, we can get the preserved encoder }. e variant "p t " is a pollutant vector representing the output of the attention mechanism. It quantifies the contributions of all the information to the current output of forecasting through a weighted summation. e weights are calculated through the energy function and the SoftMax function [35]. erefore, the parameters of LSTM of the decoder with the attention mechanism at time "t" are updated by the following equations: where F represents a full-connection layer. e energy function E � {e 1 , e 2 , e 3 , . . . , e O−1 , e O } represents the correlation between the current hidden state and the previous one. It is also needed in the calculation of alignment vector at time "t" (a t ).
Multiple energy functions can be used by the attention mechanism. Among them, the Bahdanau attention [36] and cosine similarity attention are the most related ones for this study. 6 Mathematical Problems in Engineering Bahdanau attention is as follows: Cosine similarity is as follows: where "j" and "k" represent the encoder time steps, ranging from 1 to the time span of the observed period O. anks to the attention mechanism and the encoderdecoder structure, the proposed approach can retrospect abundant information when simulating the current concentration of PM2.5. e windowed attention queries also ensure that the approach pays closer attention to the time span between the current hidden state and the previous ones without altering the core task of the traditional attention mechanism-fixed alignment. In addition, the time decay factor contributes an extra part to the high concentration on time of the A&A learning approach. e difference can be found in Figures 4 and 5, which depicts a typical attentionmechanism-based algorithm and the improved attention mechanism-based one, respectively.

e A&A Learning.
As the A&A learning approach incorporates the time factor, a time decay factor "D" is added to the calculation of the energy function in the improved attention mechanism. Additionally, the retrospective hidden states include not only the hidden states of the encoder but also those of the decoder. anks to the window whose time span matches the length of the observed period O, the size of the retrospective hidden states keeps stable, and the information for both the observed period and the predicted period can be analyzed and extracted. e A&A learning approach forecasts the concentration of the specific air pollution index PM2.5 based on the following equations.
For LSTM in the sequence-to-sequence structure, For the improved attention mechanism (the Bahdanau attention is taken as an example).
Assume that the retrospective hidden states are H R � {h r 1 , h r 2 , h r 3 , . . . , h r O−1 , h r O }, and at time "t", H R is composed of the hidden states of the encoder whose time step ranges from "t-1" to "O" ("O" refers to the length of the observed period) and the decoder from time step "1" to "t−2"; thus, the series of all related hidden states used in the calculation can be e forecasted value of time "t" is formed with the pollutant vector and the current hidden state of the decoder.
e energy function and the alignment vector remain unchanged.
To sum up, the A&A learning approach inherits the strength of the attention mechanism to extract more historical information and puts emphasis on time with the help of the windowed hidden states and the time decay factor.

Experiments
To evaluate the performance of the A&A learning approach, we conducted certain experiments on the datasets of real atmosphere and pollutant data.

Experiment Setup and Data Collection.
All experiments were conducted on a PC Server with an AMD Ryzen 7 3700X 8-Core CPU of 3.6 GHz, an 8G GeForce RTX 2080 SUPER GPU, and a 32 GB memory.
We utilized two datasets for the experiments: the atmosphere dataset with four dimensions of features (e.g., air temperature and dew point temperature), originated from the U.S. National Climate Data Center, and the air pollutant dataset with fifteen dimensions of features (e.g., PM2.5, PM10, and AOI), gleaned from the China Meteorological Bureau. Both datasets contained data from 2017 to 2018.
We used a crawling tool to collect data every single or three hours. Since this process might be interrupted by unexpected events, like hardware or power failures, a tiny missing rate, lower than 5%, was allowed in our datasets. And we adopted the mean imputation method to fill up the missing data, as air quality usually does not change dramatically in a short period [37].

Feature Selection.
We conducted a correlation analysis on the air quality indexes to achieve two goals. First, the irrelevant features were identified and eliminated for the sake of the convergence of the model. Second, a specific index that held strong relations with all other indexes was Mathematical Problems in Engineering determined and chosen as the object for later experiments.
is is because encompassing multi-indexes into the model would incur higher errors, and one specific index related to all others is enough to well reflect the performance of the model. e correlation result of the indexes in 2018 (the most recent year in our datasets) is shown in Figure 6. e results indicated that features related to ozone were redundant, since no others well correlated with them, and the index PM2.5 could be the specific index due to its high relevance with others. us, the task of the proposed A&A learning approach in the following experiments was to predict the concentration of PM2.5 for future twelve or twenty-four hours. A typical PM2.5 concentration variation of Shanghai ranging from 2018.01.01 to 2018.02.28 is shown in Figure 7 We conducted sensitivity tests on the hyperparameter of node size of LSTM (or GRU) to prove that our model does not heavily depend on specific parameter combinations. We also run the model over various time spans to validate its predictive power.

Evaluation Metrics.
e error between the predicted value of a model and the observed value can be utilized to indicate the performance. To quantify the errors, we adopted certain indicators. e mean absolute error (MAE) is the frequently used indicator to reflect the numeric gap between the ground truth and the predicted value. As the concentration of the pollutant could be lower than 20 or higher than 200, the mean absolute percentage error (MAPE) was chosen to      depict the errors. Finally, a correlation analysis was conducted on the predicted values and the real values in order to quantify model's ability to extract tendency.
where "y i " represents the predicted values, "y i " the observed values, and "n" the time span of the predicted period.

Data Preprocessing.
Data preprocessing is one of the most critical factors that could influence the training of a model. e amount, reliability, and appropriateness of the dataset and even the interpretation of the data could affect the ultimate results. e atmospheric dataset needed to be split into multiple parts with the same observed period and the predicted period and then be fed into the minibatch training and the encoder-decoder structure. During the training period, the segmentation is illustrated in Figure 8, with time span of observed period being 3, predicted period 2, and step of the window 3. e green blocks of this illustration are values we can observe and the orange part is values we are going to predict.
In the experiments, we separated each dataset into an observation part (denoted as "T O ") and a prediction part ("T p "). en, a typical data unit could then be formed with shape of [T O + T P , features], where "features" represent the number of factors used in the prediction. e time step T S , representing the gap between two adjacent data units, was determined, and the data units could be concatenated with shape of [batch size, T O + T P , features] by the minibatch algorithm during the training period.
We then adopted Z-score normalization [38] to eliminate the effects of different dimensions, thus accelerating the model's convergence. e normalization is described as where "x" stands for the transformed data and x for the untransformed; μ and σ represent the mean and the standard values of this feature, respectively. e training set contained the data of the whole year of 2017 and the data of 2018 before the unstable predicting period.
e time span of the tests was set as a constant number of 360 hours.

Comparison Results.
e results of the general comparison between the A&A learning approach, the LSTM, the GRU, the pure Decoder-Encoder model, and the traditional Attention Mechanism-based Decoder-Encoder model on datasets of Shanghai (SH) and Wulumuqi (WLMQ) from 2017.01.01 to 2018.12.31 will be shown in this section. e first experiment, conducted on the data of Shanghai, set the training epochs as 50, the time step of data separation as 48, the node size as 64, the time span of the observation as 48 hours, the prediction time as 24 hours, and the split rate of training as 0.7. e results are listed in Table 1. e segmentation of [23,72] stands for time span of 72 hours in total and the predicted time span of 24 hours, which also indicates that the observed time span takes up to 48 hours. e second experiment changed the node size to 32 and kept other parameters the same. e results are listed in Table 2. To test our model's generalization, we conducted another experiment on the dataset of Wulumuqi, with a node size of 64, and the results are listed in Table 3. e fourth experiment changed the node size to 32 and kept the same parameters with those in the fourth one and the results are listed in Table 4.
To clarify the performance of LSTM, GRU, SEQ2SEQ, Attention, and A&A learning, a sort of line chart depicting the predicted result of the first and third experiments has been shown in Figures 9-18. Compared with A&A learning, other models have a common weakness and periodicity and it is hard for them to seize the variation trend. us, we can draw the conclusion that the A&A learning performs better than any other models on both Shanghai and Wulumuqi datasets in all four experiments. As it is depicted above, the accuracy of the proposed model increased alongside the node size.
is phenomenon reflects the A&A learning's requirement of a larger node size to restore more previous information to get an accurate result, and among these five models, it is clear to notice that the COR of A&A learning rank much higher than any other models, which has also been depicted in Figures 13 and 18 with high consistency between the predicted value and the real one. Meanwhile, when we compared the result of all five models, the stable performance of A&A learning under different pollute condition demonstrated the generalization and effectiveness of our model.     To investigate the ultimate forecasting ability of these modes, we then conducted a series of experiments with the data segmentation of [11,34] and [48,144], where the parameters were not specifically adjusted to favor our model. With various predicted spans, we will figure out how long the A&A learning can predict. e results are listed in Table 5 (in the column of Dataset, W represents Wulumuqi and S Shanghai, and for the evaluation metrics like MAE, MAPE, and COR, the former score represents the result with LSTM's node size of 32 and the latter one represents that of 128.)   When the time spans varied from [23] and [72] to [11,34] and [48,144], the results remained broadly the same. With a node size of 128, A&A learning maintains its well performance. However, with a node size of 32, A&A learning still ranked the top place, which implied that higher prediction accuracy of the A&A learning depends on a specific node size for its inner model, and the longer the time span for prediction is, the larger the node size is required.
In Table 5, although the A&A learning approach did not get the first place in all the results, its results only have a gentle difference from the first one. As we take the predicted value as the input of the following value which is going to be predicted, the accumulative errors will grow unavoidably.
rough this way, our prime target is to fetch the variation tendency which can be reflected by the COR result, and thus, it is possible for us to forecast the variation trend rather than focusing on a specific value in one moment.
ough LSTM and GRU performed outstandingly, their simple network structures made their result fluctuate around a certain line which has been illustrated in Figures 9 and 10. But if the training epochs were enlarged, the simple network would hamper them from further extracting information, and the evaluation indicators of LSTM and GRU would rise dramatically. is result inspired us to pretrain the A&A learning to limit the variation scope of PM2.5 for better performance and it will be one of our future concerns.

Limitations.
Despite higher errors, the tendency of PM2.5 was captured by the A&A learning approach. Unlike previous studies, the approach focuses on the changing tendency of PM2.5 within the following 24 or even 48 hours. We prefer to forecast for a longer time span with an acceptable sacrifice of accuracy because it is more valuable for real-life activities.
Nevertheless, the accuracy of the prediction would be improved if the following three limitations were lifted.
First, the data of the two datasets were not gleaned at the same frequency: the atmosphere dataset was recorded every three hours, while the pollution dataset was recorded every single hour, so we used the interpolation algorithm to fill up the missing data, which probably contributed to the errors.
Second, errors could have been accumulated when the input features, features of PM2.5 in these experiments, were produced. e original data we collected about PM2.5 were not enough for predicting for 12, 24, or even 48 hours. All       the input features were calculated based on the previously predicted ones, through which the errors could easily accumulate and greatly damage the correlation of the predicted values and the real ones. Although we deployed windowed time steps and the decoder hidden state that contained the attention mechanism to adjust the accumulative errors, the real atmosphere and pollution status could be better simulated in an alternative way, for example, incorporating related theories into the model. ird, as the goal of the A&A learning approach is to balance errors and forecast time spans, a specified loss function should have been deployed to fulfill this purpose, which would have helped to smooth the convergence of the model.

Discussion
Prediction for indexes of air pollution such as PM2.5 is of utmost importance for a healthy living atmosphere. e A&A learning has attached adequate importance to time factors and we reformed the input structure of the attention mechanism. In comparison with other attention-based methods, the performance of A&A learning on the dataset of Shanghai proves better than all methods [39] when the segmentation is [23] and [72] and the result is shown in Table 6.
On the other hand, the practical value of models for air pollution prediction task can be hindered by the limited forecast time span. For previous study mainly focusing on accuracy [41][42][43], the time span was restricted to one hour and the result was bound to be accurate since it is hard for air pollution values to change dramatically during one hour without specific activity of human beings.
us, the correlation coefficient result of the predict and real value will remain high even if we let the current value as the predicted one. In this way, our research focuses on prolonging the time span and, at the same time, controlling the error under an acceptable level. e result of A&A learning shown in Section 5 was better than other commonly used models and we hope that more researches can appear in this field.

Conclusions
In this paper, we propose a novel A&A learning approach to predict air pollution for a longer period. As air pollution prediction holds a strong relation with the factor of "time," we add a time factor to simulate the "decay" effect of time on prediction. We had also included the decoder hidden states that can obtain the variation trend of the history and predicted data. Moreover, to make the total number of hidden states stable, we proposed a window method. As it is shown in experiments, the A&A learning approach can predict the air quality for a longer period while being more alert to recent changes and retain a good accuracy. Compared with  other models, the A&A learning performs in a quite stable and robust way under different circumstances, although it must rely on adjusted parameters to keep higher accuracy. e following conclusions were drawn: (1) A&A learning can prolong the time span of air pollution forecast to over 48 hours. (2) A&A learning performs well in both datasets with a high and low level of pollution. (3) Air pollution forecast problem is a problem of time series prediction and more emphasis on the models should be laid on time factors.
For future study, we are convinced that A&A learning can support longer air pollution prediction with the combination of the traditional method, for example, using CAMx to impose restrictions on accumulative errors of input features.
Meanwhile, this method will contribute to a longer prediction in stock market or other time-related tasks [44] and we are convinced that it would help those issues laying great emphasis on variation tendency to get a prediction with longer time span and limit the accumulative errors within acceptable levels. Also, when we use the traditional attention mechanism, the performance of A&A learning will be better than the traditional machine-learning method such as KNN and NNs on other problems [45].
Future study will focus more on practical values of NNs and thus prolong the prediction time span would be of great help. We hold the view that, for future study, more features will be considered for higher accuracy and we intend to add factors of human activities, like fireworks and traffic restrictions on private cars, to the A&A learning approach and improve the prediction accuracy or even identify the major contributor to air pollutions. Further, given the accumulated prediction results, the A&A learning has room to improve. For example, if the more accurate results of professional models, like weather forecasting models, were deployed as the input of the decoder, the accuracy of the A&A learning would be improved.

Conflicts of Interest
e authors declare that they have no known conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.