Air Quality Prediction Model Based on Spatiotemporal Data Analysis and Metalearning

With the continuous improvement of people’s quality of life, air quality issues have become one of the topics of daily concern. How to achieve accurate predictions of air quality in a variety of complex situations is the key to the rapid response of local governments. This paper studies two problems: (1) how to predict the air quality of any monitoring station based on the existing weather and environmental data while considering the spatiotemporal correlation among monitoring stations and (2) how to maintain the accuracy and stability of the forecast even when the available data is severely insufficient. A prediction model combining Long Short-Term Memory networks (LSTM) and Graph Attention (GAT) mechanism is proposed to solve the first problems. A metalearning algorithm for the prediction model is proposed to solve the second problem. LSTM is used to characterize the temporal correlation of historical data and GAT is used to characterize the spatial correlation among all the monitoring stations in the target city. In the case of insufficient training data, the proposed metalearning algorithm can be used to transfer knowledge from other cities with abundant training data. Through testing on public data sets, the proposed model has obvious advantages in accuracy compared with baseline models. Combining with the metalearning algorithm, it gives a much better performance in the case of insufficient training data.


Introduction
Because of the increasingly serious air pollution all over the world, air quality has become one of the most socially concerned issues. In many countries, air quality has become a key indicator to measure the happiness index of residents. In order to achieve real-time monitoring of air quality, almost all countries have arranged a large number of air quality monitoring stations in major cities. Besides, more and more mobile portable monitoring devices are participating in air quality monitoring [1,2]. Although many monitoring methods have been applied, it is still extremely challenging to make accurate predictions of air quality. Especially in the case of insufficient monitoring data or poor data quality, it is more difficult to maintain the accuracy and stability of the prediction model. Air quality is affected by a variety of complex factors [3][4][5][6], including meteorological factors, industrial factors, fuel factors, traffic factors, and other human activity factors.
The development of related monitoring equipment leads the collection of air quality data more and more comprehensive [7]. With the application of a series of spatiotemporal prediction models [8], air quality prediction has made considerable progress. Time correlation refers to the impact of historical monitoring data on future data, and spatial correlation refers to the mutual influence among adjacent monitoring stations. Most of the existing research works [3,4] focus on establishing prediction models based on time correlation, while there are obvious shortcomings in the study of spatial correlation. The reason for this phenomenon is because the diffusion of air pollutants is affected by various factors such as geographical location, wind direction, wind speed, air pressure, and air humidity. The impact of each factor on the relevance of different regions is difficult to accurately model. In this paper, we propose a spatiotemporal model for air quality prediction. The proposed model combines Long and Short-Term Memory networks (LSTM) and Graph ATtention (GAT) mechanism [9], where LSTM is used to capture the correlation in the time domain and GAT is used to model the spatial correlation among different regions.
In recent years, many deep learning models [3,4,[10][11][12][13] have achieved good results in air quality prediction. However, the accuracy of these predicting models highly depends on the sufficiency of training data. In reality, a lack of sufficient training data is the most common situation. As we all know, air quality monitoring in developing countries mainly depends on monitoring stations arranged by the government. There are few or no such stations in small cities and towns. Insufficient training data makes it difficult for the existing prediction models to achieve accurate results in these small cities and towns. Even in large cities, government-monitoring stations are very sparse. Although there are some unofficial monitoring devices that can provide data as a supplement, the data collected by these simple monitoring devices are often of poor quality with large amounts of various dirty data and missing values. Therefore, making accurate predictions based on insufficient training data is a realistic and challenging problem. Transfer learning (metalearning) [14] is currently the most effective method to solve this problem. Some transfer learning models [15][16][17] have been proposed to predict air quality with insufficient data. However, these methods require a strong similarity between the source domain and the target domain. Different cities and towns (especially large cities and small towns) have huge differences in pollution levels, climate, pollutant diffusion conditions, and density of monitoring sites. This makes it difficult for the existing transfer learning technology to successfully transfer the knowledge acquired in large cities to the air quality prediction of small and medium cities. To meet these challenges, based on the proposed prediction model, we give a metalearning algorithm for knowledge transfer among cities with huge differences.
The main contributions of this paper are as follows: (i) Proposing a spatiotemporal model by combining LSTM and GAT for accurate air quality prediction (ii) Designing a metalearning algorithm for the proposed model, which can transfer knowledge among different cities and make an accurate prediction in case of insufficient training data (iii) Verifying the advantages of the proposed model and meta-learning algorithm in the aspect of prediction accuracy through a large number of experiments The rest of this paper is organized as Section 2 introduces some related research works in the area of air quality prediction, transfer learning, and metalearning; Section 3 gives the definition of the problems; the proposed prediction model and metalearning algorithm are introduced in Section 4; after showing the experimental results to prove the effectiveness of the proposed model and metalearning algorithm in Section 5, Section 6 summarizes the whole paper.

Related Works
This section will briefly present the related research works in the area of air quality prediction, transfer learning, and metalearning.

Air Quality Prediction.
The machine learning models for air quality prediction can be divided into two categories: basic learning models and deep learning models. Basic learning models include linear regression, supporting vector regression, random forest, and LightGBM. Land Use Regression (LUR) [5,6] makes air quality predictions through a linear regression model that takes into account multiple factors like regional population level, traffic condition, and land use condition. LUR does not consider the complicated spatiotemporal correlation of air pollution data, so the accuracy of prediction is poor. Later, the basic time series model autoregressive integrated moving average model (ARIMA) [18] appeared, which was used for time series forecasting with strong periodicity. However, it does not perform well for complex weather conditions. Random forest [19], LightGBM [20], deep learning methods have become widely used methods in air pollution prediction. Later, in order to further improve the accuracy of prediction, Zheng et al. [21] proposed U-Air, which uses a spatial classifier based on an artificial neural network (ANN) and a temporal classifier based on the linear-chain conditional random field (CRF) to capture temporal and spatial characteristics. Convolutional neural networks (CNN) are used to process data from Euclidean structures. For example, they are very effective in the field of image recognition, and it is impractical to use CNN directly to capture the spatial relationships between monitoring stations for sparse graph structures consisting of monitoring stations. The ConvLSTM model proposed in [22] combines CNN and LSTM to characterize the spatiotemporal relationship between monitoring stations, and it is still applicable to the spatial relationship in Euclidean space. The emergence of Graph Convolutional Networks (GCN) [23,24] has made up for the deficiencies of CNN and is widely used in traffic data. GCN has realized the full use of the traffic network. GAT [9] are proposed on the basis of GCN, using an attention mechanism, and are good at capturing dynamic relationships between nodes. The ST-GAT model proposed by Zhang et al. [25] can dynamically capture the dynamic dependencies in the traffic network, making the traffic speed prediction results more advanced than existing models.

Transfer Learning and Meta-Learning.
To improve the practicality of air quality prediction models, the obstacles caused by insufficient data must be resolved. Transfer learning can be divided into three categories according to the difference in source domains and target domains and tasks, namely, inductive transfer learning, transitive transfer learning, and unsupervised transfer learning [14]. In recent years, transfer learning combined with deep neural networks (DNN) has been widely used. The VGG model proposed in the image field [26], with the help of this model, can achieve fast and accurate model training under a small number of sets. Unlike image data, air quality data are more complex in spatial and temporal distribution. Hu et al. [27] proposed a DNN-based sharing model that fused multisource wind speed data together to solve the problem of insufficient wind farm data. However, this model does not 2 Wireless Communications and Mobile Computing provide a solution to the knowledge transfer of spatially related data. Metalearning [28][29][30] can quickly initialize the model by learning knowledge in multiple different learning tasks in order to widely adapt to a variety of situations. Literature [29] firstly proposes the concept of metalearning, also known as learning. The goal is to train a metalearning model on multiple learning tasks, so as to use a small number of training samples to solve new learning tasks. A modelindependent metalearning algorithm MAML is proposed in [28]. MAML deals with the situation of insufficient training data by transferring data and models among multiple learning tasks. Each update step consists of multitask pretraining, model migration, target task training, and model parameter synchronization. Unlike previous metalearning methods, MAML uses gradients to update model and does not introduce additional parameters. Literature [27] proposes a MAML-based spatiotemporal prediction model, which is used for urban traffic prediction and water quality prediction by transferring knowledge among multiple cities.

Summary of Related Works.
Through the introduction of the above related works, it can be seen that the existing air quality prediction methods rarely consider the spatial relationship between multiple monitoring stations. A few spatiotemporal prediction models lack the ability to dynamically model spatial correlation based on weather and other related factors. The only methods that can dynamically model spatial correlation do not consider how to deal with insufficient training data. Some existing methods in the area of transfer learning and metalearning can solve the insufficienttraining-data situation to a certain extent by transferring the knowledge from other source domains, but these methods lack the ability to adapt to the air quality spatiotemporal prediction models and cannot be directly applied to the scenarios targeted in this article. For this reason, this paper proposes a spatiotemporal model for air quality prediction and a metalearning algorithm for this model. The prediction model can dynamically and accurately model the temporal and spatial correlation in air quality prediction. The metalearning algorithm is used to establish a more accurate prediction model in the case of insufficient training data. As far as we know, it is the first time that metalearning has been used for air quality prediction.

Problem Formulation
This paper will solve two problems: prediction problem and transfer learning problem. The prediction problem is how to build a prediction model for the target pollutant in the city with sufficient training data. The transfer learning problem is how to build a prediction model in the target city with insufficient training data, given the source cities with sufficient data. The symbols used in this paper are given in Table 1.

Prediction Problem.
Suppose that there is a set of urban monitoring stations S = fs 1 , s 2 , ⋯, s n g in the target city. We use a fixed time interval while counting historical data and making predictions. The prediction problem is building a model to predict the concentration of a certain pollutant sampled by a specified monitoring station in the future. The target air pollutant can be one of PM2.5, PM10, SO 2 , NO 2 , O 3 , CO, AQI (can be regarded as a comprehensive pollutant). Suppose that the current time is t. The input of the prediction model contains (1) a specified monitoring station, (2) the historical monitoring data of the target pollutant sampled from time t − k to time t, (3) the historical weather information from time t − k to time t, and (4) the weather forecast information from time t + 1 to time t + l. The output of the prediction model is the predicted value of the target pollutant sampled by the specified station from time t + 1 to time t + l. In practical applications, we usually set k + 1 = 2l.
For a monitoring station s ∈ S, define the historical data vector of s as The weather information used by the prediction model includes temperature, humidity, pressure, wind direction, and wind speed. The historical weather dataset of the target city is expressed as  Given the target monitoring station s ∈ S, our goal is to predict the concentration of the target pollutant sampled by s from time t + 1 to time t + l, which can be expressed as a vector y s = ðy t+1 s ; y t+2 s ; ⋯, y t+l s Þ. Suppose that f θ with parameter θ is the model we build, so we havê whereŷ s is the predicion of y s . Let D tc be the training dataset of the target city. D tc contains the historical monitoring data of all monitoring stations, the historical weather data, and the historical weather forecast data collected from the target city over a period of time. The prediction problem can be formally defined as how to build an accurate prediction model f θ based on D tc .

Transfer Learning
Problem. In addition to constructing the prediction model, another important issue to be solved in this paper is how to make accurate predictions when there is little training data. In this case, we will transfer knowledge from the source cities with sufficient training data to the target city with insufficient data. Suppose that we have m source cities with sufficient training data. Let D 1 sc , D 2 sc , ..., D m sc be the training datasets collected from the source cities, respectively. Let D tc be the insufficient training dataset collected from the target city. The transfer learning problem

Monitoring Station Graph.
In order to measure the mutual influence among different monitoring stations, we initially model all monitoring stations as a directed graph G r . Monitoring stations are represented by the nodes (vertices) in G r . Given two nodes s i and s j , there are directed edges hs i , s j i and hs j , s i i if the Euclidean distance between s i and s j is less than or equal to r. r is the influence radius of monitoring stations, i.e., the maximum range affected by the pollutant in the diffusion process. As shown in Figure 1, by setting r = 20km, we get the graph among 34 monitoring stations located in Beijing, China. The weights of the edges in G r will be calculated by GAT mechanism and change over time.
Hereinafter, we use the term "node" to refer to monitoring station and define set N r ðsÞ as 4.2. Air Quality Prediction Model. To solve the prediction problem, we propose a spatiotemporal prediction model (referred as GAT-LSTM) as shown in Figure 2. The model is built by a recurrent neural network incorporating graph attention mechanism, which means that it has encoderdecoder structure. The encoder is used to embed historical data, and the decoder is used to generate the predicted value in the future. It uses LSTM to model time correlation of a Historcical data vector of station s Historcical data vectors of all stations Historical weather of the target city at time i Historical weather dataset  Figure 2) and the cell state vector (gray lines in Figure 2) of the LSTM unit at time i, respectively. Unlike the traditional approach passing h i s and c i s directly to the next LSTM unit, we pass h i s to GAT to find spatial correlation among different nodes. Let z i s be the output vector of GAT for node s at time i. In the end, z i s and c i s are passed to the next LSTM unit. The structure of LSTM unit is as shown in Figure 3.
In the decoding phase, l LSTM units are used to generate ðŷ t+1 s ;ŷ t+2 s ; ⋯,ŷ t+l s Þ, i.e., the predicted concentration of the target pollutant sampled by node s. For time j ðt + 1 ≤ j ≤ t + lÞ, the input of the corresponding LSTM unit is ðy is the prediction of the previous moment and w j is the weather forecast data at time j. As with the coding phase, we pass the LSTM's output vector h j s to GAT to generate vector z j s . In addition to being passed to the next LSTM unit, z j s is also passed to a Feedforward Neural Network (FNN) to generate the output y j s . Base on the monitoring station graph G r , we use a GAT to model the spatial relationship among different nodes. In GAT, each node uses the attention mechanism [31] to collect information from neighbor nodes (weighting and summing the feature vectors of neighbor nodes) and uses the collected information to update its own feature vector. Unlike GCN, the weight of an edge in GAT is calculated based on the similarity of the feature vectors of the two corresponding nodes and changes dynamically with the change of the node's data. GAT is very sensitive to the changes of the spatial correlation among nodes caused by weather factors such as wind speed and wind direction.
The GAT mechanism can be demonstrated by Figure 4. At any time, the input of GAT is the output vectors of all nodes' LSTM units, i.e., fh s 1 , h s 2 , ⋯, h s n g. The output of GAT is fz s 1 , z s 2 , ⋯, z s n g. Each z s is passed to the next LSTM unit on the corresponding node as a hidden state vector. To get z s , for each s ′ ∈ N r ðsÞ, GAT firstly calculates the similarity score between node s′ and s by where vector v and matrix U 1 and U 2 are the parameters that need to be learned. Then, b α s ′ ,s is calculated by normalizing all the α s ′ ,s through the softmax layer: b α s ′ ,s can be seen as the weight of the edge hs ′ , si in G r . Finally, z s is calculated by a weighted summation of all its neighbors' h, i.e., 4.3. Metalearning Algorithm. To solve the transfer learning problem, we propose a metalearning algorithm named MetaGAT-LSTM (given by Algorithm 1) for training the GAT-LSTM model in the target city with insufficient training data. The algorithm will build an accurate prediction model by transferring knowledge from source cities with sufficient data. It uses a modified version of Model-Agnostic Meta-Learning (MAML) [28] as the parameter learning method. Let T = fD 1 sc , D 2 sc , ⋯, D m sc g be the set of datasets from source cities. Define the distribution over T as P ðTÞ, in which the probability of choosing dataset D ∈ T is Let f θ with parameter θ be the prediction model at the beginning of each training iteration. At first, with respect to P ðTÞ, we sample k datasets D 1 , D 2 , ⋯, D k from T with replacement (Line 3 in Algorithm 1). Then, get the next training batches B 1 , B 2 , ⋯, B k from D 1 , D 2 , ⋯, D k , respectively (Line 4 in Algorithm 1) and get the next training batch B 0 from D tc (Line 5 in Algorithm 1). The model's parameter is updated in two steps. In the first step (Line 6~10 in Algorithm 1), for each B i (1 ≤ i ≤ k), we get the first-step adapted parameter θ i by

Wireless Communications and Mobile Computing
where L B i ðf θ Þ is the loss of the original model f θ on training batch B i , ∇L B i ðf θ Þ is gradient of L B i ðf θ Þ and β is the learning rate in the first step. In the second step (Line 11 in Algorithm 1), we get the second-step adapted parameter by where L B 0 ð f θ i Þ is the loss of first-step adapted model f θ i on training batch B 0 and γ is the learning rate in the second step.  Figure 4: The GAT mechanism for node 1 (suppose that node 2~6 are 1's neighbors in G r ). 6 Wireless Communications and Mobile Computing geographic coordinates, city size, population density, etc., resulting in very different air quality distributions. For example, the air pollution situation in Beijing and Tianjin is much more serious than that in Shenzhen and Guangzhou. Each dataset contains the air quality data (from all monitoring stations), weather data, and weather forecast data collected from a city within one year. The period of data sampling is one hour. Taking the dataset of Beijing as an example, the air quality data contains the concentration of six major pollutants (PM2.5, PM10, SO2, NO2, O3, CO) and AQI sampled by 36 monitoring stations within one year. The weather data contains basic weather, temperature, humidity, air pressure, wind speed, and wind direction collected within one year. Weather forecast data contains the forecast value of the above weather indexes published by Beijing Meteorological Bureau. There are missing and dirty values in these datasets. In order to exploit the data as much as possible, we fill in missing values with the mean in a period of time and delete the tuples with too many consecutive missing data; Table 2 shows the details of these datasets.

Experiment Setting.
There are two groups of experiments. In the first group, we compare GAT-LSTM with the most effective method to date and benchmark models. For each dataset, we divide it into train set and test set, then train these models on the same train set and evaluate their effectiveness on the same test set. The models used for comparison include the following: (i) ARIMA: Auto Regressive Integrated Moving Average. ARIMA is the most common statistical model used for time series forecasting (ii) LSTM [32]. LSTM can learn the time dependence in time series. Compared with RNN, they can deal with longer time series and obtain better results (iii) ST-DNN [33]. The spatiotemporal models combing of Convolution Neural Networks (CNN) and LSTM for air quality prediction. ST-DNN is the most effective method to date In the second group, we set one city as target city and the other cities as source cities. Then, delete most of the data from the target city and only keep a small part for training. The proposed metalearning algorithm is used to build a prediction model by transfer learning knowledge from source cities with sufficient data. We compare the proposed metalearning algorithm MetaGAT-LSTM with the following transfer learning methods: (i) Fine-Tuning. First, use the data of a single city to pretrain the GAT-LSTM model and, then, finetune the model on the target city, which is called the single-source domain fine-tuning (Single-FT). Secondly, use the data from multiple source cities to pretrain the GAT-LSTM model then fine-tune it on the target city, which is called the multisource domain fine-tuning (Multi-FT).
(ii) MAML [28]. Use data from all cities to jointly train the model for the target. MAML is implemented based on the metalearning method The target pollutant is AQI (can be seen as a single pollutant). We use Root Means Square Error (RMSE), Mean Absolute Error (MAE), and ACCuracy (ACC) to evaluate models, which are defined as Input: T = fD 1 sc , D 2 sc , ⋯, D m sc g: The set of the training datasets from source cities; D tc : Datasets from target city; P ðTÞ: Distribution over T; β, γ: Learning rates Output: f θ : The GAT-LSTM model for the target city 1. Randomly initialize θ 2. While not done do: 3.
Sample k datasets D 1 , D 2 , ⋯, D k from T with replacement w.r.t. P ðTÞ 4.
Get next training batch B 0 from D tc 6. For Calculate End for 11.
Calculate second-step adapted parameter Here, T is the test set. y is sample's label (true monitoring data in the future) and f ðXÞ is the predicted value of y. k•k 1 and k•k 2 are L1 and L2 norm, respectively.
In GAT-LSTM, the dimension of the GAT's output vector, the LSTM's output vector, and cell state vector are all set to 128. While training, we use dropout [34] and batch normalization [35] to strengthen the training effect. The batch size is set to 64. The number of epochs is set to 3.

Experiment
Results. At first, we need to find an appropriate influence radius r for building the directed graph G r in GAT-LSTM, so we compare the performance of GAT-LSTM with different r. Table 3 and Figure 5 give the comparison results on the dataset from Beijing. They show that when r < 20km, the accuracy of GAT-LSTM increases as r increases. The reason for this phenomenon is that when r is within a reasonable range, a larger r allows the model to consider more spatial correlation, thereby providing a more accurate prediction. The accuracy reaches its peak at r = 20 km. When r > 20km, the accuracy decreases slightly as r increases. This phenomenon is because too large r will cause the model to incorrectly estimate the correlation among some remote monitoring sites based on the data similarity. Thus, we set r =20 km in the following experiments.
In the first group of experiments, we use the dataset from Beijing to evaluate all the prediction models. We divide the dataset into training set (70%), validation set (20%), and test set (10%). Each model takes data from the past 48 hours as input (k + 1 = 48), then outputs prediction values for the next l hours. Table 4 shows the experiment results with different l. The best results are marked in bold. It can be seen that the traditional linear model ARIMA does not perform well under the influence of multiple complex factors. LSTM's performance is acceptable for short-term prediction and drops quickly with the increase of l. Spatial correlation plays an important role in air quality prediction. By using CNN to extract spatial correlation among monitoring stations, the ST-DNN performs much better than ARIMA and LSTM. However, the spatial correlation built by ST-DNN cannot change dynamically with the change of weather, which reduces its predictive effects. By using GAT to dynamically model spatial correlation, GAT-LSTM gives the best performance in all cases. The performance of all models declines with the increase of l, but the decline rate of GAT-LSTM is lower than the other three, which shows that it is suitable for long-term prediction.  In the second group of experiments, we execute two experiments by taking Beijing and Shenzhen as the target cities, respectively. We delete most of the data from the target city and only keep a small part for training. With different sizes of the training dataset (in target city), the results of comparing MetaGAT-LSTM with other transfer learning methods are given by Tables 5 and 6. It can be seen all the methods performs better with larger training dataset. Table 5 shows that Single-FT from the Tianjin is better than that from the other two cities. Table 6 shows that Single-FT from Guangzhou is better than that from the other two cities. The climate and geographical location cause similarity    9 Wireless Communications and Mobile Computing of air conditions in Tianjin and Beijing, as well as the similarity of air conditions in Guangzhou and Shenzhen. This can be proven by Figure 6, in which the AQI distribution of the four cities from 2014/5/1 to 2015/4/30 is given. The more similar the two cities' datasets are, the better Single-FT performs. Multi-FT enriches the training samples by using the data from all source cities. It is better than Single-FT in some cases. However, because of simply mixing all datasets, it may cause negative migration and give an even worse performance compared with Single-FT in some cases. Both MAML and MetaGAT-LSTM are better than the finetuning methods. MetaGAT-LSTM outperforms MAML in all cases by more rationally integrating data from all cities for joint training.

Conclusions
In this paper, we propose a spatiotemporal model GAT-LSTM by combining LSTM and GAT for air quality prediction, then design a metalearning algorithm for GAT-LSTM for transfer learning. By more accurately modeling the temporal and spatial correlation of pollutants at all monitoring stations, GAT-LSTM gives a better performance compared with the up-to-date air quality prediction models. In the case of insufficient training data from the target city, the proposed metalearning algorithm for GAT-LSTM can effectively transfer knowledge from source cities with sufficient data and jointly training an accurate prediction model. A number of comparative experiments show the effectiveness of the proposed prediction model and metalearning algorithm. In the future, we may consider more factors related to air quality to improve prediction's accuracy. On the other hand, it will be reasonable to apply the proposed model and metalearning algorithm to other fields.

Data Availability
The source code implemented in this article can be obtained from a GitHub repository (https://github.com/ 123scarecrow/paperCode), which also includes data analysis code, data pre-processing code, and training data generation code. The data used in the experiments comes from the website: http://research.microsoft.com/apps/pubs/?id=246398.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.