PM2.5 Concentration Forecasting in Industrial Parks Based on Attention Mechanism Spatiotemporal Graph Convolutional Networks

Industrial parks are one of the main sources of air pollution; the ability to forecast PM2.5, the main pollutant in the industrial park, is of great significance to the health of the workers in the industrial park and environmental governance, which can improve the decision-making ability of environmental management. Most of the existing PM2.5 concentration forecast methods lack the ability to model the dynamic temporal and spatial correlations of PM2.5 concentration. In an industrial park environment, in order to improve the accuracy of PM2.5 concentration forecast, based on deep learning technology, this paper proposes a spatiotemporal graph convolutional network based on the attention mechanism (STAM-STGCN) to solve the PM2.5 concentration forecast problem. When constructing the adjacency matrix, we not only use the Euclidean distance between sites but also consider the impact of wind fields and the impact of pollution sources near the nodes. In the process of model construction, we first use the spatiotemporal attention mechanism to capture the dynamic spatiotemporal correlations in PM2.5 data. In the spatiotemporal convolution module, we use graph convolutional neural networks to capture spatial features and standard convolution to describe temporal features. Finally, the output module adjusts the output shape of the data to produce the final forecast result. In this paper, the mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) are used as the performance evaluation metrics of the model, and the Dongmingnan Industrial Park atmospheric dataset is used to verify the effectiveness of the proposed algorithm. The experimental results show that our STAM-STGCN model can more fully capture the spatial-temporal characteristics of PM2.5 concentration data; compared with the most advanced model in the comparison model, the RMSE can be improved about 24.2%, the MAE is improved about 35.8%, and the MAPE is improved about 34.6%.


Introduction
PM2.5 refers to the particulate matter with a diameter less than or equal to 2.5 microns in the atmosphere, also known as fine particulate matter or particulate matter that can enter the lung [1]. Scientists use PM2.5 concentration to indicate the content of such particles per cubic meter of air; the higher the value, the more serious the air pollution. The main sources are industrial fuels, dust, motor vehicle exhaust, photochemical smog, and other pollutants [2]. Although fine particulate matter is only a small component of the earth's atmosphere, it has an important impact on air quality and visibility. Compared with coarser atmospheric particulate matter, fine particulate matter has a small particle size and is rich in a large amount of toxic and harmful substances stay in the atmosphere for a long time, so they have a greater impact on human health and the quality of the atmospheric environment. There is a high correlation between the time of exposure to a high concentration of PM2.5 environment and mortality [3]. With the rapid development of our country's economy, the process of industrialization and urbanization has accelerated, and air pollution problems caused by air pollutants mainly PM2. 5 have become more and more prominent [4], which has caused more and more serious impacts on people's production and life. Therefore, forecasting the PM2.5 concentration of industrial parks as the main source of air pollution is of great significance to human health and environmental governance.
PM2.5 concentration has nonlinear characteristics in time and space [5] and has a very complicated formation mechanism and process [6]. Due to this complexity, it is difficult to forecast PM2.5 concentration. Therefore, it has become a hot topic of PM2.5 forecast to propose a model that can fully exploit this complex feature. Atmospheric data in industrial parks is closely related to time and is a typical time series data. Therefore, forecasting the PM2.5 concentration in industrial parks is essentially a forecast of time series data. At this stage, there are many mature time series forecasting methods, including support vector regression (SVR) [7], autoregressive moving average model (ARMA) [8], and BP neural network [9]. However, with the exponential increase in the amount of time series data and the increase in complexity, these traditional forecast methods can only extract relatively simple linear features, and it is difficult to effectively extract more complex nonlinear features and the training time is too long, and the forecast accuracy is limited and cannot meet the actual demand.
In recent years, with the rise of deep learning applications in various fields, many excellent algorithms have been developed, including recurrent neural network (RNN) [10] and long short-term memory (LSTM) networks [11]. These algorithms focus on modelling the temporal correlation of data, while ignoring the modelling of spatial correlation of data. In PM2.5 concentration forecast, in order to consider the spatial dependence between each monitoring station, people began to combine spatial learning and time series learning [12], which can make full use of the spatial and temporal dependence of data [13]. Literature [14] proposed a hybrid CNN-LSTM model to forecast PM2.5 concentration; compared with the LSTM model, the experimental results show that the accuracy of the model exceeds the LSTM model, which proves the effectiveness of the hybrid CNN-LSTM model. Literature [15] combined the convolutional neural network and bidirectional gated loop unit to achieve multistep wind speed forecast. Literature [16] constructed a CNN-LSTM hybrid model to forecast the ozone concentration in Beijing. Literature [17] combined attention-based bidirectional gated recurrent neural network and convolutional neural network for emotion classification and conducted experiments on four public datasets; the results show that the proposed model is better than the existing model. However, these regular convolution operations based on the CNN model [18] are only suitable for processing grid-structured data, in order to solve this problem; [19] constructed a site topology map based on the spatial distribution of the site to describe the spatial relationship; through graph convolutional neural network, the spatial relationship between monitoring stations is extracted. [20] uses graph convolution as a spatial convolution operation and use one-dimensional convolution in time to form a spatiotemporal convolution block; the experimental results show that the model can effectively capture the spatiotemporal correla-tions of data and has achieved good results in traffic forecast. Literature [21] proposed a novel attention-based spatialtemporal graph convolutional network model to solve traffic flow forecasting problem; the results show that the spatiotemporal attention mechanism can effectively capture the dynamic spatial-temporal correlations in traffic data. In order to more fully capture the spatial correlation of data, [13] proposed a dynamic directed spatiotemporal graph convolutional network, which uses directed graph time series to describe the topological relationship between vertices and vertices; the wind field diffusion distance is used to replace the traditional Euclidean distance to describe the proximity relationship between the vertices; the experimental results show that the forecast ability of this model is better than that of the comparison model. It can be seen from the above that the graph convolutional neural network can model irregular graph structure data, and the ability to extract spatial correlation is ideal.
The PM2.5 concentration of each monitoring station in the industrial park has more complex nonlinear characteristics under the influence of various pollution sources, and the influence of external factors on the PM2.5 concentration is difficult to express. In conclusion, the challenges and problems of PM2.5 concentration forecast in industrial parks mainly include the following three points: First, how to make full use of the time and space characteristics of the PM2.5 concentration data in the industrial park and dynamically capture this characteristics? Secondly, how to make full use of other data to make effective forecast, for example, wind direction and wind speed, etc.? Finally, which network structure should be adopted to meet the above two requirements? Existing forecast methods, for example, [14,20] cannot simultaneously model the temporal and spatial characteristics and dynamic correlations of PM2.5 concentration data. In order to solve this problem, we propose a spatiotemporal graph convolutional network based on the attention mechanism (STAM-STGCN), which is used to centrally forecast the PM2.5 concentration of each monitoring station in the industrial park; this model can capture the dynamic spatial-temporal characteristics of data more effectively. The main contributions of this paper are summarized as follows: (1) Use the number of pollution sources and the dynamic wind field information to jointly construct an adjacency matrix to define the spatial correlation between the stations, so that the spatial relationship can be better described (2) The dynamic spatiotemporal correlations of PM2.5 data is modelled through the spatiotemporal attention mechanism. Spatial attention is used to model the dynamic spatial correlation of monitoring stations in different locations in the industrial park, and temporal attention is used to capture the dynamic temporal correlation between different times (3) Combining graph convolution and temporal convolution to construct a spatiotemporal convolution module for extracting the spatiotemporal correlations of data This paper is divided into five sections, and the rest of the structure is as follows. In Section 2, we introduce the source of the dataset, complete data preprocessing, and data analysis. In Section 3, the construction of graph data and the network structure of the forecast model are introduced in detail. In Section 4, the experimental results are displayed and analysed. Finally, we summarize and forecast the paper in Section 5.

Dataset
2.1. Data Sources. The dataset used in this paper comes from the real atmospheric data of Dongmingnan Industrial Park; the panoramic view of Dongmingnan Industrial Park is shown in Figure 1. The equipment that collects these data is mainly IoT sensing equipment that monitors the smoke and toxic and harmful gases emitted by the industrial park, as shown in Figure 2. These sensing equipment for atmospheric monitoring are distributed on the boundary of the park, the boundary of the enterprise, the inside of the enterprise, the sensitive area, and the mobile monitoring point according to the layout principle of points, lines, and surfaces. Through the atmospheric monitoring gateway device, the data collected by the monitoring atmospheric sensing device is uploaded to the database using 4G or wired network, and the upload frequency is 30 seconds. We selected 9 monitoring stations with relatively complete data, and the time span is from 0:00 on August 25, 2020 to 0:00 on February 2, 2021. The information of the dataset is shown in Table 1.

Data
Preprocessing. The data preprocessing in this paper mainly has the following three steps; the flow chart is shown in Figure 3.
(1) For the missing values of the data we obtained, the value of the monitoring station with the largest correlation coefficient is used to fill in (2) The frequency of data collected by the atmospheric monitoring station is 30 seconds, but due to network delays and other reasons, the data interval actually stored in the database is not strictly 30 seconds, resulting in messy data. So, we resample the data and adjust the time interval to 10 minutes to ensure the regularity of the dataset (3) Standardize the data through the z-score method to speed up the training process 2.3. Data Analysis. We visualize part of the PM2.5 concentration data, as shown in Figure 4. It can be seen from the figure that the time of the highest PM2.5 concentration occurs in the morning; the concentration value gradually drops to the bottom in the afternoon and then gradually rises at night, until the next morning, basically cyclical changes; and there is a strong correlation between different monitoring stations, and there are differences in numerical values. For example, the 7# monitoring station has the lowest PM2.5 concentration, because the 7# monitoring station is located at the edge of Dongmingnan Industrial Park and there is no pollution source nearby. The pollution sources around the monitoring station greatly affect the concentration of PM2.5, but this effect is not static; the influence of pollution sources on PM2.5 concentration is dynamic, the uneven distribution of pollution sources in space affects the spatial correlation between sites, and this spatial correlation is also dynamic. Therefore, when designing the forecast framework, an effective method is needed to simultaneously capture the dynamic temporal and spatial dependence of PM2.5 concentration.

Graph Data Construction and Forecast Model
In this section, according to the characteristics of the industrial park, we first introduce the construction idea of the graph of the industrial park monitoring station and the construction method of graph data and then introduce our deep learning forecast model.

Graph Data Construction.
In the industrial park scenario, it is impractical to construct grid data, so we construct the distribution map of the monitoring stations in the industrial park into a graph to characterize the spatial correlation between the monitoring stations. We abstract the spatial distribution of N monitoring stations at a certain time as a graph G = ðV, E, AÞ; under the influence of the wind field, this graph is a directed graph, where V is a limited set of monitoring stations; E is the edge set; A ∈ ℝ N×N is the weighted adjacency matrix of the graph. Each monitoring station on the graph detects the PM2.5 concentration value at the same sampling frequency, thereby composing the graph sequence data, as shown in Figure 5.
The PM2.5 concentration forecast problem is essentially a time series forecast problem, that is, using historical PM2.5 concentration data of m continuous time steps to forecast We use the sliding window method to divide the data, as shown in Figure 6, where X t represents the graph data composed of all monitoring stations at time t.
The graph can represent the spatial relationship between geospatial data. When we forecast the PM2.5 concentration, we need to consider the spatial relationship between monitoring stations. In general, we use the Euclidean distance between sites to represent the spatial association between sites; this value can be understood as the difficulty of interaction between sites. PM2.5 can diffuse completely freely and will be affected by wind. Therefore, in the industrial park scenario, we need to consider the impact of wind [13]. The number of pollution sources around the monitoring station largely affects the PM2.5 concentration; the uneven distribution of pollution sources in space affects the correlation between the stations, so we also need to consider the pollution sources around the stations.
In order to obtain the influence of wind on the monitoring stations, we introduced the Gaussian diffusion model. The Gaussian diffusion model is a standard model to solve the problem of wind field diffusion; the basic formula of the model is as follows: where C 0 represents the concentration of air pollutants. x and y represent the downwind distance and the horizontal distance between the point of interest and the centerline, respectively. z represents the height of the pollution source. u is the horizontal wind speed. σ y and σ z represent the standard deviation of horizontal dispersion and the standard deviation of vertical dispersion, respectively. Q is the source strength, which means the amount of pollutants discharged per unit time. When only horizontal diffusion is considered, the formula can be simplified to Equation (2) [22].
where A and B are the starting point and the end point; costðE AB Þ is used to describe the difficulty of air pollutant diffusion from point A to point B; E AB is the edge between two points; D A and D B represent the wind direction azimuth of these two points; D M is the azimuth of E AB ; L AB is the length of E AB , that is, the distance between point A and point B, the unit is kilometres; F is a function to calculate the absolute value of azimuth difference.   Due to the limited geographical space of the industrial park, the wind direction of each monitoring station at the same time can be regarded as the same, so the formula can be simplified as Equation (3), and the constant term can be omitted.
In order to measure the impact of pollution sources on monitoring stations, we set the distance S, take a certain monitoring station as the center and the number of pollution sources within a radius of S as the number of pollution sources affecting the monitoring station and S defaults to 0.2 km. Since we cannot know the types of pollution sources around the monitoring station, we can only know the approximate number of pollution sources, in order to reduce the error caused by uncertain pollution source types, we do not directly use the quantity data of pollution sources, but by setting the scope, we obtain the influence coefficient c of the quantity of pollution sources on the monitoring station. Table 2 shows our division of impact coefficients, where r is a variable, representing the span of the number of pollution sources, and the default is 3. For example, if there are six pollution sources around monitoring station 1, then its influence coefficient is 3, that is, c 1 is equal to 3.
Based on the above analysis, we express the weighted adjacency matrix as where c i represents the influence coefficient of the number of pollution sources on the monitoring station i, costðE ij Þ means the difficulty of air pollutant diffusion from site i to station j, and the value of costðE ij Þ will be relatively large; we normalized it. Here, we use the min-max normalization method for normalization, so that the data is distributed in a smaller range, and the data distribution is more reasonable.

Proposed Forecasting Model.
We use x i t to represent PM2.5 concentration of node i at time t, X t = ðx 1 t , x 2 t , ⋯, x N t Þ Τ ∈ ℝ N×1 represents PM2.5 concentration of all nodes at time t, χ = ðX 1 , X 2 , ⋯, X m Þ Τ ∈ ℝ N×1×m denotes PM2.5 concentration of all nodes over m time slices, y∧ i = ðŷ i m+1 ,ŷ i m+2 , ⋯,ŷ i m+n Þ represents the forecasted value of node i in the next n time slices,Ŷ = ðy∧ 1 , y∧ 2 , ⋯, y∧ N Þ Τ ∈ ℝ N×n represents the forecasted value of all nodes in the next n time slices, and Y = ðy 1 , y 2 , ⋯, y N Þ Τ ∈ ℝ N×n represents the true value of all nodes in the next n time slices. PM2.5 concentration forecast can be expressed as follows: where f indicates the forecast method and A is the adjacency matrix. That is to say, the PM2.5 concentration data of all monitoring stations in the past m moments are used to forecast the PM2.5 concentration of all monitoring stations in the future n moments. Figure 7 shows the overall framework of the proposed STAM-STGCN model, which is composed of the spatiotemporal attention block, spatiotemporal convolution block, and output block. The details of each part are as follows.

Spatiotemporal Attention
Block. This block contains two kinds of attention, namely, spatial attention and temporal attention, which are used to capture the dynamic spatiotemporal correlation of data.
In the time dimension, PM2.5 concentrations in different time periods are correlated, but the correlation is different in different situations. We use the temporal attention mechanism to adaptively assign different weights to the data to capture this dynamic correlation.
where V e , b e , U 1 , U 2 , and U 3 are learnable parameters, χ = ðX 1 , X 2 , ⋯, X m Þ Τ ∈ ℝ N×1×m is the input of the model, σ is the activation function, and E is the time attention matrix; we normalize E by softmax function. We perform the Hadamard product of the normalized time attention matrix and the input χ to obtain the dynamically adjusted input χ′.
In the spatial dimension, we use the spatial attention mechanism to adaptively capture the dynamic correlation between nodes.
where V S , b S , W 1 , W 2 , and W 3 are learnable parameters and σ is the activation function. The spatial attention matrix S ∈ ℝ N×N is dynamically calculated based on the input and S ij represents the correlation strength between node i and node j. We use the softmax function to ensure that the sum of the attention weights between nodes is 1; then, we accompany the adjacency matrix A with the spatial attention matrix S to dynamic adjust the weights between nodes and get the dynamically adjusted adjacency matrix A′ and χ′′. χ ′ ′ is the output of the spatial attention module, which is Prediction window Observation window X t−m+1 X t+1 X t+n X t−1 X t Figure 6: Sliding window method.  [20]. Spatiotemporal convolutional block is composed of two temporal convolution layers and one spatial graph convolution layer. In this module, the input data is first convolved in the time dimension, and then, the output result is convolved in the graph. The output result of the graph convolution is subjected to an activation function, and then, a time-dimension convolution is per-formed. The calculation process is as follows: where Γ 0 and Γ 1 are the upper and lower temporal kernel within block, respectively; θ is the convolution kernel of graph convolution; ReLU denotes the rectified linear units function; * τ denotes time dimension convolution; * g denotes the graph convolution; and χ′′ is the input; and O is the output.

Output Block.
The output block consists of a temporal convolution layer and a fully connected layer. First, the temporal convolution layer merges the temporal dimensions output by the spatiotemporal convolution block to obtain O′ and then gets the final forecast result through the fully connected layer.Ŷ where W is a weight vector and b is a bias, and both W and b are learnable parameters.

Experiment and Evaluation
4.1. Performance Evaluation Method. In order to measure the effectiveness of our proposed method and compare with the comparison model conveniently, we considered three different metrics, including mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). MAE, RMSE, and MAPE can be calculated by Equations (10)- (12).
where y i represents the true value,ŷ i represents the forecasted value, and n represents the number of samples. RMSE is used to measure the deviation between the forecasted value and the true value, it is one of the most common evaluation metrics, and it is numerically equal to the    7 Wireless Communications and Mobile Computing process, the first 80% of the entire dataset is used as the training set, use 10% of the data as a validation set and the last 10% of the data is used as the test set. We implemented the STAM-STGCN in the PyTorch framework [23]. Parameters of the model are also shown in Table 3.
We use mean square error to measure the performance of our model, the process of training the model is the process of continuously optimizing parameters and minimizing the cost function, and the cost function of STAM-STGCN can be expressed as where θ is the parameter of STAM-STGCN, which can be continuously updated in training optimization,ŷ i is the forecasted value of PM2.5 concentration, y i is the true value, h θ ðx i Þ is the value forecasted by the parameter θ and input x, and M is the number of training samples.

Results and Analysis.
In order to verify the advantages of the proposed model, we selected four models for time series forecasting for comparison, namely, HA, LSTM, GRU, and STGCN. Each model uses the same dataset as the proposed method; the experimental results are shown in Table 4. In order to show the advantages of the proposed model more intuitively, a histogram of each model evaluation metrics is made, as shown in Figure 8. It can be seen from the experimental results in Table 4  The results show that this method has better forecast effect than the comparison model. We can also further see that the forecast results of traditional time series analysis methods are not ideal, which proves that the ability of these methods to deal with complex spatiotemporal data is limited. In contrast, methods based on deep learning usually obtain better forecast results than traditional time series analysis methods. In the deep learning method, both the STGCN model and the model we proposed consider spatial-temporal correlations, and the results are better than traditional deep learning models, such as LSTM and GRU; this shows that in the industrial park scenario, considering the spatial-temporal correlations of monitoring stations is useful for forecasting PM2.5. For further analysis of the experimental results, the results of our proposed model are better than the STGCN model, which fully shows that the attention mechanism is effective in capturing the dynamic spatialtemporal correlations of data.
In order to further compare the forecast capabilities of each model, we drew a line graph of the forecasted and true values of each model on the test set, as shown in Figure 9. It can be seen from the figure that our model shows the best fit between the forecasted value and the observed value; among them, the green line in the figure represents the forecast of our model, and the purple line represents the true value.
In order to compare the performance of the model, we draw a graph of the loss during the training process, as shown in Figure 10. It can be seen from the figure that our model can achieve convergence faster and is relatively stable after convergence. However, while the performance of the model is improved, our model has a small increase in complexity compared to the comparison model; this is within our controllable range, which is acceptable relative to the  Wireless Communications and Mobile Computing improvement in performance. In the future, we will work to improve this problem.

Conclusion and Future Work
In this research, we propose a deep learning model STAM-STGCN to forecast PM2.5 concentration in industrial parks. We constructed the PM2.5 concentration data of all monitoring stations in the industrial park into graph time series data, using temporal convolution and graph convolution combined with the spatiotemporal attention mechanism to simultaneously capture the dynamic spatiotemporal characteristics of PM2.5 concentration in industrial parks. When constructing the adjacency matrix, we also considered the wind field information and the number of pollution sources around the monitoring stations and formulated an information fusion strategy to represent the adjacency matrix. Our model is verified on the real atmospheric dataset of Dongmingnan Industrial Park, experimental results show that compared with the most advanced model in the comparison model, the RMSE of our model is improved by about 24.2%, the MAE performance is improved by 35.8%, and the MAPE performance is improved by 34.6%, the forecast accuracy of the model is better than the comparison model. In fact, in industrial parks, the PM2.5 concentration is also affected by many other factors, such as weather conditions and wind speed. In the future, we will consider these influencing factors to further improve the forecasting accuracy. Since STAM-STGCN is a general spatial-temporal forecasting framework for the graph structure data in industrial park scenarios, it has a certain universality; we can also apply it to the forecast of PM10 and other air pollutants in industrial parks.

Data Availability
The data used to support the findings of this study are included within the article.