A DBN-Based Deep Neural Network Model with Multitask Learning for Online Air Quality Prediction

. To avoid the adverse effects of severe air pollution on human health, we need accuratereal-time air quality prediction. In this paper, for the purposeof improveprediction accuracyof air pollutant concentration,a deep neuralnetwork model with multitask learning (MTL-DBN-DNN), pretrained by a deep belief network (DBN), is proposed for forecasting of nonlinear systems and tested on the forecast of air quality time series. MTL-DBN-DNN model can solve several related prediction tasks at the same time by using shared information contained in the training data of different tasks. In the model, DBN is used to learn feature representations. Each unit in the output layer is connected to only a subset of units in the last hidden layer of DBN. Such connection effectively avoids the problem that fully connected networks need to juggle the learning of each task while being trained, so that the trained networks cannot get optimal prediction accuracy for each task. The sliding window is used to take the recent data to dynamically adjust the parameters of the MTL-DBN-DNN model. The MTL-DBN-DNN model is evaluated with a dataset from Microsoft Research. Comparison with multiple baseline models shows that the proposed MTL-DBN-DNN achieve state-of-art performance on air pollutant concentration forecasting.


Introduction
Air pollution is becoming increasingly serious.To protect human health and the environment, accurate real-time air quality prediction is sorely needed.
There are nonlinear and complex interactions among variables of air quality prediction data.Artificial neural networks can be used as a nonlinear system to express complex nonlinear maps, so they have been frequently applied to realtime air quality forecasting (e.g., [1][2][3][4][5]).
Deep networks have significantly greater representational power than shallow networks [6].To solve several difficulties of training deep networks, Hinton et al. proposed a deep belief network (DBN) in [7].DBN is trained via greedy layer-wise training method and automatically extracts deep hierarchical abstract feature representations of the input data [8,9].Deep belief networks can be used for time series forecasting, (e.g., [10][11][12][13][14][15]). For these reasons, in this paper, the proposed prediction model is based on a deep neural network pretrained by a deep belief network.
Multitask learning can improve learning for one task by using the information contained in the training data of other related tasks [16].Multitask deep neural network has already been applied successfully to solve many real problems, such as multilabel learning [17], compound selectivity prediction [18], traffic flow prediction [19], speech recognition [20], categorical emotion recognition [21], and natural language processing [22].Collobert and Weston demonstrated that a unified neural network architecture, trained jointly on related tasks, provides more accurate prediction results than a network trained only on a single task [22].
Current air quality prediction studies mainly focus on one kind of air pollutants and perform single task forecasting.The most studied problem is the PM 2.5 concentration prediction.However, there are correlations between some air pollutants predicted by us so that there is a certain relevance between different prediction tasks.For example, SO 2 and NO 2 are related, because they may come from the same pollution sources.Studies have showed that sulfate (SO 4 2− ) is a major PM constituent in the atmosphere [23].And in 2016, a discovery revealed that the aqueous oxidation of SO 2 by NO 2 , under specific atmospheric conditions, is key to efficient sulfate formation, and the chemical reaction led to the 1952 London "Killer" Fog [24].And a study published in the US journal Science Advances also discovered that fine water particles in the air acted as a reactor, trapping sulfur dioxide (SO 2 ) molecules and interacting with nitrogen dioxide (NO 2 ) to form sulfate [25].Therefore, we can regard the concentration forecasting of these three kinds of pollutants (PM 2.5 , SO 2 , and NO 2 ) as related tasks.Figure 1 shows some of the historical monitoring data for the concentrations of the three kinds of pollutants in a target station (Dongcheng Dongsi: air-quality-monitor-station) selected in this study.The three kinds of pollutants show almost the same concentration trend.Therefore, the concentration forecasting of the three kinds of pollutants can indeed be regarded as related tasks.
In this paper, based on the powerful representational ability of DBN and the advantage of multitask learning to allow knowledge transfer, a deep neural network model with multitask learning capabilities (MTL-DBN-DNN), pretrained by a deep belief network (DBN), is proposed for forecasting of nonlinear systems and tested on the forecast of air quality time series.DBN is used to learn feature representations, and several related tasks are solved simultaneously by using shared representations.
For multitask learning, a deep neural network with local connections is used in the study.Such connection effectively avoids the problem that fully connected networks need to juggle the learning of each task while being trained so that the trained networks cannot get optimal prediction accuracy for each task.The locally connected architecture can well learn the commonalities and differences of multiple tasks.
In order to get a better prediction of future concentrations, the sliding window [26,27] is used to take the recent data to dynamically adjust the parameters of prediction model.
The rest of the paper is organized as follows.Section 2 presents the background knowledge of multitask learning, deep belief networks, and DBN-DNN and describes DBN-DNN model with multitask learning (MTL-DBN-DNN).In Section 3, the proposed model MTL-DBN-DNN is applied to the case study of the real-time forecasting of air pollutant concentration, and the results and analysis are shown.Finally, in Section 4, the conclusions on the paper are presented.

MultiTask Learning.
Multitask learning can improve learning for one task by using the information contained in the training data of other related tasks.Multitask learning learns tasks in parallel and "what is learned for each task can help other tasks be learned better" [16].
Several related problems are solved at the same time by using a shared representation.Related learning tasks can share the information contained in their input data sets to a certain extent.Multitask learning exploits commonalities among different learning tasks.Such exploitation allows knowledge transfer among different learning tasks.The difference between the neural network with multitask learning capabilities and the simple neural network with multiple output level lies in the following: in multitask case, input feature vector is made up of the features of each task and hidden layers are shared by multiple tasks.Multitask learning is often adopted when training data is very limited for the target task domain [28].

Deep Belief Networks and DBN-DNN.
Deep Belief Networks (DBNs) [29] are probabilistic generative models, and they are stacked by many layers of Restricted Boltzmann Machines (RBMs), each of which contains a layer of visible units and a layer of hidden units.DBN can be trained to extract a deep hierarchical representation of the input data using greedy layer-wise procedures.After a layer of RBM has been trained, the representations of the previous hidden layer are used as inputs for the next hidden layer.A schematic representation of a DBN is shown in Figure 2.
Where ℎ (1) and ℎ (2) are the state vectors of the hidden layers,  is the state vector of the visible layer,  (1) and  (2) are the matrices of symmetrical weights,  (1) and  (2) are the bias vector of the hidden layers, and  (0) is the bias vector of the visible layer.
The weights from the trained DBN can be used as the initialized weights of a DNN [8,30], ℎ (1) =  ( (1) +  T  (1) ) , and, then, all of the weights are fine-tuned by applying backpropagation or other discriminative algorithms to improve the performance of the whole network.When DBN is used to initialize the parameters of a DNN, the resulting network is called DBN-DNN [31].

DBN-Based Deep Neural Network Model with MultiTask Learning (MTL-DBN-DNN).
In this section, a DBN-based multitask deep neural network prediction model is proposed to solve multiple related tasks simultaneously by using shared information contained in the training data of different tasks.
DBN-DNN prediction model with multitask learning is constructed by a DBN and an output layer with multiple units.Deep belief network is used to extract better feature representations, and several related tasks are solved simultaneously by using shared representations.The sigmoid function is used as the activation function of the output layer.
Each unit in the output layer is connected to only a subset of units in the last hidden layer of DBN.It is assumed that the number of related tasks to be processed is N, and it is assumed that the size of the subset (that is, the ratio of the number of nodes in the subset to the number of nodes in the entire last hidden layer) is , then 1/(N-1) >  > 1/N.At the locally connected layer, each output node has a portion of hidden nodes that are only connected to it, and it is assumed that the number of nodes in this part is , then 0 <  < 1/N.There are common units with a specified quantity between two adjacent subsets.
The MTL-DBN-DNN model is learned with unsupervised DBN pretraining followed by backpropagation finetuning.The architecture of the model MTL-DBN-DNN is shown in Figure 3.
Remark.First, pretraining and fine-tuning ensure that the information in the weights comes from modeling the input data [32].In other words, the network memorizes the information of the training data via the weights.The network needs not only to learn the commonalities of multiple tasks but also to learn the differences of multiple tasks.Locally connected network allows a subset of hidden units to be unique to one of the tasks, and unique units can better model the task-specific information.Therefore, fully connected networks do not learn the information contained in the training data of multiple tasks better than locally connected networks.Second, fully connected networks need to juggle (i.e., balance) the learning of each task while being trained, so that the trained networks cannot get optimal prediction accuracy for each task.Based on the above two reasons, the last (fully connected) layer is replaced by a locally connected layer, and each unit in the output layer is connected to only a subset of units in the previous layer.There are common units with a specified quantity between two adjacent subsets.
Input.As long as a feature is statistically relevant to one of the tasks, the feature is used as an input variable to the model.
When the MTL-DBN-DNN model is used for time series forecasting, the parameters of model can be dynamically adjusted according to the recent monitoring data taken by the sliding window to achieve online forecasting.
The Setting of the Structures and Parameters.The architecture and parameters of the MTL-DBN-DNN can be set according to the practical guide for training RBMs in technical report [33].

Data Set.
In this study, we used a data set that was collected in (Urban Computing Team, Microsoft Research) Urban Air project over a period of one year (from 1 May 2014 to 30 April 2015) [34].There are missing values in the data, so the data was preprocessed in this study.We chose Dongcheng Dongsi air-quality-monitor-station, located in Beijing, as a target station.The hourly concentrations of PM 2.5 , NO 2 , and SO 2 at the station were predicted 12 hours in advance.

Feature Set.
According to some research results, we let the factors that may be relevant to the concentration forecasting of three kinds of air pollutants make up a set of candidate features.
Traffic emission is one of the sources of air pollutants.The traffic flow on weekdays and weekend is different.During the morning peak hours and the afternoon rush hours, traffic density is notably increased.In this paper, the hour of day and the day of week were used to represent the traffic flow data that is not easy to obtain.
Anthropogenic activities that lead to air pollution are different at different times of a year.The day of year (DAY) [3] was used as a representation of the different times of a year, and it is calculated by where ℎ represents the ordinal number of the day in the year and T is the number of days in this year.
Regional transport of atmospheric pollutants may be an important factor that affects the concentrations of air pollutants.Three transport corridors are tracked by 24 h backward trajectories of air masses in Jing-Jin-Ji area [3,35], and they are presented in Figure 4.According to the current wind direction and the transport corridors of air masses, we selected a nearby city located in the upwind direction of Beijing.Then we used the monitoring data of the concentrations of six kinds of air pollutants from a station located in the city to represent the current pollutant concentrations of the selected nearby city.
Candidate features include meteorological data from the target station whose three kinds of air pollutant concentrations will be predicted (including weather, temperature, pressure, humidity, wind speed, and wind direction) and the concentrations of six kinds of air pollutants at the present moment from the target station and the selected nearby city (including PM 2.5 , PM 10 , SO 2 , NO 2 , CO, and O 3 ), the hour of day, the day of week, and the day of year.The current PM 2.5 concentration of the selected nearby station (g/m 3 ) 14 The current PM 10 concentration of the selected nearby station (g/m 3 ) 15 The current NO 2 concentration of the selected nearby station (g/m 3 ) 16 The current CO concentration of the selected nearby station (mg/m 3 ) 17 The current O 3 concentration of the selected nearby station (g/m 3 ) 18 The current SO 2 concentration of the selected nearby station (g/m 3 ) 19 The day of year 20 The day of week 21 The hour of day Weather has 17 different conditions, and they are sunny, cloudy, overcast, rainy, sprinkle, moderate rain, heaver rain, rain storm, thunder storm, freezing rain, snowy, light snow, moderate snow, heavy snow, foggy, sand storm, and dusty.All feature numbers are presented in the Table 1.

Evaluation Metrics.
In this study, four performance indicators, including Mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE), and Accuracy (Acc) [34], were used to assess the performance of the models.They are defined by where N is the number of time points and O i and P i represent the observed and predicted values respectively.

Experiment Setup.
There is a new data element arriving each hour.Each data element together with the features that determine the element constitute a training sample where  1 ,  2 , and  3 represent PM 2.5 concentration, NO 2 concentration and SO 2 concentration, respectively.  is a set of features, and the set is made up of the factors that may be relevant to the concentration forecasting of three kinds of pollutant.

Setting the Parameters of Sliding Window (Window Size, Step
Size, Horizon).In the study, the concentrations of PM 2.5 , NO 2 , and SO 2 were predicted 12 hours in advance, so, horizon was set to 12. Window size was equal to 1220; that is, the sliding window always contained 1220 elements.
Step size was set to 1.After the current concentration was monitored, the sliding window moved one-step forward, the prediction model was trained with 1220 training samples corresponding to the elements contained in the sliding window, and then the well-trained model was used to predict the responses of the target instances.
Selecting Features Relevant to Each Task.The experimental procedures are as follows: (1) After the continuous variables are discretized, for different tasks, the features were evaluated and sorted according to minimal-redundancy-maximal-relevance (mRMR) criterion.
First, the continuous variables were discretized, and the discretized response variable became a class label with numerical significance.In this paper, continuous variables were divided into 20 levels.A MI Tool box, a mutual information package of Adam Pocock, was used to evaluate the importance of the features according to the mRMR criterion.
14} NO 2 concentration prediction {19, (2) The dataset was divided into training set and test set.For each task, we used random forest to test the feature subsets from top1-topn according to the feature importance ranking, and then selected the first n features corresponding to the minimum value of the MAE as the optimal feature subset.The curves of MAE are depicted in Figure 5. Table 2 shows the selected features relevant to each task.
In order to verify whether the application of multitask learning and online forecasting can improve the DBN-DNN forecasting accuracy, respectively, and assess the capability of the proposed MTL-DBN-DNN to predict air pollutant concentration, we compared the proposed MTL-DBN-DNN model with four baseline models (2-5): (1) DBN-DNN model with multitask learning using online forecasting method (OL-MTL-DBN-DNN).
For the single task prediction model, the input of the model is the selected features relevant to single task.For the multitask prediction model, as long as a feature is relevant to one of the tasks, the feature is used as an input variable to the model.

Remark.
For the first two models (MTL-DBN-DNN and DBN-DNN), we used the online forecasting method.To be distinguished from static forecasting models, the models using online forecasting method were denoted by OL-MTL-DBN-DNN and OL-DBN-DNN, respectively.
For the first three models above, we used the same DBN architecture and parameters.According to the practical guide for training RBMs in technical report [33] and the dataset used in the study, we set the architecture and parameters of the deep neural network as follows.In this study, deep neural network consisted of a DBN with layers of size G-100-100-100-90 and a top output layer, and G is the number of input variables.The DBN was constructed by stacking four RBMs, and a Gaussian-Bernoulli RBM was used as the first layer.In the pretraining stage, the learning rate was set to 0.00001, and the number of training epochs was set to 50.In the fine-tuning stage, we used 10 iterations, and grid search was used to find a suitable learning rate.For the OL-MTL-DBN-DNN model, the output layer contained three units and simultaneously output the predicted concentrations of three kinds of pollutants.Each unit at output layer was connected to only a subset of units at the last hidden layer of DBN.
For Winning-Model, time back was set to 4. Since the dataset used in this study was released by the authors of [34], the experimental results given in the original paper for the FFA model were quoted for comparison.
Because the first two models above are the models that use online forecasting method, the training set changes over time.For the sake of fair comparison, we selected original 1220 elements contained in the window before sliding window begins to slide forward, and used samples corresponding to   (g Figure 6: The prediction performances of different models for a 12-h horizon.In the pictures, time is measured along the horizontal axis and the concentrations of three kinds of air pollutants (PM 2.5 , NO 2 , SO 2 ) are measured along the vertical axis.
these elements as the training samples of the static prediction models (DBN-DNN and Winning-Model).The four models were used to predict the concentrations of three kinds of pollutants in the same period.The experimental results of hourly concentration forecasting for a 12h horizon are shown in Table 3, where the best results are marked with italic.

Results and Discussions.
Table 3 shows that the best results are obtained by using OL-MTL-DBN-DNN method for concentration forecasting.Three error evaluation criteria (MAE, RMSE, and MAPE) of the OL-MTL-DBN-DNN are lower than that of the baseline models, and its accuracy is significantly higher than that of the baseline models.The prediction performance of OL-DBN-DNN is better than DBN-DNN, which shows that the use of online forecasting method can improve the prediction performance.The performance of OL-MTL-DBN-DNN surpasses the performance of OL-DBN-DNN, which shows that multitask learning is an effective approach to improve the forecasting accuracy of air pollutant concentration and demonstrates that it is necessary to share the information contained in the training data of three prediction tasks.It is worth mentioning that learning tasks in parallel to get the forecast results is more efficient than training a model separately for each task.
The experimental results show that the OL-MTL-DBN-DNN model proposed in this paper achieves better prediction performances than the Air-Quality-Prediction-Hackathon-Winning-Model and FFA model, and the prediction accuracy is greatly improved.For example, when we predict PM When the prediction time interval in advance is set to 12 hours, some prediction results of three models are presented in Figure 6.
Figure 6 shows that predicted concentrations and observed concentrations can match very well when the OL-MTL-DBN-DNN is used.The advantage of the OL-MTL-DBN-DNN is more obvious when OL-MTL-DBN-DNN is used to predict the sudden changes of concentrations and the high peaks of concentrations.

Conclusion
In this paper, a deep neural network model with multitask learning (MTL-DBN-DNN), pretrained by a deep belief network (DBN), is proposed for forecasting of nonlinear systems and tested on the forecast of air quality time series.
The MTL-DBN-DNN model can fulfill prediction tasks at the same time by using shared information.In the model, each unit in the output layer is connected to only a subset of units in the last hidden layer of DBN.There are common units with a specified quantity between two adjacent subsets.Such connection effectively avoids the problem that fully connected networks need to juggle the learning of each task while being trained, so that the trained networks cannot get optimal prediction accuracy for each task.The locally connected architecture can well learn the commonalities and differences of multiple tasks.PM 2.5 , SO 2 , and NO 2 have chemical reaction and almost the same concentration trend, so we apply the proposed model to the case study on the concentration forecasting of three kinds of air pollutants 12 hours in advance.Comparison with multiple baseline models shows our model MTL-DBN-DNN has a stronger capability of predicting air pollutant concentration.Therefore, by combining the advantages of deep learning, multitask learning and online forecasting, the MTL-DBN-DNN model is able to provide accurate real-time concentration predictions of air pollutants.

2 JournalFigure 1 :
Figure 1: The observed data from 7 o' clock in November 30, 2014, to 22 o' clock in January 10, 2015.In the figure, time is measured along the horizontal axis and the concentrations of three kinds of air pollutants (PM 2.5 , NO 2 , and SO 2 ) are measured along the vertical axis.There are some missing values in data sets.Dongcheng Dongsi is a target air-quality-monitor-station selected in this study.

Figure 3 :Figure 4 :
Figure 3: The schematic representation of the DBN-DNN model with multitask learning.

3 )Figure 5 :
Figure 5: MAE vs. different numbers of selected features on three tasks.

2 . 5
concentrations, compared with Winning-Model, MAE and RMSE of OL-MTL-DBN-DNN are reduced by about 5.11 and 4.34, respectively, and accuracy of OL-MTL-DBN-DNN is improved by about 13%.These positive results demonstrate that our model MTL-DBN-DNN is promising in real-time air pollutant concentration forecasting.

Table 1 :
The 21 elements in the candidate feature set.

Table 2 :
Selected features relevant to each task.

Table 3 :
Comparison among different models.