Hierarchical Bayesian Spatio-Temporal Modeling for PM 10 Prediction

. Over the past few years, hierarchical Bayesian models have been extensively used for modeling the joint spatial and temporal dependence of big spatio-temporal data which commonly involves a large number of missing observations. This article represented, assessed, and compared some recently proposed Bayesian and non-Bayesian models for predicting the daily average particulate matter with a diameter of less than 10 (PM 10 ) measured in Qatar during the years 2016 – 2019. The disaggregating technique with a Markov chain Monte Carlo method with Gibbs sampler are used to handle the missing data. Based on the obtained results, we conclude that the Gaussian predictive processes with autoregressive terms of the latent underlying space-time process model is the best, compared with the Bayesian Gaussian processes and non-Bayesian generalized additive models.


Introduction
Many of the environmental data contain different scales of variability over space and time. For example, scientists from environmental and public health sciences are typically interested in modeling the evolving process of the air pollution during the time over specified locations. Such a stochastic process is often high-dimensional, large, and complicated with nonstationary structures, so the traditional statistical methods are hampered by the need of advanced statistical techniques to specify the spatio-temporal dependency. This can be quite practical with modern computers with highlevel computational programming. The spatio-temporal modeling of PM 10 and PM 2:5 (particulate matter with diameters of less than 10 and 2.5 micrometers, respectively) is rapidly becoming an important component of most air quality studies [1][2][3]. Particulate matter (also called particle pollution) is a mixture of solid particles and liquid droplets found in the air as a result of dust, soot, dirt, smoke caused by road transportation, and complex chemical reactions in the atmosphere such as sulfur dioxide and nitrogen oxides. Exposure to particle pollution is a public health hazard and can cause acute and chronic heart and lung diseases [4]. The larger the values of particulate matters (PM) are, the more harm on short and long terms of public health is. World Health Organization's (WHO) air quality guidelines recommend that the annual and 24-hour mean concentrations should not, respectively, exceed 20 and 50 microgram per cubic metre (μg/m 3 ) for PM 10 , and 10 and 25 μg/m 3 for PM 2.5 . Countries with fast-developed infrastructure such as Qatar usually suffer from relatively high levels of PM air pollutant. In this regard, the WHO classified the air quality in Qatar as poor and unsafe. In this paper, we focus our attention on modeling the PM 10 in Qatar because the PM 2.5 data is inaccessible. The most recent data indicates that country's annual average 24 hr PM 10 concentration levels were ranged from 126.69 μg/m 3 to 184.55 μg/m 3 , which exceeds the recommended maximum of 50 μg/m 3 [4][5][6].
Several authors have developed spatio-temporal models for analyzing the ambient of air pollution. Research in this field is back in dates to Cressie [7] and Goodall and Mardia [8]. Cressie et al. [9] compared the performance of Markovrandom field with the familiar geestatistical approach in prediction the PM 10 concentrations around the city of Pittsburgh, United States of America. The authors did not model the joint temporal and spatial structure of observations. They first modeled the purely spatial structure of observations for one particular day, then they incorporated the temporal component in their final statistical modeling. Sun et al. [10] developed a spatial forecasting distribution for unmeasured daily log PM 10 average concentration given from ten locations in Vancouver, Canada. At each monitoring site, Sun et al. [10] showed that the autoregressive model of order one (AR(1)) described quite well the daily log PM 10 average values. The authors did not consider the Bayesian models in their approach. Golam Kibria et al. [11] proposed a multivariate spatial prediction methodology in a Bayesian approach that is suited for spatio-temporal data observed at a small set of ambient monitoring stations at successive time points. They demonstrated the usefulness of their approach by mapping PM 2.5 at monitoring sites with different start-up times in the city of Philadelphia, USA. Golam Kibria et al. [11] did not compare the performance of their model with other non-Bayesian models that are commonly used in literature. Shaddick and Wakefield [12] and Sahu and Mardia [13] used short-term spatio-temporal predictive analysis for modeling the PM 2.5 and PM 10 concentration levels. Zidek et al. [3] presented predictive distributions on nonmonitored PM 10 values in Vancouver, Canada. Smith et al. [14] proposed a spatio-temporal model to predict the weekly averages of particulate matter concentrations within three southeastern states in the USA. Sahu et al. [15] modeled the PM 2.5 by combining the rural background and the urban areas into one process. Cocchi et al. [1] developed a hierarchical Bayesian model for the daily average PM 10 values. Pollice and Lasinio [2] developed a Bayesian-based kriging method for estimating the daily average PM 10 concentration levels. Wikle et al. [16] provided an excellent review of classical and Bayesian approaches for analyzing spatio-temporal data. Although Taylor et al. [6], Ahmadi et al. [5], and others have studied the relationship between some environmental conditions and particulate matter levels assessing the air quality based on particulate matter levels in different locations in Qatar, they did not study the spatiotemporal variability of PM 10 . The main objective of this research is to develop space-time models for daily PM 10 air pollution levels in Qatar for the four years, 2016-2019 comparing the hierarchical Bayesian approach with other spatiotemporal recent methods. We develop spatial interpolation and forecasting model using iterative Markov chain Monte Carlo (MCMC) computation setup which is an effective method for modeling a data with large number of missing values [17]. To the best of our knowledge, this will be the first study in Qatar, and we hope that this research will be helpful to protect the environment and public health in Qatar.
The rest of this article is organized as follows: Section 2 provides a brief review of two-stage hierarchical Bayesian models that have been used for modeling spatio-temporal data. A numerical example is given in Section 3 to demonstrate that the Bayesian approach accurately predicts the daily average PM 10 values with a large number of missing values comparing with non-Bayesian models. Finally, a conclusion is given in Section 4.

Hierarchical Spatio-Temporal Model
When data is collected at different points in space and time, we should use a model that can, simultaneously, describe the dependency structure coming from the three sources of variations: time variation, space variation, and joint variability between time and space. Such a model is called a spacetime model (or spatio-temporal model, where spatio refers to space and temporal refers to time).
In this article, we develop hierarchical models to predict the daily PM 10 concentration levels which vary over time and locations. The PM 10 concentration levels do not often follow the normal distribution. Thus, we usually model these values on the square-root scale or we use the logtransformation to stabilize the variance and enforce normality and to stabilize the variance [18]. We consider the squareroot scale to alleviate the departure from normality in our research data.
Let ℓ and t denote the two units of time where ℓ = 1, ⋯, r represents the longer unit (e.g., year), and t = 1, ⋯, T ℓ represents the shorter unit (e.g., day). Let Z ℓ ðs, tÞ denote the observed value of the PM 10 concentration, after any necessary transformation, at a given location s and over a given discrete time t. We assume that the spatial location s is a two-dimensional vector describing the latitude-longitude (or equivalently northing and easting coordinates), and the time unit is typically hour, day, month, or year. We also assume that the Z ℓ ðs, tÞ is observed at n monitoring sites denoted by s i , i = 1, ⋯, n and at time points denoted by two indices ℓ and t so that the total number of observations is denoted by N = n∑ r ℓ=1 T ℓ . In this article, we denote all the missing data by z ⋆ , whereas all the observed data will be denoted by z.
The first stage of the hierarchy assumes that the observed values Z ℓt , where Z ℓt = ðZ ℓ ðs 1 , tÞ,⋯,Z ℓ ðs n , tÞÞ′, can be decomposed into a true (latent) spatio-temporal process Y ℓt = ðY ℓ ðs 1 , tÞ,⋯,Y ℓ ðs n , tÞÞ ′ with an error term ε ℓt = ðε ℓ ðs 1 , tÞ,⋯,ε ℓ ðs n , tÞÞ ′ . More specifically, the data (or measurement error) model in the first stage of the hierarchy is The error term ε ℓt is assumed to be a Gaussian white noise process with mean zero and constant variance σ 2 ε , which is often called the nugget effect absorbing microscale variability.
The second stage assumes that the true process Y ℓt has a systematic mean μ ℓt and a spatio-temporal error term. The mean can be modeled based on the past values of the unobserved variable or/and based on some relevant covariates. Typically, Y ℓt can be specified in the following formula: where η ℓt = ðη ℓ ðs 1 , tÞ,⋯,η ℓ ðs n , tÞÞ′ is an spatio-temporal residual random intercept assumed to follow N ð0, CÞ, where C = σ 2 η H η , σ 2 η is the site invariant spatial variance, and H η is the spatial correlation matrix. In this article, we consider 2 Journal of Applied Mathematics four modelings for fη ℓt g. The first one assumes that the random effects, η ℓ ðs i , tÞ = 0, for all locations s i and times t. This implies that the model in (1)-(2) will be the simple regression model. The other models for fη ℓt g are assumed to be Gaussian process, independent over the time, which is specified in Sections 2.1, 2.2, 2.3, and 2.4.

Matérn Spatio-Temporal Covariance Function.
For spatio-temporal modeling, we usually assume that the random effects process is a weakly stationary Gaussian with a zero mean and a valid isotropic covariance function. A valid covariance function implied that the covariance matrix is positive definite, and isotropic means that the separation vector between the two locations only depends on the distance and not on the direction. The class of the spatiotemporal covariance functions can be separable, productsum, metric, and sum-metric [19]. In this paper, we use the separable covariance model which is simply the product of the pure spatial covariance function, C s ðhÞ, by the pure temporal one, C t ðuÞ, given by where h = ks − s ′ k is the separating spatial distance, and u = jt − t′j is the temporal distance for any pair of points ðs, tÞ × ðs′, t′Þ in the spatial and temporal study domain. The Matérn family provides a general choice of covariance functions. For each time t, the Matérn covariance is given by: where K κ ð·Þ is the modified Bessel function of second kind of order κ which is a parameter controlling the smoothness of the realized random field [20], ΓðκÞ is the standard gamma function, and ϕ is a parameter which controls the decay rate in the correlation as the distance h increases. Popular special cases of Matérn model are (i) κ = 0:5 leads to exponential covariance function CðhÞ = σ 2 η exp ð−ϕhÞ and (ii) Gaussian model, (2), where ℓ = 1, ⋯, r, t = 1, ⋯, T ℓ , X ℓt is the n × ðk + 1Þ design matrix of spatially and temporally varying k -covariates and β = ðβ 0 , β 1 ,⋯,β k Þ′ is the k + 1-dimensional vector of regression coefficients. Thus, the Gaussian process (GP) two-stage model can be written as We assume that the random effect process fηð·Þg is independent from the white noise process fεð·Þg. Note that some of the covariates may vary spatially and not temporally or vice versa.
One advantage of using Bayesian model is that we can use it to handle any missing data. This can be done by using the Markov chain Monte Carlo, MCMC, computation where the missing data is simulated from the N ðY ℓt , σ 2 ε Þ distribution defined by (5) at each MCMC iteration using Gibbs sampling. The Gibbs sampler requires that the full conditional distributions of the parameters θ = ðβ, σ 2 ε , σ 2 η , ϕ, κÞ are given in a closed form.
The logarithm of the joint posterior distribution of the missing data and the parameters in this case is given by: where pðθÞ is the prior distribution [21] and we refer the readers to Bakar and Sahu ([21], Appendix A) to obtain the Gibbs sampling using the full conditional distributions of θ. (1) and (2) can be written in the hierarchical form (e.g., [18] model) as follows:

Autoregressive Model Specification. The autoregressive process (AR) model in Equations
where −1 < ρ < 1 is parameter of the first-order autoregressive model and μ ℓ = ρY ℓt−1 + X ℓt β. Note that if ρ = 0, we get the GP model given by (5). Also, note that the autoregressive model requires to specify an independent spatial model with initial values of Y ℓ0 for each ℓ = 1, ⋯, r, with mean μ ℓ and covariance σ 2 ℓ H 0 obtained from the Matérn covariance function in (4) with the same set of parameters. In this case, logarithm of the joint posterior distribution of the missing data and the parameters in this case is given by: 3

Journal of Applied Mathematics
where C ⋆ ðϕ, κÞ is the n × m crosscorrelation matrix between η t and η ⋆ t having elements ½C ⋆ ij = Cðks i − s ⋆ j kÞ, i = 1, ⋯, n and j = 1, ⋯, m, and H η ⋆ ðϕ, κÞ is the m × m correlation matrix of η ⋆ t so that ½H η ⋆ ij = ½Cðks ⋆ k − s ⋆ j k ; ϕ, κÞ, for k, j = 1, ⋯, m. Clearly, the process fη ℓt g shows nonstationary structure with variance function that is given by The advantage of using the nonstationary model in (10) is the flexibility in theη t surface which is based on m ≪ n linear functions of η ⋆ t . When m is very small compared with n, this will lead to reduce the computational burden, especially for big data which is usually the case of the spatio-temporal data. Moreover, using nonstationary models usually provides more accurate results in prediction the nonstationary PM 10 process. We specify the hierarchical Gaussian predictive processes (GPP) as follows: whereη ℓt is given in (10). The process fη ⋆ ℓt g, at the S ⋆ m knots, can be modeled according to the autoregressive model where η ⋆ ℓt~N m ð0, Σ η ⋆ Þ for ℓ = 1, ⋯, r, t = 1, ⋯, T ℓ and Σ η ⋆ = σ 2 η H η ⋆ . We assume that the initial conditions η ⋆ ℓ0 has normal distribution with mean zero and covariance matrix Σ 0 = σ 2 ℓ H 0 , and both Σ η ⋆ , Σ 0 can be obtained from the Matérn covariance function defined in (4), where Σ η ⋆ is an m × m matrix much lower dimensional than Σ η of dimension n × n.
In this case, logarithm of the joint posterior distribution of the missing data and the parameters in this case is given by: where pðθÞ is the prior distribution for the parameter θ = ðβ, ρ, σ 2 ε , σ 2 η ⋆ , ϕ, κ, σ 2 ℓ Þ and the Gibbs sampling procedure of these parameters can be obtained from the full conditional distributions required provided in the Appendix.

Numerical Example
3.1. Data Description. The data used in this article is obtained from three different sources. The first one was collected by the air pointer and the meteorological station and is managed by the Environmental Science Center at Qatar University over the years 2016-2019. It has hourly pollutant including PM 10 measured in μg/m 3 , temperature, Temp, measured in degree Celsius, and relative humidity, RH, with several missing values.
First, we sort the data into an ascending order by date, and then, we impute the missing values of the PM 10 by averaging the two nonmissing values before and after this missing value. When two or more successive missing values exist, we impute them by the corresponding monthly average. We do the same for the hourly missing values of temperature and relative humidity. After that, we aggregate the hourly data into daily data for which we fit our models. The amended data from this resource is irregular time series which started on the fourth of January 2016 and ended on the twenty-eighth of August 2019. Table 1 provides a summary statistics of this data. Results suggest that the PM 10 levels increased during the time, and the majority of missing observations (with more than 52%) have been occurred during the year 2018.
The second source of the data was the regular monthly observations obtained from nine meteorological stations over the years 2016-2019. The total number of observations is 432 where the stations are located in the following sites: Qatar University, Abu Samra, Al Khor, Al Ruwais, Al Wakrah, Doha Airport, Dukhan, Mukenis-Al Karanaah, and Umm Said. Figure 1 shows the map locations of 9 meteorological sites. Although this data provides the monthly average values of temperature in degree Celsius and relative humidity (RH), it does not include measures for any type of pollutant. Table 2 summarizes the descriptive statistics of this data. We use this data to interpolate the daily temperature and relative humidity by disaggregating technique at each location. We also use our own portable devices to collect daily PM 10 values, as a third source of data, over the period of time from November 5th to December 31th of 2020 close to these sites. We use these devices to collect and simulate more daily PM 10 concentrations over the aforementioned locations. In particular, we recorded the average of the PM 10 taken randomly in the morning and evening of each day over this period. Then, we use this collected data to simulate a new data in each location. Finally, we merge the simulated data with the previous one and obtained the data that we use in this article.
The final data has 1037 observations (see Figure 2) which has three variables (PM 10 , temperature, and relative humidity) measured over irregular time over the years 2016-2020 in nine spatial locations.
The top panel in Figure 3 shows the distribution of the daily average PM 10 concentrations over the years 2016-2020, whereas the bottom one in the same figure shows these averages by the nine sites. Clearly that the distribution of the PM 10 concentrations, in all locations, are right skewed with very high extreme values. Thus, to stabilize the variance and reduce the departure from normality, we transform the original scale to square root scale.

Parameter Estimates, Model Validations, Prediction, and
Comparison Results. The main objective of this research is to develop a hierarchical Bayesian model that can be used to select and validate the best model which fits the daily average of the PM 10 air pollutant levels over different locations in Qatar. We consider the two covariates, temperature and relative humidity, for spatial interpolation of the PM 10 concentration at a new location and any time.
First, we use the MCMC algorithm with Gibbs sampler, based on 5,000 iterations discarding the first 1,000 values, to impute the missing data z ⋆ as explained in Appendix, and then, we utilize this data to predict the values of Zðs ′ , tÞ at new location s ′ ∉ fs 1 ,⋯,s n g. The posterior predictive distribution is where θ = ðρ, σ 2 ε , σ 2 η , ϕÞ ′ are the model parameters and η ⋆ = ðη ⋆ 1 ,⋯,η ⋆ n Þ ′ (see [17]). We fit the Bayesian GP, AR, and GPP models described in Section 2 using the spTimer package [21,23] and compare these models with the non-Bayesian generalized additive model (GAM) using the R package mgcv [24][25][26]. We use the crossvalidation method to evaluate the predictive performance of these models.
Here, data at locations ðs 1 ,⋯,s m Þ where m < n are used to fit the model while data at other sites ðs m+1 ,⋯,s n Þ are used     Journal of Applied Mathematics to assess the model using the mean squared error (MSE), mean absolute error (MAE), mean absolute prediction error (MAPE), relative bias (rBIAS), and relative mean separation (rMSEP) defined, respectively, as follows: where N is the total number of nonmissing observations,ẑ i is the posterior predictive value of z i , and z, z − z i are the arithmetic means of z i ,ẑ i for i = 1, ⋯, N. Specifically, we use the data from Abu-Samra and Doha airport stations with 2 × 1037 = 2074 observations for validation purposes. The data from the remaining six stations with 7 × 1037 = 7259 total number of observations are used for model fitting (see Figure 1). The parameter estimates (posterior) for the Bayesian spatio-temporal models are given in Table 3. The 95% credible intervals for the parameters suggest that most of regression parameters for the AR and GP models are statistically significant, whereas the GPP model suggest that the variables are significant. In all models, the nugget effect σ 2 ε is small. On the other hand, the spatial decay parameter ϕ = 0:0033 for GP model and ϕ = 0:001 for GPP model suggesting that the effective ranges will approximately be 909 and 3000 kilometers, respectively. These two values are unusual associated with very large spatial variance values (σ 2 η ) suggesting that the PM 10 concentrations in Qatar are not significantly different over the space.
After fitting the three models, we perform the crossvalidation and select the best one. Table 4 summarized the validation statistics for the GAM, GP, and AR models. Clearly, the MSE, MAE, MAPE, rBIAS, and rMSEP are smaller for the Bayesian spatio-temporal methods providing better predictive performance compared to the non-Bayesian additive model. For example, the Bayesian AP, GP, and GPP models reduced the MSE by about 63.7%, 66.4%, and 79.6%, respectively, compared with the non-Bayesian GAM model. We conclude from this table that the Bayesian spatio-temporal GPP model is the best model that can be used to predict the PM 10 concentrations. Figure 4 shows the time series plot of the true and fitted values for Qatar University location using the GPP model. We clearly see that the fitted values are very close to the true values. To further demonstrate the usefulness of the spatio-temporal GPP model for predicting the linear trend surfaces of the PM 10 values with their standard values, we illustrate a prediction map for the 29 th of March 2018 and 15 May of 2019 over a one-kilometer square grid (see Figure 5). Obviously, the graphs show that the GPP model correctly represented the PM 10 concentration level over the space and time.

Conclusion
The potential of applying nonstationary hierarchical Bayesian spatio-temporal models in PM 10 prediction with a large number of missing values is presented in this paper. The predicting model is developed by comparing the Gaussian predictive processes (GPP) with Gaussian processes (GP), autoregressive (AR) Bayesian models, and non-Bayesian generalized additive model (GAM) models using the datasets from the state of Qatar. The numerical results show that the GPP model outperforms other alternative models providing forecasting with good accuracy and interpretability. We applied the disaggregating technique and simulated a daily spatio-temporal PM 10 data using the available and collected data. Then, we used the Markov chain Monte Carlo with Gibbs sampler to impute the missing data in real collected data. We believe that our statistical data analysis approach will give similar results for future available real data. In many applications, the support vector machine (SVM) algorithm has shown a superior forecasting performance compared with several evolutionary algorithms. This could be an interesting possible extension to this article where further research can be done by comparing the performance of the SVM algorithms with Bayesian models in predicting the daily average PM 10 concentration levels.