Time-Series Regression Model for Prediction of Mean Daily Global Solar Radiation in Al-Ain, UAE

The availability of short-term forecast weather model for a particular country or region is essential for operation planning of energy systems. This paper presents the first step by a group of researchers at UAE University to establish a weather model for the UAE using the weather data for at least 10 years and employing various models such as classical empirical models, artificial neural network (ANN) models, and time-series regression models with autoregressive integrated moving-average (ARIMA). This work uses time-series regression with ARIMA modeling to establish a model for the mean daily and monthly global solar radiation (GSR) for the city of Al-Ain, United Arab Emirates. Time-series analysis of solar radiation has shown to yield accurate average long-term prediction performance of solar radiation in Al-Ain. The model was built using data for 10 years (1995–2004) and was validated using data of three years (2005–2007), yielding deterministic coefficients (R2) of 92.6% and 99.98% for mean daily and monthly GSR data, respectively. The low corresponding values of mean bias error (MBE), mean absolute bias error (MABE), mean absolute percentage error (MAPE), and root-mean-square error (RMSE) confirm the adequacy of the obtained model for long-term prediction of GSR data in Al-Ain, UAE.


Introduction
Gulf Corporation Countries (GCC) that include the United Arab Emirates (UAE) are seeking to make better use of their abundant sustainable energy sources and specifically solar energy.New sources of clean energy are essential to help reduce carbon dioxide levels that are increasing at an alarming level in order to protect the endangered ozone layer and avoid future climate changes.The potential of solar energy harvesting in the UAE is significant, with an average annual sunshine hours of 3568 h (i.e., 9.7 h/day), which corresponds to an average annual solar radiation of approximately 2285 kWh/m 2 , that is, 6.3 kWh/m 2 per day [1].
Many researchers worldwide have developed models to predict long-term daily and monthly average solar radiation in their region using different combinations of measured weather parameters.The availability of a solar radiation model in a particular region is very useful in estimating the amount of power that can be generated from a particular solar energy system.Short-term forecasts of solar radiation are also useful in planning the operation of energy systems as well as in agricultural studies, for example, for crop modeling and for hydrology, meteorology, and soil physics.
Techniques used for modeling the mean daily global solar radiation include empirical regression, neural network, and time-series ARIMA models.The use of neural network (NN) or hybrid regression-NN models was mainly due to the nonlinearity in weather data.
Numerous authors  developed empirical regression models to predict the monthly average daily GSR in their region using various parameters.The mean daily sunshine duration was the most commonly used and available parameter.The most popular model was the linear model by Angström-Prescott [8,19], which establishes a linear relationship between global radiation and sunshine duration with knowledge of extraterrestrial solar radiation and the theoretical maximum daily solar hours.Many studies with empirical regression models were done for diverse regions around the world.Menges et al. [17] reviewed 50 global radiation empirical models available in the literature for computing the monthly average daily global radiation on a horizontal surface.They tested the models on data recorded in Konya, Turkey, for comparison of model accuracy.The number of weather parameters varied between models.The diverse regression models used include linear, logarithmic, quadratic, third-order polynomial, and logarithmic-linear and exponential and power models relating the normalized GSR to normalized sunshine hours.Other models included in Menges work used direct regression models involving various weather parameters in addition to geographical data (altitude, latitude) and other weather parameters such as precipitation and cloud cover.S ¸ahin [20] presented a novel method for estimating the solar irradiation and sunshine duration by incorporating the atmospheric effects due to extraterrestrial solar irradiation and length of day.In this work, the author compares the model with Angström's equation with favourable advantages as his method does not use Least Square Method in addition to having no procedural re-strictions or assumptions.Ulgen and Hepbasli [24] developed two empirical correlations to estimate the monthly average daily GSR on a horizontal surface for Izmir, Turkey.The developed models resemble Angström-type equations and were compared with 25 models previously reported in the literature on the basis of statistical error test (MBE, RMSE, MPE, and R 2 ) with favourable results.
The main limitation of using many weather parameters is the difficulty of obtaining this data due to the high expense and availability of recording equipment.Some of the regression models are more accurate for monthly data than daily data and most published work only shows monthly data comparison results as the daily mean data are less accurate.
Other authors worked on prediction models based on ANN techniques and most specifically multilayer perceptron (MLP) and radial-basis function (RBF) methods [25][26][27][28][29][30][31][32][33][34][35][36][37].The advantage of the ANN models is their ability to handle large amounts of data as well as the ability to handle random data without worry of incomplete, inaccurate, or noisecontaminated data.Behrang et al. [28] compared MLP and RBF ANN techniques for daily GSR modeling based on 6 proposed combinations.Their MLP-ANN models outperformed the best conventional GSR prediction (CGSRP) model yielding 5.21% MAPE error compared to a 10.02% MAPE error for the best CGSRP model.Another study by Benghanem et al. [29] uses ANN models for estimating and modeling daily GSR data in Al-Madinah, Saudi Arabia.In this work, the input data included the Mean daily air temperature, relative humidity, sunshine duration, and day of year.The authors compared the obtained models with conventional empirical regression models showing promising improvements.Numerous other researchers have worked on modeling GSR in their region using ANN techniques.Mohandes et al. [33] have applied ANN techniques to predict the mean monthly GSR using weather data from 41 stations in Saudi Arabia.Data from 31 stations was used in training the network, and the remaining data was used for testing.Input variables to the network included latitude, longitude, altitude, and sunshine duration.Their RBF model is compared with MLP ANN technique and other classical empirical models such as Angström-type equations.In this work, data samples were not large enough to obtain a credible comparison.
Another group of investigators [38][39][40][41][42][43] employed timeseries regression modeling for the deterministic component and Box-Jenkins ARIMA modeling [44,45] for the stochastic residual component.Boland [38] and Zeroual et al. [39] use a combination of regression model with Fourier series for the deterministic part and ARMA modeling for the residual stochastic part.Sulaiman et al. [40] and Zaharim et al. [41] use the ARMA Box-Jenkins method to model GSR data in Malaysia.The authors model the data using nonseasonal autoregressive models where the model adequacy is checked using the Ljung-Box statistic for diagnostic data.However, only short-term data was used for testing the model.Reikard [42] employs a combination of logarithmic regression and ARIMA modeling to predict solar radiation at high resolution and compares his models with other forecast methods such as ANN and considered the 24-hour daily seasonality not taken into account in most modeling approaches.
This work uses examples with classical empirical regression techniques and time-series techniques to predict the monthly average daily GSR data in Al-Ain, UAE.In this timeseries regression technique, the deterministic component was modeled by decomposing it into a multiple linear regression component as a function of the available weather variables plus a cyclical component accounting for the annual periodicity and a linear trend.The residual error is studied using ARIMA modeling techniques to resemble a quasinormal noise error with zero mean and constant variance.The stationary form of the resulting time-series and the white-noise-type residual error provide extra-confidence of the long-term prediction accuracy of the estimated model.
The authors are doing extensive work on the analysis of weather data in three UAE cities, namely, Al-Ain, Abu Dhabi, and Sharjah.The prediction techniques employed range from classical one-parameter-based regression techniques (Sunshine data), MLP and RBF ANN techniques, and late-ly using time-series regression techniques with ARIMA modeling.The current work on solar radiation data in Al-Ain City, UAE, will be correlated with other ongoing studies by authors to come up with solar radiation prediction models for the UAE cities of Abu Dhabi and Sharjah.The final objective is to come up with a good national weather model capable of predicting the mean monthly GSR for the whole UAE within an acceptable prediction error.

Methodology
This paper discusses an ARIMA-based time-series regression technique used for predicting the monthly average GSR in the city of Al-Ain, UAE.
The weather database is provided by the National center of Meteorology and Seismology (NCMS) in Abu Dhabi, UAE.Weather data was provided for years 1995-2007 include the mean air temperature (T in • C), mean wind speed (W in knots), daily sunshine hours (SSH), and percent relative humidity (RH) in addition to the mean daily GSR in kWh/m 2 .The data is divided into two groups: one data group from 1995-2004 for the prediction model and the second group (2005-2007) for testing and validating the resulting regression model.The data is first examined for any missing values and outliers.The missing values are replaced by the average of values from same week.Long arrays of missing data (a month for example) are replaced with the corresponding average of same days over the remaining model years.The total number of missing data in SSH or GSR columns does not exceed 3% of the total data points.Three leapyear days were also removed (February 29 for years 1996,2000,2004) to ensure uniformity of month com-parison over the ten-year model period.
The time-series regression model to predict the mean daily global solar radiation (GSR) makes use of the four weather parameters (T, W, SSH, RH) measured daily over a period of 10 years, that is, 1995-2004.The highest mean monthly temperatures occur during the period of May-September with monthly temperatures around 45 • C. The highest sunshine hours result during the months, of May and June with an average of 11.5 sunshine hours daily.During these two months the mean daily GSR values are the highest at 8 kWh/m 2 .March and August experience the highest wind speeds, while the humidity rates reach as high as 65% in December-January and as low as 30% during May.
The regression model relating the GSR data to the four predictor weather variables is obtained using the data of years 1995-2004.The partial least square multivariate regression technique is first used to correlate GSR with the four aforementioned dependent weather variables.The use of all the four predictor variables yields the highest correlation with a deterministic coefficient R 2 = 0.782.
The model equation obtained using SPSS [46] is The regression model of (1) yields R 2 = 0.782 and MSE = 0.425.The next task is to examine the GSR residual component obtained after subtracting the regression component, that is, GSR residue1 (t), for trends and/or seasonality and decompose it into the following: ( The trend equation is found by performing a linear polynomial regression fit to yield The mathematical model for the seasonal component (GSR seasonal (t)) is found using the FFT algorithm (N = 3650 data points) and is expressed in the Fourier series form: where the Fourier coefficients are computed from the FFT of GSR seasonal (t), that is, y[k], k = 1, 2, . . ., N, using the relations: and where NP = 365 days is the period of the data set.
It is worth pointing out that nonzero values of a k and b k coefficients exist only for k = 10, 20, 30, . . ., 3650, thus yielding a total of 182 nonzero Fourier Coefficients that can be used to model the seasonal component of the residual error GSR residue1 (t).
The overall decomposition leads to the following regression model: where the multivariable linear regression, trend, and seasonal components are described in ( 4), (6), and (7), respectively.The normality test of the residual term GSR residue2 (t) in Minitab [47] yields a near zero mean (mean = −1.013E− 06) and standard deviation of 0.393.The residual term GSR residue2 (t) exhibits quasinormal distribution behavior between the 1% and 95% percentiles with a computed skewness of 0.09 and kurtosis of 3.17.The normality plot is slightly skewed to the right of zero mean, and the curve is sharper than the normal distribution curve.

Time-Series ARMA Modelling for the Residual Term (Stochastic Component
).The Box-Jenkins models [44] are only applicable to stationary time series.The identification of an appropriate box-Jenkins model for a particular timeseries would first require a check for stationarity.If the residual term exhibits a normal distribution behavior with zero mean and constant variance, then it resembles white noise error and there is no need for further ARIMA modeling.
The behavior of the Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots can help identify the ARMA model that best describes the resulting stationary time-series.Table 1 summarizes the ARMA model selection criteria [45].In general, if the ACF of the time series value either cuts off or dies down fairly quickly, then the time series values should be considered stationary.On the other hand, if the ACF dies down extremely slowly, then the time series values may be considered nonstationary.If the model is adequate, then these plots should show all spikes within the 95% Confidence Interval (CI) bounds (±1.96/ √ N), where N is the sample size.If the series is non-stationary, one could try to apply differencing or log-transformation and then check if these make the series stationary.A stationary time-series would have a quasinormal distribution with zero mean and constant variance.
Figures 1(a) and 1(b) show, respectively, the ACF and PACF plots for the residual component as obtained from Minitab.Note that the ACF decays in an oscillating form after

MA (q)
Spikes decay to zero after lag q Spikes decay towards zero.
Coefficients may oscillate.

ARMA (p, q)
Spikes decay (either direct or oscillatory) to zero beginning after lag q Spikes decay (either direct or oscillatory) to zero beginning after lag p few lags within CI bounds thus implying a fairly stationary time-series.Differencing the residual component made the ACF and PACF more unstable with further deterioration of their behavior thus implying that differencing is not appropriate for this data.The PACF plot cuts off quickly after 2 lags indicating that an AR (2) or higher could be adequate.The time-series analysis of the residual stochastic component of the mean daily GSR data is conducted using SPSS and Minitab.The nature of the ARMA model used is first described followed by explanation of the model selection process with diagnostic measures used to validate the selected model.
The nonseasonal Autoregressive-moving average model of order (p, q), for example, ARMA (p, q) is described by the equation [45] where δ is a random shock and φ 1 , φ 2 , . . ., φ p are unknown autoregressive model coefficients that are estimated from sample data.The constant parameter is proportional to the mean value μ, which is zero in our case.θ 1 , θ 2 , . . ., θ q are unknown moving average model coefficients that depend on the sample data.On the other hand, t , t−1 , t−2 , . . ., t−q are statistically independent random shocks that are assumed to be randomly selected from a normal distribution with zero mean and constant variance.
The ARMA (p, q) model parameters (φ i , i = 1, 2, . . ., p and θ j , j = 1, 2, . . ., q) described in (7) are estimated in SPSS or Minitab using Least Square methods with known val-ues of y t representing the residual component data.Based on results seen in the ACF/PACF curves of Figure 1, an ARMA Model is executed in SPSS for p = 0, 1, 2, 3, 4, 5 and q = 0, 1, 2, 3, 4, 5 to select the parsimonious model that best describes the data.The estimated parameters of the se-lected ARMA (p, q) models should have t-values higher than 2.0 in order to be judged significantly different from zero at the 5% level.Moreover, the coefficients should not be strongly correlated with each other in order to yield a parsimonious ARMA model, whilst passing the diagnostic checks.A parsimonious model is desirable because including irrelevant lags in the model increases the coefficient standard errors and therefore reduces the t-statistics.Models that incorporate large numbers of lags tend not to forecast well as they fit data specific features, explaining much of the noise or random features in the data.Therefore, model coefficients with P values higher than 0.05 are insignificant and should be eliminated to avoid overfitting as they have a little effect on the prediction model [45].
The differencing of the time-series led to deterioration of the ACF/PACF behavior, and thus no differencing was made (d = 0), so ARIMA (p, 0, q) will be considered, which we refer to as ARMA (p, q) models.Different models can be obtained for various combinations of AR and MA individually and collectively.The best model is obtained with following diagnostics.

(a) Low Akaike Information Criteria (AIC)/Schwarz-Bayesian Information Criteria (SBC). AIC and SBC are given in SPSS by
where m = p + q, n = N − d, with N being the number of sample data points and d the order of differencing (here d = 0), and L is the likelihood function.Since −2 log L is approximately equal to {n (1 + log 2π) + n log σ 2 } where σ 2 is the model mean-square error (MSE), AIC can be written as AIC = {n(1 + log 2π) + n log σ 2 + 2m} and because the first term in this equation is a constant, it is usually omitted while comparing between models.Moreover, SBC can also be written as SBC = log σ 2 + (m log n)/n.SBC selects the more parsimonious model and is better than AIC for large samples.The best model should have the lowest AIC/SBC value and the least MSE.Once the optimal ARMA (p, q) model for the residual GSR timeseries is selected, there is a need to check the white noise test if the ACF/PACF correlograms show significant spikes at one or more lags that could be just by chance.These tests indicate whether there is any correlation in the time-series or whether the abnormal spikes encountered in the ACF and PACF of the residual error are just a set of random, identically distributed variables overall.
The Ljung-Box Q-statistics can be used to check if the residuals from the ARMA (p, q) model behave as a white-noise process [45,48].Ljung and Box [48] used the Q statistic: which yields a more accurate variance of ACF [variance becomes (n−K)/n 2 instead of 1/n] compared to the statistic defined earlier by Box and Pierce [49].K is the degrees of freedom representing the maximum lags considered.n = N − d with N being the number of data points, d the degree of differencing (no differencing is assumed in this work so d = 0), and r k is the sample ACF at lag k.
Under the null-hypothesis that all values of autocorrelation r k = 0, the Q Statistic is compared to critical values from chi-square distribution χ 2 distributed with K-degrees of freedom.If the model is correctly specified, the residuals should be uncorrelated and Q should be small and consequently the probability value should be large.A white noise process would ideally have DF,α at the specified degrees of freedom (DF) and significance level (α), then the null hypothesis can be rejected.

(d) Check the Normality Test of the Residual Error of the
Selected ARMA Model.The residual error should resemble white noise with zero mean and constant variance.
Another method to ensure that the selected ARMA model yields a stationary residual error is by checking the Unitroot rule for stationarity.
Assume that the lag operator in ( 7) is B, then By = y t−1 .Then (7) can be written in the form (assume zero mean; δ = 0): where The stationarity of the time-series sequence y t requires that all the roots of the polynomial φ(B) should lie outside the unit circle [45].
Several ARMA models were analyzed based on the above criteria and two models surfaced out to be the best, namely, ARMA (2, 1) and ARMA (4,3).Tables 2 and 3 show the estimated parameters for the ARMA (2, 1) and ARMA (4, 3) models, respectively, as obtained from SPSS.
The SPSS output shown in Tables 2 and 3 indicate that all estimated model coefficients have P value less than 0.05 (α = 5%).This implies that all the coefficients of the two selected Box-Jenkins ARMA models are significant since the null hypothesis H 0 : φ = 0 (AR) or θ = 0 (MA) can be rejected for the preset significance level α (chosen as 0.05).Other models tested resulted in coefficients with P-values greater than 0.05 and thus the coefficients will have little effect on model description (overfitting).Table 4 shows the estimated coefficients for few of the ARMA models studied to show the over-fitting behavior.Note that model ARMA (4, 4) is overfitted from ARMA (4,3).The same behavior can be noted from models ARMA (2, 2) and ARMA (3,1), which are overfitted from ARMA (2, 1) model.
The ACF and PACF plots of the residual error for the ARMA (2, 1) and ARMA (4, 3) models are displayed in Figures 2(a) and 2(b) and Figures 3(a) and 3(b), respectively.The plots vary within the 95% CI bounds.The spikes in the ACF and PACF at lag 7 are due to random events and thus cannot be explained.However, since it is a 95% confidence interval, one can expect this to happen once in every twenty lags and so we will not be concerned with this.
Ljung-Box test results for the selected ARMA (2, 1) and ARMA (4, 3) Models obtained using SPSS and JMulTi [50] are shown in Tables 5 and 6, respectively.The Ljung-Box Q-statistic values correspond to P-values greater than α = 0.05 thus indicating that the test is not significant and residuals appear to be uncorrelated.This implies that for both models, the GSR residual resembles random white noise.Both models will, therefore, be useful to forecast the GSR weather data, with ARMA (2, 1) being the more parsimonious and thus will be used hereafter.Model ARMA (2, 1) also has the lowest BIC value as witnessed from Table 4 and hence is the most suitable model for long-term prediction of GSR data.
For the selected ARMA (2, 1) model, the computed AR and MA roots of the polynomials φ(B) and θ(B) in ( 12) using JMulTi software are shown in Table 7.Note that all roots lie outside the unit circle and hence imply that ARMA (2, 1) model will yield a stationary residual error.Once the parsimonious ARMA (2, 1) model is selected and checked for stationary residual error, the GSR regression model involving the ARMA model becomes where WN(t) is the final residual error resembling white noise with zero mean and constant variance.Table 8 shows the result of running Levene's two-tailed variance test on the Regression model with and without ARMA modeling and measured data for years 1995-2004.This variance ratio test follows the hypothesis: H 0 : the variances are identical, H a : at least one of the variances is different from another.
As the computed P-value in Table 8 are greater than the significance level alpha = 0.05, one cannot reject the null hypothesis H 0 .The risk to reject the null hypothesis H 0 in Levene's test while it is true is 55.62% and 43.96% for regression and regression with ARMA cases, respectively.Levene's test is mostly used in samples with normal distribution.[1995][1996][1997][1998][1999][2000][2001][2002][2003][2004].The "cftool" MATLAB GUI toolbox [51] is used to study the correlation between the regression model without (6) and with ARMA modeling (13) and the measured data set for years 1995-2004, as depicted in Figures 4(a) and 4(b).The predicted time-series regression model with and without ARMA modeling yields deterministic coefficients of 94.2% and 92.4%, respectively, as shown in Figures 3(a The statistical error parameters for both models are given in Table 9.Note the ARMA model helps decrease the RMSE,  MABE, and MAPE error slightly and increase R 2 , and hence will do a better forecasting job.

Comparison of the Time-Series Regression Model (with and without ARMA Model) with Measured Data for Years
The mean daily GSR data comparison between the measured and the regression model for the years 1995-2004 in Al-Ain is shown in Figure 5.
The resulting mean monthly model data were computed and are shown in Figure 6.
Note the excellent agreement between the regression model and measured data for the period 1995-2004.Table 10 shows the mean monthly statistical error parameters for comparison of regression models with and without ARMA      of the potential of these techniques for long-term GSR data prediction.

Definition of the Statistical Error Parameters
The formulas used to compute the statistical error parameters in MATLAB are given in [1].These parameters attest to the accuracy of the models used for predicting the mean daily global solar radiation.

Conclusion
The solar radiation data taken in Al-Ain, UAE, are analyzed using time-series regression with Box-Jenkins ARMA model.Based on the behavior of ACF and PACF plots, the nonseasonal ARMA (2, 1) and ARMA (4,3)
(b) Plot of Residual Autocorrelation Function (ACF).The appropriate ARMA model once fitted should have a residual error whose ACF plot varies within the 95% CI bounds (±1.96/ √ N), where N is the number of observations upon which the model is based.(c) Nonsignificance of Autocorrelations of Residuals via Portmanteau Tests (Q-Tests Based on Chi-Square Statistics) Such As Box-Pierce or Ljung-Box Tests (White Noise tests).

Figure 7 :Figure 8 :
Figure 7: Mean daily GSR for regression model with the test data set for years 2005-2007 for Al-Ain city.

Table 1 :
Behavior of ACF and PACF for each of the general nonseasonal models.

Table 2 :
SPSS least square estimation of the model parameters for ARMA (2, 1) model of residual GSR component.

Table 3 :
SPSS least square estimation of the model parameters for ARMA (4, 3) model of residual GSR component.

Table 7 :
Roots and moduli for selected ARMA models.
2.3.Validation ofTime-Series Regression Model.The resulting time-series regression model is validated with measured test data set for years 2005-2007 in Al-Ain, UAE, in order

Table 8 :
[26]ne's two-tailed variance test (α = 0.05).Artificial Neural Networks (ANN) obtained by the coauthor[26]using the same model and test data sets.Figure8shows a better prediction performance for the Al-Ain test data.Note from Table12the very high R 2 value (near 100%) obtained for monthly mean data validation for both time-series regression models.The low error parameters (RMSE, MAPE, MBE, and MABE) provide a clear indication

Table 9 :
Statistical error data for the mean daily GSR model for regression and regression-ARMA models versus measured model data for years1995-2004.

Table 10 :
Statistical error data for the mean monthly GSR model for regression and regression-ARMA models versus measured model data for years1995-2004.

Table 11 :
Statistical error data for the mean daily GSR error for regression and regression-ARMA models versus measured test data for years2005-2007.

Table 12 :
Statistical error data for the mean monthly GSR error for regression models versus measured test data for years2005-2007.
models were identified as the two most adequate models with model ARMA (2, 1) being the most parsimonious and best suited for the long-term GSR prediction as it yielded the lowest BIC value.The model adequacy was conformed through Ljung-Box Q test and normality test.The model was validated by testing with GSR data for years 2005-2007 yielding deterministic coefficients of 92.59% and 99.98%, for mean daily and monthly GSR data comparison, respectively.The corresponding values for RMSE, MBE, and MAPE were also small confirming the adequacy of the regression-ARMA model for GSR data prediction in the city of Al-Ain, UAE.Work is underway to establish a model for the city of Abu Dhabi, UAE, using GSR measured data for the same time period in order to come up later with a more generalized GSR weather model for the UAE.