Evaluation of Temperature-Based Empirical Models andMachine Learning Techniques to Estimate Daily Global Solar Radiation at Biratnagar Airport, Nepal

Global solar radiation (GSR) is a critical variable for designing photovoltaic cells, solar furnaces, solar collectors, and other passive solar applications. In Nepal, the high initial cost and subsequent maintenance cost required for the instrument to measure GSR have restricted its applicability all over the country.)e current study compares six different temperature-based empirical models, artificial neural network (ANN), and other five different machine learning (ML) models for estimating daily GSR utilizing readily available meteorological data at Biratnagar Airport. Amongst the temperature-based models, the model developed by Fan et al. performs better than the rest with an R2 of 0.7498 and RMSE of 2.0162MJm 2d 1. Feed-forward multilayer perceptron (MLP) is utilized to model daily GSR utilizing extraterrestrial solar radiation, sunshine duration, maximum and minimum ambient temperature, precipitation, and relative humidity as inputs. ANN3 performs better than other ANN models with an R2 of 0.8446 and RMSE of 1.4595MJm 2d 1. Likewise, stepwise linear regression performs better than other ML models with an R2 of 0.8870 and RMSE of 1.5143MJm 2d 1. )us, the model developed by Fan et al. is recommended to estimate daily GSR in the region where only ambient temperature data are available. Similarly, a more robust ANN3 and stepwise linear regression models are recommended to estimate daily GSR in the region where data about sunshine duration, maximum and minimum ambient temperature, precipitation, and relative humidity are available.


Introduction
Some of the critical global issues currently encountered by human civilization include global warming and environmental pollution particularly instigated by the excessive use of fossil fuels like petroleum products and natural gas and traditional fuels like timber and firewood [1]. Nepal, being a developing nation where 60% of the entire population is involved in agriculture, has a disproportionate dependence on traditional fuels [2]. However, clean and perpetual solar energy is gaining more and more attention from the government as well as the private sector in recent years. Global solar radiation (GSR) data serve to be one of the critical variables in applications relating to hydrology, meteorology, agriculture, and renewable energy. e GSR is important in the renewable energy sector to predict the capacity and efficiency of devices based on solar energy applications like photovoltaic cells, solar furnaces, and solar collectors. However, unlike routine meteorological parameters, the data pertaining to daily GSR are not readily available in many locations all over the world [3]. is is particularly relevant for developing countries like Nepal. Lack of measured GSR data has led to the development of several methods to estimate GSR, namely, a neural network [4,5], empirical models [6,7], stochastic algorithm [8], and satellite-based methods [9]. Despite the current development in new methods and technologies, the empirical method utilizing meteorological data is preferred because of the cost and technical constraint imposed on new methods and technologies [3,6,7,10].
According to the Department of Hydrology and Meteorology (DHM), only 284 meteorological stations are currently in operation in Nepal. Out of these, only 64 meteorological stations are equipped with pyranometer to measure daily GSR, while only 34 meteorological stations have the necessary infrastructure to measure daily sunshine duration [11]. e data of daily GSR are not readily available for most of its locations. us, various meteorological parameters can be used instead for the estimation of daily GSR.
As of 2019, more than 294 empirical models are available for the estimation of GSR employing readily available meteorological data [7]. Some of the major categories of empirical models include sunshine-based models [12], temperature-based models [13], cloudiness-based models [14], and complex models. Angstrom pioneered the estimation of GSR employing a linear empirical model which was later modified by Prescott [15]. Simplicity of the Angstrom-Prescott (A-P) model and strong correlation of sunshine duration with GSR are the reasons for its extensive application all over the world [12,16,17]. After the development of the A-P model, one or more meteorological parameters have been incorporated in the original model to improve the estimation [15]. Sunshine-based models perform more efficiently than models based on other meteorological parameters since the sunshine duration is strongly correlated with the GSR [18][19][20]. However, the high initial investment and high maintenance cost of the instrument are constraints to the widespread application of sunshine-based models. erefore, developing empirical models that utilize readily available meteorological parameters such as ambient air temperature, relative humidity, and precipitation are widely preferred. e ambient temperature range is the most readily available meteorological parameter. One of the simplest temperature-based models consisting of mean monthly maximum and minimum temperature as inputs was proposed by Hargreaves and Samani [21]. After the introduction of the H-S model, several modifications have been developed by incorporating other meteorological parameters to improve the model performance. Hassan et al. modified the H-S model by introducing precipitation term that performed better than two of the most effective sunshine-based models from the literature [13]. Jahani et al. recently developed two new accurate polynomial models that outperformed several temperature-based models from the literature [3]. Although all of the temperature-based models were derived empirically, the variation in ambient temperature was assumed to be largely dependent on the solar radiation arriving on the Earth's surface [22].
Although empirical models are widely analyzed and evaluated, the performance of these models is found to vary according to the geographical location and local climate [6]. Lately, several machine learning (ML) models are employed to estimate GSR at several locations [4,5]. e capability of generalizing and optimizing time and capacity to resolve problems that are difficult to be represented by an explicit algorithm are some of the biggest advantages of ML models [23,24]. e main ML models currently in practice include artificial neural network (ANN), support vector machine (SVM), genetic programming (GP), random forests (RF), and adaptive neural-fuzzy inference system (ANFIS). Some of the predominantly applied ANN models include radial basis function network (RBFN) and multilayer perceptron (MLP). Behrang et al. [4] concluded that MLP was more accurate than the RBFN for the estimation of GSR in Iran. Belaid and Mellit [25] applied SVM with different input combinations and concluded that it required a fewer number of input parameters to provide better accuracy than ANN. e current study presents a comparison between the temperature-based empirical models and ML models. e most common type of feed-forward network, i.e., MLP, is employed in the current study to estimate the GSR at Biratnagar Airport, located in the Eastern Terai Belt of Nepal. Hence, the objectives of the study include the following: (1) e performance analysis of six different temperature-based empirical models to estimate daily GSR (2) e application of ANN and other ML models to estimate daily GSR (3) e comparative analysis of the aforementioned models to recommend the best model for the estimation of GSR

Study Location and Data.
Nepal is situated between 26.2°N and 30.54°N latitude in the temperate zone. Nepal experiences 300 days of annual sunshine with an annual average solar radiation of 5 kWh/m 2 /day [26]. Fourteen-month daily data of various meteorological parameters for Biratnagar Airport (26.4840°N latitude, 87.2662°E longitude, and 236 m altitude) were obtained from DHM. e average annual temperature in Biratnagar is 24.3°C with an average annual rainfall of 1898 mm [27].
CMP6 pyranometer is employed to measure the daily GSR on a horizontal surface. e pyranometer consists of a blackened thermopile that absorbs the solar radiation which is converted into heat. Voltage output is generated by the thermopile which is then calibrated to indicate the GSR. A data logger is utilized to record the measured daily GSR. Campbell-Stokes sunshine recorder is employed for the measurement of sunshine duration. Mercury-filled and alcohol-filled meteorological thermometers are used to measure wet-bulb and dry-bulb temperature for normal ambient and low ambient temperature, respectively. Similarly, data for relative humidity is computed using the dry-bulb and wet-bulb temperature.

Temperature-Based Empirical Models.
Several models have correlated ambient air temperature with GSR empirically. Two of the most widely employed empirical model for the estimation of GSR using ambient air temperature data only are the Hargreaves and Samani (HS) model and Bristow and Campbell (BC) model. us, the above two models along with other four recently developed highly accurate models are selected in the present study.
Hargreaves and Samani [21] proposed a simple model employing a mean monthly maximum temperature and 2 Advances in Meteorology mean monthly minimum temperature as inputs for the estimation of daily GSR: where ΔT � T max − T min and c 1 is the empirical constant. Chen and Li [28] developed and analyzed the performance of more than 20 different temperature-based empirical models. Two of the top-performing temperature-based models incorporating ambient temperature range and mean monthly maximum temperature and mean monthly minimum temperature as inputs are taken for the current study. One model incorporates an additional constant term to the original H-S model with an exponent of "1" (abbreviated as Chen and Li (model 1)): where a and c 1 are the empirical constants. Another model is a multiple regression model that takes mean monthly maximum and minimum temperature as inputs (abbreviated as Chen and Li (model 2)): where a, c 1 , c 2 , and c 3 are the empirical constants. Bristow and Campbell [29] developed a model considering that the GSR is exponentially related to the ambient temperature range: where c 1 , c 2 , and c 3 are the empirical constants. Jahani et al. [3] proposed a model considering a polynomial correlation of GSR with ambient temperature range: where a, c 1 , c 2 , and c 3 are the empirical constants. Fan et al. [6] modified the Jahani model by using a different exponent in the ambient temperature range term and incorporating additional average temperature term to improve the model performance: where T a � (T max + T min )/2, and a, c 1 , c 2 , c 3 , and c 4 are the empirical constants.Extraterrestrial GSR is computed using the following equation [30]: e angle of declination is determined by using the following equation [31]: e day length is calculated as follows: e sunset hour angle is calculated as follows:

Machine Learning Models.
Multilayer perceptron (MLP) model is employed in the current study among several other ANN topologies available in the literature. MLP is particularly useful in modeling to resolve a complex problem. Figure 1 illustrates the structure of MLP that consists of an input layer, a hidden layer, and an output layer. Input signals are multiplied by a set of weights as they are sent to the output layer through the hidden layer. Typical MLP with a hidden layer can be modeled as follows [32]: where θ is the bias in the hidden layer. A nonlinear activation function (typically a sigmoid) is employed for the calculation of output of neurons given by [33] f Support vector machine (SVM) is a powerful supervised learning technique with excellent generalization ability because of which it is extensively utilized for solving problems regarding pattern recognition, classification, regression, and prediction [34]. SVM function can be used with various kernel functions to implement its regression learner. e application of the Gaussian kernel is popular with SVM classification and regression. Very complex boundaries and relations can be established with the help of Gaussian-based SVM. Medium Gaussian and fine Gaussian are defined based on the slenderness of the Gaussian function being used. Fine Gaussian uses very thin Gaussian function with very low standard deviation. As a result of these thinly separated boundaries, fine Gaussian is susceptible to overfitting. Linear regression learner performs a multivariate linear regression on a set of input data. On the other hand, stepwise linear Advances in Meteorology regression utilizes only highly correlated variables for linear regression removing redundant weakly correlated variables.

Model Training and Testing.
e fourteen-month data is divided into two datasets, initial 85% of the data is utilized for the development of ML models or the calibration of empirical models and the rest 15% is employed for the model assessment. e present study employs extraterrestrial solar radiation (R a ), sunshine duration, maximum ambient temperature (T max ), minimum ambient temperature (T min ), precipitation, and relative humidity as inputs for the development of ML models. e performance of the neural network is analyzed by differing the number of neurons in the hidden layer and recording the respective statistical indicators. Neural Net Fitting [35], an implicit application in MATLAB, is employed to design and train the neural network. e most well-known feed-forward network, i.e., MLP, is employed in the current study to model the GSR. Levenberg-Marquardt backpropagation algorithm is utilized to train the network. Training terminates when generalization stops improving, as demonstrated by an increase in the mean square error and the corresponding decrease in R 2 . Regression Learner [36], an implicit application in MATLAB, is employed to analyze and evaluate the performance of linear regression, stepwise linear regression, medium Gaussian SVM, matern 5/2 Gaussian process regression (GPR), and exponential GPR.

Statistical
Indicatives. Four statistical indicators, namely, coefficient of determination (R 2 ), adjusted R 2 , mean square error (MSE), and root mean square error (RMSE) are utilized to evaluate the performance of various models. e performance of the model is evaluated using the following equations: where X i and Y i represent the measured and predicted values, while X i and Y i represent the average measured and average estimated values, n is the number of data points, and k is the number of independent regressors. RMSE value provides an indication of the short-term performance of the model. Lower RMSE value corresponds to better performance. R 2 indicates the variance of the dependent variable that is explained by independent variables. Figure 2 illustrates the correlation between the measured value and the estimated value of daily GSR for the calibration dataset. e estimated value of GSR is reasonably correlated with the measured value of GSR for all models. From the statistical indicators (Table 1)   e least-square method is employed to fit the empirical coefficients for temperature-based models. Empirical coefficients obtained for all temperature-based models are incorporated in Table 2. Figures 3 and 4 illustrate the correlation between daily GSR estimated by ANN and measured daily GSR for 6 and 10 neurons in the hidden layer, respectively. Table 3 provides a summary of the statistical indicators for ANN models. For the training set, the model with 10 neurons in the hidden layer (abbreviated as ANN5) performs better than other ANN models with an R 2 of 0.8485 and RMSE of 1.4967 MJm − 2 d − 1 . Likewise, ANN3 ranks second among other ANN models in the training set with an R 2 of 0.8341 and RMSE of 1.6823 MJm − 2 d − 1 . A comparison of statistical indicators shows that R 2 and RMSE sometimes follow a different trend. Similar to the training set, R 2 and RMSE sometimes follow a different trend in the validation set. Although ANN5 is the best model in terms of R 2 , it exhibits a comparatively higher RMSE of 1.8221 MJm − 2 d − 1 . In the model development, ANN models account for greater variance in the training data in comparison with the temperature-based empirical model. Figure 5 illustrates the correlation between daily GSR estimated by various ML models and measured daily GSR. In order to prevent the overfitting of the model, 5-fold crossvalidation is performed during the model development. e statistical indicators for these models are incorporated in Table 4.

Training and Validation of Other ML Models.
Medium Gaussian SVM performs better than other models with an R 2 of 0.79 and RMSE of 1.867 MJm − 2 d − 1 . Similarly, GPR matern 5/2 ranks second with an R 2 of 0.79 and RMSE of 1.870 MJm − 2 d − 1 . e performance of GPR exponential degrades extensively with cross-validation which might be attributed to the overfitting of the dataset.      On the contrary, the cross-validation has little or no effect on the performance of the linear regression and stepwise linear regression model. In the model development, most of these ML models perform extremely well in comparison with temperature-based empirical models. On the contrary, the ANN5 model accounts for greater variance in the training data in comparison with that in ML models.

Performance Comparison of Empirical and ML Models.
Model assessment is carried out on the 15% unseen data after the model development. RMSE and R 2 of empirical and ML models on the test data are illustrated in Figures 6 and 7, respectively. All of the temperature-based empirical models perform reasonably well on the test data. Amongst the empirical models, the model developed by Fan et al. outperforms other models with an R 2 of 0.7498 and RMSE of 2.0162 MJm − 2 d − 1 . A comparison of statistical indicators shows that R 2 and RMSE sometimes follow a different trend.

Conclusion
e present study analyzes and evaluates six different temperature-based empirical models, ANN, and other five different ML models to estimate daily GSR at Biratnagar Airport. Initially, six different temperature-based empirical models with ambient temperature range and daily average temperature as inputs are calibrated. In the model assessment, the model developed by Fan et al. performs better in comparison with other temperaturebased models with anR 2 of 0.7498 and RMSE of 2.0162 MJm − 2 d − 1 . e performance of ANN with a higher number of neurons in the hidden layer degrades substantially in the model assessment because a large number of neurons tend to overfit the dataset. In the model assessment, ANN3 performs better than other ANN models with an R 2 of 0.8446 and RMSE of 1.4595 MJm − 2 d − 1 . Similarly, ANN1 ranks second with an R 2 of 0.8134 and RMSE of 1.4663 MJm − 2 d − 1 . Five different ML models available in the MATLAB Regression Learner are analyzed and evaluated to determine the best performing ML model. e performance of medium Gaussian SVM, GPR matern 5/2, and GPR exponential degrades substantially in the model assessment because of the overfitting of the training data during the model development. Concerning R 2 , stepwise linear regression performs better than other ML models with an R 2 of 0.8870 and RMSE of 1.5143 MJm − 2 d − 1 . Likewise, concerning RMSE, linear regression learner performs better than other ML models with an R 2 of 0.8102 and RMSE of 1.4765 MJm − 2 d − 1 .
Considering the generalization capability of temperature-based empirical models, the model proposed by Fan et al. is recommended to estimate daily GSR in the region where only data pertaining to ambient temperature are available. For the regions where data about sunshine duration, maximum and minimum ambient temperature, precipitation, and relative humidity are available, a more robust ANN3 and the stepwise linear regression model are recommended to estimate daily GSR.
averageGSR data measuredper second × sunshineduration × 3600. Supplementary 3: Sunshine Duration file contains daily sunshine duration data in hr. Supplementary 4: T max and T min file contains daily maximum and minimum ambient temperature data in 0 C. Supplementary 5: Rainfall file contains daily rainfall data in millimeters (mm). Supplementary 6: RH file contains data of relative humidity computed a couple of times a day. (Supplementary Materials)