Effective Estimation of Hourly Global Solar Radiation Using Machine Learning Algorithms

. The precise estimation of solar radiation is of great importance in solar energy applications with respect to installation and capacity. In estimate modelling on selected target locations, various computer-based and experimental methods and techniques are employed. In the present study, the Multilayer Feed-Forward Neural Network (MFFNN), K -Nearest Neighbors ( K -NN), a Library for Support Vector Machines (LibSVM), and M5 rules algorithms, which are among the Machine Learning (ML) algorithms, were used to estimate the hourly average solar radiation of two geographic locations on the same latitude. The input variables that had the most impact on solar radiation were identi ﬁ ed and grouped as a result of 29 di ﬀ erent applications that were developed by using 6 di ﬀ erent feature selection methods with Waikato Environment for Knowledge Analysis (WEKA) software. Estimation models were developed by using the selected data groups and all input variables for each target location. The results show that the estimations developed with the feature selection method were more successful for target locations, and the radiation potentials were similar. The performance of the estimation models was evaluated by comparing each model with di ﬀ erent statistical indicators and with previous studies. According to the RMSE, MAE, R 2 , and SMAPE statistical scales, the results of the most successful estimation models that were developed with MFFNN


Introduction
Energy, which is an effective parameter in the development of countries, is increasing rapidly with industry, technological advances, and increasing population. Not every country has adequate energy resources to meet the need for energy, and the rapid increase in energy consumption forces countries to turn to alternative sources in energy supply. For this reason, countries prefer renewable energy sources such as solar, wind, hydro, bio, hydrogen, geothermal, and tidal energy to meet their energy needs instead of conventional energy sources [1]. Solar energy, which plays a critical role in electricity generation with each passing day, has become one of the promising renewable energy sources attracting the attention of countries because it is clean, unlimited, and sustainable compared to fossil fuels. As a result of this, investments in solar energy for electricity generation are increasing rapidly in recent years with technological advances in solar energy, global climate change, dependence on other countries, and other environmental factors. In this context, photovoltaic (PV), as one of the usages of solar energy application areas, is intensively applied in order to produce electricity [2,3].
PV, which is used reliably in electricity generation, has been growing rapidly in the world for more than 40 years, and the amount of electrical energy produced from PV power plants has reached 480 GW [4]. Before designing and modelling a PV system in a selected geographical area, solar radiation (SR) data must be measured as the most important input value, where the feasibility of the designs made in terms of investment can be evaluated according to this data. This value is not only necessary in PV designs but is also the most important parameter in many scientific and engineering works on solar energy practices [5]. For this reason, it is the most accurate method to obtain long-term data in a selected special geographic area. However, measuring the SR everywhere is often not possible, as it requires costly, long, and precise processes. In addition, radiation values cannot be measured in an accurate way in most countries because the measurements can only be made in certain areas. For this reason, experimental, statistical, and Artificial Intelligence-(AI-) based estimation methods were developed to calculate the value of SR worldwide [6][7][8]. ML algorithms, which are a subfield of AI, are one of the most common methods used in estimation studies.
Many studies have been conducted in recent years based on ML algorithms in different geographic areas of the world to estimate SR. In these studies, algorithms including Artificial Neural Network (ANN), Support Vector Machines (SVMs)/(Support Vector Regression) SVR, K-NN, Linear/-Nonlinear Regression, M5, and Random Forests have been used frequently [9]. However, estimation models were developed in these studies by selecting a specific geographical area of a country or different geographical locations in the country [10]. Before the development of estimation models for a selected geographic location, it must be decided which hourly, daily, and monthly average global radiation values that fall onto a certain horizontal surface will be used [11]. Notton et al. [12] recommended that the monthly average data should be used if preliminary modelling or draft design is required to be done, and the daily average data can be used if a more comprehensive design is to be established. However, they also indicated that it is necessary to use hourly average or shorter-scale data in more precise and resultoriented designs. Zhang et al. [13] explained that the estimation processes of studies with hourly data compared with daily and monthly data are more difficult and complex. For this reason, estimation models made with hourly data are less common since they contain more difficult and complex processes. After determining the input data according to the type of work that will be carried out, the SR values of the target area can be estimated by using one [14], multiple [15], or hybrid [16] ML algorithms. Different solutions were sought for Global Solar Radiation (GSR) estimation problems in developed models by making changes on the functional structure and architecture of one single ML algorithm by comparing multiple algorithms or by working two or more AI methods together.
It is possible to classify the planned studies in which the ML method is used in GSR estimation in three different categories according to the measurement time intervals of SR: Monthly Average Global Solar Radiation (MAGSR), Daily Average Global Solar Radiation (DAGSR), and Hourly Average Global Solar Radiation (HAGSR). HAGSR- [17,18], DAGSR- [19][20][21][22], and MAGSR- [23][24][25][26] based estimation models were developed by using one single ML algorithm, and it was noticed that the ANN algorithm was used frequently compared to other algorithms because of its flexible structure and accuracy. On the other hand, studies on the methods in which multiple ML algorithms can be analyzed and used together at the same time are increasing rapidly.
In these types of studies, a clear idea can be achieved on the effectiveness of each ML algorithm on the dataset used, and the most successful models can be compared and evaluated. In this context, Pang et al. [27] estimated GSR comparatively by using ANN and Recurrent Neural Network (RNN) ML algorithms in 10-, 30-, and 60-minute time zones. Li et al. [28] developed estimation models with the help of sevenyear measured hourly data with the Multivariate Adaptive Regression Spline (MARS) ML algorithm to estimate HAGSR and compared their results obtained in Hong Kong with the ANN and logistic regression algorithms. They reported that ANN achieved superior performance compared to the other algorithms. Khosravi et al. [29] developed the most successful estimation models to estimate HAGSR for two different network groups on the Iranian island of Abu Musa by using MFFNN, Radial Basis Function Neural Network (RBFNN), SVR, Fuzzy Inference System (FIS), and Adaptive Neuro-Fuzzy Inference System (ANFIS) ML algorithms. The first network was planned with five inputs, and the second network was planned with one single input, and it was reported that the SVR reached superior estimative accuracy than other algorithms on both networks. Lotfinejad et al. [30] investigated the DAGSR of different cities of Iran by using Bat Neural Network (BNN), Generalized Regression Neural Network (GRNN), and Neuro-Fuzzy (NF) algorithms. They reported that the models developed with the recommended BNN algorithms performed better than other algorithms. Meenal and Selvakumar [31] examined a comparative DAGSR estimation model among SVM, ANN, and experimental models by identifying the most suitable input variables from nine input data from four different cities of India and showed that SVM was more successful than the other algorithms. Loutfi et al. [32] developed ten different HAGSR estimation models in the city of Fes, Morocco, with the help of nine different input variables from 2010 to 2014 five-year with Multilayer Perceptron (MLP) and Neural Autoregressive with Exogenous Inputs (NARX) algorithms. Among the models developed, they contended that the most successful estimation model was the model developed with NARX. Lazzaroni et al. [33] compared the GSR estimation models that were developed according to hourly, daily, and monthly time zones with SVR and Extreme Learning Machine (ELM) ML algorithms by using three-year hourly data in Milan, Italy, with the K-NN algorithm. Long et al. [34] investigated the estimation of DAGSR by using MLbased ANN, K-NN, SVM, and Multivariate Linear Regression (MLR) algorithms and made a comparative analysis of data-driven algorithms. Ozgoren et al. [35] compared the estimation models developed with the ANN and Multi-Nonlinear Regression (MNLR) algorithms to estimate MAGSR in 31 cities of Turkey using five-year input data collected between 2002 and 2006. Moghaddamnia et al. [36] estimated the DAGSR by using the different meteorological parameters of Britain's Brue Basin by using the Local Linear Regression (LLR), NARX, MLP, Elman Network, and ANFIS ML algorithms.
In the present study, the purpose was to comparatively analyze the HAGSR of two geographical provinces located on the same latitude of the Mediterranean Region by using 2 International Journal of Photoenergy four different ML algorithms MFFNN, K-NN, M5 rules, and SVR-based LibSVM library. Another purpose was to use the WEKA software program to determine the features of input data that has the most impact on SR. For this purpose, the best features were determined in five groups by developing twenty-nine different applications with the help of six feature selection functions. The eventual models of ML algorithms that were used in the study were developed according to the output groups of feature selection functions. HAGSR estimation models were evaluated with respect to among themselves and also on the basis of the algorithm that was used, and the results were then compared with similar studies. In addition, unlike other previous studies, the present study developed estimation models and evaluated their performance by using the classical SVR algorithm and LibSVM software, which are similar to each other. The framework outlining how the data mining processes and four different ML algorithms are used in this study to evaluate the solar radiation potential of two provinces in the same latitude is shown in Figure 1. The WEKA software was used in data mining processes such as data preprocessing and feature selection, and Matlab R2017b software was made use in modelling studies developed with ML algorithms used in SR estimation. The rest of the study is organized as follows. The provinces for which the models were developed and the meteoro-logical and categorical dataset used in the study are defined in Section 2. The details of the feature selection processes that were used to determine the most appropriate input data groups, the methodologies of the MFFNN, SVR-LibSVM, K -NN, and M5 rules algorithms, the architectural and functional structures of the developed models, and the methods applied are also explained in this chapter. Section 3 includes the results and comparative analyses of the estimation models developed with the ML algorithm used for each input data group. The HAGSR estimation performance of the two provinces, which are located in the same latitude, was evaluated with multiple statistical error methods and was also compared with previous similar studies. The results of this study and its contribution to the literature are summarized in Section 4.

Materials and Methods
In this section, the evident features of the two selected provinces and the editing of data to be used in ML models are explained. Then, the selection procedures of the most effective input groups are mentioned using feature selection processes. The input data were determined in five different groups at the end of the selection process, and the development processes of the best ML models were explained for each group. In addition, the structural characteristics of the  3 International Journal of Photoenergy ML algorithms that were used in the comparative estimation of HAGSR and the statistical scales that were used in evaluating model performance are discussed in detail.

Study Area and Preparation of Database.
The provinces of Kahramanmaras and Isparta were selected as the study areas by considering the climatic characteristics, elevation, various different geographical characteristics, and in particular latitude. The selected provinces are located in the Mediterranean Region and have a high solar power potential with an average annual sunshine time of 2956 hours and an average annual amount of solar energy of 1390 kWh/m 2 [37]. The location of the selected provinces in the Mediterranean Region and the latitude coordinates of meteorological measurement stations are given in Figure 2.
Radiation data is the most important parameter used in solar energy-based systems. However, the radiation data value cannot be measured at every measuring station across the country; instead it is measured at a limited measuring station. SR was measured for certain locations by the Turkish State Meteorological Service (TSMS), which is a government agency with a large network of stations in Turkey. In the study, the data collected for the target provinces consisted of meteorological data measured by TSMS between 2002 and 2006. These data used were the hourly average data that were measured every 5 minutes and meteorological data from measuring stations collected from Hourly Pressure (P ), Hourly Sunshine Duration (HSD), Hourly Humidity (H), Hourly Temperature (T), Hourly Wind Speed (WS), and Hourly Solar Irradiance (HSI). 3D plots showing the change of SR for both monthly and seasonal measurements of yearly intervals of Kahramanmaras and Isparta are given in Figures 3 and 4. The annual distributions of the SR values measured in these charts are given in detail on an hourly basis. The specific characteristics of geographical and meteorological data of the target provinces are given in Table 1.
The data preprocessing that will improve the quality of the raw data to be used in the study is one of the most important processes that have a direct positive effect on the performance of all computer science-related algorithms [38]. Since ML algorithms are generally data-focused structures, several operations like cleaning, scaling, reduction, and normalization have significant effects on the accuracy of the estimation [39]. In the present study, four categorical data were included in the meteorological dataset including year of measurement (year), month of the year (month), day of the month (day), and hour of the day (hour). Geographical data were not used since the effect of the latitude was evaluated. Measurement time intervals of the other meteorological data were determined according to HSI measurement time intervals. The data between 06:00-17:00 hours for January and February; 06:00-18:00 hours for March, April, October, November, and December; and 07:00-19:00 hours between May and September were selected. Factors such as measurement time differences between years, variability of the measured time zones of each month, and winter time-summer time were effective in selecting the time ranges. Any missing data was calculated by taking the arithmetic average of the data in the same time frames of the previous and following years, and the data that were calculated in this way constituted approximately 4% of all data. A total of 23442 SR data were obtained for each province. After the raw data were arranged and determined, min-max normalization was applied and scaled. The normalization formula applied is given in equation (1). In this formula, each input (x i ) value was normalized (X n ) linearly between the 0 and 1 range by finding the minimum (x min ) and maximum (x max ) values of the raw dataset [40].
In data selection, since different estimation results are obtained each time when a certain year range is used in training the model and the remaining years are used to test the model, it was ensured that all data at hand were randomly allocated hourly with a specifically coded program, instead of determining year-based training and test data. In this way, it was aimed that the output results of the estimation models were not affected by the data selection by providing a homogeneous distribution in the input data according to years. The number and basic characteristics of the training and test data, which were determined hourly for each province, are given in Table 2.  International Journal of Photoenergy 8 International Journal of Photoenergy output, some others may have negative effects, and some have no effect. For this reason, determining the most effective data features on output prior to the modelling process will decrease the dimensionality of the data employed in this process, facilitate the interpretation of the estimation, and shorten the modelling process increasing the estimation accuracy [10,41]. The methods and techniques used in the selection of the features affecting SR the most as well as the methods and applications used in the current study are compared with similar ML-based studies in Table 3. In feature selection methods used commonly, the features that have the greatest effect on the SR data are found and a new input dataset is determined. Unlike in previous studies, different input data groups were created in this study by evaluating the different input parameters affecting SR data with multiple applications that were developed by the selection methods applied.
In this study, the open-source WEKA program was used to select the most affected features of SR. This program was developed by Waikato University by using the JAVA programming language. Two feature selection methods based on the wrapper and filter approaches were used to select features that most influenced the SR data in the program. Although the filter approach uses simple, fast, and scalable methods, the wrapper approach processes the data by using classification-based techniques. The relations between different features selected in each application and the classifier models were evaluated in this study in the selection processes [43].
Instead of processing the data with one single feature selection method, it was aimed to evaluate the effect of input variables on SR by developing multiple applications in each function by using six different feature selection functions based on filter and wrapper approaches. Three of these work as wrapper approach-based functions, and the other three work as filter-based functions. Classifier Subset Evaluator    In the selection process of the most effective input variables, 29 different applications were developed by using a total of six different feature selection functions. A 10-fold cross-validation method was used in all applications. The screenshot of the application developed by using the CfsSE feature selection function with the BF search method is given in Table 4. It is seen in the table that the year, H, and HSD data were most effective on SR, and the other data had no effect.
Six different data groups were created for each feature selection function, with variables that most affected the SR. The input variables that affected the SR the most according to the selection functions for the provinces for which the models were developed are given in Tables 5 and 6. In processes where more than one selection was applied, the selection of the most effective features was determined by evaluating the number of applications and the impact totals of the selected variables. Consequently, the feature was not included in the data group if the impact level on SR was negative, neutral, or very low. Some feature variables calculated for selection functions and the number of inputs were similar in selection processes. The results of the CSE and WSE feature selection functions in Isparta and the results of the CSE and RAE feature selection functions in the data of Kahramanmaras were similar.  For each province, the final data groups and feature numbers created to be used in estimation models to be developed with ML algorithms as a result of feature selection processes are given in Table 7. 2.3. MFFNN Algorithm. ANN is an ML algorithm developed based on nerve cells specific to humans. This structure is known as a computer-modelled version of the biological and intellectual structure of the brain and is used frequently in solving problems such as estimations which cannot be calculated by nonlinear and classical calculation methods, time series problems, pattern recognition, and classification [44]. For the past 50 years, many neural network architectures have been developed based on Feed-Forward and Recurrent Networks to be used for various purposes and in a number of fields. Each architectural structure does not reach the same level of success on input data [45]. For this reason, the MFFNN ML algorithm, which was based on the Feed-Forward architectural structure, which is suitable for the available data structure and exhibits high performance, was used. Since MFFNN works with the backpropagation learning algorithm to minimize error, it has greatly increased learning success [25]. The Matlab R2017b software program was used in the development and modelling of this network. The architectural structure and working principles of the MFFNN that was used in the modelling studies are given in Figure 5.
GR1-GR5 represents the input data groups selected at the end of the feature selection process, and GR6 represents all input data that did not undergo any selection process. The architectural structure of the neural network was created in three layers, and a 5-iteration training model was developed for each neuron by using 1-50 neurons in the hidden layer. No significant increase was detected in the operating performance in neurons over 50, and the working time became considerably longer. In the developed MFFNN models, each input data (X j ) connected to neurons between layers was multiplied by a weight value (W ij ), added by bias (b i ), and the net input values (N i ) were calculated. The formula of the net input is given in Net input is activated with a transfer function once it is calculated [46]. A hyperbolic tangent sigmoid transfer function (Tansig) was used between the input layer and the hidden layer and between the hidden and the output layer. By using Tansig, net-input values are scaled in the -1 to +1 range. When determining the transfer function, the logistic sigmoid (Logsig) or Tansig function was determined to be available in the hidden layer, while Tansig or Linear (Purelin) functions were available in the output layer. Choosing a function other than these significantly reduced the performance. The formula for the Tansig transfer function is given in The Levenberg-Marquardt Backpropagation (Trainlm) training function was used in the MFFNN. Other training functions such as Trainbr (Bayesian Regularization Backpropagation) and Traincgb (Conjugate Gradient Backpropagation) were also tested. However, since the best performance was provided with Trainlm, this training function was selected.
2.4. SVR Algorithm. SVM is known as the ML algorithm that was developed by Vapnik and commonly used in classification problems. The smallest subsets of training data are used to find the best prediction model between two classes with SVM [47]. However, since it was not adequate in multiclass   11 International Journal of Photoenergy estimation problems, the SVM-based SVR method was developed. SVR uses a technique based on regression problems and based on calculating a linear regression function in a multidimensional feature set [48]. The architectural structure of the SVR used in modelling studies is given in Figure 6.
The gaps between the data are kept wide in the SVR algorithm, ideal locations are found, and errors are minimized. In a dataset with a certain number of elements, fðx a , y a Þ, a = 1, 2, 3, ⋯, Mg represent the input vector x a ϵ R d , respectively, y a ϵR x i represents the corresponding output vector, and M represents the total number of elements [49]. The formula of the SVR linear function is given in φðxÞ represents the nonlinear mapping function which converts multidimensional data structures into a twodimensional chart, W represents the weight vector, and b represents bias. The error function is given in equation (5). The constant C and the ε values are determined by the user and are defined as the estimation accuracy of the training data.
The equation that minimizes the error function is given in equation (6). α * a and α a are the LaGrange multipliers and are referred to as the support vectors if the training vector has a value other than zero. This structure is known as the critical values for SVR algorithms [50]. The Kðx a , xÞ structure is called the kernel function and converts the data it receives as input into an available form. Different types of kernel functions are used in SVR. Three different types of kernels, i.e., Polynomial (POL), Normalized Polynomial (NOR-P), and Gaussian Radial Basis Functions (RBF), were used in the models that were developed with SVR, and formulas for these functions are given in equations (7)-(9), respectively.
The classic SVR algorithm was also evaluated in the study by developing estimation models with LibSVM, which is another SVR-based method, and which is also an SVM-SVR-based algorithm software supporting single-class SVMs, two or multiclass SVMs, and SVRs [51]. LibSVM is preferred because it is a method that is used quite frequently in academic studies but not much preferred in SR prediction studies. Two different SVR types and kernels (Epsilon SVR (E-SVR) and Nu-SVR) were used in the estimation models that were developed with LibSVM, and RBF was used as the kernel. All prediction models were developed with Matlab R2017b software using LibSVM library interface software plugin.

KNN Algorithm.
This algorithm is widely preferred in classification problems. However, a regression-based method was used in the present study. KNN is a nonparametric lazy

12
International Journal of Photoenergy ML-based learning algorithm and estimates by searching for the closest neighbors in the training dataset. KNN's nonparametric equation is given in equation (10), where each N k ðxÞ was taken as the neighbor K of x data. In the formula, the y i value represents the target output for each x i training data.
Each new data intended to be estimated is looked at in the neighborhood of K from the previous data with a KNN. The distance between any data value and all values in the training dataset is calculated and then the nearest K training data values were determined. The average of the target output values is estimated for these values [52,53]. The Euclid function was used for the calculation of the distance. The formula for the Euclid Function is given in equation (11). Care should be paid in choosing the K value; small values should be used since the model tends to overfitting if the selection is too high [54]. In the present study, the K value was taken as 1, 2, 3, 4, 6, and 10, and six different KNN models were developed for each data group. The model was deemed to over fit with a K value of more than 10. The linear nearest neighbor (Line-arNN), which is a rough force-based search algorithm, was also used in the study. With this structure, the distance between each point pairs was found in the dataset.
2.6. Rule-Based M5 Algorithm. The M5 algorithm was developed by Quinlan as an advanced version of the Classification and Regression Tree (CART) [55], which is based on a binary decision tree structure developing a relation between dependent and independent variables of tree leaves creating a linear regression model on each leaf to estimate the value of the samples reaching the leaf. The algorithm is established on two structures, which are the decision tree and the linear regression. The best leaf is determined as the rule in the M5 algorithm, and pruning and dividing occur in two stages. In the dividing operation, the dataset at hand is divided into subsets to create a decision tree. It is also ensured in this process that numerical features are constantly estimated on each node by using a linear regression function in leaf nodes [56]. Standard deviation is used to find the error in the relevant node, and the error is seen to decrease here at the desired rate for each feature. The division ends if there is little change in the values of the samples that reach a node or if the number of samples decreases too much [57]. The Standard Deviation Reduction (SDR) formula is given in equation (12), where T is defined as the set of feature values reaching the node, T i is the feature values taken from the divided node, and std is the standard deviation [57].
A rule-based type of the M5 algorithm was used in the present study. In this method, which is also known as M5 rules, a series of M5 trees are created where the best leaf (rule) is hidden, and the sample dataset with the best rule in each cycle is removed from the training dataset without creating the next tree. While the M5 algorithm creates one single decision tree, M5 rules create a complete tree in each cycle. M5 rules develop a series of rules based on the M5 algorithm by using the Partial and Regression Tree (PART) algorithm [58].

Kernels & Mapped Vectors
Output Bias K(x a ,x) ( aa )K(X a , X)+b ⁎ Figure 6: SVR's architectural structure. 13 International Journal of Photoenergy (RMSE), Mean Absolute Error (MAE), and Symmetric Mean Absolute Percentage Error (SMAPE) are the error measurement statistics used in the study. Two different statistical analysis methods, Correlation Coefficient (R) and Coefficient of Determination (R 2 ), were also used. The formulas used for the statistical scales are given in equations (13)- (18), respectively. In the formulas, O i , P i , O, and P are the measured, estimated, and measurement and estimation averages, respectively.
The percentage errors are used widely to compare the estimation performance of various datasets. MAPE, which is an estimation error calculation method independent from   14 International Journal of Photoenergy    [59]. The SMAPE percentage scale was used to overcome this problem since the measurement and estimation results had values that were zero or quite close to zero.

Results and Discussion
Although input data is used in SR estimation studies in many areas and location, it is not common to evaluate SR on the same latitude and at locations that have similar geographical characteristics. Based on the effect of latitude on sunshine duration and the angle of coming solar rays, Darhmaoui and Lahjouji [60] calculated that the annual solar radiation values were at similar levels at the same latitude points of a geographical area, with a strong relationship between optimum tilt angle and target latitude value. Ahlgren et al. [61] emphasized the relationship between annual yield and latitude because there was a directly proportional relationship between latitude and direct normal radiation where parabolic groove collectors were located. For this reason, places that had the same latitude coordinates were selected on the target area, and ML algorithms were employed for high-accuracy GSR estimation. The estimated results of the data groups were compared by using statistical error measurement and analysis methods including SMAPE, MAE, RMSE, and R 2 to evaluate the training and testing estimation performance of the developed models. The closer the value between the measured and estimated in statistical error measurement methods is to 0 and the closer to 1 in analysis methods, the estimation accuracy of the developed models is higher [40]. The flowchart of the HAGSR estimation processes of both provinces is given in Figure 7.
Different features were used for each data group to estimate SR with the MFFNN algorithm by employing data groups in the GR1-GR6 range. During the training process of the models, a five-iteration structure was created for each hidden layer neurons between 1 and 50, and 250   R 2 = 0.957 9 9 9 9 95 95 9 9 95 9 9 95 9 9 9 9 9 9 9 9 9 9 9 7 MAE = 0.0383 SMAPE = 8.11% R 2 = 0.965 9 9 9 9 9 9 9 9 9 96 9 9 96 96 9 9 9 9 96 96 9 9 9 9 96 6 6 96 9 9 9 9 96 9 9 96 6 MAE = 0.0341 SMAPE = 7.79%  16 International Journal of Photoenergy different models were developed for each input group, improving 1500 different models in total. In the range of 0 to 1000 epochs, the network performance plots of the models that reached the best estimation results in the training process of both provinces are shown in Figure 8. When the training, validation, and testing SR estimation of each neural network model that was developed was evaluated statistically, the following results of the most successful MFFNN estimation models were determined and are given in Table 8.
As seen in Table 8, the most successful estimation models were calculated by using GR3 for Isparta and GR2 input data for Kahramanmaras. Although a 48-hidden layer neuron was found as the most successful MFFNN model in the first iteration in Isparta, a 40-hidden layer neuron was the most successful estimation model in the second iteration in Kahramanmaras. The training performance of the most successful models that were developed for Isparta and Kahramanmaras was found to be MSE = 0:0025 and 0.0023 and R = 0:9768 and 0.9845, respectively. When the best estimation values were compared with the actual values measured by using the test data, the R 2 , MAE, and SMAPE values for Isparta were 0.9488, 0.0352, and 7.77%, respectively; and these values were 0.9656, 0.0341, and 7.79%, respectively, for Kahramanmaras. The estimation performance of the two target areas was evaluated with different scales, and both the training and test data reached very similar results.
Boxplots between the measured and estimated values of the study done on the selected provinces are given in Figure 9. In these plots, the statistical error average measurement results between the test input data and the estimated values of each province can be seen. Scatter plots between measured and estimated values of the most successful model developed for each data group are given in Figure 10. It is understood in both plots that a high level of correlation was achieved for GR2, GR3, and GR6 data groups in Isparta, and a similarly high-level relationship was reached for GR2, GR3, GR5, and GR6 data groups in Kahramanmaras.
Another ML algorithm that is employed in estimating HAGSR is SVR. The results calculated with SVR-based estimation models were found to be quite low. Therefore, it  17 International Journal of Photoenergy was decided that the classic SVR estimation results should be evaluated with LibSVM, which is another SVR-based method. LibSVM was preferred because it is a well-known method in academic literature. In both methods, the most suitable combinations were determined by creating numerous models and the most successful estimation models were developed in the selection of user-defined C (complexity and cost parameter), epsilon (error parameter), and Nu (parameter used instead of C). The performance results of 18 different models that were developed for each province by using the POL, NOR-P, and RBF core functions with SVR are shown in Figure 11. The estimates were obtained between the 0.6786 and 0.8596 range for the province of Kahramanmaras according to the R 2 scale and 0.5273-0.7969 for Isparta. A total of 12 different estimation models were developed with LibSVM for the data groups in each  Figure 12: Scatter plots of the most successful models that were developed according to SVR and LibSVM for (a) Isparta and for (b) Kahramanmaras. International Journal of Photoenergy province. The statistical results of SR estimation models that were developed by using the RBF kernel function for two different regression-based SVR algorithms are given in Table 9.
The models that were developed with Nu-SVR were more successful than E-SVR. The model that was developed with the GR2 data group had the most successful estimation   0.0752, 0.0573, 12.11%, and 0.8995, respectively. Comparative scatter plots of the most successful models developed with two different SVR methods used in the study according to the selected provinces are given in Figure 12.
As understood in Figure 12, the best SR estimation results of the models that were developed with LibSVM from both similar methods were found to be more successful than the classic SVR. For this reason, it was decided to use the estimation results of LibSVM in comparative evaluation of ML algorithms.
A total of 36 different estimation models were developed by using selected input data for each province based on six different K-neighbor coefficients between 1 and 10 with the KNN ML algorithm. The estimation performance results of the two most successful models that were developed in each data group with user-defined K parameters are given in Table 10. It was determined that the K   Figure 13. It is seen in the SMAPE scale that the SR estimations of the provinces used in the study are very close and similar. Six different rule-based estimation models were developed for each province by using selected data groups of the targeted cities with the M5 rules algorithm. The estimation performance of the developed models is given in  Figure 15: MAE performance comparisons of the best models developed by using ML algorithms according to data groups for (a) Kahramanmaras and (b) Isparta. 21 International Journal of Photoenergy estimated by using the GR5 data group, which is unlike other ML algorithms employed in the study. Isparta, on the other hand, was estimated similarly by using the GR3 data group. According to the RMSE, MAE, R 2 , and SMAPE statistical scales, the values of 0.0610, 0.0418, 0.9506, and 9.01%, respectively, were obtained in the performance of the best model for Kahramanmaras. Similarly, 0.650, 0.0441, 0.9254, and 8.42% were obtained for the province of Isparta. Scatter plots of the most successful models that were developed in the target cities are given in Figure 14. According to the plots, it is understood that the data distributions and performance measurement metrics of the target provinces were very close to each other.
Aside from the trial studies in all ML algorithms used in the target provinces to increase estimation accuracy and to select the most successful models in each data group, 3000, 72, 12, and 24 different estimation models were developed with MFFNN, KNN, M5 rules, and SVR algorithm-based LibSVM library, respectively. In all studies, the estimated performance of the models that were developed with the GR2 (month, hour, T, H, and HSD) and the GR5 (month, hour, T, H, WS, and HSD) data groups determined at the end of the feature selection process in Kahramanmaras was more successful. In Isparta however, the models that were developed with the GR3 (year, month, day, hour, P, T, and HSD) data group showed higher performance. The statistical comparisons of the best performing models according to ML algorithms used in SR estimations of both provinces are given in Table 12. Based on the statistical scales that were employed in the study, the MFFNN algorithm estimated SR more accurately in both provinces than the other algorithms. However, similar estimation results were achieved with the KNN and M5 rules algorithm for each province, and the lowest performance values were detected in SVR models that were developed with LibSVM. With the MFFNN algorithm, the SR estimation results achieved in Kahramanmaras and Isparta according to SMAPE were 7.79% and 7.77%, respectively; 8.84% and 8.88%, respectively, with the KNN algorithm; and 12.14% and 12.11%, respectively, with the LibSVM algorithm. According to SMAPE, the fact that the SR estimation results of both provinces selected in the study are very close to each other a level is associated with the similarity of latitude and some geographical characteristics.
In the HAGSR estimation studies, the final performance results of the estimation models that were developed with the GR6 data group by using all the available input data were lower than the final performance results of models that were developed with the GR2, GR3, and GR5 data groups, which

22
International Journal of Photoenergy were groups created at the end of the feature selection process. MAE and SMAPE performance plots according to four ML algorithms of all data groups used for the target provinces are given in Figures 15 and 16. It is clearly seen that feature selection processes have positive contributions to the performance of the developed estimation models. The HAGSR estimation models that were developed for Kahramanmaras and Isparta estimated the solar energy source of target areas quite well in general. However, the studies with the GR1 and GR4 data groups represent the input data groups that have the lowest estimated performance in both provinces. As a result, it was concluded that CfsSE and CorrAE, which are among the feature selection functions, applied to the meteorological and categorical input datasets, were inadequate in determining the best input data. The most successful feature selection functions were ClassAE and WSE for Kahramanmaras and CSE and WSE for Isparta. The comparisons of the SR estimations and real measurement results of the best models that were developed with the four ML algorithm using the 7 input data that were determined with the CSE and WSE feature selection functions for Isparta are given in Figure 17. Similarly, the comparisons of the best models that were developed with 5 inputs for the ClassAE feature selection function and 6 inputs for the WSE feature selection function for Kahramanmaras are given in Figure 18. The five-day hourly input data that were selected randomly from the test data for July 5-9 in 2004 were used for comparisons. It is seen that the HAGSR estimation  Figure 18: Comparative performance plot of ML algorithm models measured and estimated by using five-day input data from July 2004 in Kahramanmaras according to HAGSR results.   Figure 17: Comparative performance plot of ML algorithm models measured and estimated by using five-day input data from July 2004 in Isparta according to HAGSR results. 23 International Journal of Photoenergy models that were developed for Kahramanmaras are slightly more successful in estimating SR compared to Isparta where the test data time zones were selected randomly for each day.
The comparison of the HAGSR estimation models that were developed by using ML algorithms in the literature, and the most successful model developed in this study, is given in Table 13. The most successful models that were developed in previous studies were commonly based on a neural network, as in this study. It is understood that the accuracy of the proposed estimation model is better than, or similar to, previous studies.

Conclusion
In the present study, a comparative evaluation was made by developing models based on four different ML (MFFNN, KNN, SVR-based LibSVM, and M5 rules) algorithms to predict the HAGSR of the provinces of Kahramanmaras and Isparta, which are located on the same latitude coordinates of the Mediterranean Region. The most suitable input features were determined for each feature selection function by using meteorological and categorical input data and by developing 29 different applications based on six different feature selection functions with WEKA, and the input data were created in five different selection groups (GR1-GR5). Six different input datasets were determined to be used in modelling by including the GR6 data group in which all input data were collected to this selection group. The most successful estimation models were developed with the MFFNN algorithm in Kahramanmaras and Isparta by using the GR2 and GR3 data groups, respectively. Although month, hour, T, H, and HSD data were the most effective features in Kahramanmaras on estimation models, the variables of year, month, day, hour, P, T, and HSD were the most effective in Isparta. It is clear that HSD is the most effective data on SR in all data groups selected. The results show that the predictive accuracy of models that were developed with the data groups created at the end of the selection process increased, modelling time decreased, and the model is easier to interpret.
According to the data groups, the performance of KNN and M5 rules models was quite similar in each province. The performance of the estimation model that was developed with the KNN algorithm for the GR2 data group in Kahramanmaras was R 2 = 0:9511, and R 2 = 0:9506 for the M5 rules algorithm. In Isparta, the performance of the estimation model that was developed with the KNN algorithm for the GR3 data group was R 2 = 0:9261, and R 2 = 0:9254 for the M5 rules algorithm. The lowest performances were received for the GR1 and GR4 data groups in each province.
The best SR estimation performance of the two provinces was achieved with the MFFNN algorithm. When the results were evaluated in statistical terms, very close values were obtained in Kahramanmaras and Isparta. The MAE of the most successful model that was developed in Kahramanmaras for the MFFNN algorithm was found to be 0.0341 and 0.0352 for Isparta. Similarly, the SMAPE of the most successful model that was developed in Kahramanmaras was found to be 7.79% and 7.77% in Isparta. Although the statistical evaluation result of the different ML algorithms used in the study was low, similar results were obtained. These results show us that these two cities, which are very far from each other, have similar SR estimation potentials and that the latitude or different geographical characteristics have significant effects on these similarities. As a result of the present study, the HAGSR potential of both cities was estimated successfully and performed better than any other studies conducted in this field. In future studies, different parts of Turkey and the world should be evaluated in terms of performance of various ML algorithms and time intervals.

Data Availability
The data used to support the findings of this study are available from the corresponding author or Turkey General Directorate of Meteorology Meteorological Data Information Sales and Presentation System (MEVBIS) website upon request; website address: https://mevbis.mgm.gov.tr/mevbis/ ui/index.html#/Workspace.