Bandwidth Selection in Geographically Weighted Regression Models via Information Complexity Criteria

. The geographically weighted regression (GWR) model is a local spatial regression technique used to determine and map spatial variations in the relationships between variables. In the GWR model, the bandwidth is very important as it can change the parameter estimates and affect the model performance. In this study, we applied the information complexity (ICOMP) type criteria in the selection of fixed bandwidth for the first time in the literature. The ICOMP-type criteria use a complexity measure that measures how parameters in the model relate to each other. A real dataset example and a simulation study have been conducted. Results of the simulation demonstrate that GWR models created with the bandwidth selection by ICOMP-type criteria show superior performance. In addition, when the bandwidth is selected according to the ICOMP-type criteria and the GWR model is created for the actual total fertility rate data, it is seen that the spatial distribution of the total fertility rate estimates is quite compatible with the distribution of the actual total fertility rate. According to the results, ICOMP-type criteria can be used effectively instead of the classical criteria in the literature in the selection of bandwidth in the GWR model.


Introduction
Spatial regression analysis has become one of the important elds in the eld of statistics in recent years. In studies where space is important, it is clear that classical statistical methods are insu cient to explain statistical change and estimate statistical inferences. erefore, spatial statistical methods started to use in this eld. Spatial statistical methods include spatial models that contain spatial information and consider the e ects of locations on observations. e geographically weighted regression (GWR) method is a local spatial regression technique used to model relationships on geography [1]. GWR can generate estimates for other points where the location is known by the regression method and the reference points where the location and features are known on the geography. Unlike the classical regression model, the coe cients are not xed in the GWR model and each spatial point has its own coe cients [2]. In the GWR approach, the neighborhood weight of all reference points adjacent to the regression point is determined to estimate the value of the parameters of a regression point. is neighborhood weight is usually determined by the kernel function using the Euclidean distance. Bandwidth is the distance metric or number of neighbors used for each local regression equation, and changing the bandwidth is the most important parameter to consider in the GWR approach because it can change the coe cient estimates [3]. As the bandwidth value increases, the weights decrease and the local variation of the parameters decreases. So the regression equations turn into a general equation rather than a local one. As the bandwidth value decreases, the weights increase and the local variation of the parameters increases. But in this case, the equation may not give correct results because few reference points will be taken into account [4]. Studies to improve the accuracy of GWR approaches generally focus on the calibration of bandwidths [5][6][7]. ese approaches mainly focus on nding the appropriate bandwidth values for the datasets and are not concerned with the spatial datasets, altitude and temporal stationary. e bandwidth value can be a xed value or an adaptive value for the all dataset, depending on the distribution density of the locations in the data [8]. Crossvalidation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC), and Bayesian information criterion (BIC) are among the methods used to find the optimal bandwidth value in the GWR method [9,10]. Guo [4] constructed a forest plot with a clustered spatial model of tree positions to investigate the effects of different kernel functions and different bandwidths on the model performance and coefficient estimates of the GWR method. Cho et al. [6] selected bandwidths using the GWR method, CV, and smallest spatial error LaGrange multiplier test statistic calibration methods. Yacim and Boshoff [11] defined five different GWR models by choosing different cores and bandwidths in the residential data sample and compared the model performances. Koc and Akın [12] selected the bandwidth with the CV criterion and applied GWR models with different kernel functions at fixed bandwidth. Yuan et al. [13] examined the effects of AIC and their different bandwidths on GWR results. Hu et al. [14] introduced a twodimensional bandwidth matrix for parameter estimation in the GWR model. Punzo et al. [15] examined local differences in the effects of the main sociodemographic, economic, and institutional determinants of land consumption by using the bandwidth adjusted AIC in the GWR method. It is clear that bandwidth selection has a strong influence on the descriptive and predictive power of GWR models. Generally, CV and AIC are used for bandwidth selection in literature.
In this study, we propose to use the information complexity (ICOMP) criteria proposed by Bozdogan [16,17] in the selection of bandwidth in the GWR model. ICOMP use an information theoretic measure of general model complexity based on Van Emden's generalized covariance complexity and the Kullback-Leibler distance [18,19]. e purpose of ICOMP-type criteria is to achieve the optimal balance between the complexity and fit of a model. ICOMP aims to establish this balance by considering a complexity measure that measures how the parameters in the model are related to each other. erefore, although it is a measure based on the Akaike Information Criterion, unlike AIC, it penalizes the covariance complexity of the model instead of directly penalizing the number of independent parameters. e use of ICOMP-type criteria in the selection of bandwidth in the GWR model will increase the confidence in the selected bandwidth due to its theoretical basis and will bring a new perspective to the literature. e study is organized as follows. In Section 2, we present the model and methods. First, we explain geographically weighted regression and we applied the ICOMPtype criteria for the regression model. In Section 3, we present the simulation results regarding bandwidth selection using ICOMP-type. In Section 4, a real dataset example of the total fertility rate is present. ere are conclusions obtained from this study in Section 5.

Materials and Methods
In this section, we introduced the geographically weighted regression method and information complexity criteria. First, the geographically weighted regression model is a local spatial regression technique that produces the predicted values for other points whose positions and properties are known [1]. Unlike the linear regression model, the coefficients in the GWR model are not constant. e coefficients of each spatial point are created [2]. e GWR model is given by In equation (1), (u i , v i ) is the latitude and longitude coordinates of the location in space i th, y i is the dependent variable, x i (k � 1, 2, . . . , m) is the independent variable, α k is the coefficient of the GWR regression model, and ε i is the error of ith location which is assumed to be independent and identically distributed normal random variable with mean zero and constant variance σ 2 .
e weighted least square methods provide a basis of estimating the GWR parameters. Parameters estimation of the GWR model are obtained as follows: where X is the matrix of independent variables and consists of m and is as shown as follows [1].
where w ij is the neighborhood ratio between the regression point and the reference point. w ij is calculated using the global model, box-car, exponential, Gaussian, bi-square, and tricube [20]. e Gaussian function is commonly used with Gaussian kernel function [21,22], and w ij is calculated as in the following equation.
where bw is the bandwidth value, d ij is the distance between the regression point i and the reference point j, and d ij is usually the Euclidean distance and is calculated as shown in the following equation, where u and v are the point coordinates.
e value of the bw bandwidth parameter can be constant for a whole model in the GWR model or it can be variable according to the point density in the location. e optimal value of bandwidth can be determined by crossvalidation (CV), generalized cross-validation (GCV), Akaike information criterion (AIC), and corrected Akaike information criterion (AICc) methods [23]. e cross-validation criterion is given as follows [24].
where y ≠i (bw) is the fitted value of y i by omitting the i th point from the process [1]. e generalized cross-validation criterion for GWR is defined as [25] where y i (bw) is the fitted value of y i using a bandwidth of bw, n is the sample size, and tr (S) denotes the trace of the hat matrix [1]. e Akaike information criterion and corrected Akaike information criterion for GWR is defined as AIC � 2n log(σ) + n log(2π) + n + tr(S), where n is the sample size, σ is the estimated standard deviation of the error term, and tr (S) denotes the trace of the hat matrix [26]. Second, the information complexity (ICOMP) criteria are a measure developed by Bozdogan [27] on the basis of AIC. Unlike the AIC-based information criterion, ICOMP approximates the sum of the two Kullback and Leibler [28] distances that measure the model's lack of fit and model complexity in a criterion function using an entropic measure of the estimated covariance matrix of the model parameters. us, the concept of model complexity takes into account not only the number of free parameters in the model but also the interdependence of parameter estimates.
erefore, a general model selection criterion can be provided by ICOMP by understanding the relational structure between parameter estimates in the selected model [29]. ICOMP-type criteria provide the most appropriate balance between the complexity of a model and the goodness-of-fit [30]. ICOMP criteria can be defined in several formulations. e formulations of the information criteria are as follows [31].
where n is the sample size, L(M) is the maximized likelihood function, k is the number of variables, C is a realvalued complexity measure, and Σ model � Cov(M) is the estimated covariance matrix of the parameter vector of the models [29].

Simulation Study
A simulation study was conducted to understand the performance of the information complexity criteria on bandwidth. Simulation design is in the same way as follows: where X 1 ∼ N(0, 1), X 2 ∼ N(2, 4), X 3 ∼ N (5,8), and ε ∼ N(0, 1) have been generated. Lon and lat are the spatial coordinates of the locations. Sample size is n � 300, and kernel function is taken as Gaussian. In Table 1, the values of the optimal bandwidth selection by different methods are given.
e ICOMP-type information criteria selected the bandwidth value as 2811.487. It was found that ICOMP-PEULN performs better with the lowest information criterion in the selection of bandwidth. While the GWR model was created with fixed bandwidth selection, the GWR model established with ICOMP-type was GWR-ICOMP-type, the GWR model established with CV bandwidth selection was taken as GWR-CV, the GWR model established with GCV bandwidth selection was taken as GWR-GCV, the GWR model established with AIC bandwidth selection was taken as the GWR-AIC model, and the GWR model established with AICc bandwidth selection was taken as GWR-AICc. Performance evaluations of the models are as given in Table 2.
In Table 2, the GWR-ICOMP-type model was performed the best with the highest Adj.R 2 value of 0.9901 and the lowest information criterion AIC � 169.0912.

Real Data Application
In this study, total fertility rate data were used for 81 provinces of Turkey in 2020. Total fertility rate refers to the average number of children a woman can have in the 15-49 age group. e data obtained are available at [32]. A set of 6 continues variables are used in this study and described as dependent variable (Y): total fertility rate, and independent variables: gross domestic product (GDP-per city) (x1), mean age of mother by provinces, (2009-2020) (x2), number of illiterate (x3), number of women with higher education (x4), unemployment rate (% of GDP) (x5). Furthermore, the coordinates (longitude, latitude) for the 81 cities in the Turkey (see [33]) are used to fit the GWR. e spatial distribution of the total fertility rate in Turkey for 2020 is shown in Figure 1.
e provinces with the highest total fertility rate in Turkey are Şanlıurfa with 3.71 which are shown in red on the map. is province was followed by Şırnak with 3.22 and Agrı and Siirt with 2.88. e province with the lowest total fertility rate was Karabuk with 1.29, which is shown in dark blue on the map. is province was followed by Zonguldak and Kütahya with 1.31. First, factors affecting total fertility rates in Turkey are modeled with the multiple regression method. Furthermore, the GWR model is used to determine whether there is an effect of locations. Table 3 provides the coefficients of multiple regression models. According to the multiple regression models, gross domestic product, mean age of mother by provinces, and number of women with higher education and unemployment rate had a significant effect on total fertility rates. Testing the goodness-of-fit of the GWR model is important for experimental analysis. is test is called the global test of nonstationary [1].
In Table 4, the nonstationary global test result fits the GWR model to the total fertility rate data. erefore, we can apply the GWR method to the total fertility rate data. e selection of optimal bandwidths in the GWR model for the total fertility rate data is given in Table 5. e ICOMP PEULN criterion, which has the lowest information criteria among the ICOMP-type criteria, has chosen the optimal bandwidth. GWR models were found significant in the multiple regression models and selected the fixed bandwidth values. ese models are the GWR-CV model with a bandwidth value of 1.6477 for the CV score, the GWR-GCV model with a bandwidth value of 1.6526 for the GCV score, the GWR-AIC model with bandwidth value of 16.0042, the GWR-AICc model with bandwidth value of 14.9964, and the Bold values indicate values that give the best model performance with the highest Adj and the lowest information criteria.     Journal of Mathematics GWR-ICOMP-type with a bandwidth value of 0.4111. Model performances are as given in Table 6. In Table 6, it is seen that the GWR-ICOMP-type model performed the best result with the highest Adj. R 2 and lowest model information criteria values. Spatial distributions of total fertility rate estimates are shown in Figure 2.
In Figure 2, after selecting the bandwidth with ICOMPtype, the spatial distribution of the total fertility rate estimates in the GWR model is given. In Figure 2, it is seen that the distribution of the total fertility rate estimates is quite compatible with the distribution of the actual total fertility rate shown in Figure 1.

Conclusions
e selection of bandwidth in the GWR model is very important to increase the efficiency and accuracy of the model, and the selection of bandwidth can be considered as a model selection problem. When bandwidth is large, more data points will be included in the regression. us, the variance will be small, while the deviation will be large. If a small bandwidth is used, the regression will be confined to a local area and parameter estimates will depend on observations close to the regression point. erefore, the variance of parameter estimates will increase, but the bias will be   small and more anomalies can be discovered. In this study, we applied ICOMP-type criteria in the selection of bandwidth for the first time in literature. We found in the simulation design that the bandwidths selected with the ICOMP-type criteria increase the model performance and prediction accuracy of the GWR model. When GWR models with different bandwidths are established for actual total fertility rate data, it is concluded that the best performance is the GWR-ICOMP-type model with the highest R 2 and lowest information criteria. In addition, when the ICOMPtype criteria were examined, it was seen that the ICOMP-PEULN criterion chose the most suitable model with the smallest information criterion in fixed bandwidth. e spatial distribution of the total fertility rate estimates in the GWR-ICOMP-type model seems to be quite compatible with the actual total fertility rate distribution. As a result, the use of ICOMP-type criteria in the selection of bandwidth can improve the GWR model in terms of prediction accuracy and model performance.

Data Availability
e data used to support this study are available from https://data.tuik.gov.tr/.

Conflicts of Interest
e author declares that there are no conflicts of interest.