Pedestrian Crash Exposure Analysis Using Alternative Geographically Weighted Regression Models

In order to develop a sustainable, safe, and dynamic transportation system, proper attention must be paid to the safety of pedestrians. The purpose of this study is to analyze the surrogate measures related to pedestrian crash exposure in urban roads, including the use of sociodemographic characteristics, land use, and geometric characteristics of the network. This study develops pedestrian exposure models using geographical spatial models including geographically weighted regression (GWR), geographically weighted Poisson regression (GWPR), and geographically weighted Gaussian regression (GWGR). In general, the results of the GWPR model show that the presence of a bus station, population density, type of residential use, average number of lanes, number of traﬃc control cameras, and sidewalk width are negatively associated with increasing the number of crashes. In this study, in order to identify traﬃc analysis zones (TAZ) based on the observed and predicted crash data, spatial distance-based methods using GWPR outputs have been used. This study shows the dispersion and density of pedestrian crashes without possessing the volume of pedestrians. Comparison of the performance of GWPR and Poisson models shows a signiﬁcant spatial heterogeneity in the analysis.


Introduction
Pedestrians are known as vulnerable road users, and the severity of pedestrian injuries in motor vehicle crashes is relatively high. Today, ensuring the safe movement of pedestrians is one of the most challenging concerns for transportation engineers. In general, in urban accidents, drivers and passengers have the largest share of the comprehensive cost of traffic accidents (94%) [1]. In order to develop a sustainable, safe, and dynamic transportation system, proper attention must be paid to the safety of pedestrians.
e proportion of pedestrian casualties in the world has increased by an average of 11 percent to 14 percent over the past decade, so addressing pedestrian safety and raising awareness about safe pedestrianization is an important issue.
is study was conducted with the aim of addressing the safety of pedestrians and identifying the extent of exposure to pedestrian crashes in urban areas and identifying accident-prone areas. Obviously, the number of people walking on the streets (i.e., pedestrian trips) and the factors that cause the presence of more pedestrians on the streets is one of the best measures for pedestrian exposure [2][3][4][5].
However, continuous measurement of pedestrian travels is difficult considering all the effective variables because it requires the use of significant resources and many factors that play a role in creating pedestrian travels. e purpose of this study is to investigate the available criteria and select the most effective measures in order to predict the variable of pedestrian exposure (pedestrian trips) and identify areas prone to pedestrian crashes. In other words, the purpose of this study is to analyze the surrogate measures related to pedestrian exposure in urban roads, including the use of sociodemographic characteristics, land use, and geometric characteristics of the network. e three-step process in the study involves the development of exposure models using geographical spatial models including geographically weighted regression (GWR), geographically weighted Poisson regression (GWPR), and geographically weighted Gaussian regression (GWGR). Exposure models in this study are compared with the study model developed by Lee et al. [5] which were performed using Tobit method and generalized linear models (GLMs) and predicted pedestrian travel. In their study, it was suggested that the effect of the geographical location of exposure variables for pedestrian crashes be investigated in future studies. In the current study, the effect of spatial exposure variables based on their geographical location has also been investigated. en, in order to identify the best exposure model between GWR, GWPR, and GWGR models, the Akaike Information Criterion (AIC) index and P value were used after validating the models to predict pedestrian crashes. is method can be described as a pedestrian safety analysis on urban roads (microlevel) with macrolevel data. Also in this study, in order to identify traffic analysis zones (TAZs) based on the observed and predicted crash data, spatial distance-based methods using GWPR outputs have been used. e city of Tehran in Iran and its urban areas have been considered in this study. Although the two-step process (i.e., first identifying pedestrian collision variables and second crash prediction and identifying high-risk areas) has a relatively larger modeling error than the one-step model (pedestrian crash prediction only), but still by analyzing the volume of pedestrians and crashes, their output can lead to a better understanding of safety [5].
is study addresses pedestrian safety in the study area by identifying the best model for dealing with pedestrian crashes on major urban roads (first-and second-degree arteries and collector streets) as well as creating a safety analysis process for regions where pedestrian crash data are not available. is process can help transportation authorities create safer paths for pedestrians by implementing appropriate safety interventions.

Literature Review
Pedestrian safety is a growing concern, and so extensive studies have been conducted to ensure pedestrian safety. Researchers have tried to identify the factors contributing to pedestrian fatalities as well as identifying the urban areas with the highest risk of crashes for pedestrians by developing spatial relationships. For this purpose, in this section, past studies on the development of exposure models and spatial analysis of crashes have been reviewed.
Lee and Abdel-Aty [6] conducted a comprehensive study of pedestrian crashes at intersections in Florida. is study followed the Keall [7] method to use pedestrian personal travel data to create a rational model of a pedestrian exposure with crashes. In the proposed exposure measure, different walking patterns were reflected by different age groups of pedestrians. Miranda-Moreno et al. [8] analyzed two important relationships between land development and pedestrianization: (a) between land use and pedestrian activities and (b) between risk exposure (pedestrian and vehicle activities) and pedestrian crash frequency. e authors concluded that the land use pattern affects the level of pedestrian activities with a direct impact on pedestrian safety. Ukkusuri et al. [9] developed a pedestrian count crash frequency model for New York City using a negative binomial model and a stochastic parameter. is model found that the ratio of illiterate population, business areas, school areas, functional characteristics of the intersection, type of access control on the roads, and the number of lanes had a positive effect on pedestrian crashes.
In order to select the variables used in the proposed exposure model, several previous studies have been reviewed. Previous researchers have shown that pedestrian volume is a significant measure of exposure that has a positive effect on the occurrence of vehicle-pedestrian collisions [2][3][4][5]10]. Another significant measure of exposure is the effects of land use patterns that have long been studied by researchers [5,[11][12][13]. Wier et al. [13] found that the number of pedestrian crashes is relatively higher in commercial and residential areas. ere are many studies that describe the impact of demographic and socioeconomic characteristics on pedestrian safety (e.g., [12]).
Although previous researchers have attempted to explain pedestrian exposure, there are very few studies that specifically identify the exact causes of this criterion. Another issue is the reliability of pedestrian volume data. ere have been many cases where pedestrian volume data were not available or were not sufficiently accurate to perform a safety analysis. A reliable process for identifying surrogate measures is needed to express the pedestrian exposure criteria in such cases. Apart from these issues, the use of the negative binomial (NB) or even zero-inflated negative binomial (ZINB) model in microlevel safety analysis has been questioned by many authors [5,[14][15][16]. However, some authors have confirmed it in macrolevel analysis [17].
In general, spatial prediction of crashes using localized parameters gives us more accurate predictions compared to methods in Highway Safety Manual [18] that use global parameters. In addition, traffic exposure criteria (such as AADT and length of segment) are considered as predictors of crash frequency and have been widely used by transportation professionals to predict the occurrence of crashes at a particular site. erefore, in the field of safety performance functions (SPFs), understanding the different spatial relationships between the main factors of exposure and the frequency of crashes has significant potential for the development of localized SPFs that can potentially provide more accurate crash predictions at separate sites [19].
In the current study, in order to identify high-risk TAZs based on observed and predicted crash data, two methods have been adopted: (1) frequency-based methods and (2) distance-based models. e first group measures the severity of point events based on the density of an area. ese methods include kernel density estimation (KDE). e second group measures the spatial dependence of point events based on the distance of points from each other. is group includes methods such as nearest neighbor distances, K-functions, and Moran I [20,21]. Hadayeghi et al. [22] presented traditional crash prediction models for 463 TAZs in Toronto using traditional NB (global) and GWPR general regression models. e results showed that GWPR models were able to partially deal with spatial dependence as well as spatial heterogeneity resulting from these factors and TAZs. Xu and Huang [23] modeled the total crash frequency as a function of road length density, population density, average household income, and percentage of road sections with different speed limits and showed that the GWPR model due to the instability of the crash location has acceptable accuracy compared to the NB model with random parameter. A parametric GWPR model was also developed to estimate some parameters globally and some locally [24]. Similarly, a study by Rhee et al. [25] investigated traffic accidents with spatial correlation and spatial relevance using advanced spatial modeling methods. e results showed that the statistical performance of GWR was superior in the correlation coefficient of localization.

Methods
In this study, in the first step, which is the identification of exposure variables, several statistical methods have been used to identify these variables and the crash frequency is examined based on different modeling methods. In the second step, crash prediction models are presented at the TAZ level using surrogate variables. e following is a brief description of the modeling techniques used in these two steps.

Models Used to Identify Exposure Variables
3.1.1. Generalized Linear Models. Generalized linear models (GLMs) are a general class of statistical models that include many common models with specific features. A typical GLM is as follows: In this equation, Y is the linear prediction and ε i is the error parameter. In the generalized linear model, the assumptions of independent and normal distribution in Y are given.
is distribution includes such cases as normal, Poisson, gamma, and binomial distributions [26]. e GLM is a flexible generalization of ordinary linear regression that allows the use of response variables that have error distribution models other than the normal distribution.

Tobit Model.
In this study, in order to eliminate any negative prediction of pedestrian crash, Tobit model was used to identify the measure of exposure. e Tobit model is a statistical model used to describe the relationship between a censored dependent variable y i and an independent variable (or vector) x i . e Tobit model is as follows: In this relation, y * i is a hidden variable that can only be seen if it is positive. Also, N is the number of observations, y i is a dependent variable, x i is a vector of explanatory variables, β is a vector of estimable parameters, and ε i is a normal and independent distribution. e error parameter also has a mean of zero and a variance of σ 2 [27].
3.1.3. Variable Importance for Exposure Model Using Random Forest. Important explanatory variables can be determined using a random forest exposure model. e first step in this process is to place a random forest of data. During the fitting of this process, an out-of-bag error (a method for measuring random forest prediction error) is recorded for each data point and averaged in the forest. In order to measure the importance of the j th attribute after training, the values of the j th attribute can be changed among the training data and the out-of-bag error is re-estimated in this turbulent dataset [28].

Network KDE.
As mentioned earlier, many recent studies have used the KDE network method developed by Okabe et al. [29] to examine the spatial correlation of point events in a road network. In this study, the KDE network method has been used to estimate the density of road sections in the Tehran road network. is method is based on the study method of Okabe and Sugihara [30].
In this study, the KDE network was performed on 1000meter sections, similar to those proposed by Xie and Yan [20] and Nie et al. [31]. Also, according to studies [32,33] in order to achieve more accurate KDE results, three values of 100, 200, and 500 meters have been considered for bandwidth measure. See Okabe and Sugihara [30] for more details on the computational process.

Geographically Weighted Regression Models.
Geographically weight regression (GWR) is an exploratory method that has been adopted in the relevant literature mainly to deal with spatial variables. Past studies on the relationship between urban form and pedestrian behavior have mainly used global regression models. However, since the present study includes urban areas with the main arterial functional class and collectors, the behavioral characteristics of pedestrians may be different in each one. erefore, it is likely that the relationship between urban form and walking varies across the study area.
In this study, a Gaussian GWR model is used to evaluate the relationship between exposure variables (land use and street characteristics around houses as independent variables). To determine which model is appropriate, a comparison was made between Gaussian GWR and geographically weighted regression and geographically weighted Poisson regression. A model with lower AIC values is a more appropriate model [34][35][36].
A GWPR model is also used in this study. In a GWPR model, the frequency of crashes is predicted by a set of explanatory variables in which the parameters are allowed to change in space. is model can be written as follows [37]: In this relation, (u i , v i ) specifies the coordinates of region i. It should be noted that, in GWPR, β k (u i , v i ) is a function of the coordinates of the center for region i.
In this study, GWR4.0 software was used to identify high-risk points in which the chances of determining the walking exposure variable increase because changes in independent variables are given in the first step of modeling.
e results of GWR, GWPR, and GWGR in ArcGIS, version 10.2, were mapped to visualize spatial relationships. Also, even if there is a discontinuity in the study area, the optimal bandwidth has been selected based on several experiments to ensure that the blank spaces are outside the crash points of the study area.

Measures of Goodness of Fit.
To evaluate and compare the performance of the models, three statistics were used to measure the accuracy of the estimates. First, we used AIC, which indicates that the lower the AIC, the better the model [38]. e AIC is measured as follows: where D represents the model deviation and k is the number of parameters. In GWPR, due to the nonparametric framework of the model, the number of parameters is meaningless. erefore, an effective number of parameters must be considered, which can be written as follows [39]: where S is the hat matrix. In addition to AIC, we also used mean absolute error (MAE) and root mean square error (RMSE) to compare model performance. Lower MAE and RMSE values indicate better model performance. Finally, Moran's I statistics model was used to validate the models. Statistically, Moran's I statistics is a measure of spatial correlation. In this study, the Moran test was used to examine whether the residuals of city-wide crash predictions were spatially related to neighboring TAZs. Negative (positive) value of Moran's I statistics indicates a negative (positive) spatial correlation at the overall level.

Data Preparation.
e main source of data for this study is Tehran Municipality. e data used in this study are shown in Table 1. In this study, the analyzed zones have been considered for model development based on the variables of exposure in TAZs, which are 560 zones for Tehran. Of course, it should be considered that, in order to match the analyzed zones, the whole city can be divided into equal units, but due to the lack of homogeneous distribution of pedestrian crashes, many of the identical zones will have zero observed crashes. e TAZ characteristics selected for the crash analysis include all the variables in Table 1. All items selected as exposure variables are items that affect the frequency of crashes. Crash data variable (CR) shows the total number of pedestrian crashes in Tehran. e density of speed cameras (TCC) indicates a risk factor at their installation site, as these devices are typically installed in locations where drivers need more focus and are at greater risk of road crashes [32]. Bus station (BS) can be a risk factor as a large number of pedestrians get on and off in one place and some of them tend to cross the street [5]. e presence of schools (SC) is one of the most important places to attract pedestrians, so the presence of schools in TAZs during the hours of the day is a risk factor for pedestrian crashes [5]. e presence of a pedestrian bridge (PB) based on studies has improved pedestrian safety in conflicting with vehicles [5]. Also, the presence of intersections in any zone increases the risk of pedestrian collisions. It is obvious that population density (TP0) in each zone and the density of vulnerable users (TP1 and TP2) increase the risk of pedestrian collisions [5,32,33].
is study was conducted in two steps including [1] identifying the variables of pedestrian exposure and [2] investigating the spatial-geographical relationship between the variables and the spatial crash prediction at the TAZ level. GIS and SPSS software were used to extract and process data for the first step and GWR4 for the second step. e integration of the database with all the information collected in TAZs is done with the help of standard tools in GIS that allow spatial search, layer addition, and spatial operations based on topological relationship. Table 1 shows the pedestrian crash dataset of Tehran, which includes 1231 observed cases. Descriptive variables were also classified into three categories: "demographic and socioeconomic," "land use," and "traffic and geometric". Out of 25 variables collected, 15 variables are listed in Table 1 based on the results of the first step in the study. Road network in Tehran, including arterial roads and collectors, has been used for analysis. Data were collected from various sources. Figure 1 shows the study area and the status of existing crashes. Figure 2 shows the KDE crash density function based on crash point and crash density per kilometer in three bandwidths of 100, 200, and 500 meters.
e correlation between the descriptive variables used in the Tobit model (selected model based on the results of the first step of the study) was investigated before the modeling process. Pairs of variables with correlation coefficients higher than 0.6 are not included in the models simultaneously [5]. In the modeling process, first the explanatory variables with the lowest correlation values were included in the model and the variables with relatively lower correlation values were preferred in the model (Tables 2 and 3).

Results and Discussion
A total of six exposure models have been developed in this study (Table 4). Because two different modeling methods are used (GLM vs. Tobit) to compare the best model, it is not appropriate to compare the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). erefore, to compare the models, the mean absolute deviation (MAD) and the root mean square error (RMSE) for each model have been used [5]. Table 4 shows that the Tobit model using all variables performs best with the lowest MAD and RMSE values. e Tobit model also shows any predicted negative pedestrian crashes using the exposure variables equal to zero because the lower limit is set at zero.
Between 2017 and 2019, 1228 pedestrian crashes were reported in Tehran, in which 44 people died and 1184 were injured. A total of 4979 schools, 4831 bus stops, 927 pedestrian bridges, 801 lighted intersections, and 5386 traffic control cameras located in 560 different TAZs were included in this analysis (Table 1), with an average number of variables ranging from 0 to 80 in all TAZs.
is initial benchmark shows that the difference between the scatterings of exposure variables at the level of TAZs is significant. e  Journal of Advanced Transportation results of GWR, GWGR, and GWPR models are shown in Tables 5-7, respectively. Table 8 compares the GWR, GWGR, and GWPR models. According to the results shown in these tables, the GWPR model has a higher accuracy in predicting crashes based on exposure variables. In Figure 3, based on crash predictive models and using ArcGis software, a pedestrian crash map of Tehran has been produced. Table 9 shows the ANOVA values for the GWPR model. Tables 5-9 show the results of the GWR, GWGR, and GWPR models with adaptive bisquare kernel for predicting crashes. In the kernel density function model, the lowest AICC value was obtained based on the adaptive bisquare kernel. Notably, we found that the results were largely consistent with the adaptive bisquare kernel. Also, the Lagrange coefficient values for the GWPR model and the global model were 0.12 and 0.15, respectively, which is less than the critical Lagrange value (3.48). e study by Hezaveh et al. [1] also confirms the adaptive bisquare kernel in relation to the Gaussian adaptive for urban TAZs. e comparison of AIC, AICc, deviation, MAE, and RMSE presented in Table 8 shows that the GWPR model is more appropriate than the global model. e value of Moran I (0.031) indicates that, in the GWPR model, the residues are not related to each other. In addition, VIF values    (mean � 1.8; maximum � 3.9) indicate that the local multicollinearity issue is not critical in this study. In the study by Lee et al. [5], it is proposed to investigate the local effect of pedestrian crash exposure variables. e statistical results of the second step models in the current study show that, in the GWPR model, all variables had a local effect. Figure 3 shows the spatial effect of the estimated parameter on crashes.
is figure shows only those    Journal of Advanced Transportation coefficients that have a significant effect and the small coefficients are shown in white. It is noteworthy that the estimated coefficients in common fixed models are in the range of similar values in spatial models [1] and this shows that the estimated parameters in global models (i.e., fixed models) are characteristics of the average of the factors affecting the dependent variable. VIF values do not change between 1.12 and 3.9 (the critical value of VIF between 5 and 10 means a complete spatial correlation between the independent variables). is could be due to excessive scatter in the exposure variables. According to studies [1,23,40], the location of VIF in this limit can also justify the geographical distribution of Poisson in crash prediction, but this issue can be explored in the future. Traffic parameters such as intersections and speed cameras have a significant impact on pedestrian crashes and in TAZs where the speed camera density is higher, and fewer crashes will occur due to the variable estimation coefficient (negative). e model predicts that crashes decrease as population density increases. e sign of population density is negatively associated with crash frequency which is consistent with previous studies [41]. is could be due to the fact that, in residential areas without commercial and recreational land uses, due to the low speed of vehicles and the presence of speed bumps, as well as distracting effects, the crash density has decreased. In general, the results of the GWPR model show that the presence of a bus station, population density, type of residential land use, average number of lanes, number of traffic control cameras, and sidewalk width have a negative effect on increasing the number of crashes. In the GWR model, the number of motorcycles, residential land use, recreational land use, the average number of lanes, and the number of speed cameras in TAZs had a negative relation with increasing pedestrian crashes. Finally, in the GWGR model, the number of bus stops, population density, residential land use density, average number of lanes, and the number of speed cameras were negatively associated with increasing crashes. It should be noted that, in the three mentioned models, a significant relationship between dependent variable and independent variables has been obtained, which has been confirmed by previous studies (e.g., [13,40,42]).
One of the explanations for the negative sign of the bus station in urban areas can be the reduction of the volume of motor vehicles around the residential area, which reduces  traffic congestion and ultimately exposes the motor traffic of other residents [1,5]. On the other hand, poor design of a multimode network can negatively affect the safety of nonmotorized users and public transportation. e difference between the signs of the estimated coefficients in several different models requires more details in future studies. We may expect older people to suffer more severe injuries due to vulnerability [1]. Conversely, older people travel less than other groups [43][44][45]. As a result, in this study, the percentage of elderly people compared to other age groups was negatively associated with the increase in crash density in the GWPR model. e negative sign of motorcycle users on the  increase of pedestrian crashes can be due to the increase in motorcycle travel; hence, the number of pedestrian trips has decreased. In this study, the variables of number of schools, number of intersections, and pedestrian bridges have been shown with a positive sign, in which the impact of intersections on pedestrian crashes is not significant but has a positive effect on causing crash, which is confirmed by the study by Lee et al. [5]. Due to the presence of parents and children on the school routes, there is more walking activity in TAZs with more schools. Households with less than two vehicles (0 or 1 vehicle) are another important source of pedestrian activity. Car ownership is directly related to household income levels, which reflects the socioeconomic impact on pedestrian activity [5]. It is obvious that family members without transportation meet their transportation needs through public transportation or walking.
On the other hand, the amount of car ownership in TAZs also has a significant impact on increasing pedestrian crash exposure. Of course, the crash may occur with vehicles that are outside the TAZ and collide with a pedestrian while traveling in different TAZs. In this study, the TC variable is the vehicle ownership variable in the study area, considering the two factors, and the model results show a positive effect on pedestrian exposure. As mentioned earlier, in this study, the average width of the sidewalk was negatively associated with pedestrian exposure and did not cause crashes, which is consistent with previous studies (e.g., [5,45,46]). ese findings are also consistent with studies of human factors that show that some groups (e.g., low income, low education, and young urban road users) are more prone to abnormal behaviors [1,47,48]. According to the modeling results, the effect of slope and average width of the route on crashes is positive and the reason could be that pedestrians in wider passages have to travel longer to cross the street so they are more exposed in the passage of vehicles. Also, the medium slope has a positive effect on pedestrian crashes compared to zero-slope roads due to more difficult control of vehicles in adverse weather conditions.

Conclusion
In this study, a systematic approach has been developed that uses pedestrian surrogate measures based on exposure information. In this study, in the first step, which is the identification of exposure variables, several statistical methods have been used to identify these variables and the frequency of crashes is investigated based on different modeling methods. In the second step, crash prediction models were presented at the zone level using surrogate variables. In this study, three models GWR, GWGR, and GWPR have been used to spatially predict the crash frequency based on exposure variables, and the results of the study showed that the GWPR model makes more accurate predictions. In addition, identifying effective criteria such as the presence of school, car and motorcycle ownership, bus station, sidewalk width, pedestrian bridges, type of intersection control and the presence of midroad refuge, population density, type of land use, width of roads, average number of routs, average road slopes, and number of speed cameras in dealing with pedestrians is important in this study. In this study, it is emphasized that while providing safety measures for pedestrians, cases such as improving traffic calming should be done in areas with high density of schools as well as schools in the area of intersections and increase the width of sidewalks in areas with more bus stations, because in areas where the bus is the main mode of transportation, there is a tendency to walk and consequently pedestrians are exposed to crashes. e proposed two-step method in this study involves two consecutive modeling processes. e first model identifies the exposure variables in pedestrian crashes and the second model estimates the number of pedestrian crashes using three spatial models GWR, GWGR, and GWPR. However, this trend is limited because the result can be affected by the errors accumulated in the first stage due to the existence of an uncontrollable confounding variable as well as information biases. It is possible to solve the problem by adopting a simultaneous modeling approach. is study has shown the dispersion and density of pedestrian crashes without possessing the volume of pedestrians and thus by taking safety measures in places prone to pedestrian crashes, social costs, and casualties can be decreased. In this study, Poisson regression was used to evaluate the relationship between sociological variables and crashes at the zone level. Comparison of the performance of GWPR and Poisson models shows a significant spatial heterogeneity in the analysis. e increase in residential density in urban areas has been associated with a decrease in speed and therefore has led to a reduction in crash frequency. On the other hand, increasing travel time and consequently increasing traffic exposure affect the social costs of crashes. Identifying traffic-prone zones can be a useful element in developing policies to support mitigation measures related to pedestrian exposure to traffic. We expect that, in future studies, negative geographic binomial distribution models and the experimental Bayesian geographic model will be used to identify pedestrian exposure variables.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.