Modeling the Frequency of Cyclists ’ Red-Light Running Behavior Using Bayesian PG Model and PLN Model

1 Jiangsu Key Laboratory of Urban ITS, Jiangsu Collaborative Innovation Center of Modern Urban Traffic Technologies, School of Transportation, Southeast University, Si Pai Lou No. 2, Nanjing 210096, China 2School of Highway, Chang’an University, Middle of Nanerhuan Road, Xi’an 710064, China 3Hualan Design & Consulting Group, Hua Dong Lu No. 39, Nanning 530011, China 4Guilin University of Electronic Technology, Jinjilu No. 1, Guilin 541004, China


Introduction
In recent years, the bicycle has been widely used as an important traffic mode, especially for a commuting trip or recreational trip [1,2].Bicycles provide users with convenient, flexible, and affordable mobility, constituting an important supplementation to the urban transit system.Bicycle has also been recognized as an environmentally friendly mode of transport [3][4][5][6].In China, bicycle use has significantly increased during the past several decades mainly due to the increasing congestion in most of the large cities.A study in 2010 showed that average bicycle modal share for urban trips accounts for 38% in China [7].Because of the advantage of no pollution emission, low carbon, and low noise, the government is showing an interest in promoting bicycles [3][4][5][6]8].
Despite all the obvious advantages, the use of bicycles has also raised some issues and concerns regarding their safety impacts.In 2011, 8,776 cyclists were killed and 35,552 were seriously injured in road accidents [9].The fatalities and injuries of cyclists accounted for 82.3% and 84.6% of the nonmotorized traffic fatalities and injuries in 2011 in China [9].Accident analysis reveals that approximately 43% of fatal crashes involving bicycles result from violation of traffic rules [9].As one of the most overt illegal behaviors, red-light running at signalized intersection is very common in China.According to a previous study, e-bikes and bicycles contributed to 22.4% of the incidents, in which red-light running was found to be the predominant factor [10].
Previously, several studies have investigated the red-light running behavior of bicycles [11][12][13][14][15].For example, a recent study by Wu et al. [11] observed 451 two-wheelers facing a red light.It was found that 56% of the two-wheelers crossed the intersection against a red light in China.A cross-sectional survey study by Bacchieri et al. [12] in Brazil showed that the red-light infringement rate reached 38.4% of male commuter cyclists.By contrast, the cyclists' red-light running behavior proportion was low in Australia [13,14].The observational studies reported relatively low infringement rates from 7% to 9% in Australia.Some studies focus on the impact factors of cyclists' red-light running violation [11,13,[16][17][18][19][20].Wu et al. [11] developed a binary logit model to identify the significant factors that affect two-wheeled rider's red-light running behavior likelihood.It was found that the main factor for red-light running was age, with the young and middle-aged riders being more likely than the old ones to run against a red light.A further study by Zhang and Wu [16] in 2013 showed that sunshields installed at the intersection could reduce the red-light running behavior of cyclists and e-bike riders.The results showed that riders were 1.376 times more likely to run against traffic light upon intersection without sunshields than with shields.Studies for cyclists in Australia [11,18] have found that the three main factors for cyclists' infringement were travel directions, the presence of other road users, and the volume of cross traffic.Cyclists turning left are 28.3 times more likely to run against red lights than cyclists who continued straight through the intersection.It also found that, for gender, males are more likely to offend than females, for age, older cyclists are less likely to infringe compared to younger cyclists, and, for crash involvement, cyclists are more likely to infringe at red lights if they had not previously been involved in a bicycle-vehicle crash while riding.
From the previous research on the red-light violations, a basic conclusion showed that individual characteristics, such as gender and age, are found to be important factors affecting the red-light running behaviors.Some other factors, such as the presence of other road users, group size, and traffic volume, are found to have an effect on crossing behaviors too.Until recently, however, little documentations have been available regarding modeling the cyclists' red-light running frequency in an aggregate way.Previous studies are limited in their capacity to explore if the cyclists' red-light running frequency can be modeled, what factors may affect the frequency, and what kind of model suits the cycles' red-light running count.Research is needed to better understand the above issues.
The primary objectives of this research are (1) to develop cyclists' red-light running frequency models within the framework of an advanced Bayesian statistical approach and (2) to validate the reliability of the developed models.The focus of this study was on signalized intersections where the red-light running of bicycle often constituted a safety concern.

Data Collection
Field survey was designed to get the amount of cyclists' redlight running frequency and the possible influential factors including road geometric design, environmental condition, and traffic condition.Field data collection was conducted at twenty-five approaches at seventeen signalized intersections in the city of Nanjing in China.Nanjing is one of the biggest cities in East China by the year of 2012 with a population of 8.16 million and an area of 6,597 square kilometers.
The sites were carefully selected such that their geometric design and traffic control features represent the most common situations in major cities in China.More specifically, (a) there have to be a reasonably high number of bicycles during the observation period for the data extraction effort to be efficient and (b) each intersection should have pedestrian signals, in order to judge the red-light running behavior.
Field data collection was only conducted during weekday peak periods, under fine weather conditions, and when traffic police was not present.Two synchronized video cameras were set up in the field for data collection.One camera (camera A) was placed beside the crosswalks to film the cyclists' whole crossing process and the other camera (camera B) was set up on top of a roadside building to observe the traffic volume and bicycle volume.The cameras were carefully placed so that the cyclists were unaware that they were being observed.A total of 47.5 hours of data was recorded at the selected sites.
The recorded videos were then reviewed in the lab for data reduction.A trained graduate student was designated to review all the videos to ensure that consistent criteria were applied for identifying the crossing behaviors at different sites.From camera A, information of cyclists' crossing behavior including red-light running behavior and non-red-light running behavior was extracted.From camera B, information of traffic condition, such as bicycle flow volume and conflict traffic volume, was extracted.Since the e-bikes run much faster than conventional ones, the type of bicycle (e-bikes or conventional bicycle) was recorded.The information of bicycle speed and vehicle speed was also extracted from camera B. The speed of bicycle was estimated using the VideoStudio software.VideoStudio can process the video files in a frame-by-frame way at a rate of 25 frames per second so that the observer can identify the speed of bicycles by comparing their locations in different frames.And vehicle speed was estimated using the same previous method.The road geometrical and environmental conditions such as lane width, roadway width, and pedestrian signal type on selected sites were also recorded by the investigators during the survey.
Note that, according to the traffic signal, three crossing behaviors were classified as "red-light running" defined as three types: (1) cyclists who cross the intersection during the red signal; (2) cyclists who begin to cross when the signal is green but do not finish during the green signal; (3) cyclists who cross part of the intersection during the red signal and then continue crossing during the green signal.
In total, 2961 red-light running behaviors were identified at the selected signalized intersections.The original data collected in 1-minute time intervals were then aggregated into 5-minute levels, resulting in a sample size of 570.(

Methodology
The Poisson distribution assumes that the mean equals the variance.However, this assumption does not suit this case because the variance of the red-light running amount is often greater than the mean.To deal with the overdispersion for unobserved or unmeasured heterogeneity, it is assumed that where   is the expected amount of red-light running for the th time period and exp (  ) represents a multiplicative random effect to model possible overdispersion in cyclists' red-light running counts.
The PG model is obtained by the following assumption: where  is the inverse dispersion parameter.The dispersion parameter is usually given as  = 1/.For the PG model, the mean and variance are given as follows [21]: Similarly, the PLN model is obtained by the following assumption: where  2  represents the extra-Poisson variance.For the PLN model, the mean and variance are given as follows [21]: information is available, it should be used to formulate the socalled informative priors.In contrast, uninformative (vague) priors are usually used to reflect the lack of prior information.Since prior information about the PG model and PLN model parameters is not available, the following uninformative prior distributions are used: where 0  is  × 1 vector of zeros and I  is  ×  identity matrix.

Model Comparison.
The deviance information criterion (DIC) was used for model comparison.Among the candidate models, the one with the lowest DIC is considered as the best one.DIC can be calculated by [22] DIC =  +   ;   =  − D, where  is the posterior mean of the unstandardized deviance of the model, D is the point estimate of the model's parameters, and   is the number of valid parameters in the model.

Data Description.
The dependent variable of the model was the amount of cyclists' red-light running in five minutes, with a sample size of 570.The data collected in 1-minute time interval was not considered due to too many zeros

Model Specification.
The models were specified using the software package WinBUGS.The MCMC sampling techniques were used to approximate the posterior distributions (mean and standard deviation) of the model parameters.Two Markov chains for each parameter in the models were run 20,000 iterations, 10,000 of which were excluded as a burnin sample.Monitoring the convergence is important since it ensures that the posterior distribution has been found.The convergence was monitored by several ways.Convergence of the two chains is assessed using the Brooks-Gelman-Rubin (BRG) statistic.A BRG value less than 1.2 indicates convergence [23].Convergence is also monitored by visual inspection of the MCMC trace plots.
The results of the model specification are illustrated in Table 2. Parameters in the final model are significant at the 95% confidence level (i.e., the ranges do not include a value with sign different from the mean).Generally, the two probability distributions provide similar parameter estimates.The final equations of the two models are given as where RF PG and RF PLN represent the expected amount of cyclists' red-light running during a 5-minute time period,  1 represents the bicycle flow volume,  2 represents the conflict traffic volume,  3 represents the pedestrian signal type (1: countdown signal; 0: flashing signal),  4 represents the average speed of vehicle, and  5 represents proportion of e-bikes.
As observed from Table 2, the DIC value for the PG model is 1097, whereas, for the PLN model with the same response variables, the DIC value is 1063.It is assumed that a difference of more than 10 in the DIC value might rule out the model with higher DIC [22].As the drop in the DIC value is 34, the analysis of the DIC suggests that the PLN model outperforms the PG one.
A positive coefficient sign indicates that the cyclists' redlight running frequency increases with the increase of the corresponding parameter, whereas a negative coefficient sign indicates that the red-light running frequency decreases with the increase of the corresponding parameter.The coefficients for bicycle flow volume were found with a value approximately around 1.63, which means that the cyclists' red-light running count increases more rapidly than traffic volume.The e-bike rate was found highly significant with positive signs in the models, indicating that an increase in the proportion of e-bikes in bicycles also increased the total cyclists' red-light running frequency.Furthermore, the coefficient associated with pedestrian signal type was reasonably found positive, ranging between 0.514 and 0.564, which means a decrease of 67.2% and 75.7% of cycles' red-light running by replacing a countdown signal with a flashing signal (i.e., exp (0.514) − 1 = 0.672; exp (0.564) − 1 = 0.757).The coefficients for conflict traffic volume and vehicle speed are significantly negative, implying that the amount of red-light running decreased with the increase in conflict traffic flow volume and vehicle speed.
where (⋅) is the discrepancy statistic and  is the model parameters.A model is considered suspect if the observed value has a tail area probability close to 0 or 1 [24].Three discrepancy statistics were selected to check potential failing of the model as follows: where  denotes either the simulated data  rep or the observed data .
The Bayesian  values are reported in Table 2 for  1 ,  2 , and  3 discrepancy statistics, respectively, which are around 0.5 and 0.6.The Bayesian  values suggest that the models fit well the red-light running count observed as the probability of regression residuals from simulated data.Therefore, replication of cyclists' red-light running frequency using the developed models is likely to be close to the amount of cyclists' red-light running frequency observed on site.

Conclusion and Discussions
This study evaluated the application of PG model and PLN model developed using Bayesian statistical techniques for modeling the frequency of cyclists' red-light running behavior at signalized intersection.Data were collected at seventeen signalized intersections in the city of Nanjing.In total, 2,961 cyclists' red-light running behaviors were observed at the selected sites.In detail, the amount of cyclists' red-light running was modeled as events that occur randomly in a given interval of time (i.e., 5 minutes) under the assumption of Poisson distribution.Overdispersion of the count data was accounted for by adding a multiplication random effect term with gamma and lognormal distribution in the original Poisson regression model, resulting in the PG model and the PLN model.With the Bayesian framework, the model specification results demonstrated that the two models can fit the observed data; however, the PLN model outperformed the PG model by comparing the DIC values.The validation procedure of the developed models was conducted using the Bayesian  value.Three discrepancy statistics were selected in the procedure.The analysis of the Bayesian  values that are far from 0 or 1 indicates the reliability of the models.
The cyclists' red-light running frequency predictive model developed in this study can be used by the researchers or agencies to estimate the expected amount of cyclists' redlight running frequency given information such as the bicycle, conflict traffic volume, and traffic control.In addition, the research results are helpful to provide the direction for policies and countermeasures aimed at reducing the amount of red-light running of bicycles.For example, since a flashing signal can reduce the red-light running frequency, it could be installed at the signalized intersection instead of countdown signal.It is also found true that the high proportion of ebikes increases the red-light running frequency at signalized intersections; a license management of e-bike might be made by law, aiming at enhancing the enforcement of e-bikes for road safety [8,11,25].
There are several limitations in the present study.The data used for model specification was only with a small sample size of 570Data need to be collected at more signalized intersections with heterogeneous traffic, geometric, and traffic control.More variables should also be added into the prediction models.Therefore, further research should expand the use of state-of-the-art models, such as random parameter Bayesian models, to account for the heterogeneity.In addition, the survey was conducted during fine weather.A further study could be conducted during different weather condition to evaluate the impacts of different weather conditions on redlight running frequency of bicycles; some other factors such as the pavement markings [26] and traffic conflicts [27] would also be considered in further study.

3. 1 .
Poisson-Gamma (PG) Model and Poisson-Lognormal (PLN) Model.In this study, the Poisson-gamma (PG) model and Poisson-lognormal (PLN) model were used to fit the cyclists' red-light running frequency observed during a particular time interval.Let   represent the amount of cyclists' red-light running for the th specific time period.It is assumed that the frequencies are independent and that   |   ∼Poisson (  ) ,  = 1, 2, 3, . . ., .

4. 2 .
Model Validation.The Bayesian posterior  values were used to assess the goodness-of-fit of the model.Only PLN model was considered for the validation procedure as it provided a lowest DIC, which suggested that the PLN model provides a better fit to the data set.This procedure firstly generates replicated data set (simulation data set) based on the postulated model and then compares the simulation data set with the observed data set through the discrepancy statistics.The probability that the simulated data set  rep could be more extreme than the observed one  is measured by Bayes -value =  [ ( rep , )] ≥  (,  | ) , ) 3.2.Bayesian Estimation and Prior Distributions.The PG model and PLN model are estimated in a full Bayesian context via MCMC (Markov Chain Monte Carlo) simulation.To obtain the posterior distribution of the model parameters ,  (under PG model), and  2  (under PLN model), prior distributions of these parameters should be given firstly.Prior distributions are meant to reflect to some extent prior knowledge about parameters of interest.If such prior

Table 1 :
Descriptive statistics of explanatory variables.

Table 2 :
Summary of model estimation results.