Evaluation, Classification, and Influential Factors Analysis of Traffic Congestion in Chinese Cities Using the Online Map Data

. This study proposes a new method to describe, compare, and classify the traffic congestion states in 23 Chinese cities using the online map data and further reveals the influential factors that may affect them. First, the real-time traffic congestion information is obtained from the online map of AutoNavi in a 15-minute interval. Next, a new measuring index is introduced to describe the overall characterization of congestion patterns in each city based on online map data, which is named as the congestion ratio. The next analysis is the cluster analysis based on the temporal distribution of the congestion ratio, which helps to identify groups of the selected cities with similar traffic congestion states. These cities are categorized as four groups according to the severity of traffic congestion: severely congested, less severely congested, amble, and smooth cities. Lastly, multiple linear regression models are developed to identify the primary factors that affect the congestion ratio. The result shows that the influences of per capita road area, car ownership, and vehicle miles traveled (VMT) on the congestion ratio are significant. Sensitivity analyses are also implemented in order to reveal more effective policy measures in mitigating traffic congestion in urban areas.


Introduction
Accurate description and classification of traffic congestion states of an urban transportation network will improve our understanding of the performance of an entire country or region's transportation network.And further analysis of the primary factors causing traffic congestion could be used to efficiently ease traffic congestion, in terms of planning, operation, and management.
Over the past 50 years, a lots of studies focused on distinguishing congestion state on certain road links.Researchers have developed a huge number of traffic flow models (see, e.g., reviews [1][2][3]), including well-known Macroscopic Fundamental Diagram (MFD) [4,5], and Kerner's three-phase traffic theory [6][7][8][9][10] (see papers of Rehborn et al. [11,12] about the development of Kerner's theory) is used to distinguish and describe congestion patterns on road links.In practice, level of service (LOS) is applied in the US HCM 2000 using letters A through F to describe it, with A being the best and F being the worst [13].A standard similar to HCM had been established in China.These standards were mainly based on proximity to other vehicles, travel speed, volume capacity ratio, and so forth.Although road-link based congestion evaluation methods above are detailed and accurate, they can hardly be used to evaluate the traffic congestion condition in the whole wide city and further to compare different cities' congestion circumstances.
Traditional regional congestion measures were mostly based on travel survey data, including travel delay and travel time, like National Household Travel Survey (NHTS) in the US and Person Travel (PT) survey in Tokyo metropolitan area.In recent years, with the popularity of "Big Data," more accurate regional congestion measure became possible.INRIX provided detailed traffic speed data every 800 feet (250 meters) across 4 million miles of roads in 40 countries.Powered by INRIX's data, the Texas Transportation Institute annually developed a measure of traffic congestion for many urbanized areas [14].However, for most developing countries like China, real-time traffic data is either not published due to restrictions or not collectable owing to lack of devices.
Previous researches studied several factors related to congestion.Generally, there are three ways to analyze factors which influence urban traffic status.Firstly, some researchers tried to build spatial interaction model and solved it mathematically.Tsekeris and Geroliminis [15] applied Macroscopic Fundamental Diagram (MFD) in a monocentric city model to analyze the relationship between land use and traffic congestion.The findings reinforced the "compact city" hypothesis, by favoring a larger mixed-use core area compared to the peripheral area.Safirova [16] developed a fully closed general equilibrium model of a monocentric city to analyze the effects of telecommuting on the urban economy and traffic congestion.The study concluded that, under certain conditions, telecommuting increased the city size and the average transport costs of commuters.Gubins and Verhoef [17] considered a monocentric city model with a traffic bottleneck.Instead of building the standard monocentric model of urban land use, Anas and Xu [18] built a fully closed computable general equilibrium model with endogenous traffic congestion and then discussed the relationship between congestion, land use, and job dispersion.While these models above were based on several assumptions, such as simplified geometrical city shape and hypothetical traveler behaviors, urban transportation system is a highly complex system.As a result, there are inevitably mismatches between the reality and theoretical models.Statistical analysis is another way that researchers used in this field.Sarzynski et al. [19] examined the relationship between 7 distinct aspects of land use in 1990 and 3 measures of transport congestion in 2000, using data from a nationally representative sample of 50 of the 100 largest US urban areas as of 1990.They reveal most measures of land use in 1990 were significantly related to traffic congestion levels in 2000.Scharank et al. [14] used real data and theoretical assumptions to illustrate that bigger public transportation ridership, more roadways in critical corridors, and better traffic flow condition can relieve the congestion.Cervero [20] employed a path model to causally sort out the links between freeway investments and traffic increased, using data for 24 California freeway projects across 15 years.The research confirmed the presence of induced travel, new travel spurred by roadway investments.Nasri and Zhang [21] examined the changes in vehicle miles traveled (VMT) in order to analyze the effectiveness of transit-oriented development (TOD) on less driving using Washington DC and Baltimore's case.Although the research did not directly discuss traffic congestion, it pointed out how VMT which significantly affects congestion was influenced by TOD.The third method is case study.Sim et al. [22] used Singapore as an example to illustrate how integrating land use and transport planning reduced work-related travel.Findings from related surveys showed that there was a great potential in reducing work-travel and reliance on the car so as to alleviate traffic congestion through decentralization of commercial activities.These studies above were generally based on data in developed countries, while quantitative factor analyses in developing counties including China are limited.
Due to lack of published data in a large number of developing countries such as China, this study utilizes the integrated online map data provided by AutoNavi for road congestion state description (Section 2).Next, a novel method to evaluate the congestion state of a city is proposed using the online map data and classification of traffic congestion in Chinese cities is performed (Section 3).Moreover, influential factors of traffic congestion in Chinese cities are discussed based on the online map data (Section 4).The major findings and limitations of this study could be found in Section 5.

Description of Traffic Congestion Using
the Online Map Data This research chooses 23 cities in China, taking their urban zones as the research objects.These 23 cities include Beijing, Changchun, Shenyang, Shanghai, Guangzhou, Hangzhou, Dalian, Tianjin, Qingdao, Chengdu, Chongqing, Wuhan, Changsha, Nanchang, Wuxi, Changzhou, Xiamen, Ningbo, Xi'an, Kunming, Xining, Fuzhou, and Shenzhen.The reason to choose these cities is twofold.First, they represent major large cities in China.Second, traffic congestion has become one of the most serious urban problems in most of them.

Access to the Map Image Information on the Web.
Map images on the web are composed of different layers: the bottom layer displays basic geographic information; the middle layer shows names, comments, and other information of different venues; the upper layer, which is utilized in this study, contains traffic information depicted with colors red, yellow, and green.Using MATLAB 7.11.0(R2010b), we download all the blocks of the upper layer and splice them according to their naming rules, to generate a map containing the realtime network running state for a certain area.This process is shown in Figures 1 and 2.  the congestion ratio collected from AutoNavi is corresponding to residents' travel behaviors.It is an appropriate substitution for evaluation and comparison of traffic congestion between different cities.

Evaluation and Classification of Congestion Using Cluster Analysis
Travel characteristics on weekdays have significant differences from those on weekends.This results in significant difference in the urban traffic condition between weekdays and weekends, as shown in Figure 3. Therefore, this study divides structured data into two groups: the weekday group and the weekend group.For each group, the congestion ratio during each period (length: 15 min) is chosen as the variable for the following cluster analysis.

Data Preprocessing.
As discussed in the previous section, the targeted variables are determined as the summation of the congestion ratio during each time period.For further cluster analysis, we need to preprocess original data.The objective matrix is shown below.Each row indicates a city, when each column stands for a time period which is 15 minutes.There are 23 rows and 96 columns in objective matrix, as we divide one day time into 96 periods: city,, represents the congestion ratio at the tth period on the dth day; day  represents the amount of date that contains the tth period's data in our dataset.As a result,  city, is the average congestion ratio of the certain city in the tth period.

Cluster Methodology.
As it is hard to accurately estimate the number of cluster groups in city congestion classification, instead of k-means cluster method, hierarchical cluster by using median clustering agglomerative method (HMC) is performed in this study.Since the physical meaning of all the variables stays consistent, there is no need to standardize the variables.There are several basic parameters in HMC model.
Euclidean distance between sample city  and sample city  is denoted as   by the following formulation: In HMC model, the distance between cluster   and cluster   denoted as   (, ) is defined as the average value of the distances between each sample city  in cluster   and each sample city  in cluster   , as follows: The minimum value of   (, ) is denoted as  0 , which is the distance between cluster   and cluster   : HMC model contains 4 steps.
Step 1.Each sample city belongs to its own cluster; that is, the number of clusters is equal to the amount of sample cities.
Step 2. Calculate the distances between each two clusters as   (, ) and find out  0 .
Step 3. Combine   and   as a new cluster   ⋅   =   ∪   .
Step 4. Iterate Steps 2 and 3 until all sample cities are in one cluster.
After getting the hierarchical cluster results, determining the number of clusters is requisite.However, there are no satisfactory rules determining the exact number of clusters in hierarchical cluster analysis.Milligan and Cooper [24] conducted 30 procedures for determining the number of clusters.Their results showed that the pseudo  statistic, pseudo  2 statistic developed by Duda and Hart [25], and the cubic clustering criterion (CCC) perform better.
Cluster process is conducted in SAS 9.2 and the cluster results are concluded by four statistics, that is, local peak of the CCC, large difference between the  2 statistics in this step and the next cluster fusion, a large pseudo  statistic, and a small value of pseudo  2 statistic combined with a large pseudo  2 statistic in next cluster iteration.

Cluster Analysis Results
for the Weekday Group.The cluster analysis statistical indexes, the  2 value, the CCC value, the pseudo  statistic, and the pseudo  2 statistic, indicate that the cities should be divided into four clusters, as shown in Table 1.The cluster tree is shown in Figure 5  The second category is less severely congested cities. Represented by Guangzhou and Wuhan, this group of cities shares a relatively flat morning peak, an obvious peak at noon, and a very lasting and highly congested evening peak.This group also includes Fuzhou, Xi'an, and Xining.These cities should further examine the congested period of time and section of roads and take actions towards these targeted range of time and area, in case the jams are getting worse.
The third category is amble cities. Cities such as Shanghai, Changchun, Dalian, Ningbo, Shenyang, Chengdu, Changsha, Shenzhen, Kunming, Hangzhou, Tianjin, Qingdao, Nanchang, and Chongqing belong to this group, of which the morning and evening peaks are obvious, but overall traffic conditions are moderate.Congestion is only limited to local road network in a relatively short time period.It is the most common traffic condition for large and medium-sized cities in present China.
The fourth category is smooth cities. Represented by Wuxi, Changzhou, and Xiamen, cities in this category share a smooth traffic condition throughout the whole time.

Cluster Analysis Results
for the Weekend Group.The cluster analysis statistical indexes indicate that it is proper to divide cities into five categories, which is shown in Table 2.It is shown in Figure 7 that the fifth category only contains Dalian.The reason for Dalian's heterogeneity is that the morning peak in Dalian is a lot earlier than other cities.The peak starts about 2 hours before other cities, while it ends 2 hours earlier as well.Considering obvious similarities between Dalian and amble cities group on congestion duration and peak value, we classify Dalian as a city in the amble city group.
Generally speaking, the traffic condition on weekends is significantly smoother than that on weekdays, and the peak value of rush hour is also significantly lower.The starting time of morning peak on weekends has a 2-hour delay compared to that on weekdays.The congestion characteristics for each weekend's group are shown in Figure 8.The first category is severely congested cities.This group represented by Beijing still experience very lasting and extensive congestion as the situation on weekdays.The second category is less severely congested cities.This group of cities includes Guangzhou, Wuhan, Fuzhou, and Xi'an.The third category is amble cities.In China, most cities including Shanghai, Changchun, Dalian, Ningbo, Shenyang, Chengdu, Changsha, Shenzhen, Kunming, Hangzhou, Tianjin, Qingdao, Xining, Nanchang, and Chongqing belong to this group.The congestion ratio in these cities is below 15% for most of the time.The fourth category is smooth cities.This category is represented by Wuxi, Changzhou, and Xiamen.Cities belong to this category have very smooth traffic condition throughout the whole weekends.

Primary Factors Influencing Congestion and Sensitivity Analyses
4.1.Methodology and Data.This study uses multiple linear regression to identify the primary factors influencing the traffic congestion: where the dependent variable  is the congestion level.It is calculated as the 15th percentile of the summation of congestion ratio as an evaluation index for congestion.In addition, considering heterogeneity between weekdays and weekends, two models are developed to represent the cases during weekdays or weekends separately.Figure 9 shows 15th percentile of the congestion ratio for the selected 23 cities.
The independent variables  1 ,  2 , . . .,   are primary factors influencing congestion, and  1 ,  2 , . . .,   are the corresponding coefficients.We hold that the emergence of congestion is related to the supply of network facilities and the volume of vehicles used.The supply of facilities could be measured using per capita urban road area, and the volume of vehicle used is codetermined by the vehicle ownership and usage.And the vehicle usage could be further represented as average daily Vehicle Mileage Traveled (VMT)  (or Vehicle Kilometer Traveled (VKT)), which is associated with residents' driving preference and further related to the urban scale.According to the analysis above, the following model is proposed: where  is the estimable intercept term, which equals zero;   is the 1/per capita road area;   represents the vehicles per 1,000 people;   is the daily Vehicle Mileage Traveled (VMT);   ,   , and   are the estimable coefficient vectors of   ,   , and   .When the supply of infrastructure denoted as per capita road area is considerably large,   is limited to 0. Meanwhile, while there is no vehicle in the city,   and   equal zero.In this circumstance that there is plenty of road supply with no traffic demand, there should be no congestion at all; that is, the congestion ratio  is zero.Therefore, the estimable intercept term  should keep 0 in the regression model.
The 23 cities' per capita road area and vehicles per 1,000 people are quoted from [26].The direct acquisition of daily VMT is rather difficult, so we used the internet platform to launch a nationwide survey on vehicle usage in February 2014 [27], to discover the driving patterns in different cities in China.4,000 questionnaires were issued, and the total number of valid returns was 2130, covering 29 provinces.Among these provinces, Shanghai, Guangdong, and Beijing returned most questionnaires, which had 293, 289, and 232 valid samples, respectively.The distribution of valid returns is shown in Figure 10.The values of daily VMT are acquired through this survey.
Table 3 shows that regression model is significant for both weekdays and weekends groups under 85% confidence level.Specifically, all the coefficients,   ,   , and   , are positive, which indicates that the reciprocal of per capita road area, the vehicles per 1,000 people, and VMT are all positively correlated to the congestion ratio.The smaller value of per capita road area, the more vehicles per 1,000 people, and the more VMT lead to higher congestion ratio.It conforms to our common sense that if road supply was increased and travel demand was reduced, then the road condition would be smoother.
There is a multicollinearity issue that needs to be further discussed in this regression model.According to Downs [28], the increase of road supply may promote the use of vehicle and result in more serious congestion, introducing multicollinearity problems into the model.In fact, the car ownership mainly depends on the level of economic development instead of infrastructure supply.Stares and Liu [29] reviewed autoownership researches from 1970s and made a conclusion that the number of vehicles per 1,000 people is related to many factors such as economy, culture, and geography, but the per capita income is still the most important factor influencing car ownership.

Single Factor Sensitivity Analysis.
Let the variables per capita road area and vehicles per 1,000 people be uncertainties  and their range ability be ±10% and ±20%, respectively.The changing rates of congestion ratio are indicated in Table 4.This study takes the mean value of   ,   , and   as the base for corresponding variables.
For both weekdays and weekends, the changes from two factors, per capita road area and vehicles per 1,000 people, have obvious impacts on the congestion ratio.Currently, cities in China have taken a variety of approaches to ease the congestion.Megacities such as Shanghai, Beijing, and Guangzhou mainly use auction practice of private car license plates or license-plate lottery to restrict the growth of car ownership, while medium and small-sized cities largely improve their infrastructure to ease traffic pressure.Through the sensitivity analysis results we could find, constructing more roads in the main urban area might have a more significant effect on the control of congestion than restricting car ownership.But obviously, in large cities with dense road networks already being constructed, building more roads is fairly difficult.

Multifactor Sensitivity Analysis.
The multifactor sensitivity analysis also uses the variables per capita road area and vehicles per 1,000 people as the changing variables.The results are present in Table 5.
The multifactor sensitivity analysis shows, under the assumption that daily VMT stays constant, that if we increase the values of per capita road area and vehicles per 1,000 people by 20%, the congestion ratios on weekdays and weekends will decrease by 1.0% and 0.7%, respectively.Thus, if governments only intensify the construction of road infrastructure and do not take actions to guide the growth of car ownership, congestion situation can hardly be alleviated.On the other hand, if per capita road area is increased and the car ownership is controlled by the government, the congestion situation can be reduced significantly.Then, the congestion ratios on weekdays and weekends are witnessed to have a significant reduction of 5.6% and 5.16%, respectively.
It must be added, however, that guiding and controlling the growth of car ownership are only a means for easing traffic congestion, rather than a purpose.To alleviate congestion, the key point still lies in reducing the use of cars.Controlling the car ownership is just one way to reduce car travel.Other strategies such as constructing an impeccable public traffic system or nonmotorized transportation could also play a positive role in easing traffic congestion.

Conclusions
One of the main contributions of this study is that it provides a feasible way for researchers, especially for those in countries with limited standard traffic data, to analyze traffic congestion from the online map data.Comparing with the results from travel survey in Beijing, this research confirms that the congestion ratio derived from the congestion status denoted as different colors on online maps is a reliable index to describe and compare different cities' congestion states.
Based on the real-time congestion ratio data, cluster analysis is proposed for the selected 23 Chinese cities.The result indicates that the selected cities can be divided into four categories according to their congestion pattern.Furthermore, by multiple linear regression analysis, this study finds that per capita road area, car ownership per thousand people, and VMT per day can significantly affect the congestion ratio.Sensitivity analyses upon influential factors indicate that it is more efficient to reduce congestion when increasing per capita road area and restricting vehicles per 1,000 people simultaneously.
This research holds some limitations as well.Firstly, the congestion status data is only gotten from AutoNavi; comparison with other online maps such as Baidu is needed in order to improve the quality of data.Secondly, congestion and amble state are both considered to be the congestion state in this research; these two states could be distinguished in future works.Moreover, in the regression model, the parameter daily VMT is obtained from questionnaires.A larger sample size can potentially increase the estimation accuracy.Lastly, a more detailed look at the urban shape and household activity within the cities and creating a classification system for predicting congesting levels are very necessary.

Figure 4 :
Figure 4: Travel temporal distribution from Beijing travel survey 2010.

Figure 9 :
Figure 9: 15th percentile of the congestion ratios for 23 sample cities.

Figure 10 :
Figure 10: The distribution of valid returns in nationwide vehicle usage survey.

Table 1 :
; the congestion characteristics for each group are shown in Figure 6.The first category is severely congested cities.This group represented by Beijing has a very lasting and extensive HMC model statistics for weekdays.

Table 2 :
HMC model statistics for weekends.

Table 3 :
Multiple regression analysis statistics.

Table 4 :
Single factor sensitivity analysis result on weekdays and weekends.

Table 5 :
Multifactor sensitivity analysis on weekdays and weekends.