Heterogenous Trip Distance-Based Route Choice Behavior Analysis Using Real-World Large-Scale Taxi Trajectory Data

Most early research on route choice behavior analysis relied on the data collected from the stated preference survey or through small-scale experiments. )is manuscript focused on the understanding of commuters’ route choice behavior based on the massive amount of trajectory data collected from occupied taxicabs. )e underlying assumption was that travel behavior of occupied taxi drivers can be considered as no different than the well-experienced commuters. To this end, the DBSCAN algorithm and Akaike information criterion (AIC) were first used to classify trips into different categories based on the trip length. Next, a total of 9 explanatory variables were defined to describe the route choice behavior, and and the path size (PS) logit model was then built, which avoided the invalid assumption of independence of irrelevant alternatives (IIA) in the commonly seen multinomial logit (MNL) model. )e taxi trajectory data from over 11,000 taxicabs in Xi’an, China, with 40 million trajectory records each day were used in the case study. )e results confirmed that commuters’ route choice behavior are heterogenous for trips with varying distances and that considering such heterogeneity in the modeling process would better explain commuters’ route choice behaviors, when compared with the traditional MNL model.


Introduction
Analysis of the routing choice behavior provides theoretical support for route guidance and traffic assignment. Most early research studies on route choice behavior were based on the data collected from stated preference (SP) surveys or through small-scale experiments that were usually limited in data size or number of participants. In the modeling process, discrete choice models especially logit models were commonly used. e differences among these models were mainly reflected in the differences of data set, explanatory variables, or the model structure. For example, McFadden and Reid applied logit models to travel behavior analysis [1]. After that, based on the hypothesis that the random term of route utility function follows the Gumbel distribution, Dial constructed a discrete multinomial logit (MNL) model for multimode selection [2,3]. In order to address the independence of irrelevant alternatives (IIA) issue of the MNL model, various modified models were proposed, such as the C-logit model and PS-logit model [4,5], which were built by adding a modification term in the utility function to characterize the interactions among different routes. In addition, according to the generalized extreme value (GEV) theorem proposed by McFadden, some researchers proposed CNL and PCL models [6,7] to avoid the IIA assumption of the MNL model. In general, these early research studies on route choice behavior lacked real-world data and were restricted by the algorithm complexity, and the numbers of explanatory variables used were usually limited as well.
With the rapid advancement of information and communication technologies (ICT), GPS technology has made significant progress, and the data collected by GPS devices have been widely used in various transportation research, such as in travel time estimation [8][9][10], driving risk analysis [11,12], departure time modeling [13,14], and many others [15][16][17]. Such data have also been used to directly support the route choice behavior analysis, and the data-driven route choice models were qualitatively improved in terms of both effectiveness and accuracy. For example, route choice behaviors and network information in Chicago were studied using data collected using portable GPS devices, and path size (PS) logit models for different travel purposes in different time periods were proposed [18]. Based on the same method, Schussler and Axhausen collected travel data in Zurich area and calibrated C-logit model and PS-logit model [19]. Kim Mahmassani proposed a trajectory clustering algorithm to analyze the spatial and temporal travel patterns in a network [20], in which a framework for clustering and classifying vehicle trajectory data was built. Additionally, several medium-sized cities in Netherlands were selected as research objects and an MNL model based on GPS data was proposed to analyze the route choice behavior [21]. Li et al. collected the GPS data of private cars in Toyota City, explored the effect of travelers' heterogeneity on route choice, and concluded that the route choice behavior is affected by travelers' age, gender, vehicle displacement, and O-D's characteristics [22]. However, the analysis focus was on the traveler's heterogeneity, as opposed to the differences on the route characteristics. Bierlaire and Frejinger used the GPS data in Swiss to study the behavior characteristics of longdistance travel route selection and gave the estimation results of the PS-logit model and subnetwork model [23]. Miwa et al. used the taxi travel data of Nagoya City to analyze the characteristics of dynamic route choice behavior, an MNL model was built, and it was concluded that there are differences in the route choice behavior at different O-D distances [24]. Yamamoto et al. used the pedestrian GPS data from Nagoya to build a nested logit model [25], and Hu et al. used GPS data to analyze route choice behavior changes under preplanned road closures [26].
is manuscript focused on the analysis of route choice behavior of general traffic, based on the massive amount of trajectory data collected from the occupied taxicabs. Taxicabs, especially those work with the e-hailing platform such as Uber and Lyft, on the other hand, are mostly installed with the GPS devices for dispatching and safety purposes. However, most existing research studies based on the taxi GPS trajectory data focused on the routing behavior of the vacant taxi drivers, with the objective of minimizing the search time for the next customer [27,28] or maximizing the profit [29][30][31], which was significantly different from regular drivers. Our underlying assumption was that when a taxi was occupied by customers, the taxi driver would seek to arrive at the destination in the least amount of time or distance as expected or required by the customer, similar to the objective of a commuter in his/her own car. Additionally, taxi drivers usually had good knowledge on the roadway network and traffic conditions, and thus their travel behavior can be considered as very similar to, and no different than, the wellexperienced commuters.
Furthermore, this manuscript tested a hypothesis that trips with different lengths may exhibit different characteristics in driver's route choice behavior. As opposed to the common practice of developing and calibrating a unified model to describe the route choice behavior of all trips, the Akaike information criterion (AIC) was first used to classify trips into different categories based on the trip length. Next, a total of 9 explanatory variables were defined to describe the route choice behavior, and a PS-logit model was then built, which avoided the invalid assumption of IIA in the commonly seen multinomial logit model [24]. e taxi trajectory data from over 11,000 taxicabs in Xi'an, China, with 40 million trajectory records each day were used in the case study.
e results confirmed the hypothesis that commuters' route choice behaviors are heterogenous for trips with varying distances and that considering such heterogeneity in the modeling process would better explain commuters' route choice behaviors. e rest of this paper is organized as follows: Section 2 presents the data used in this research, including the GPS trajectory data and the traffic network. Section 3 discusses the analysis methodology in depth, and Section 4 presents the numerical analysis results. Section 5 concludes this research.

GPS Data Set.
e GPS trajectory data used in this research came from the historical database of the taxi dispatch system in Xi'an, China. e recording time was from 0 : 00 to 24 : 00, the recording interval was 30 s, and each record contained license plate number, timestamp, longitude, latitude, speed, driving direction, and loading state. e data set included data from over 11,000 taxicabs with 40 million trajectory records each day. Such a huge amount of data can meet the needs of this research. e following data cleaning and preprocessing were performed: (1) Removed the flawed data with missing values.

Traffic Network.
e OpenStreetMap (OSM) network of Xi'an was downloaded and utilized for this research. Postprocessing efforts were made, including the removing the duplicate or redundant roads and adding the length of road segment and node information. Additionally, the road segments were classified into seven categories, including expressway, national highway, other highways, urban expressway, main road, secondary road, and neighborhood street. e research region is shown in Figure 1.

Hotspot OD Trips Extraction.
Occupied trips between frequent origin-destination (OD) pairs were extracted from the database as the target data for analysis. We first identified 2 Journal of Advanced Transportation pick-up and drop-off hotspots and then extracted the frequent OD between these hotspots.

Identification of Drop-Off Hotspots.
is step aimed to identify the areas with high density of drop-off events, with the goal of providing the basis for hotspot OD matching and ensuring that there was a sufficient number of passengercarrying trips between the same OD pair (from pick-up to drop-off).
According to the change of loading state between two adjacent GPS data records, pick-up points and drop-off points can be identified. Taking GPS data of taxi on 19 April 2017 in Xi'an as an example, from 40 million trajectory data generated by 11,281 taxis, nearly 594 thousand drop-off points in the research region were obtained. e DBSCAN spatial clustering algorithm [32] was adopted to identify the drop-off hotspots. e algorithm contained two parameters: cluster neighborhood radius (Eps) and minimum density threshold (MinPts). In this paper, the K-distance method was used to determine the reasonable Eps. e method contained three steps: Step 1: assuming that the drop-off points data set D � P i (x, y), i � 1, 2, . . . , n contained n points, we selected a drop-off point P i (x, y) and calculated the Euclidean distances between P i (x, y) and P 1 (x, y), P 2 (x, y), . . . , P i−1 (x, y), P i+1 (x, y), . . . , P n (x, y), respectively. en, they were sorted by Euclidean distances in ascending order as Step 2: we calculated the K-distance of each drop-off point in the data set based on Step 1.
Step 3: we sorted the K-distances of all drop-off points in ascending order and plotted the K-distance figure. In the figure, the K-distance of the inflection point was defined as Eps of the data set.
Taking the drop-off points data set on Wednesday, April 19, 2017, as an example, we analyzed the data in different lengths of time. We found that when the length of the time period exceeds 8 minutes, the change of the K-distance figure tends to be stable, and the characteristics of the inflection point are more clear, which was shown in Figure 2. Finally, considering the limitations of computer performance, we took the drop-off points data set (5,000-5,400 points in total) of 10 : 00-10 : 10 am on Wednesday, April 19, 2017, as an example, and its K-distance figure is shown in Figure 2(d), which showed the K-distance changed significantly around 0.00211. erefore, 0.00211 was selected as the Eps. is value will be used in the clustering of one day's drop-off points data set to identify the hotspot ODs.
MinPts indicated the density of drop-off points in each cluster. In this paper, with the given Eps and assuming MinPts, clustering results of drop-off points can be obtained. According to the clustering results under different MinPts, reasonable MinPts can be determined. Under different MinPts, the clustering results of the drop-off points are shown in Table 1.
To obtain as many clusters as possible and to ensure each cluster has a sufficient number of pick-up or drop-off points, the value of MinPts was set to be 800. e 594 thousand drop-off points were clustered into 11 clusters (Table 1). When the value of MinPts was set to be 800, spatial distribution of drop-off clusters and number of trips of each cluster were obtained as shown in Figure 3.

Identification of Hotspot OD.
In order to ensure that the trip between the selected OD pairs is of sufficient quantity and effectiveness, a hotspot OD identification method was proposed in this step. It consisted of the following two steps: (1) For each drop-off point in drop-off clusters shown in Figure 3 (14283 points in total), search the corresponding pick-up point and trajectory data in between; (2) Re-cluster the pick-up points. e DBSCAN algorithm was used for the re-clustering of pick-up points. e pick-up points generated by the 11 drop-off clusters, as shown in Figure 3, were re-clustered. Eighteen pairs of hotspot ODs were obtained (Table 2). e results show that using the method above only needs to process one day's data to ensure that the number of passenger-carrying routes between ODs is sufficient.
In Table 2, CCluster means a pick-up hotspot that was reclustered. "Cluster 1-CCluster : 245" means there are 245 single passenger-carrying trips between the pick-up point Cluster1 and the drop-off point CCluster.

Trip Length Classification.
To test the hypothesis of heterogenous route choice behavior for trips with different lengths, the Akaike information criterion (AIC) was first used to classify trips into different categories based on their length.
A few studies on the classification of trips by travel distance can be found in the literature. In the survey of urban residents' travel, the travel distance was subjectively divided into few distance segments, such as 0∼3 km, 3∼6 km, 6∼9 km, 9∼12 km, and longer than 12 km [33,34]. For mode split purpose, only qualitative classification of travel distance (short distance and long distance) was performed [35,36]. In the route choice model, most studies used only one model to describe all the route choice behaviors [8,21,37]. For different types of passenger-carrying routes, the behavior of travelers was different. As such, currently a theoreticalsound method for classifying the travel routes is missing.
Based on the OD-Euclidean distance distribution of passenger-carrying routes, we sought for the eigenvalues with the travel volume changes significantly. ese eigenvalues were used as the basis for the preliminary classification. e OD-Euclidean distance distribution of the 14,283 trips in 11 drop-off clusters mentioned in Section 2.3 is shown in Figure 4. In this section, we use this part of data for research.   Figure 3: Spatial distribution of drop-off clusters.  Journal of Advanced Transportation Figure 4 shows that at 3, 7, and 10 km, three peak values of travel volume can be observed. It is believed that these three peaks were consistent with the urban structure of Xi'an: (1) 3 km radius: within 1-3 km of the Central Business District (CBD), there were many service facilities. ese facilities can serve residents well and residents can fulfill their daily needs in this region, such as working, schooling, and shopping.
(2) 7 km radius: as a city with thousands of years of history, the CBD of Xi'an attracted a large number of trips. e CBD of Xi'an is located in the geometric center of the city, and CBD-centered 6-7 km covered major urban areas. (3) 10 km radius: there are many passenger stations, airports, and tourist areas around the city, and these important points of interests also attracted a lot of travel. is phenomenon explains the occurrence of the third peak.
According to the above analysis, single passenger-carrying route of taxi can be divided into four categories: 0-3 km, 3-7, 7-10 km, and longer than 10 km. It should be noted that these were OD-Euclidean distances, which represented the linear distances between pick-up point and   drop-off point. It was difficult to reflect the actual length or travel time of the routes. In order to reflect the actual length of taxi passenger-carrying route, circuity was selected as another route classification index. We screened out the data of 14,283 trips, including the Euclidean distances and circuity of each OD as shown in Figure 5. e relationship between OD-Euclidean distances and average circuity for different types of passenger-carrying routes was fitted as follows. is is a typical regression curve fitting done using Microsoft Excel, and the results show an R square value of 0.9416, which indicates satisfactory results: where cir rs is the circuity of passenger-carrying route from pick-up point r to drop-off point s and was calculated by the ratio of OD-Euclidean distance to the actual travel distance. eucl rs is the OD-Euclidean distance of passenger-carrying route from pick-up point r to drop-off point s (unit: kilometer).
e mean values of 0-3 km, 3-7 km, and 7-10 were 1.5 km, 5 km, and 8.5 km, respectively. Considering that only 13.17% of OD-Euclidean distances were over 10 km, and 80% of them were distributed in 10-15 km, and 12.5 km was selected as the representative value. By introducing 1.5 km, 5 km, 8.5 km, and 12.5 km into equation (1), the initial clustering centers of five schemes can be calculated (1.9905, 1.7247, 1.5471, and 1.4324). In addition, there have been studies that divide the travel distance of travelers into 3 categories and above [35,36]. erefore, we decided to set the number of clusters to 3 or 4. If the number of clusters is 3, depending on the cluster center, there are 4 optional clustering schemes; if the number of clusters is 4, there is 1 optional clustering scheme. Five clustering schemes are shown in Table 3.
In order to compare the effect of the five clustering schemes, the AIC criterion, proposed by H. Akaike in information theory, was introduced to identify the best scheme.
where L(θ) is the maximum likelihood estimation of the model, and with the increase in the difference among clusters, the value becomes larger. k is the number of parameters in the model, and the more classifications the model consists, the greater the value will be. e value of AIC depends on L(θ) and k. e smaller the k is, the more concise the model becomes, and the larger the L(θ) is, the more accurate the model will be. e AIC therefore considered both complexity and precision in identifying the best scheme. For circuity data sets X i | i � 1, 2, . . . , K , which contained K circuitries of passenger-carrying routes. e number of clusters was N, the final cluster center of each cluster was C m | m � 1, 2, . . . , N , sample size of each cluster was Q m | m � 1, 2, . . . , N , and internal deviation of each cluster was D m | m � 1, 2, . . . , N .
where dist(X j , C m ) is the Euclidean distance between X j and C m , X j is the circuity of a passenger-carrying route in cluster m, and C m is the center of cluster m. e density distribution of deviations in each cluster is shown in equation (4).
where d max � Max(D m ) and d min � Min(D m ).
According to the principle of logarithmic maximum likelihood estimation, the logarithmic maximum likelihood estimation functions of the internal deviations of each cluster ( D m | m � 1, 2, . . . , N ) can be obtained as follows: Plug equation (5) into equation (2), the AIC, which was the basis of passenger-carrying route classification, can be calculated as follows: e clustering scheme with minimum AIC were selected as the optimal scheme. Five K-means clustering schemes were implemented by SPSS, which is a statistical analysis software package developed by IBM, and the AIC values of the five schemes, which as shown in Table 3, were 2.885, 2.6137, 2.8041, 3.5233, and 3.0231, respectively. e AIC value of scheme 2 was the smallest, which means that this scheme had the best balance in complexity and precision.  Journal of Advanced Transportation Accordingly, scheme 2 was considered as the optimal scheme. In clustering scheme 2, the boundaries of cluster 1 were 1 and 1.489, which corresponded to the passenger-carrying routes with OD-Euclidean distance longer than 10 km. e boundaries of cluster 2 were 1.489 and 1.826, which corresponded to the passenger-carrying routes with OD-Euclidean distance between 3 km and 10 km. e boundaries of cluster 3 were 1.826 and 2.544, which corresponded to the passenger-carrying routes with OD-Euclidean distance between 0 km and 3 km. Accordingly, the classification results of taxi passenger-carrying routes were 0 km ≤ D ≤ 3 km (short distance), 3 km ≤ D ≤ 10 km (medium distance), and 10 km D (long distance), where D indicated the OD-Euclidean distance.
With such thresholds for trip lengths clarification, the Euclidean distance distribution of 18 pairs of hotspot ODs is shown in Figure 6. e hotspot OD from Xiaozhai (Cluster18, pick-up cluster) to Shaanxi Province People's Hospital and Xi'an Medical College (Cluster11, drop-off cluster) was selected as the research object of short-distance taxi passenger-carrying route. e hotspot OD from Lagerstroemia Garden and Four Seasons Garden (Cluster16, pick-up cluster) to Xiaozhai (Cluster10, drop-off cluster) was selected as the research object of medium-distance taxi passenger-carrying route. e hotspot OD from Xi'an Bei Railway Station (Cluster2, pick-up cluster) to Xi'an Railway Station (Cluster2, drop-off cluster) was selected as the research object of long-distance taxi passenger-carrying route. ese three OD pairs are illustrated in Figure 7. Figure 8 illustrates the actual probability distribution of route choice for different passenger-carrying route categories shown in Figure 7. e formula for calculating the fluctuation value of the path choice probability is as follow:

Route Choice Probability Distribution Analysis.
where P rs ik stands for the probability of driver i choosing route k taxi from r to s.
It can be observed that the fluctuation of route choice probability can be summarized as follow: 0.2010 (short distance) <0.239 (long distance) <0.305 (medium distance). e following can be found: (1) Short-distance passenger-carrying routes had the smallest fluctuation. A most likely explanation was that due to the limited scale of the network between short-distance hotspot OD pair, drivers did not have enough options to make a detour and utility values of difference routes were similar. (2) Medium-distance passenger-carrying routes had the highest fluctuation. e scale of network between medium-distance hotspot OD pair was moderate, as drivers had more options to make a detour in acceptable travel time.
(3) e fluctuation of long-distance passenger-carrying routes was higher than short-distance routes but lower than medium-distance routes. It was probably because that the scale of network between longdistance hotspot OD pair was large and drivers had enough options to make a detour. However, the drivers' acceptable circuity or delays were small for long-distance passenger-carrying routes.

Explanatory Variables.
In this study, route choice behavior modeling explanatory variables were selected from three aspects: path factor, road factor, and PS correction term. We defined the coefficients corresponding to the explanatory variables in the model as shown in Table 4 below.
In Table 4, the travel time (TT) equals to the difference between the origin and destination GPS timestamps of a single passenger-carrying trip, K represents the length of path, D represents the OD-Euclidean distance, N p is the number of intersections, K m stands for the length of main road, K s represents the length of secondary, K b represents the length of branch road, and K co is the length of congested road, which is judged by the average travel speed of the road section from GPS data.

Path Size Logit Model.
e traditional multinomial logit model was a discrete choice model based on the theory of random utility, which can be used to describe the individual's choice behavior. e model was simple and easy to understand. However, the IID assumption of utility random item led to the result that there were IIA characteristics in the model. e probability that two routes were selected was only related to the utility of them and not to other routes. However, according to Figure 6, we knew that there were many common roadway segments among different taxi passenger-carrying routes.
e path size logit model reflected this issue by introducing a correction term into the utility function. erefore, the PS-logit model was adopted to analyze the taxi passenger-carrying route choice behavior in this paper. e utility function of PS-logit is shown in equation (8).
Γ k : roads set in route k.

Model Calibration Results.
With the help of Biogeme software package, the parameters of MNL model and PS-Logit model with different types of passenger-carrying routes were calibrated, respectively. In addition, we aggregate all routes together as a control group. e results are shown in Table 5. According to Table 5, for different route types, the t-statistics of explanatory parameters of the two models were statistically valid. e coefficient of PS correction term was positive, which was consistent with the basic principle of the Path number: Ln(PS) Ln(PS) --PS-logit model. In addition, adjusted likelihood ratio of PS-Logit model was better than that of the MNL model, which meant that PS-logit model described drivers' passengercarrying route choice behavior more accurately than the traditional MNL model. Finally, the adjusted likelihood of the control group was significantly lower than the other three groups, which showed that dividing the passengercarrying route by distance can optimize the model. According to Table 5, the following conclusions can be drawn: (1) e coefficients with positive values included β PT , β TR , and Ln(PS). e coefficients with negative values included β TT , β CI , β SR , β BR , β L , β R , and β CO .
is showed that when drivers chose routes, they tended to choose roads with high proportion of main roads, lower circuity, shorter travel time, and less congestion, regardless of the length of travel distance.
(2) With the increase of travel distance, the absolute value of β CI , β PT , β TR , β SR , β BR , and β CO increased obviously. is indicated that as travel distance increases, the impacts of circuity, path structure, and the congestion proportion of the choice of the driver will also increase.

Route Choice Preference Analysis.
With the level of consumer satisfaction unchanged, the marginal rate of substitution (MRS) referred to the scenario that when consumers increased one unit of a product and needed to abandon certain number of another product. Many existing research studies use MRS in the analysis of the calibration results of the choice model [38,39]. In this paper, with the utility of passenger-carrying route kept unchanged, MRS was defined as the change of basic variable when the other explanatory variables increased by one unit. It can be calculated as follows: In this study, the PS-Logit model with a better adjust likelihood ratio was selected as the analysis object. Travel time was selected as the basic variable, the MRS between travel time and other explanatory variables are shown in Table 6.
According to Table 6, the following conclusions can be drawn: (1) e relationship among the MRS of explanatory variables was found to be MRS(β BR , β TT ) > MRS(Ln(PS),   Figure 9: Routes of different passenger-carrying route categories for verification: (a) short-distance passenger-carrying routes, (b) mediumdistance passenger-carrying routes, and (c) long-distance passenger-carrying routes.
If the goal was to reduce travel time, the first and foremost factors to be considered should be proportion of branch road, path-size value, circuity, and proportion of congestion. e minor factors to be considered should be the number of left turn, right turn, number of nodes per minute, and the proportion of main road and secondary road.
(2) As the distance of passenger-carrying route increased, the MRS of circuity and proportion of branch road and congestion also increased. On the contrary, the MRS of frequency of intersections decreased. When the distance of passenger-carrying route was long, drivers usually avoided routes with high circuity and proportion of congestion and preferred to choose the routes with high proportion of freeway or highway segments. e calculation steps of hit ratio are as follows: Step 1: assuming that the total number of samples is N, the total number of alternatives is M, there are K parameters in the final calibration result of the model, and the parameter calibration value β k and the corresponding parameter value C k are brought into the calibration model to obtain the selection probability P m of the corresponding program.
Step 2: assuming that traveler n has the greatest probability of choosing the route m, then δ mn � 1, otherwise δ mn � 0.
Step 3: when the actual selection result δ mn of the traveler is consistent with the predicted result of the calibration model, set S n m � 1, otherwise S n m � 0. en, the hit rate can be calculated as follows: In this paper, three different types of OD, as shown in Table 2, were selected to verify the model: e hotspot OD from Tong Hua Men Station (CCluster3, pick-up cluster) to Xi'an Railway Station (Cluster2, drop-off cluster) was selected as the verification object of short-distance taxi passenger-carrying routes; the hotspot OD from Han Cheng Road Station (CCluster, pick-up cluster) to Zhangbabei Station (Cluster1, drop-off cluster) was selected as the verification object of medium-distance taxi passenger-carrying routes; and the hotspot OD from Xi'an Railway Station (Cluster2, pick-up cluster) to Xi'an Bei  Railway Station (CCluster2, drop-off cluster) was selected as the verification object of long-distance taxi passengercarrying routes. After removing abnormal data, these three ODs have 445, 189, and 289 valid trips and 7, 4, and 10 valid routes, respectively. e routes between these three ODs are shown in Figure 9.
According to the route choice model constructed in Section 4.1, the route choice results of each hotspot OD are calculated and compared with the actual choice situation. e results are shown in Tables 7-9.  e Tables 7-9 show that the hit ratios of the shortdistance, medium-distance, and long-distance passengercarrying route choice models are 0.81421, 0.76720, and 0.87889, respectively, indicating that the three types of route choice models constructed are effective and can be explained reasonably the behavior of passenger-carrying route choice. e analysis of extra OD pairs requires the substantial amount of manual work.

Conclusion and Future Work
is manuscript, for the first time, focused on the analysis of route choice behavior based on the massive amount of realworld GPS trajectory data collected from the occupied taxi cabs. Our analysis based on the trajectory data from Xi'an, China, found that for trips with different lengths, the characteristics of route choice behavior could be very different. As such, according to the distribution of Euclidean distance and volume, five route classification schemes for taxi passenger-carrying routes were proposed based on the circuity K-means clustering method. e Akaike information criterion (AIC) principle was adopted to identify the best route classification scheme. After that, taxi passengercarrying routes were divided into three categories: short distance, medium distance, and long distance. Based on the MNL model, three PS-Logit models were proposed to analyze the route choice behaviors. e numerical analysis validated our hypothesis and revealed heterogenous activity patterns and influencing factors for trips with different lengths.
According to the study, the following conclusions can be drawn: (1) taxi passenger-carrying routes can be classified based on the distribution of Euclidean distance and K-means clustering of circuities; (2) for different taxi passengercarrying routes, the fluctuation of route choice probability can be summarized as follows: short distance < long distance < medium distance; (3) for different taxi passengercarrying routes, the first and foremost factors to be considered were proportion of branch road, path-size value, circuity, and proportion of congestion. e minor factors to be considered were the number of left turns, right turns, the number of nodes per minute, and the proportion of main road and secondary road; (4) with the increase of travel distance, drivers usually avoided routes with high circuity and intersection density but preferred to choose the routes with high proportion of freeway or highway; and (5) the effects of circuity, frequency of intersections, path structure, and congestion degree on utility function were significantly different among different taxi passenger-carrying route categories.
Finally, we have selected another OD pair for each category for validation purpose, and the analysis shows consistent conclusions. Future research could be focused on using the data set from other cities to validate the model. e works to be improved are as follows: On the one hand, the variables considered in the model in this paper were easy to be defined, while some other factors that were difficult to be defined or computed were not taken into account such as trip purpose, preference, network familiarity, and influence of weather and environment. On the other hand, in this manuscript, only Euclidean distance, travel volume, and circuity were considered in the taxi passenger-carrying route classification. If more data types become available, more factors could be considered such as the network structure among the hotspot OD. How to identify and select sufficient factors to improve the route classification results may need further discussion.

Data Availability
e GPS trajectory data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.