^{1}

^{2}

^{1}

^{1}

^{2}

Automatic fare collection (AFC) systems have been widely used all around the world which record rich data resources for researchers mining the passenger behavior and operation estimation. However, most transit systems are open systems for which only boarding information is recorded but the alighting information is missing. Because of the lack of trip information, validation of utility functions for passenger choices is difficult. To fill the research gaps, this study uses the AFC data from Beijing metro, which is a closed system and records both boarding information and alighting information. To estimate a more reasonable utility function for choice modeling, the study uses the trip chaining method to infer the actual destination of the trip. Based on the land use and passenger flow pattern, applying k-means clustering method, stations are classified into 7 categories. A trip purpose labelling process was proposed considering the station category, trip time, trip sequence, and alighting station frequency during five weekdays. We apply multinomial logit models as well as mixed logit models with independent and correlated normally distributed random coefficients to infer passengers’ preferences for ticket fare, walking time, and in-vehicle time towards their alighting station choice based on different trip purposes. The results find that time is a combined key factor while the ticket price based on distance is not significant. The estimated alighting stations are validated with real choices from a separate sample to illustrate the accuracy of the station choice models.

In the late 1990s, smartcard payment systems were installed in some big cities, and after more than twenty years of development, more than one hundred cities over five continents have adopted smartcard payment systems [

In addition to the closed travel information loop for each transit user, in some transit systems, passengers are required to tap the card only while they enter the vehicle, which provide only boarding information [

Review of studies on estimating alighting stop in a tap-in transit system.

Author | Data | Assumption and constraints | Analysis/use methodology | Application | Pros | Limitations |
---|---|---|---|---|---|---|

Barry et al. (2002) [ |
AFC | Two basic assumptions | Trip chaining | New York | Easy to apply | Lack of one trip estimation |

Zhao et al. [ |
ADC | Walking distance threshold | Database management systems | Chicago | (i) Integrating the AFC and AVL |
The model was just focused on the bus and rail station |

Trépanier et al. (2007) [ |
AFC | Walking tolerance is 2 km. | Transportation object-oriented modeling with vanishing route set | Gatineau | The model is quite suitable for regular transit users | Some passenger information such as single ticket user is missing. |

Chu and Chapleau (2008) [ |
AFC | 5 min temporal leeway for uncertainty | The linear interpolation and extrapolation to infer the vehicle position | Société de transport de l’Outauais | Avoids the overestimation of the transfer. | Improves the results of trip purpose and destination inference. |

Nassir et al. (2011) [ |
ADCS |
Geographical and temporal check Transfer time threshold | OD estimation algorithm | Minneapolis-Saint Paul | Relative relaxation of the search in finding the boarding stops. | The transfer time threshold is fixed |

Wang et al. (2011) [ |
ADCS |
Walking tolerance is 1 km or 12 min. | Trip chaining methodology based on next trip is bus or rail | London | Validates the automatic inference results against large-scale survey results | Linking system usage to home addresses; access behavior could be better understood |

Munizaga and Palma (2012) [ |
AFC |
Generalized time | Position-time alighting estimate model | Santiago | (i) Uses generalized time rather than physical distance |
The one trip per card destination estimation is missing. |

Gordon et al. (2013) [ |
AFC |
Walking tolerance is 1 km and max. transfer is 30 min. | Four-step trip chaining algorithm | London | The circuity ratios to decide the potential destination for previous journey. | Not all of the passengers alight from the stops closest to the next journey. |

Alsger et al. (2015) [ |
AFC | The dynamic transfer time threshold | OD estimation algorithm | Queensland | Transfer time threshold could be increased. | Extended to compare other estimation methods. |

Trip chaining methodology is the typical methodology in these research. Here are two basic assumptions: (1) A high percentage of riders return to the destination station of their previous trip to begin their next trip, and (2) a high percentage of riders end their last trip of the day at the station where they began their first trip of the day. In addition to applying the basic assumptions, for each cardholder, there should be more than one trip in the system. Otherwise, it is impossible to infer the alighting station. For some passengers such as commuters, multiday travel information is recorded. The single trip destination could be inferred based on records from other days. If there is only a one-day trip for the cardholder and contains only one trip, the alighting station is invalid.

For passengers, when choosing the alighting station, they consider the in-vehicle time, transfer time, walking time, and ticket fare comprehensively and choose the station which has the highest utility. Sometimes, the alighting station differs based on different trip purposes because the time value could vary for different purposes. To formulate this optimization model, it is necessary to validate the weight and the coefficient for those impact parameters. Because of the missing information and lack of closed trip data, the validation of those models is seldom discussed. The early attempt to validation and sensitivity analysis is based on the on-board survey data to illustrate the feasibility of the method. However, the on-board survey is expensive and data samples are limited.

The Beijing metro system is a closed system, which contains both boarding and alighting information. With walking time, in-vehicle time, and ticket fare for each candidate alighting a station in a buffer walking time for each trip and the real alighting station from AFC data, the coefficient of each utility factor is estimated. Inspired by Tavassoli et al. [

This paper is organized as follows. In the following section, it describes the data and data preparation process. In the next section, the method for determining trip purposes, trip origins, and trip destinations is presented. In the methodology section, a multinomial logit model and mixed logit models with independent and correlated normally distributed random coefficients are proposed. We used the AFC data to calibrate the parameters in different models in the first and second parts of the empirical study. In the last part of the empirical study, a separate sample of AFC records is used to illustrate the model’s accuracy and validity. Conclusions and directions for future work are presented in the last section.

The data used in this paper are obtained from a metro transit in Beijing, China, and were excerpted from one week of data, in December 2016. At that time, there were 17 lines serving more than 10 million passengers every day with more than 8000 train services. The majority of line headways ranged from 2 to 5 min, and in the peak hour, the headway could reach 90 s. There are two kinds of payment in Beijing metro, a Yikatong card, which can be charged and used for several times, and one trip pass. The proportion of the Yikatong cardholder among all transit passengers is roughly 80%, and only the Yikatong card data can be recorded in the AFC system. In this research, the AFC data, station geometry data, and timetable data are required, and Table

Description of each dataset.

Dataset | Description |
---|---|

Card ID | Unique number that could be taken as the passenger ID |

O station | Boarding station ID |

Entry time | Access time to the station |

D station | Alighting station ID |

Exittime | Exit time from the station |

Station ID | Unique station number |

Station name | Name of metro station |

Station latitude | Latitude of metro station |

Station longitude | Longitude of metro station |

Station route ID | Route number which serves at metro station |

Service ID | Given number to every trip |

Arrival time | Scheduled arrival time |

Departure time | Scheduled departure time |

Station ID | Given station number |

Route ID | Given route number |

O station | Entry station ID |

D station | Exit station ID |

Ticket price | The price for a specific OD pair. |

The AFC dataset contains the entry and exit information for each passenger. One record represents a trip for a passenger. For example, a passenger started his trip from Xizhimen Station at 8 : 00 AM and alighted at Dongzhimen Station at 8 : 30 AM. Every station has a unique station ID and station location. For a normal station, the route ID saved only one route. For a transfer station, it serves more than one route, so the route ID contains more than one route. For example, Xizhimen Station is a transfer station for route 13, route 2, and route 4. This station only has one unique station ID, station name, and station location in the dataset. The 3 routes are saved in the route ID. The timetable dataset recorded the train arrival and departure time at each stop for each route. The passenger in-vehicle time could be inferred. In Beijing, the ticket price is based on the shortest travel distance and does not take route into consideration. For example, one passenger started his trip from Xizhimen Station to Dongzhimen Station; regardless of whether he takes route 13 or route 2, the ticket price is the same.

In the database discussed above, the AFC data provide the sample for the empirical study. Walking distances were calculated as the Euclidean distance, and the timetable was used to calculate the travel time between stations using the shortest path.

It has been highlighted that the level of accuracy of AFC data may vary and the data can be affected by various types of errors. These errors may affect the accuracy of individual journeys and passenger behavior analysis. In the original AFC data, some errors are caused by system failure or passenger error. The data were filtered with some transactions excluded, such as reloaded transactions, transactions with missing information such as no boarding or alighting stops, and transactions with the same entry and exit stations.

As the study uses the trip chaining method to infer the actual destinations and potential purpose, we exclude single trip cardholders due to lack of information. Figure

Data preparation process.

Trip purpose could be inferred from their alighting and boarding station. For example, if the passenger started his trip at a residential area and went to CBD, we could say that this trip is a work trip. Based on the land use and the daily entry flow pattern for each metro station, we processed the

Working stations (red)

Those stations are usually in the CBD area or near the software plaza. In the morning, commuters take transit to go to work and go back home in the early evening. The morning exit passengers are much larger than that in the afternoon. The entry passenger volume in the early evening or late afternoon is much more than that in the morning. The typical stations such as Guomao Station and Zhongguancun Station are marked in red in Figure

Residential stations (orange)

Beijing has 6 ring roads in the city. The house price is unusually high within the 3rd ring. In order to save living expenses, a lot of citizens go to the 6th or even further place to buy or rent a house. There are some huge residential zones in Beijing such as Huilongguan, Huoying. The passenger flow pattern is the opposite. The morning incoming flow is much larger than that in the afternoon, and most passengers exit at these stations in the afternoon. The typical stations such as Huilongguan Station and Tiantongyuan Station are marked in orange in Figure

Working-residential stations (yellow)

Although the house price is pretty high, comparing with the travel time, some commuters prefer to rent or buy a house in the downtown area. The land use is more like the mix of CBD and the residential place such as the university campus area. The passenger flow patterns of these stations keep stable, and they do not have a flow peak during the day. The typical stations such as Wukesong Station and Gongzhufen Station are marked in yellow in Figure

Transit hub stations (green)

The in-coming and out-coming passenger flows, whether in the morning peak hour or in the afternoon peak hour, are always large in the transit hub. Mostly, they are the key points of the transit line such as transfer stations. The typical stations such as Dongzhimen Station, Xizhimen Station, and Songjiazhuang Station are marked in green in Figure

Railway stations (light blue)

Based on the land use, the railway station is a very independent station category. The in-coming and out-coming flow highly depends on the railway schedule. We have 3 railway stations in Beijing. They are Beijing railway station, Beijing south railway station, and Beijing west railway station, which are marked in blue in Figure

Shopping-sightseeing stations (deep blue)

There are some sightseeing and shopping sites such as The Forbidden City and Tiananmen Square, which attract a lot of tourists and visitors every day. For these stations, the total daily passenger volume during the weekends and holidays is usually higher than during workdays. The typical stations for this category, such as Tiananmen East, Tiananmen West, and Xidan stations, are marked in deep blue in Figure

Rural stations (purple)

The Beijing network is a huge network, and the operation distance has reached 608 km. Some rural areas also have operation lines for passengers such as Changing Line and Fangshan Line. The daily average passenger flow is much smaller in the rural lines compared with the volume in the downtown area. The typical rural stations are marked in purple in Figure

The typical stations for each category in Beijing metro.

For each trip, the trip purpose could be estimated based on the station category. For example, a passenger started his trip from a residential station and finished his trip at a working station. Based on the station category, we could label this trip as a working trip. This process could efficiently determine the trip purpose during the day.

However, there is a category that the station could be a workplace or a residential place. In order to determine the trip purpose for these trips, we performed a filter process. For each passenger in Beijing AFC data, the alighting station and boarding time are recorded according to the alighting station list for a passenger during a week. If the alighting station frequency is more than three times on weekdays, we make an assumption that the passenger is a commuter in the city and this place is a workplace or a home [

Trip purpose labelling process for work-residential trips.

Although the boarding and alighting information is recorded in the AFC data, the passenger trip routes are not recorded. In our study, we assume every passenger is an intelligent agent and wants to minimize the travel cost and maximize the utility of the travel. As such, the passenger will choose the shortest path from the boarding station to the alighting station. We calculate and use the shortest path travel time as in-vehicle. Also, we assume that a passenger will not detour when they go to another station by foot, so we take the Euclidean distance between the two stations as the walking distance.

AFC data recorded the alighting station, but the actual destination is missing. We assume that the passenger is a smart decision-maker, so he/she would choose an alighting station which is closer to the actual destination. In this study, we assume that the actual destination is somewhere in between the two consecutive stations, the alighting station of the previous trip, and the next trip’s origin, as seen in Figure

The assumption for actual destination and potential alighting station choices. Pink circles are the boarding and alighting stations of the first trip. Green circles are the boarding and alighting stations of the next trip. Yellow circles are candidate alighting stations.

When the alighting stations are relaxed, in order to find some candidate alighting stations, we set a walking buffer circle. According to the previous literature, we take a 15 min walk, or nearly 1 km, as the walking buffer radius. The stations which are included in the buffer circle are candidate alighting stations, shown as yellow circles in Figure

The following notation corresponding to the choice model is used:

Utility function of passenger

Coefficients for in-vehicle time, walking time and ticket fare, respectively

Value of in-vehicle time, walking time, and ticket for passenger

Takes the value one if the corresponding parameter is significant in the utility function

Random error term

Choice set for each passenger

Factor set

Coefficient set

Trip purpose set. 1, 2, and 3 represent work, home, and others.

The MNL model is the prime model in transportation research which calculated the probability or each choice in a choice set. In Beijing metro, the ticket fare is distance-based, which means that passengers could walk a long distance to save money. When a passenger chooses an alighting station, there are three factors which impact the utility, in-vehicle travel time, walking time, and ticket fare. For each passenger, the utility function can be written as (

The choice of alternative

In the standard logit model, the coefficients for the same factors share the same “preference.” However, a different passenger could have a different preference for the same factor. Mixed logit models can be derived from a variety of different behavioral specifications, and each derivation provides a particular interpretation. The mixed logit model is defined on the basis of the functional form for its choice probabilities. The utility function in the mixed logit model and the coefficient in (

Different elements in

As in some cases, the different elements in

We assume that the in-vehicle time and ticket fare follow a multivariate normal distribution.

Using the Cholesky factorization [

From the one-week AFC dataset, there were 5.05 million transactions each workday in the Beijing metro system. For a commuter, if he takes the metro to go to work and come back home, he would make at least 2 transactions in the dataset. Averagely, these transactions are made by 2.9 million cardholders, based on the static theory and sample size calculator [

In this study, we choose 1 km as the walking buffer distance [

For some OD pairs, the distance between real alighting station and alternative alighting station is more than 1 km, and these OD pairs did not have candidate alighting stations, which means the passenger could only egress at that station. The logit model could not be estimated in these no-candidate alighting stations or only one alighting station case. Therefore, these records are excluded, after which 13,180 trips remained.

After applying the trip purpose labelling process, 6027 trips are labeled as work trips, 2339 trips are home trips, and the remaining 4814 trips have other purposes. We used Biogeme [

For the utility function, we made the assumption that the passenger choice may be influenced by in-vehicle time, walking distance, and ticket fare. To make sure which of these factors significantly impact the utility, we tried every factor and their combination in the model to determine which ones are mostly considered in the choice process. Table

Results of the different factor combination of the MNL model.

Purpose | RhS | ILL | FLL | TF_Coff | IVT_Coff | WT_Coff | ||||
---|---|---|---|---|---|---|---|---|---|---|

MV | PV | MV | PV | MV | PV | |||||

Single factor IVT | W | 0.065 | −7213.23 | −6812.13 | — | — | −7.37 | 0.00 | — | — |

H | 0.065 | −2847.97 | −2670.73 | — | — | −7.12 | 0.00 | — | — | |

O | 0.064 | −6065.24 | −5725.43 | — | — | −7.74 | 0.00 | — | — | |

T | 0.065 | −16071.30 | −15277.80 | — | — | −7.52 | 0.00 | — | — | |

Single factor WT | W | 0.412 | −7213.23 | −4466.97 | — | — | — | — | −18.2 | 0.00 |

H | 0.131 | −2847.97 | −2472.62 | — | — | — | — | −9.87 | 0.00 | |

O | 0.283 | −6065.24 | −4435.62 | — | — | — | — | −13.2 | 0.00 | |

T | 0.376 | −16071.30 | −10801.90 | — | — | — | — | −16.4 | 0.00 | |

Two factors WT and IVT | W | 0.475 | −7213.23 | −4015.23 | — | — | −11.2 | 0.00 | −18.3 | 0.00 |

H | 0.21 | −2847.97 | −2442.40 | — | — | −7.4 | 0.00 | −9.01 | 0.00 | |

O | 0.353 | −6065.24 | −3979.53 | — | — | −13.2 | 0.00 | −15.2 | 0.00 | |

T | 0.414 | −16071.30 | −9729.70 | — | — | −11.2 | 0.00 | −16.4 | 0.00 | |

Three factors | W | 0.477 | −7213.23 | −4008.45 | 0.570 | 0.02 | −11.4 | 0.00 | −19.8 | 0.00 |

O | 0.354 | −6065.24 | −3960.00 | 0.665 | 0.00 | −13.7 | 0.00 | −15.3 | 0.00 | |

T | 0.412 | −16071.30 | −9705.12 | 0.598 | 0.00 | −13.1 | 0.00 | −16.6 | 0.00 |

RhS = rho square; ILL = init log likelihood; FLL = final log likelihood; PV =

Firstly, we consider the only single impact factor in the utility function. We found out that a single factor could not explain the passenger behavior very well, especially for the ticket fare, which did not influence the passenger choice. The walking time is more influential among three factors. The coefficient for in-vehicle time is almost the same for four types of the trips, but the coefficient for walking time differs based on different trip purposes.

For the two-factor combinations, in-vehicle time and walking time explained the user behavior as the best among the three possible combinations. This combination could illustrate every trip purpose well. Regardless of the trip purpose, there is higher disutility associated with walking time compared with in-vehicle time. On average, the walking and in-vehicle time coefficient ratio

As for the final log likelihood, the chi-square test was used to analyze the passenger behavior based on different trip purposes rather than overall. In this case, we use

When we only consider the rho square, the model which has three factors in the utility function performs a little better than the two-factor combinations. But in the three-factor combination model, the coefficient for ticket fare is positive. In the Beijing metro system, the ticket fare is distance-based with a potentially high correlation with in-vehicle time. So, we could consider the positive coefficient as an adjustment for overestimation of the in-vehicle time coefficient. To be more objective, in the next step, the walking and in-vehicle time model will be as the test model for home, work, other, and total trips, and the three-factor model will be the candidate model for work, other, and total trips.

We considered the three-factor and two-factor models in the mixed logit model for utility function estimation. For each utility function, similar to the MNL analysis, we test the factors with different combinations such as single-factor or two-factor with independent or correlated distributions.

In-vehicle time, walking time, and ticket fare are all considered in the three-factor utility function. For each trip purpose, fourteen combinations of the mixed logit model were tested. Because of the computational complexity of mixed logit model estimation, only some cases could reach convergence, such as the two independent distributions for fare and in-vehicle time. However, for some combinations, even when the estimation is converged, the coefficients in the model did not pass the

Among the passed models, the penalty for walking time is much higher than that for in-vehicle time, where the home trip has the highest coefficient ratio. Meanwhile, from other mixed logit models, we learned that the ticket fare standard deviation and in-vehicle time standard deviation are not significant for the utility function, which means that different passengers could share the same coefficient for ticket fare and in-vehicle time.

From the previous tests, we learned that walking time and in-vehicle time are more important factors compared with ticket fare. In this case, we only consider the walking and in-vehicle times in the utility function to see which mixed logit combinations could explain the passenger behavior well. From the results, similar to the three-factor utility condition, the single walking time distribution model also passed the

Above all, for the work trips, other trips, and total trips, some mixed logit models could illustrate passenger behavior well and based on the rho square, mixed logit models performed a better estimation result than MNL models did. Comparing the models with rho square and

The selected combination of mixed models for different trip purposes.

Pur | Mixed logit model | RhS | ILL | FLL | TF_Cofficient | IVT_Cofficient | WT_Cofficient | TF_Stad | WT_Stad | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

MV | PV | MV | PV | MV | PV | MV | PV | MV | PV | |||||

W | three-factor utility function, WT distribution | 0.627 | −7213.23 | −3202.92 | 0.81 | 0.08 | −18.1 | 0.00 | −249.00 | 0.00 | — | — | −156.00 | 0.02 |

O | Three-factor utility function, WT distribution | 0.512 | −6065.24 | −3212.77 | 1.32 | 0.02 | −21.1 | 0.00 | −312.00 | 0.08 | — | — | −266.00 | 0.08 |

T | Three-factor utility function, independent distributions WT and TF | 0.593 | −16071.3 | −8010.65 | 1.03 | 0.00 | −17.3 | 0.00 | −302.00 | 0.00 | 5.62 | 0.04 | −213.00 | 0.00 |

Stad = standard deviation.

According to the research above, we selected the best model that could illustrate every trip purpose. This time, we randomly select another 9573 cardholders and did the same prework such as data cleaning, trip purpose labelling, and candidate station selection as presented in the first part of the empirical study. For each trip purpose, 70% of the data is used as the sample to estimate the coefficient for each model and the remaining data is used for alighting station estimation simulation by Biosim [

Results for alighting station estimation based on selected MNL and mixed logit models.

Trip purpose | MNL | Mixed logit | ||
---|---|---|---|---|

Model | Percentage | Model | Percentage | |

Home | Two factors (WT and IVT) | 66.30% | — | — |

Work | Two factors (WT and IVT) | 78.27% | Three-factor utility function, WT distribution | 81.31% |

Others | Two factors (WT and IVT) | 70.74% | 75.35% | |

Total | Two factors (WT and IVT) | 72.59% | Three-factor utility function, independent distributions WT and TF | 79.23% |

From Table

This study is focused on the utility function calibration for alighting station estimation for different trip purposes. The main conclusions of this paper are fivefold:

We provided a two-step trip purpose labelling process to infer the trip purpose. Based on the land use and passenger flow pattern,

The walking buffer radius was applied to infer the real destination. With three assumptions and the trip chaining method, the actual destination and candidate alighting stations of the trips were inferred.

The MNL mixed logit models were proposed to illustrate passenger behavior. In order to estimate alighting stations, MNL and mixed logit models with different combinations of independent variables were discussed to illustrate passenger behavior for different trip purposes.

The influence factors for alighting station choice were tested. In the empirical study, passengers were found to have a different penalty for walking time and in-vehicle time based on trip purpose, and in general, walking time has a higher disutility. Ticket fare was not found significant compared with walking time and in-vehicle time.

The validation test represents the feasibility of the methodology proposed in this paper. Using a validation test, the model could successfully estimate 75% of the alighting stations. The work purpose trips have higher accuracy compared with other purpose trips. This coefficient calibration helps planners understand passenger behavior better and could be used in planning and policy applications.

This research, with the real AFC alighting station data, provided a new method to infer the alighting station and could validate the passenger behavior. Comparing with the on-board survey, this one is much cheaper and more convenient. Meanwhile, this work considers the passenger alighting behavior with different trip purposes, which is a new aspect of alighting behavior analysis.

Some aspects of this study could be improved in future research. The trip purpose labelling process is based on land use, passenger flow pattern, trip time, and alighting station frequency. We can define the trip purpose as a latent variable and apply the latent logit model to capture the trip purpose based on alighting station frequency, trip sequence, and boarding time automatically. Moreover, we will apply the model to a bigger data sample in order to make a more accurate estimation of complex models such as mixed logit. Finally, if possible, passengers’ sociodemographic characteristics could be incorporated in the choice model to make the choice more interesting and analyze passenger behavior in a different way.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work was supported by the Foundation of China Scholarship Council, the National Natural Science Foundation Project of P.R. China under Grant no. U1434207, and the Beijing Municipal Natural Science Foundation under Reference no. 8162033. The authors thank the coworkers in CEGE transit lab at University of Minnesota for their constructive suggestions and comments that have led to a significant improvement in this paper.