Estimating Train Choices of Rail Transit Passengers with Real Timetable and Automatic Fare Collection Data

. An urban rail transit (URT) system is operated according to relatively punctual schedule, which is one of the most important constraints for a URT passenger’s travel. Thus, it is the key to estimate passengers’ train choices based on which passenger route choices as well as flow distribution on the URT network can be deduced. In this paper we propose a methodology that can estimate individual passenger’s train choices with real timetable and automatic fare collection (AFC) data. First, we formulate the addressed problem using Manski’s paradigm on modelling choice. Then, an integrated framework for estimating individual passenger’s train choices is developed through a data-driven approach. The approach links each passenger trip to the most feasible train itinerary. Initial case study on Shanghai metro shows that the proposed approach works well and can be further used for deducing other important operational indicators like route choices, passenger flows on section, load factor of train, and so forth.


Introduction
Passenger flow is the foundation of making and coordinating operation plans for an urban rail transit (URT) system, while assigning passenger flows on the URT network plays a paramount role in analyzing (calculating, predicting, and simulating) passenger flows.A number of transit assignment models have been developed using both theory and practical experience, and thorough reviews were presented in some of the literature [1][2][3].However, different from urban road traffic systems, a URT system is operated according to relatively punctual schedule, which is an important constraint for a URT passenger's travel.Thus, the passenger flow distribution on the network is subjected to not only passengers' physical route choices but also their individual train choices especially in peak hours (Figure 1), which may be a more important issue [4].For analyzing passenger flows on a schedule-based URT network, it is the key to estimate passengers' train choices for threefold reasons: (1) On a schedule-based URT network, passenger route choices as well as flow distribution on the network can be deduced if the train choices of passengers are obtained, but that is not so either.
(2) It can give more precise estimation results for both spatial and temporal dimensions, since URT passengers may fail to board on a train in certain conditions especially in peak hours because of the overcrowding.
(3) These pieces of information would be further useful for improving the customer relationship management of a URT company and for improving train timetables, if each passenger's train choice can be identified over a long period of time.For example, URT companies can check how passengers select trains after timetable improvements.
As mentioned, there are a number of transit assignment models developed for analyzing passengers flows on the network.In those models, in order to obtain passenger route choice preference data, a conventional approach is to conduct field surveys in rail stations, asking passengers about the exact route they took to reach their destinations.However, the shortcomings of these methods have been identified  by more and more researchers.For example, the resulting data from these manual methods may be subject to bias and error and is expensive and time consuming both to collect and to process [5].In addition, the manual methods usually focus only on particular location and time [6,7].As a result, alternative concepts and methods need to be developed.
In recent years, automatic fare collection (AFC) data such as smart card data have been used by transit service providers to analyze passenger demand and system performance.These data have been used for O-D matrices estimation [8,9], demand analysis [10,11], travel behavior analysis [12], operational management, and public transit planning [13][14][15], and so forth.In particular, there are emerging studies dealing with AFC data of URT systems.Some impressive publications include works by Chan in 2007 [16] [22], and Sun et al. in 2015 [23].However, in spite of the widespread attention on the use of AFC data, there are fewer studies dealing with the passenger train choice behavior in a URT system.Kusakabe et al. [17] developed a methodology for estimating which train would be boarded by each smart card holder using long-term transaction data.Their approach was based on the assumption that smart card data that could not be identified to the possible train choices would be assigned with equal probability.Zhou and Xu [20] developed a passenger flow assignment model based on entry and exit time constraints from AFC data.The model includes an algorithm for generating path's boarding plan which is similar to passenger train choice.However, the matching degree employed in this algorithm is more intuitive than rigorously defined.Sun and Schonfeld [19] proposed a schedule-based passenger's path-choice estimation model using AFC data.The model uses the train schedule connection network (TSCN) which considers passengers' behaviors of boarding on and alighting from the train.However, a weighted assignment used by the model may be not appropriate for a factual travel choice process which uses only one route at the same time rather than multiroutes.And the problem will further become more obvious for those O-D pairs with fewer passenger trips.
For better understanding of passenger flows on network, the objective of this paper is to propose a methodology that can estimate passenger train choices with real timetable and AFC data.The contributions of this paper are presented as follows: (1) We formulate the addressed problem using Manski's [24] paradigm on modelling choice, which consists of generating consideration choice set and calculating corresponding choice probability.
(2) An integrated framework for estimating passenger train choices is developed.The approach links each AFC transaction (a passenger trip) to the most feasible train itinerary (a boarding plan).
(3) Real timetable and AFC data are investigated as the inputs to the proposed methodology, instead of relying on manual methods.
The remainder of this paper is organized as follows.In Section 2, the estimation problem of passenger train choices is described and formulated.Section 3 presents the integrated estimation framework.In particular, methods of deducing passenger boarding plan, choice probability, and travel behavior parameters are developed with real timetable and AFC data.Section 4 demonstrates a case study of the proposed approach.Finally, Section 5 concludes the paper.

Formulating the Problem
The topic discussed in this paper falls in the scope of choice modelling.From a variety of studies [24,25] it is well known that the size and composition of choice sets do matter in cases  of choice model estimation.Incorrect choice sets can lead to misspecification of choice models [26,27].And, furthermore, for a variety of reasons the specification of train choice sets for train choice modelling is different from and more complex than mode choice and route choice, which is why this topic deserves our special attention.
To clearly formulate the estimation problem, Manski's [28] paradigm on predicting choice is used.The essential conceptual contribution of this paradigm lies in its explicit treatment of the processes making perfect predictions of choice behavior unattainable.Up to date, most of the existing literature on random utility models still generally imposes distributional assumptions directly and consequently this practice has often caused researchers to remain unaware of the restrictiveness of their models because it leaves so much implicit information.
Manski's paradigm states that the probability of passenger  to choose alternative  from the choice set CS  , which is also called his/her consideration set, is given by the following expression: where The above addressed solution also can be depicted as in Figure 2. The horizontal axis indicates the travel cost for alternative , and the vertical axis indicates the probability that passenger  chooses alternative .As we use travel time as cost measure in this study, "cost" and "time" are treated the same (interchangeable) throughout the paper.The red vertical line indicates the observed travel time of passenger  extracted from his/her AFC transaction record.Each alternative in his/her consideration set (CS  ) can be plotted as a dot in the figure.Then, how to estimate which train itinerary the passenger chose in reality?It seems natural that the alternative, which is close to the red vertical line with higher probability, is most likely to be used by the passenger.

Methodology
3.1.Overview of Estimation Procedure.For an individual passenger, his/her train choice solution during the travel can be depicted as a boarding plan which is the order of trains that he/she can take to complete his/her travel.The overall framework of our estimation procedure for this kind of boarding plan is shown in Figure 3.At the beginning of the algorithm, denoted by "a," the AFC data are extracted from the original transaction data and sorted with fields of origin station, entry time, destination station, and exit time, which will be used later.After these data are sorted, several travel behavior parameters of passengers are extracted from abundant timetable and AFC data, which is denoted by "b."Then, one of the records of the AFC transaction data (which is also a passenger trip) is extracted for estimation.To generate the consideration set, boarding plan generation algorithm is applied at "c."And at "d," calculating choice probability of boarding plan is executed.At "e," the train choice solution (which equals a boarding plan), the passenger choice is determined based on the probability of each alternative in the consideration set.These processes are repeated until all of the records are estimated.

Start
Read the physical network, schedule, and AFC data.
Extract and sort AFC data with fields of origin station, entry time, destination station, and exit time.
Extract the first record of the AFC data.
Judge the eigenperiod ith record belongs to.
Generate the universal set of boarding plans with physical topology and schedule.
Generate the consideration set of boarding plans with constraints of passengers' travel cost and behavior parameters.
Calculate the point probabilities for an alternative solution in the consideration set.
Calculate the probabilities for the alternative solution based on point probabilities.
Extract the first alternative in the consideration set.
Is it the last alternative?
Judge the boarding plan solution the passenger chose.

Universal Set Generation
. This is a two-step part as shown in Figure 4. Due to the URT system's networked operation, there may be several alternative routes for a given O-D pair, and passengers in practice will choose not only the shortest route but also the second, third, . .., kth shortest route for their imperfect knowledge of the network, individual differences, factor of congestion, and so forth.First, an improved Deletion Algorithm (DA) [29] based on Depth-First Traversal (DFT) is introduced to find the kth shortest route, and the initial route choice set of the O-D pair is obtained.Second, for each route in the initial route choice set, all the boarding choices of a given passenger at each boarding station (origin, destination, or transfer station) on the route are deduced with the corresponding schedule data.And then the universal set of the passenger's boarding plans can be obtained.
The improved DA based on DFT is provided as follows.Different from other -shortest path algorithm, it will not miss any possible route including ring routes.
Step 1. Determine the shortest tree of directed graph (, ) rooted at origin  based on the Dijkstra Algorithm.Let   be the shortest path from origin  to destination  in (, ).Note that  = 1.Step 1 Step 2  . . .Step 2. If  does not exceed , which is the maximum number of the th shortest paths, and there is still an alternative path in (, ), let  =   and proceed to Step 3; otherwise, the algorithm stops.
Step Step 5. Let   denote any note following   ∈ .Then execute as follows.
Step 5.1.Add the primed node    of node   to .
Step  Note.  is the entry time of the given passenger at the origin station;  , is the walking time of passenger with the fastest speed from the entry gate to the platform at the origin station.
Step 5.3.Compute    and find the shortest path from  to    .
Step 6.Let  be the shortest path from  to the primed node   () of node  in (, ), so that  is the best alternative path of  −1 .Set  =  + 1 and proceed to Step 2.
Moreover, considering the influence from congestion, a passenger may fail to board and has to wait for the next train.The maximum "fail to board" (FtB) number is set to 3 based on investigations in China, which means a passenger can board on a train within four runs even if the congestion during peak hours makes the passenger be unable to board on the first train.

Consideration Set Generation.
A boarding plan for a given passenger is the order of trains that the passenger can take to complete his/her travel.Obviously, it is difficult to determine which train the passenger board in reality.However, usually the passenger is rarely delayed in the process of walking out of the destination station, and consequently the train he or she alighted from can be determined accurately.Thus, we can calculate from the destination station to the origin station backward.For a given trip data (AFC transaction data) obtained from the URT system, a boarding plan is considered unreasonable and should be removed from the universal set if its boarding time at origin station is impossible for the passenger given the constraint of his/her entry time (Figure 5).
Therefore, a filtering algorithm can be developed to further narrow the universal set and get the consideration set.The algorithm is described as follows.
Step 1. Obtain possible boarding plans (universal set).For an actual passenger trip, with the corresponding train diagram, the passenger's exit time, and walking time at the destination station, possible boarding plans for each route can be easily deduced.
Step 2. Calculate the departure time of a possible boarding plan.Based on the passenger's travel chain combined with train diagrams, the departure time  departure of the possible boarding plan of each route can be calculated from the destination station to the origin station backward.
Step 3. Compare and remove.As shown in Figure 5, the calculated departure time  departure of the possible boarding plan at the origin station is compared with the passenger's arrival time (  + , ) on the platform.If (  + , ) <  departure , the boarding plan is reasonable for the passenger to choose; otherwise, the boarding plan is unreasonable and removed from the universal set.

Point Probability Calculation.
For a given boarding plan in the obtained consideration set, we name a boarding station (origin, destination, or transfer station) in the boarding plan as a boarding point.So, the point probabilities of a boarding plan need to be calculated firstly.
It should be noted that passengers may fail to board the train in certain conditions especially in peak hours because of the overcrowding, though they are usually inclined to board on the first train as we know.Therefore, without loss of generality, we use "point probability" to present the probability for a passenger to board on the train within a given boarding plan.For a boarding point  in plan i, the probability of leaving with the train for a passenger is   that can be obtained directly from the StB (success to board) rates as shown in Figure 6.

Plan Probability Calculation.
The plan probability is the function of the point probabilities.Considering that the boarding point with minimum probability is the bottleneck for the boarding plan to be chosen, instead of the product of those probabilities at all boarding points, we adopt the following function: where   is the probability of plan .
For example (as shown in Figure 6), suppose there are two boarding plans in the consideration set.For plan 1, the point probability is 0.66 for the train within the given boarding plan at origin station and 0.27 at transfer station.For plan 2, the point probability is 0.34 for origin station and 0.73 for ), and "success to board" (StB) rate ( StB ).Walking time parameters are used for generating the consideration sets, while StB rate parameter is used for calculating the choice probabilities of boarding plans.

Access/Egress
Walking Time Extraction.First, we deduce parameters of  egress min and  egress max at every station on the network based on AFC data.It should be noticed that passengers may be delayed at the origin station and transfer stations by passenger flow, the capacity utilization rate of the train, and other factors but are rarely delayed in the process of walking out of the destination station.Thus, it is easier to deduce the parameters of  egress min and  egress max .By matching the train's arrival time derived from schedule data and passengers' exit time derived from AFC data, passengers' egress walking times can be obtained and its distribution can be extracted too: It is a kind of normal distribution and can be calibrated with the AFC data.Then, we set the minimum egress walking time ( egress min ) using the 5th percentile of the calibrated distribution and the maximum egress walking time ( egress max ) using the 95th percentile of the calibrated distribution.
Second, we try to get parameters of  access min and  access max at every station on the network.It is noticed that passengers may be delayed during their walking process of access to platform, which makes distribution of access walking times different from egress walking times, and passengers' exact arrival times on platform also cannot be obtained directly.However, we can still suppose  access min =  egress min and  access max =  egress max , since there is some symmetrical characteristic between the processes of a passenger's access and egress, and we just want to obtain the threshold rather than the exact distribution.

Transfer Walking Time Extraction.
In order to extract parameters of  transfer min and  transfer max , two assumptions are adopted in advance as follows: (1) The walking speed of the same passenger should be on the same level in his or her trip train.In other words, for a given passenger, the walking speeds at stations (origin station, destination station, or transfer station) should not be different from each other to a great extent.
(2) The delay caused by crowding, high-capacity utilization of the train, and similar factors for an individual passenger happens in the origin station as well as transfer stations with equal probability.
(3) Last but not least, we just try to extract the threshold rather than the exact distribution.
Then, the minimum transfer walking time ( transfer min ) and maximum transfer walking time ( transfer max ) at a transfer station can be calculated as follows.
Step 1. Aggregate the AFC data whose O-D flows use the given transfer station as their unique transfer point.
Step 2. Calculate the egress walking speeds with egress walking times and distances ( egress ) and set the transfer walking speeds using the egress walking speeds; that is, Step 3. Calculate the transfer walking times at the transfer station with the calculated transfer walking speeds and distances ( transfer ); that is,

StB Rate Parameter Extraction.
At last, we deduce the parameter of StB (success to board) rate.Assuming StB is a direct outcome of overcrowding which is mostly true in peak periods, we can conclude that as long as passengers depart from the same station in the same direction and period, the StB parameter is the same.In that case we can use those O-D flows without any transfers (and hence no alternative route) to estimate the StB parameter.And then, we can consequently apply those parameters to O-D flows with transfers.The parameter of StB can be defined as a vector as follows: where  0 ,  1 ,  2 , and  3 are the probabilities that passengers succeed to board on the first, second, third, and fourth train, and obviously all items in the vector sum up to 1. Taking a case from the Shanghai metro network, for example, if we want to calculate the StB of down direction during 8:00 AM∼9:00 AM at Yanchang Rd.Station of Line number 5, we can use the data of those O-D flows without any transfers, including Yanchang Rd. → Zhongshan Bei Rd., Yanchang Rd. → Shanghai Railway Station, Yanchang Rd. → Hanzhong Rd., Yanchang Rd. → Xinzha Rd., and Yanchang Rd. → People's Square.Table 1 shows the distribution of passengers boarding on different trains during 8:00 AM∼9:00 AM at Yanchang Rd.And based on Table 1, the StB of down direction during 8:00 AM∼ 9:00 AM at the station of Yanchang Rd. can be deduced (Table 2).

Data
Used in the Test.In the test, 57 passenger trips records between 07:00 AM and 08:00 AM and obtained from the AFC system are used to verify the proposed approach.Table 3 gives a sample record from these 57 passenger trips.
Moreover, as another important input of the proposed approach in this paper, the corresponding real timetables of the relevant URT lines (e.g., Line 1, Line 4, Line 7, and Line 9) were obtained from automatic train supervision (ATS) system and used too.Tables 4-6 show the samples of this data.

Results and Discussions.
Using the above input data, the boarding plan estimation for these 57 passenger trips is performed with the proposed approach.Table 7 gives a sample of the estimation results.As can be seen in the table, each passenger trip (which equals an AFC transaction record) derived from the AFC system can be assigned to the unique boarding plan by the proposed approach.
As mentioned, for a schedule-based URT system, the result in Table 7 is the key for passenger flow analysis, based on which other important indicators (e.g., route choices, passenger flows on section, and load factor of train, as shown in Table 8 and Figure 8) can be deduced furthermore.
Companying with the above case study, some extended discussions can be further made.Previous studies use discrete choice analysis extensively to predict passenger choice     2015).As demonstrated in the case study, we figure out the key issue of estimating passenger boarding plans, based on which all the route choice, section flow, load factor, and so forth can be deduced, furthermore, and no longer depend on the assumption that smart card data that could not be identified to the possible train choices would be assigned with equal probability (Kusakabe et al., 2010).Furthermore, the proposed approach improves the methodologies of Sun and Schonfeld [19] and Zhou and Xu [20] on calculating passenger boarding plans.On the other hand, compared to the study efforts presented in [21,23], our approach models the problem of interest considering the temporal dynamics induced by demand profiles, service timetables, and crowdedness.

Conclusions
A URT system is operated based on its schedules.Different from those urban road traffic systems, it is more important to estimate passengers' train choices based on which passenger route choices as well as flow distribution on network can be deduced.Developments in the application of AFC systems have made the collection of detailed passenger trip data in a URT network possible and can be used to obtain more in-depth understanding to passenger travel behaviors.In this paper, we aim to formulate the problem of estimating passenger train choices and subsequently propose an integrated approach for the addressed estimation combining real timetable and AFC data.Advantages of the proposed approach include the following: (1) A posteriori estimation framework, which uses revealed information combining real timetable and AFC data of URT systems rather than the a priori knowledge, was proposed.
(2) The approach links each AFC transaction (a passenger trip) to the most feasible train itinerary (a boarding plan).It is more appropriate for a factual travel choice process which uses only one route at the same time rather than multiroutes.
(3) The travel behavior parameters used in the approach are exacted from abundant timetable and AFC data rather than the manual surveys.Meanwhile, those exact pieces of information, which are difficult to be measured such as distributions of passengers' walking speeds and times, are also avoided to be obtained.
Furthermore, the proposed approach in this paper can be used for other challenges in the field of URT operation and management such as validation of rail transit assignment models, time-dependent train load estimation, and integrated simulation of passenger flows on network.

Figure 1 :
Figure 1: Relationship among train choice, route choice, and flow assignment.

Figure 2 :
Figure 2: Formulating the estimation problem of URT passenger train choices.

Figure 4 :
Figure 4: Illustration of two-step universal set generation.

Figure 5 :
Figure 5: Comparisons of the calculated departure time of the possible boarding plan versus the passenger's arrival time on the platform.Note.  is the entry time of the given passenger at the origin station;  , is the walking time of passenger with the fastest speed from the entry gate to the platform at the origin station.

Figure 8 :
Figure 8: Example of factual train diagrams with passenger flows on sections and load factors of trains.
( | US  ) is the probability that passenger  will choose alternative  from the universal set US  of all alternatives available to ,   ( | CS  ) is the conditional probability that passenger  will choose alternative  given that CS  is his/her consideration set where CS  is a subset of US  , and (CS  | US  ) is the probability that CS  is the consideration set of passenger  given his/her universal set US  .Thus, the corresponding solution for estimating an individual passenger's train choices with schedule and AFC data can consist of two works: one is generating the consideration set (CS  ) of his/her train choices.And the other is calculating the probability   ( | CS  ) that he/she will choose alternative  from CS  .
denote the first node of current path  without  ℎ if the node's primed node is in {} and proceed to Step 5. Let   denote the value of the shortest distance from  to .Compute   ℎ and find the shortest path from  to   ℎ .Let   =  ℎ+1 .
3. Let () denote the set of incoming arcs to node .Let  ℎ denote the first node of current path  for which ( ℎ ) > 1.If the primed node   ℎ of node  ℎ is not in , proceed to Step 4; otherwise, let

Table 2 :
The parameter of StB of down direction during 8:00 AM∼9:00 AM at the station of Yanchang Rd.

Table 3 :
Samples of passenger trip records.

Table 4 :
The real timetable of trains on Route 1.

Table 5 :
The real timetable of trains on Route 2.

Table 6 :
The real timetable of trains on Route 3.

Table 7 :
Samples of estimated boarding plans for passenger trips.

Table 8 :
Route choices deduced from estimated boarding plans.