Estimation Method of Path-Selecting Proportion for Urban Rail Transit Based on AFC Data

With the successful application of automatic fare collection (AFC) system in urban rail transit (URT), the information of passengers’ travel time is recorded, which provides the possibility to analyze passengers’ path-selecting by AFC data. In this paper, the distribution characteristics of the components of travel time were analyzed, and an estimationmethod of path-selecting proportion was proposed.This methodmade use of single path ODs’ travel time data fromAFC system to estimate the distribution parameters of the components of travel time, mainly including entry walking time (ewt), exit walking time (exwt), and transfer walking time (twt).Then, formultipathODs, the distribution of each path’s travel time could be calculated under the condition of its components’ distributions known. After that, each path’s path-selecting proportion can be estimated. Finally, simulation experiments were designed to verify the estimation method, and the results show that the error rate is less than 2%. Compared with the traditional models of flow assignment, the estimation method can reduce the cost of artificial survey significantly and provide a new way to calculate the path-selecting proportion for URT.


Introduction
As the basis of network flow assignment calculation, the path-selecting proportion is directly related to the operation and management of urban rail transit (URT), including operation indicators calculating, train plan making, and fare clearing.Currently, there have been many research results on flow assignment for URT under the condition of network operation, most of which are multipath models based on path utility.
Nguyen et al. [1] developed a graph-theoretic framework for the passenger assignment problem, which encompassed simultaneously the departure time and the route choice dimensions.Also, a passenger equilibrium flow model was defined and a mathematical formulation was suggested.The research results can be used to solve the passenger flow distribution problem of URT network.
Poon et al. [2] also studied the dynamic traffic assignment model for congested networks and used time-increment simulation to calculate the passenger flow on the network.
The model assumed that the vehicle was running in full accordance with the schedule and the passengers had full predictive information about the present and future network conditions, so the model can be used to simulate the performance of an existing transit system operating with preannounced schedules or to evaluate the effects of changes in schedules, lines, or passenger demand on system performance.
Xu et al. [3] established a multipath assignment model of URT network, using travel time as paths' impedance.The basic characteristics of rail network and passenger travel behavior are fully considered in the model.With its high accuracy and strong practicality, the model has been successfully applied in Beijing Subway network.The survey results show that its error rate of section flow is less than 5%.
Si et al. [4] constructed the generalized cost function of URT network, considering the major factors (including the travel time and times of transfer) influencing the passenger flow assignment, and then proposed a mathematical optimization model of passenger flow assignment based 2 Mathematical Problems in Engineering on the stochastic user equilibrium principle.The results of numerical example using Beijing URT network data showed that the model was feasible and effective.
The basic idea of such models above can be summarized as follows: (1) determine the impedance of each path, which can be time costs or mileage; (2) design the utility function based on the path impedance; (3) calculate the flow assignment proportion of each path; (4) distribute each OD's total passenger flow of one day to paths between the origin stations and the destination stations (OD).
Such models can basically guarantee the accuracy of flow assignment.However, the parameters of these models, including the entry and exit walking time at stations and transfer time, are calibrated by manual survey, which is really a time-consuming and costly work.In addition, when the network structure or operation organization changes, the parameters need to be recalibrated in order to keep accurate.Therefore, it is necessary to study a new method of network flow assignment.
In recent years, the AFC system is widely used in URT, which can accurately record the passengers' entry time at the origin stations and the exit time at the destination stations.With many years of application, the AFC system has accumulated vast amounts of passengers' travel information.However, such information has rarely been used to study the behavior of travelling and path-selecting.
Chapleau et al. [5] and Rahbee [6] used the AFC data to predict OD matrices and network flow distribution, but only the entry records were used, which could not support the analysis of passengers' path-selecting behavior.
Lee and Hickman [7] analyzed large amount of AFC data and found that activity and travel patterns differ significantly across the different farecard types, such as travel period, travel time, travel region (urban areas to the suburbs, urban areas to urban areas, etc.).The research results can be used to grasp the characteristics of different travel cardholders.
Kusakabe et al. [8] developed a methodology and an algorithm for estimating which train is boarded by each smart card holder based on long-term transaction data.The proposed method made a number of assumptions and distributed the uncertain smart card data to each possible train evenly, which needed further research.
Zhou and Xu [9] considered the process of the passenger travelling with the train running together and set up a model of calculating each passenger's path by AFC data.The proposed method could estimate each passenger's travel route and train choice.But the model needed a lot of manual investigation work to be also further verified by mathematical statistics.
Sun and Xu [10] analyzed the travel time reliability and estimated passenger route choice behavior by AFC data.The proposed model could be used to calibrate the parameters of traditional passenger flow distribution models.But the model needed all the stations' walking time obtained by artificial survey and lacked further excavation of AFC data, such as ticket type and travel period.Based on the above background, this paper works on analyzing passengers' travel time data recorded by AFC system with the theory of mathematical statistics and proposes a new estimation method of path-selecting proportion.This method provides a new idea of network flow assignment for URT.

Analysis of Travel Time
Travel time of passengers in URT network mainly consists of the following six parts: (1) entry walking time (the time of passengers walking from the AFC gate to the platform in the origin station, denoted by ewt); (2) entry platform waiting time (the time of passengers waiting for the train on the platform in the origin station, denoted by epwt); (3) on-train time (the time of passengers travelling on the train, denoted by ott); (4) transfer walking time (the time of passengers walking from the arrival platform to the departure platform in the transfer station, denoted by twt); (5) transfer platform waiting time (the time of passengers waiting for the train on the platform in transfer station, denoted by tpwt); (6) exit walking time (the time of passengers walking from the platform to the AFC gate in the destination station, denoted by exwt).
twt and tpwt only occur in the transfer station.The components of travel time are shown in Figure 1.
Lots of random factors can affect passengers' travel time, so this paper makes the following assumptions: (1) passengers arrive at the station randomly, dispersedly, and stably; (2) all passengers can board the first train after arriving at the platform; (3) the trains run according to the plan strictly with a certain speed level and no abnormal condition occurs.

Walking Time (ewt, twt, and exwt).
For one station, suppose the walk distances for different passengers are the same, and then the distribution of walking time only depends on the walking speed.For walking speed, most research results [11,12] show that it can be considered to follow normal distribution.Thus, walking time follows a normal distribution as well; that is, where  ewt ,  twt , and  exwt are the mean of ewt, twt, exwt and  2 ewt ,  2 twt ,  2 exwt are the variance of ewt, twt, and exwt.

Entry Platform Waiting Time (epwt)
. epwt is the random variables between 0 and the interval of the trains.Based on the assumption of passengers arriving at the origin station randomly and dispersedly, they arrive at the platform randomly and dispersedly too.Thus, epwt can be considered to follow a uniform distribution; that is, where  is the interval of the trains.The distribution can be verified by the following simulation experiment.The experimental environment is as follows: the interval of the trains at Station  is 180 s, between 8:00:00 and 10:00:00; ewt ∼  (145 s, 20 2 ).
The experimental procedure is as follows.
(1) Generate randomly and evenly 20,000 passengers between 8:00:00 and 9:50:00; for passenger , let   be the arriving time of the station and let ewt  be the entry walking time and then the time of arriving at the platform is   + ewt  .
(2) Search for the first train arriving at Station  after   + ewt  .Let   be the departure time of the first train at Station , and then epwt  =   − (  + ewt  ).
MATLAB is used to do the simulation experiment.Assuming epwt ∼ [0, 180], use SPSS to do Kolmogorov-Smirnov test (KS test for short) of the waiting time data from simulation, and the significance level  is generally taken to be 0.05.Frequency statistics is shown in Figure 2. The result shows that the significance is 0.626 (> ).So, the assumption epwt ∼ [0, 180] is tenable.

Transfer Platform Waiting Time (tpwt)
. Same as epwt, tpwt is the random variables between 0 and the interval of the trains.However, their distributions are different, that is because passengers arrive randomly and dispersedly at the origin station but intensively at the transfer station.
Generally speaking, when the interval of transfer line is big, the faster passengers walk and the longer passengers may wait; when the interval of transfer line is small, the faster  passengers walk and the shorter passengers may wait.It can be seen that there is strong correlation between tpwt and twt.Thus, this paper analyzes the distribution of (twt + tpwt), which is denoted by twt&tpwt hereinafter.
The factors affecting distribution of twt&tpwt are as follows: the interval of the before-transfer line ( 1 ); the interval of the after-transfer line ( 2 ); the coordination time between the before-transfer line  1 and the after-transfer line  2 (Δ), as shown in Figure 3; the distribution of twt.
Obviously, Δ is related to  1 and  2 , and its value has certain periodicity.To facilitate the presentation, let Δ 1 be the first coordination time, and it can be calculated by the following formula: where Δ min is the minimum coordination time and Δ max is the maximum one.Let [ 1 ,  2 ] be the least common multiple of  1 ,  2 .Then, the calculation formula of Δ  is designed as follows: Suppose that the passengers arrive at the transfer station with the coordination time Δ  .Then, if the transfer walking time of one passenger is smaller than Δ  , his/her twt&tpwt is Δ  ; otherwise, the passenger will wait for one or some more intervals of Line 2 and his/her twt&tpwt will be Δ  +  ⋅  2 .As a result of twt ∼  ( twt ,  2 twt ), the probability distribution of twt&tpwt can be obtained by formula (5) as follows: In the above formula,  can be infinite theoretically, but, in fact, the probability is almost 0 when  ≥ 5.

On-Train Time (ott).
Based on the assumption, all the trains run according to the timetable strictly with a certain speed level.So, between two certain stations (Station , Station ) on the same line, the total running time of different trains is a constant, which is denoted by TT.Obviously, the on-train time of passengers traveling between Station  and Station  is equal to TT and is a constant as well.
2.5.Analysis of Independence.Whether the components of travel time are independent is very important to analyze distribution characteristics of the path's travel time.According to the analysis in the previous section, ott is a constant, so ott is independent of other components.ott divides travel time into three independent parts: ewt and epwt, twt&tpwt, and exwt.Thus, only the independence of ewt and epwt needs to be analyzed.In fact, one passenger's waiting time is not related to his/her walking speed.As the passengers arrive at the station randomly and dispersedly, they arrive at the platform randomly and dispersedly too, which makes their waiting time random variable.That is the reason why the passengers walking fast may wait longer at the origin station.Therefore, ewt and epwt can be considered independent.
Based on the above analysis, all the components of travel time are independent.

Model and Algorithm
3.1.AFC Data.In China, smart cards and AFC system are applied in most cities, which can record part of the passengers' traveling information on the URT network.The basic structure of AFC data is shown in Table 1.
Thus, based on the AFC data structure, any passenger's travel time can be calculated.

Distribution Parameter Estimation for Components of Travel Time.
According to the analysis in Section 2, the distribution characteristics and parameters of the components of travel time are shown in Table 2.
From Table 2, it can be seen that the parameters to be estimated are the mean and variance of ewt of each station ( ewt ,  2 ewt ); the mean and variance of twt in each transfer direction of each transfer station ( twt ,  2 twt ); the mean and variance of exwt of each station ( exwt ,  2 exwt ).

Estimation
Method of   ,  2  and   ,  2  .Take the OD with single path and no transfer (the origin station and the destination station are on the same line) as the research object, and the travel time is only comprised of ewt, epwt, ott, and exwt.Let  be the origin station and let  be the destination station.Then, large samples of passengers' travel time data (  = {  1 ,   2 , . . .,    , . . .,    }) can be obtained by AFC system, where    is the actual travel time of passengers  and  is the sample size.
Based on   , the mean and variance of the path's travel time ((  ), (  )) can be estimated by moment estimation.Moment estimation is a commonly used method of parameter estimation, proposed by K. Pearson in 1894 [13].According to Wiener-Khinchin law of large numbers, the sample moment converge to the population moment when the sample size is large.The principle of moment estimation is as follows: (1) estimate the corresponding population moment by the sample moment and (2) estimate the parameters by making use of the relationship between the unknown parameters and the population moment.q 25 q 24 q 23 q 52 q 54 q 53 q 43 q 42 q 45 q 34 q 32 q 35  Because the components of travel time are independent, the following equations are established: where   ewt is the mean of ewt of station ; (  ewt ) 2 is the variance of ewt of station ;   is the interval of station ; TT  is the train running time between station  and station ;   exwt is the mean of exwt of station ; (  exwt ) 2 is the variance of exwt of station ;   and TT  can be obtained from the timetable directly.
Equation ( 6) are suitable for any two different stations on the same line and can be converted to the following: where    is the station set of line .Therefore, for any station on the line , if its distribution parameters (mean and variance) of ewt or exwt are known (by survey), the distribution parameters of all the stations on line  can be calculated by (7).According to the theory of moment estimation, the larger the sample size is, the more accurate the parameters are estimated.Thus, in order to improve the accuracy of parameter estimation, the sequence of stations on line  for parameters estimation (called PES problem hereinafter) can be made according to the passenger flow.
To describe and solve PES problem, define    as the set of distribution parameters of all stations on line , and then where  Sta ewt is the set of ewt's distribution parameters of station  on line  and  Sta ewt = { Sta ewt , ( Sta ewt ) 2 };  Sta exwt is the set of exwt's distribution parameters of station  on line  and  Sta exwt = { Sta exwt , ( Sta exwt ) 2 }.Take the five stations of line  of the network in Figure 4(a); for example, the relationship between the sets of each station's distribution parameters can be established as the undirected graph in Figure 4(b).In the figure, two sets are connected by an edge if there is only one path between the two stations; otherwise, two sets are unconnected.  is the passenger flow between station  and station .
In order to estimate the parameters accurately, the sample size should be as large as possible.Therefore, the PES problem can be summarized to seek the maximum spanning tree in Figure 4(b), and the model is described as follows: where  is the spanning tree of the undirected graph and (, V) is the weight value of the edge between vertex  and V.
The most common algorithm used to calculate the optimal spanning tree is Kruskal algorithm [14].Let  be the set of all vertices, let  be the set of all edges, let  be the number of all edges, let  = (, ) be the spanning tree, let  be the set of edges of the spanning tree, and let  be the number of edges of the spanning tree.Then, the steps of Kruskal algorithm for solving the maximum spanning tree are as follows.Step 1. Set  = Φ,  = 0.
For example, suppose the passenger flows of ODs in Figure 4(b) are shown as in Figure 5(a) and the maximum spanning tree calculated by Kruskal algorithm is shown in Figure 5(b).
According to the maximum spanning tree, the node  Sta2 ewt converges the maximum number of edges, of which the parameters are designed, to be estimated by artificial survey.Then, the parameters of other nodes will be obtained in succession by formula (7).Therefore, the estimation sequence of parameters can be made as shown in Figure 6.
Based on the above analysis, the estimation method of  ewt ,  2 ewt , and  exwt ,  2 exwt of each station on one line is summarized as follows.
Step 1. Build the relationship graph of parameter sets, of which the value of each edge is the passenger flow between its two vertices collected by AFC system.
Step 2. Use Kruskal algorithm to calculate the maximum spanning tree of the relationship graph.
Step 3. Find the node with the maximum number of edges and use the artificial survey to collect its samples of walking time (ewt or exwt); then, its parameters of ( ewt ,  2 ewt ) or ( exwt ,  2 exwt ) can be estimated by moment estimation.Step 4. According to the maximum spanning tree, estimate the parameters ( ewt ,  2 ewt ,  exwt , and  2 exwt ) of other stations in succession by formula (7).

Estimation Method of 𝜇 𝑡𝑤𝑡 , 𝜎 2
.In URT system, transfer path means the path from the platform of one line to the platform of another line at the transfer station.Therefore, for a two-line transfer station, there are four transfer paths in total.Based on the analysis of most URT networks in China, for any transfer path of one transfer station, certain OD with single path and one transfer can be always found to contain the transfer path, of which the travel time only includes ewt, epwt, twt&tpwt, ott, and exwt.
Suppose that a certain OD () with single path and one transfer contains transfer path , as the components of travel time are independent.Formulas (10) and (11) are established as follows: where ,  are the origin station and the destination station; (  ), (  ) are the mean and the variance of 's travel time;   twt&tpwt , (  twt&tpwt ) 2 are the mean and the variance of twt&tpwt of transfer path ; TT  is the train running time from station  to the transfer station; TT  is the train running time from the transfer station to station .(  ), (  ) can be estimated by moment estimation of AFC data;   , TT  , and TT  can be obtained from the timetable;  ewt ,  2 ewt ,  exwt , and  2 exwt can be estimated from Section 3.2.1.Therefore, only   twt&tpwt , (  twt&tpwt ) 2 are unknown in formulas ( 9) and (10), which can be calculated easily.

Estimation Method of Path-Selecting Proportion.
Let  be a multipath OD and let   be the variable of travel time.
Then, the probability density function of 's travel time (  ) can be described as follows: where   is the path-selecting proportion of the th path between station  and station ;   (  ) is the probability density function of the th path's travel time;  is the number of paths between station  and station ;    is the set of stations on the URT network.
Based on ( 12) and ( 13), the following can be deduced as well: where   is the mean of 's travel time; (  ) 2 is the variance of 's travel time;   is the mean of the th path's travel time; (  ) 2 is the variance of the th path's travel time.  and (  ) 2 can be estimated from the AFC data; while   and (  ) 2 can be estimated by the following equations, in which each parameter can be obtained by the methods described in Section 3.2: where  twt&tpwt tr is the mean of transfer station tr's twt&tpwt; ( twt&tpwt tr ) 2 is the variance of transfer station tr's twt&tpwt;  is the set of transfer stations in the th path.
Combining ( 13)∼( 14), each path's path-selecting proportion can be calculated.It is worth noting that only -order ( ≤ 2) central moment statistics (mean and variance) are used in the derivation above.So, the estimation method can only apply to the ODs with no more than three paths.When there are more than three paths, the idea in this paper is applicable as well, but -order ( > 2) central moment statistics should be introduced.

Numerical Example
The OD "Sta1 → Sta5" in Figure 4(a) is used as an example to verify the model and algorithm.Obviously, there are two paths between the OD, as shown in Figure 7 and Table 3.
Based on the method proposed in this paper, the process for estimating the proportion of two paths is as Step 1 (estimate the distribution parameters of each station's walking time  ewt ,  2 ewt and  exwt ,  2 exwt ).Build the relationship graph of parameter sets and use Kruskal algorithm to calculate the maximum spanning tree of the graph.Based on the maximum spanning tree, the parameter estimation sequence of stations is obtained.
Take line  and its five stations; for example, as shown in Figure 6,  Sta2 ewt of Sta2 should be estimated by artificial survey, and then other parameters can be estimated by formula (7).
MATLAB is used to do the simulation experiment.The entry walking time data are collected to be the ewt data of artificial survey, and the entry and exit time data of simulation are collected to be AFC data.The parameters of Sta2 are estimated by moment estimation method, while the ones of Sta4 are calculated by formula (7).The results are shown in Table 4.
From Table 4, we can see that the error rate of parameter estimation is less than 0.2%, and the estimation method is verified.
Step 2 (estimate  twt ,  2 twt of each transfer path of all transfer stations concerned).With the walking time parameters of all the stations on the URT network known, calculate  twt&tpwt ,  2 twt&tpwt of each transfer path of transfer stations using the travel time data of AFC system by (10), (11).Then,  twt ,  2  twt can be estimated by (5).The parameter estimation method in this step can also be verified by similar simulation experiments in Step1.
Step 3 (estimate each path's path-selecting proportion).Suppose the values of parameters estimated in Step 1 and Step 2 are partly shown in Table 5. Design the simulation experiment as follows: generate randomly 10,000 passengers from Sta1 to Sta5 with the entry time between 8:00:00 and 9:50:00; the interval of each line is 180 s; the train running time of Path 1 is 21 min, while 18 min of Path 2.
Do the simulation three times with different pathselecting proportions between Path 1 and Path 2 (0.7 : 0.3, 0.5 : 0.5, 0.3 : 0.7).Also, the entry and exit time data of simulation are collected to be AFC data.
Then, each path's path-selecting proportion can be estimated by formula (13)∼ (14).The results are shown in Table 6.
The results in Table 6 shows that the error rate of path's path-selecting proportion estimation is less than 2%, which verifies the model and algorithm in this paper.

Conclusion
In the network operation phase of URT, path-selecting proportion is the key to network flow assignment and fare clearing.This paper analyzed the distribution characteristics of the components of travel time and then proposed an estimation method of path-selecting proportion, making use of the travel time data from the AFC system.Also, simulation experiments were created to verify the estimation method, and the results show that the error rate is less than 2% and the method is reliable.
Compared with the traditional models based on path utility, the estimation method of path-selecting proportion has the following advantages: (1) by making full use of the AFC data, the sample size for parameter estimation is large and the results have the good feature of accuracy and (2) the estimation method relies on data analysis and processing, reducing the cost of artificial survey significantly.
In fact, the estimation method in this paper is being used to analyze and validate the URT network flow assignment results in Shanghai.

Figure 1 :
Figure 1: Components of travel time.
ewt (b) Undirected graph of parameter sets of line

Figure 4 :
Figure 4: Relationship graph of parameter sets.

Figure 5 :
Figure 5: Results of maximum spanning tree.
e c t i o n D o w n d i r e c t i o n Non-transfer station

Table 1 :
Structure of AFC data.

Table 2 :
Distribution characteristics and parameters of the components of travel time.

Table 4 :
Results of parameter estimation.

Table 5 :
Values of parameters.twt of Sta3 (transfer path: from the platform of line 's up direction to the one of line 's down direction) 205 20 2 twt of Sta1 (transfer path: from the platform of line 's up direction to the one of line 's up direction) 170 20 2 twt of Sta3 (transfer path: from the platform of line 's up direction to the one of line 's down direction) 192 20 2

Table 6 :
Estimation results of path-selecting proportion.