Splitting Travel Time Based on AFC Data: Estimating Walking, Waiting, Transfer, and In-Vehicle Travel Times in Metro System

The walking, waiting, transfer, and delayed in-vehicle travel times mainly contribute to route’s travel time reliability in the metro system.The automatic fare collection (AFC) system provides huge amounts of smart card records which can be used to estimate all these times distributions. A new estimationmodel based on Bayesian inference formulation is proposed in this paper by integrating the probability measurement of the OD pair with only one effective route, in which all kinds of times follow the truncated normal distributions.Then,MarkovChainMonte Carlomethod is designed to estimate all parameters endogenously. Finally, based onAFC data in Guangzhou Metro, the estimations show that all parameters can be estimated endogenously and identifiably. Meanwhile, the truncated property of the travel time is significant and the threshold tested by the surveyed data is reliable. Furthermore, the superiority of the proposed model over the existing model in estimation and forecasting accuracy is also demonstrated.


Introduction
Travel time reliability in terms of variation in travel time has attracted more and more attention recently, not only in the road traffic, but also in the metro system. In the metro system, travel time consists of entry walking time, entry waiting time, in-vehicle travel time, transfer travel time, exit walking time, and so forth. As is well known, walking and waiting time are of high variability, while in-vehicle travel time has long been considered punctual (Kusakabe et al. [1]; Sun and Xu [2]; Zhou and Xu [3]), even though it may be delayed by excessive demand, especially during peak hours. Splitting travel time is referred to as estimating the entry walking, entry waiting, and exit walking time in every station, in-vehicle travel time in every section between two successive stations, and transfer travel time in every transfer station. All these kinds of times provide us with the level of services of nontransfer stations, sections, and transfer stations. They can help us calculate the congestion levels in the walking channels and waiting platforms, evaluate transfer efficiency between two lines, find the train delay in the sections, optimize the train schedule even under the cooperated operation condition, estimate passengers' route choice behavior, and so forth. However, measuring these times is extremely challenging at a network level even when carrying out a field survey in the metro system.
Fortunately, the automatic fare collection (AFC) system is currently widely used to collect smart cards data in the metro system. This data is called AFC data and records every passenger's travel details, including entry station, exit station, and the corresponding times when swiping their cards. Based on AFC data, much research has been done, such as trip generation prediction (Guang et al. [4]; Cai et al. [5]), Origin-Destination (OD) distribution prediction (Cai et al. [6]; Cai et al. [7]; Chapleau et al. [8]; Rahbee [9]), route choice proportion estimation (Sun and Schonfeld [10]; Zhu et al. [11]), and travelers' characteristics extraction (Lee and Hickman [12]). For more application, see Pelletier et al. 's review [13]. Research has also been done to estimate various times using AFC data. Sun and Xu [14] assumed that walking and waiting time were all random, but in-vehicle travel time was punctual since it was determined by fixed  train schedules. Zhou et al. [15] assumed that, except waiting time following uniform distribution, other types of times yielded to normal distribution, including in-vehicle travel time, and they were estimated by moment estimation method based on the maximum spanning tree. But in this approach, a field survey was still needed to calculate the first station's entry time. In Sun et al. 's research [16], all kinds of times were supposed to independently yield to normal distribution and the parameters were estimated by the Maximum Likelihood Estimation (MLE) approach without any additional surveys. Sun et al. [17] proposed an integrated Bayesian statistical inference framework to characterize the passenger flow assignment model in a complex metro network. In this research, all kinds of times were still assumed to follow normal distribution and a Bayesian approach was designed to estimate all parameters, along with the chosen proportions of feasible routes.
However, the distribution assumptions of all kinds of times are hardly consistent with the facts because the truncated property of the real data is neglected. In practice, all times have their ranges; for example, with respect to the walking time which relates to the travel time of free flows and the congestion degrees, it cannot be very short (e.g., being close to 0) or long (e.g., tending to infinity) which means it is truncated. Meanwhile, on account of the fact that the AFC data only records the travel time which is the period between swiping-in time at the origin station and swipingout time at the destination station, it cannot indicate which route the passenger has really traveled between the OD pairs with multiple feasible routes (e.g., without loops). Therefore, it is meaningful to select the OD pairs with only one effective route to exactly split the travel time of the route into links and then estimate all kinds of times. Generally, a threshold is set to determine whether the second shortest route is considered or not and then the OD pair with only one effective route can be selected (Sun et al. [16]). However, in the existing research, they usually neglect the random property of the OD pair with only one effective route due to the travel time reliability of the routes and calibrate the threshold exogenously. Therefore, in order to estimate the walking, waiting, invehicle, and transfer times more precisely, this paper proposes an approach in which the contributions are as follows: all kinds of times follow the truncated normal distributions, the random property of the OD pair with only one effective route is considered, and the threshold is calibrated endogenously. The left sections are organized as follows: Section 2 mainly relates to the methodology, in which the property of truncated normal distribution is introduced, the method selecting the OD pair with only one effective route is proposed, and a Bayesian inference formulation is established together with the estimation approach based on Markov Chain Monte Carlo (MCMC) method; in Section 3, based on AFC data, train schedules, topology network, and other data in the Guangzhou Metro, the estimations are discussed and estimation errors between the existing and the proposed models are compared; Section 4 is the conclusions.

Methodology
It should be mentioned here that, in this paper, the unit of all kinds of times is minute, taking account of the fact that, in terms of the time of the link, it is usually several minutes while for the travel time of the route it is usually tens of minutes.

Truncated Normal Distribution.
On the assumption that random variable follows a normal distribution ∼ ( , 2 ), if lies within the interval ∈ ( , ), conditional on < < has a truncated normal distribution ∼ ( , 2 , , ), and its probability density function (PDF) is defined as follows, along with the curve shown in Figure 1  ,  where , are the mean and standard derivation of normal distribution, (⋅) is the PDF of truncated normal distribution, (⋅) is the PDF of normal distribution, and Φ(⋅) is the cumulative density function (CDF) of normal distribution.
Generally, in the metro system, travel time contains entry walking time, entry waiting time, in-vehicle travel time, transfer travel time, exit walking time, and so forth. Suppose that every type of time is represented by a link in the metro network shown in Figure 2. It depicts that a route from O to D may consist of multiple links which represent types of times. The transfer travel link represents the sum of transfer walking and waiting times which are hard to be distinguished as they always appear at the same time for a transfer station. In practice, all these times are truncated within intervals; that is, with respect to the time of link , it follows a truncated normal distribution ∼ ( , 2 , , ; ). For the travel time of the route , it is the sum of all kinds of mentioned times; that is, = ∑ ∈H , where H denotes the set of links constituting route . Supposing that the times of every link are independent of each other, the travel time of the route still follows the truncated normal distribution whose probability density function ( , , , ; ) is shown: The mean and the variance 2 of truncated normal distribution of route can be regarded as a perturbation of the mean and the variance 2 of parent normal distribution, respectively, which can be derived from moment equations: where

The OD Pairs with Only One Effective Route.
Though the AFC data records the travel times between OD pairs, not all travel times data can be used to estimate distribution parameters because there may be multiple effective routes between some OD pairs. This paper just focuses on the OD pairs with only one effective route. Considering the randomness of the travel time of the route, except the OD pairs between which there is only one feasible route (e.g., no loop) due to the topological structure, other OD pairs with only one effective route are random. Suppose that there are more than two feasible routes between a specific OD pair. Before calculating the probability of this OD pair with only one effective route, the routes with the first and second shortest travel time should be generated. Usually, the routes with the shortest travel time and the second shortest travel time can be selected out easily according to K-shortest routes algorithms if they are constant. In the reliable context, researchers paid more attention to the route with the reliable shortest travel time (Khani and Boyles [19]). Taking account of the fact that the scale of metro network is comparatively smaller than the road network, this paper firstly generates amounts of feasible routes (e.g., at most ten routes) for the OD pair based on the route's physical length. Then, it selects the routes with the first and second reliable shortest travel time from the generated routes set R according to the below minimization formulation.
The route with the reliable shortest travel time is the solution to the minimization formulation as follows: where is a parameter indicating the value of route reliability (measured by standard deviation ) relative to the expected 4 Discrete Dynamics in Nature and Society travel time . And the route with the second reliable shortest travel time is the solution to the same minimization formulation after removing the route with the reliable shortest travel time from the routes set R. Let 1 denote the reliable shortest travel time and 2 is the second reliable shortest travel time. They both follow the truncated normal distributions, that is, 1 ∼ ( 1 , 2 1 , 1 , 1 ; ) and 2 ∼ ( 2 , ( 2 ) 2 , 2 , 2 ; ), respectively. The probability of the OD pair with only one effective route can be measured by the probability formulation as follows: where is the threshold value for OD pair which varies with the OD pairs due to the scale effects of OD pairs; is the parameter; and eps is an extreme positive value tending to 0 (e.g., 0.000001 in this paper). The purpose to add eps is to avoid the probability 0 of the OD pair with only one effective route, that is, to satisfy the condition ( | ) > 0 to carry out the estimation method below. The above equations mean that, for the OD pair with multiple feasible routes, the probability of the OD pair with only one effective route is equal to the probability that the difference between the second reliable shortest travel time and the reliable shortest travel time is larger than the threshold value. Obviously, if the threshold value is larger than the upper boundary ≥ 2 − 1 , then ( 2 − 1 ≥ ) = 0, and the probability of the OD pair with only one effective route tends to 0; that is, ( | ) = eps; if the threshold value is smaller than the lower boundary ≤ 2 − 1 , the probability ( 2 − 1 ≥ ) is equal to 1 and ( | ) = 1 + eps ≈ 1.

Bayesian Inference.
Let , , a, and b represent the sets of the parent mean, variance, interval lower boundary, and upper boundary of all kinds of times, respectively, that is, , a, b). Together with the threshold Y, they need to be estimated based on the AFC data. The complex structure of likelihood function as shown in (9) makes Maximum Likelihood Estimation method hard to be applied when trying to estimate distribution parameters and threshold parameter endogenously in this paper. Therefore, a Bayesian inference formulation is established. The Bayesian posterior distribution of unknown parameters given the travel time observations can be derived based on Bayes' theorem as follows: ( , , a, b, Y | T) where ( , , a, b, Y | T) denotes the density function of all parameters given the observations; S is the set of OD pairs; T is the travel time observations of OD pair ; (⋅) is equal to the likelihood function of all parameters given the travel time observations; ( ), ( ), (a), (b), and (Y) are the prior distributions. At last, taking account of the threshold parameter , the Bayesian formulation is shown as follows: The prior distribution needs to be determined in advance though the number of travel time observations in metro system is large enough to revise the prior knowledge to a great extent.

Estimation Method. Markov Chain Monte Carlo
(MCMC) method allows us to simulate draws that are slightly dependent and are approximately from a posterior distribution. Then, those draws can be taken to calculate quantities of interest for the posterior distribution. In Bayesian statistics, Gibbs Sampler and Metropolis-Hastings (M-H) algorithm (Metropolis et al. [20]; Hastings [21]) are widely used MCMC methods. Without the prior knowledge among parameters and the full conditional distributions for each parameter, M-H algorithm is more appropriate to estimate the parameters in this paper. Considering that it is a high-dimensional problem to solve so many parameters, the variable-at-a-time Metropolis sampling scheme (Metropolis et al. [20]) is used to avoid the large rejection rate perhaps caused by general Metropolis. Let be the set of all parameters; that is, = ∪ ∪ a ∪ b ∪ . In the variable-at-a-time scheme, it generates new sample for each coordinate (parameter) in turn in the parameters' set. The estimation approach is as follows.
Step 2. Update the travel time distribution of feasible routes based on the distribution parameters of links and select out the routes with the first and second reliable shortest travel time based on minimization formulation (5).
Here, we suppose that the jumping distribution is a normal distribution which is a symmetric distribution; that Discrete Dynamics in Nature and Society is, ( * | ( −1) ) = ( ( −1) | * ). Generally, the jumping distribution can be denoted as ( * | ( −1) ) ∼ ( ( −1) , 2 ), where 2 is the proposal variance for the th parameter. This is known as Gaussian random walk Metropolis sampling.

Results and Discussion
As shown in Figure 3 Here, we use one-month data to keep the distribution property, that is, the data in June for estimation and the data in July for testing. There are some hypotheses made here: the entry and exit walking time relating to the same line at the same station are the same while they are different relating to different lines even at the same (transfer) station; the waiting times relating to different lines even at the same (transfer) station are different but they are the same relating to the same line's different directions; the transfer times in the same transfer station from the same line transferring to another line's different directions are different; in-vehicle travel times of different directions between the same successive stations are the same; and all kinds of times follow the truncated normal distribution. Figure 4 shows the characteristics of the AFC data collected from Tianhe Coach Terminal to Gangding on June 18, 2014. The number of observations is 6328, the average travel time is 11.61 min, and the standard derivation is 3.39 min. This clearly indicates that the travel time follows approximately a truncated normal distribution. Meanwhile, the lower boundary is closer to the average value than the upper boundary. The far longer travel time than the average value may result from the common fact that some passengers miss the trains or just walk slowly. It can be found that this phenomenon exists in most OD pairs according to the AFC data. Therefore, the truncated normal distribution assumption is more suitable for the fact and the estimations should satisfy that the  difference between the upper boundary and the mean should be larger than the difference between the mean and the lower boundary.

Estimation Results.
The prior knowledge is noninformative; thus, all parameters are assumed to follow uniform distributions according to the operational experiences and the train schedules. The in-vehicle travel time distribution can be easily estimated based on the real train schedule. However, the real train schedule is hard to be collected even for the operation corporation. Therefore, in-vehicle travel time distributions are also estimated in this paper and the prior knowledge can be gained from the planned train schedule. The ratio between the standard derivation and the mean of the in-vehicle travel time is recognized as a constant (Sun et al. [17]); that is, vt = ⋅ vt . In order to distinguish the parameters conveniently, we use ee, ew, tt, and vt to denote entry/exit walking time, entry waiting time, transfer travel time, and in-vehicle travel time, respectively. The total number of parameters is 156 * 4 + 156 * 4 + 147 * 3 + 1 + 168 * 4 + 1 = 2363 (156 entry/exit walking links, 156 entry waiting links, 147 in-vehicle travel links, 1 constant measuring the ratio between the standard derivation and the mean of the in-vehicle travel time, 168 transfer travel links, and 1 threshold parameter). The random walk sampling variance 2 is 15 min for entry/exit walking time, 7 min for entry waiting time, 15 min for transfer travel time, 2 min for in-vehicle travel time, and 10 min for threshold parameter, respectively. Though the large quantities of data in Guangzhou Metro system can correct the biased prior knowledge, an appropriate distribution assumption is still needed to save the simulating time. For the same type of the time, the prior distributions of some types of parameters are the same which are exampled in Table 1. The initial value of each parameter for estimation is the mean.
The splitting model proposed in this paper avoids selecting particular OD pairs in advance, such as the OD pairs with only one feasible route. But the routes between the OD pairs should cover all links. Here, we randomly choose 5000 OD pairs (excluding the OD pair with OD volume smaller than 1000 for a month) and, for each OD pair, (1, 5)  estimation in case that some abnormal data is excluded. The parameter which indicates the value of standard deviation relative to the mean is assumed to be 1 in this paper. We take 10000 iterations of which the former 5000 iterations are burn-in period. Simulating in the PC with Intel Core i3-2130 CPU at 3.40 GHz and 4.00 GB RAM, the simulation takes about 20 s for each iteration and at last 5000 effective samples for each parameter are drawn. By analyzing the effective samples of the parameters via histograms and the results' tendency of the iterations which are shown in Figure 5 as an example, they are all converged. From Figure 5(a), we can see that, after the burn-in period, the values of the samples keep stable which means that the iterations are converged. Meanwhile, in Figure 5(b), the histograms express the frequency distribution of the effective sampled parameter values which shows that the samples of the parameter follow the normal distribution. Figure 5 indicates that the drawn samples are effective significantly and the estimations are 8 Discrete Dynamics in Nature and Society  reliable. Considering that too many parameters are estimated in this paper so that it is impossible to list all the results, the estimation results of some parameters are shown in Table 2, including the mean and 95% Bayesian conference interval (CI) according to the distribution provided by the Bayesian estimation. The mean of the threshold parameter is 5.04 and its 95% CI is 4.01, 6.10. And the prior distributions of the parameters for the same kind of time are referred to as the same as shown in Table 1, but the results in Table 2 show us that the posterior distributions can be significantly different which means that amounts of AFC data can correct the biased prior knowledge. It also can be seen from Table 2 that the lower boundary of the entry waiting time is the smallest (tending to 0) because, in some cases, passengers can catch up the departing train in time without any waiting periods. Meanwhile, the difference between the upper and lower boundaries for in-vehicle travel time and the variance are both the smallest which means that the in-vehicle travel time relating to the train schedule varies the least relative to other kinds of times.
As is shown in Figure 4, from Tianhe Coach Terminal to Gangding, the truncated characteristic of the observations is significant. According to (2), the distribution of the travel time (unit: min) for this OD pair is (10.22, 6.30, 8.01, 19.33); that is, for normal distribution, the mean is 10.22 min and the standard derivation is 2.51 min. For the truncated normal distribution, the mean is 11.06 min and standard derivation is 3.74 min which are obviously closer to the real values (the average value is 11.61 min and the standard derivation is 3.39 min) than the mean and standard derivation of normal distribution. Meanwhile, the estimated boundaries are both consistent with the real censored data. It shows the superiority of the truncated normal distribution to the normal distribution.
The estimation result of the threshold parameter is also significant with the mean 26.04 (the unit of the threshold is minute) and 95% CI (24.01, 28.10). Based on (7), the threshold for different OD pair is curved in Figure 5. In order to test the estimation for the threshold parameter and the threshold equation structure, a field survey was carried out in June, 2014. The survey mainly focused on passengers' travel characteristics in metro system and at last 10000 effective samples were collected. With respect to the threshold, 8 scenarios were designed as follows: for the OD pair, if the shortest travel time was 10 min, 20 min, 30 min, 60 min, 90 min, 120 min, 150 min, and 180 min, respectively, how much more time you can tolerate. And the average value for every scenario was calculated and then displayed as the histogram in Figure 6. It can be seen from Figure 6 that the estimated curve can significantly fit for the surveyed data which demonstrates that the estimated threshold parameter and the threshold function structure are both reliable. Exactly, the Mean Absolute Percent Error (MAPE) between the estimated values and real data is 3.46% which shows that the model has a good goodness-of-fit to the real data.
According to the estimated threshold, the distribution of the number of the OD pairs with only one effective route based on the probability intervals is shown in Figure 7. It shows that none of OD pairs have the probability 0 to have only one effective route which is a response to the estimation Discrete Dynamics in Nature and Society condition (referring to the condition that the denominator is not equal to 0). There are 3647 OD pairs whose probabilities are equal to 1 which is due to the fact that there is only one feasible route or the threshold is greater than the upper boundary for those OD pairs. These OD pairs can be used to test the forecasting performance.

Performance Test.
The estimated results show their convergence properties based on statistical test, but without the real data test, such as forecasting performance test. In order to test the mobility and estimating effectiveness of the parameters, this paper establishes an index measuring the errors between the estimated results and real data. In advance, some representative OD pairs are selected out based on the condition that, for example, if a specific OD pair ∈ S 2 , then ( | ) = 1. Thereafter, the set S 2 is made of the representative OD pairs. The measuring index can be calculated by the following equations: where is the error for OD pair ; is the number of travel time intervals for OD pair ; ( , +1 ) denotes the ratio between the number of travel time observations within the interval ( , +1 ) and the total number of observations; the interval boundary needs to satisfy 0 ≤ min(T ) < 1 and −1 ≤ max(T ) < ; (S 2 ) denotes the size of the OD pairs set S 2 ; (T ) denotes the number of the travel time observations in set T .
According to the above statistical analysis, there are 3647 OD pairs which satisfy the equation ( | ) = 1. The test data is a month sized data which was collected in July, 2014, in which the OD pair with the OD volume smaller than 1000 is deleted to make sure of the distribution property. At last, 1121 OD pairs are left to carry out the test. Based on (13)∼(15), the measuring index is calculated for every OD pair with interval length +1 − = 0.5 min. Meanwhile, in order to evaluate the forecasting accuracy, an existing model is simulated in advance. The difference between the existing model and the proposed model is formulated as follows: for the existing model, all kinds of times are assumed to follow normal distributions and the selected OD pairs satisfy the condition that the ratio between the shortest mean time and the second shortest mean time is smaller than 0.5 (Sun et al. [16]). The error based on the proposed model ( ( )) is regarded as the horizontal axis and the error based on the existing model ( ( )) is regarded as the vertical axis. The scattered points are shown in Figure 8.
In Figure 8, the solid spots represent the errors ( ) and the solid line indicates that the error based on the proposed model is equal to the error based on the existing model; that is, (p) = (e), for the OD pair . If the spot lies above the line, it means the error based on the proposed model is smaller than that based on the existing model; that is, (p) < (e), for the OD pair . Otherwise, (p) > (e). From Figure 8, we can see that, for all selected OD pairs, the error spots lie above or lie on the line which indicates that the error based on the proposed model is not greater than that based on the existing model; that is, (p) <= (e). And the average error for all selected OD pairs based on the proposed model is 0.19 while that based on the existing model is 0.30. The above results show that the proposed model can improve significantly estimation accuracy on estimating the distributions of route travel time, station walking time, station waiting time, section in-vehicle time, and transfer station transfer time, respectively.

Conclusions
Travel time reliability in the metro system, referring to the variation in travel time, has attracted more and more attention recently, as it can support evaluating the level of services, calculating route choice proportion, optimizing the train schedule, and so forth. In the metro system, the walking time, waiting time, and delayed in-vehicle time mainly contribute to the variation of the travel time of the travel route. Thus, estimating the walking time in the station, waiting time on the platform, transfer travel time in the transfer channel, and in-vehicle travel time between two successive stations in the metro system is the key to estimate the travel time of the route, especially for the OD pairs between which there are various effective routes. Based on AFC data, huge amounts of passenger's travel times between OD pairs can be collected. But the walking, waiting, transfer, and real in-vehicle travel times cannot be directly gained and the travel times between some OD pairs cannot be directly assigned to the routes as there are more than effective routes between the OD pairs. Some research has been found in studying the estimation method for various times in the metro system. However, the truncated property of the data is neglected and the threshold which is used to select the OD pairs with only one effective route is calculated exogenously without considering the random property of the travel times of feasible routes. The truncated property of the data is derived from a common sense that all kinds of times will not be too long or short which depends on, for example, the length of the walking channel, walking speed, train schedule, and so forth. Therefore, a new estimation model is proposed in this paper based on Bayesian inference formulation by integrating the probability equation of the OD pair with only one effective route, in which all kinds of times are assumed to follow truncated normal distributions and the threshold is estimated endogenously. The probability of the OD pair with only one effective route is derived from the relationship between the reliable shortest travel time of the route and the reliable second shortest travel time of the route for the OD pair, which varies with the travel time distribution. Considering that the new model contains complex integrations, Markov Chain Monte Carlo (MCMC) method is designed to estimate all parameters, in which Gaussian random walk Metropolis proposals are employed on all the unknown parameters.
Based on the AFC data collected from Guangzhou Metro, the truncated property of the travel time is demonstrated and the proposed model is estimated. The results show that the drawn samples are converged and all parameters are identifiable. The mean value and standard derivation value of the truncated normal distribution are closer to the real average value and standard derivation value than those of normal distribution, which indicates that the truncated normal distribution assumption is more reliable. The threshold is tested by the surveyed data and the MAPE indicates the effectiveness of the threshold parameter estimation and the threshold function structure assumption. Furthermore, relating to the estimating and forecasting accuracy, in terms of every selected OD pair, the error based on the proposed model is not greater than that based on the existing model, and, for all selected OD pairs, the average error based on the proposed model is smaller significantly than that based on the existing model. Therefore, both the estimations' statistical effectiveness and the forecasting accuracy indicate the superiority of the proposed model over the existing model.