Using Smart Card Data Trimmed by Train Schedule to Analyze Metro Passenger Route Choice with Synchronous Clustering

,


Introduction
Metro passenger route choice is vitally important to metro operation and management, such as passenger flow distribution and metro tickets clearing. It can provide useful data to help enhance train schedules to make full use of the train capacity. However, the metro passenger behavior is totally different from the car user behavior. The former one is largely influenced by both metro network structure and train schedule while the latter one is mostly decided by users themselves. On one hand, different metro network structures will lead to different route choices. For example, passengers would like to select those routes with few transfers. On the other hand, the train schedule will also influence passenger behaviors. Coordinated transit line could reduce passengers' waiting time in transfer stations. The routes with coordinated transit line should be more attractive than those without coordinated transit line.
So far, many scholars have modeled, analyzed, and studied the problem of passenger route choice behavior within private transportation, such as Kato et al. [1]. Unlike private transportation, metro trains are operated according to the train schedule, leading metro passengers' traveling to be restricted to the schedule. Therefore, traditional methods used in private transportation are not applicable for analyzing metro passenger behavior. Hence, the researchers tried to adopt some technologies widely used in metro transportation into the metro passenger behavior analysis. Among them, AFC (Automatic Fare Collection) system can collect these smart card data about passenger swipe inbound and outbound time of stations, which is useful for analyzing passenger behavior. A lot of research has been done to analyze passenger route choice based on smart card data. However, passengers with different walking time and waiting time may select the same route as metro trains' arrival and departure are dispersed. Hence, passengers walking time to/from platforms

Literature Review
Traditional methods on passenger behavior can be classified by Wardrop Law (Liu et al. [2]) as nonequilibrium model and equilibrium model (Smith et al. [3]). They believed that passengers' trip preference depends on travel time perception while individuals' perceptions are different. Some scholars put forward the stochastic user equilibrium model (stochastic user equilibrium (SUE)) to describe the problem. A simulation method was used to realize random users equilibrium model, and experiments were carried out in a large scale urban rail transit network (Kato et al. [1]). With the continuous expansion of parameter types and network sizes, SUE model has been becoming more and more complex for the reality (Thomas [3], Cascetta [4]). However, some scholars found that the traditional models may have some defects when they are applied in metro transportation. The main reason is that passengers' travel routes are affected by metro train schedule; that is to say, metro passengers' arrival and departure are limited to trains' arrival and departure. Thus the applicability of these traditional models is questioned.
The AFC system has been put into application in many metro systems worldwide. AFC system can record these data including passenger inbound swiping time, outbound swiping time, and some other related information. These data are useful in analyzing the passengers' route behaviors in metro. Pelletier [6] divided the usage of smart card data into three categories, long-term planning service, shortterm planning service, and operation planning service. For example, swipe card data can be used to forecast the passenger flow OD matrix (Munizaga and Palma [7,8]), to deal with demand analysis (Morency et al. [9]), to carry on operation and management of rail transit planning (Utsunomiya et al. [10]), etc.
Specifically, smart card data are getting more attention and more research has been made recently. Chan [11] put forward two research ideas based on London metro transit Oyster card data: one was to estimate the OD traffic matrix and the other was to build the metro transit service reliability matrix. This is the first time to use historical card data to make metro transit service quality evaluation. The main application of smart card is to analyze passenger travel behavior. For example, Kusakabe et al. [12] proposed a method to predict the specific trains that passengers choose to ride by using a vast number of long-term history swipe data and parameters. Zhu et al. [13] proposed a method to calibrate the metro passenger behavior model using the AFC data with the genetic algorithm and parameter estimation combining technology. Zhu et al. [14] presented a methodology for assigning passengers to individual trains using both smart card data and AVL data from train tracking systems; it can estimate the probability of the passenger boarding each feasible train and the probability distribution of the number of trains a passenger is unable to board due to capacity constraints. Ma et al. [15] developed a data mining method to identify the spatiotemporal commuting patterns of Beijing public transit riders using transit smart card data. Hong et al. [16] proposed a methodology for assigning passenger flows on a metro network based on Automatic Fare Collection (AFC) data and realized timetable. Briand et al. [17] analyzed the behavioral habits of public transport passengers using a real dataset of smart card data covering a period of five years. Faroqi et al. [18] investigated the relationship between passengers' spatial and temporal characteristics with a novel passenger-based perspective using smart card data. It is implemented for fourday smart card data including 80,000 passengers in Brisbane, Australia. Similarly, Zhu et al. [19] presented an integrated framework for estimating individual passenger's train choices through a data-driven approach with real timetable and Automatic Fare Collection (AFC) data. Besides, smart card data can also be used for estimation or prediction. For example, Hörcher et al. [20] presented a comprehensive method to estimate the user cost of crowding in terms of the equivalent travel time loss with large scale smart card, in a revealed preference route choice framework. Zhao et al. [21] developed a methodology for predicting daily individual trip making and trip attributes using transit smart card data, and the methods are tested using transit smart card data of 10,000 users in London. Also, smart card data are used to make metro train schedule. Zhang et al. [22] proposed a novel method to optimize the skip-stop scheme for bidirectional metro lines using the time-dependent passenger demand extracted from smart card data, so that the average passenger travel time can be minimized.
Some recent studies have made some progress on analyzing passenger behavior based on smart card data, part of which are useful for realistic size networks. The specific focus of this paper is to propose a method specifically aimed at using a small number of parameters, so that it can be easily used for large scale networks. Hence, this paper uses data analysis methods, i.e., cluster algorithm, to analyze the passenger route choice behaviors on metro networks. The cluster algorithm is a method of multivariate statistical analysis. Data are classified according to individual characteristics so that the data in the same category have the highest homogeneity. On the other hand different categories should have relatively higher heterogeneity. The cluster algorithm aims to analyze and mine the intrinsic structure and rules of given data [23,24]. In the process of data clustering, the clustering algorithm can automatically divide data points into different sets according to the attributes. These data with similar attributes are divided into the same set, while these data points with different attributes are divided into different sets [25]. Clustering algorithms can be divided into several types: clustering algorithms based on division (i.e., K-means), clustering algorithms based on density (i.e., DBSCAN and OPTICS), affinity propagation clustering algorithm (affinity propagation (AP) algorithm), synchronous clustering algorithm (SynC algorithm), etc.
K-means algorithm is the most widely used clustering algorithm based on division. It has been nearly 60 years since it was proposed [26]. However, the biggest shortcoming of the K-means algorithm is to select the initial K value and the value of the selected K data points since the initial value may lead the convergence of the K-means algorithm to different results. Hence, many scholars proposed other new clustering algorithms, among which AP algorithm is one kind of typical clustering algorithms [27]. AP clustering algorithm does not need to specify the number of clusters in advance. Synchronous clustering algorithm (SynC algorithm) [28,29] is another kind of clustering algorithm of which initial values are not sensitive. The main idea of synchronous clustering is that each data point is regarded as an independent individual, and similar individuals automatically get together to form clustering collections. Due to the characteristics of synchronous clustering algorithm, this algorithm has many advantages; for example, (1) the algorithm does not require given cluster centers in advance, (2) the algorithm is not sensitive to the initial value, and (3) the algorithm can well avoid noise interference data.
However, to our best knowledge, no studies adopted the SynC algorithm to analyze metro passenger route choices with smart card data trimmed by train schedules. Hence, taking the advantages of the synchronous clustering algorithm (SynC) into consideration, this paper adopts the SynC algorithm to analyze metro passenger behavior.

Methodology
3.1. Basic Assumptions. Some necessary assumptions and elements are firstly described as follows: (1) All passengers' behaviors are assumed to be reasonable, and passengers would not stay in stations for a long time. But there are always some unreasonable data which spend a very long time or an extremely short time during given OD pairs. This proposed algorithm will regard these data as noise data in the dataset.
(2) Train congestion is not considered in data preprocessing. It means passengers can ride the first arriving train after they reach platforms.
(3) All trains are operated according to the train schedule strictly.

Smart Card Data.
AFC system can record the original station (O is used in this paper), destination station (D is used in this paper), and their corresponding inbound and outbound time. These swiping data can be used to obtain the detailed passenger flow demand. Table 1 shows some examples of entry and exit swiping card data recorded by the AFC system, like card number, swiping date, inbound station code, inbound swiping time, outbound station code, outbound swiping time, etc.
Smart card data (AFC data) are defined as ( , ( ) , ( ) , ( ) , ( ) ), in which is the card ID, ( ) is the inbound swiping time, ( ) is the outbound swiping time, ( ) is the O station, and ( ) is the D station. Figure 2 shows the metro passenger travel process. It displays typical metro passenger traveling, which mainly contains passengers' swiping card at entry gates, walking to platforms, waiting for coming trains, riding trains (transfer if it has), and finally walking out of station. As shown in the figure, symbol definition includes walking cost time (entry walking time, ( ) ), waiting cost time (waiting time on platforms, ( ) ), travel cost time (invehicle time, (V) ), and walking out of station cost time (exit walking time, ( ) ). If a passenger makes a transfer, the additional transfer walking cost time (transfer walking time, ( ) ) and transfer waiting cost time (waiting time, ( ) ) are required.  Here, ( ) (inbound swiping time) is defined as the moment passengers swipe in stations. ( ) (outbound swiping time) is defined as the moment passengers swipe out of stations. The difference between ( ) and ( ) is the passengers' actual travel time during metro. Besides, ( ) (actual board time) is defined as the actual moment when passengers board trains, while ( ) (actual alight time) refers to the actual moment when passengers alight trains. Then, the pure travel time (pure travel time, ( ) ) is the difference between ( ) and ( ) . It is obvious that the values of ( ) and ( ) are limited to train arrival, which is related to the train schedule.

AFC Data Trimmed by Train
Schedule. The passengers' travel time by metro (actual travel time is used in this paper) can be obtained from the difference between the inbound swiping time and the outbound swiping time from smart card data. Obviously, the actual travel time could be different in one OD pair if passengers select different route. When the difference of route travel time between OD pairs is large, passenger's selected route can be easily decided based on the travel time. However, smart card data contains inbound and outbound walking time and waiting time, which are useless information. Since trains' arrival at stations is dispersed, some passengers with different walking time may take the same trains. That is to say, some passengers may take the train just after they arrive at platforms, while some passengers may wait for a long interval for a train they just miss. Thus, the travel time without waiting time and walking time at O station and D station can present more useful information than the travel time with waiting and walking time.
We could use train schedule to trim smart card data by removing walking and waiting time at O stations and walking time at D stations. The trimmed result can be used in cluster algorithm, subsequently. Figure 3 shows some passenger travel time before and after using AFC data trimming algorithm. It can be seen that the original AFC data are out of order, while these data after trimming are orderly. The pure travel time could reflect some discrete characteristics of train arrival and departure.
The method to determine passengers' actual boarding and alighting time is shown in Figure 4. First, for each AFC data, its inbound station is set as ( ) = , , and its inbound time is set as ( ) . Find train based on the following equation after searching all trains which run pass , in order: It means that passengers can ride train to their destinations or transfer stations. Thus the possible actual board time ( ) is Similarly, the actual alighting time can be obtained in the same way. Its outbound station is set as ( ) = , , while its outbound time is set as ( ) . Find train with the following 2) Thus the possible actual board time ( ) is It should be noted that a least walking time is needed to enter into or exit from the platform by gates. The minimum time constraint is considered in ( ) and ( ) as follows: , Therefore, the pure travel time can be acquired by 3.4. SynC Algorithm. Based on the pure travel time, this paper applies SynC algorithm analysis to process these data. This part presents how to use the SynC algorithm to analyze metro passenger route choice.

Data Normalization.
Before cluster, the data need to undergo normalization since data points may have different scales and dimensions which will affect the effectiveness of clustering algorithm. Data normalization is firstly adopted to make data fall into a certain range. This paper wants to make inbound swiping time and pure travel time into the same certain range to carry on the cluster. Z-score normalization is used in this paper to carry on data normalization, which is based on the mean and standard deviations of attribute values. The advantage of Zscore normalization is that it does not need to compute the maximum and minimum values of the data set and has good effects on the normalization of outliers. Its formula is where V is the mean value of attribute value, and is the standard deviation of attribute values.

Synchronous Clustering Algorithm (SynC Algorithm).
The main idea of SynC algorithm is to regard each data point as an individual, and the similar points would get clustered. The procedure of the algorithm is shown in Figure 5: firstly, data points are independent and move close to their similar data points, as shown in Figure 5(a)); secondly more and more data points will gather together to the one with same attribute, as shown in Figure 5(b)); finally, all similar data points are clustered together to form a cluster center, while some noise data are automatically isolated, as shown in Figure 5(c)). Some equations should be given in SynC algorithm.
Definition 1 (domain distance ). It means the maximum distance from the given point.
Definition 2 ( (the collection of data point )). Let be a data point of data set ; means the data whose distance from is smaller than : where ( , ) is the distance between data points and .
Definition 3 (Kuramoto Amplitude of data point ). Let be the th dimension of data point . After it is influenced by other points in , the Kuramoto Amplitude of data point can be described as where can be ignored in this cluster algorithm, and is a constant (equal to 1 in this part). Finally, the Kuramoto Amplitude can be rewritten as where is the time step, and = 0 represents the initial state.
Definition 4 (synchronous coordination parameter). It represents the degree of synchronous coordination of all data points in the data set at the current time step: It can be seen that synchronous coordination parameter of the data set will increase gradually when more data points gather together. And after the parameter does not change for a long time, the data set achieves convergence within . It reaches a local synchronized status. Finally, when all data points gather together ( → 1), it reaches a global synchronized status.
Definition 5 (optimal domain distance ). It means the cluster result is the best when is equal to a certain value. The optimal distance can be determined according to the SynC algorithm [28]: where is the th cluster center of the given data; argmin is the function that can calculate which leads the value of ( , ) to be minimum. ( , ) can be computed by following equations: where is the number of cluster centers; is the th cluster set; | | is the number of data points in ; is the data dimension; ( ) is the probability of data point which belongs to . Therefore, the steps of synchronization clustering algorithm (SynC algorithm) are described as follows, while the flowchart of SynC algorithm is shown in Figure 6: (1) Initial time step is set as = 0, and all data points are regarded as independent cluster center.
(2) Set domain distance , and calculate of all data points. Figure 5: Sketch of synchronous clustering (SynC) algorithm process [28]. (3) Compute the Kuramoto Amplitude of all data points using , and data points of ( + 1) can be calculated when it moves to next time step ( = + 1).

Journal of Advanced Transportation
(4) Compute the synchronous coordination parameter of this data set at this time step.
(5) If = 1, then it reaches a global synchronized status, the algorithm ends and the optimal domain distance can be computed. If this is not the case, the algorithm moves to step (6).

Case Study
To evaluate the proposed algorithm of smart card data trimming and SynC, a real-life metro network (the Shanghai Metro system, shown in Figure 7) with a large number of (2) Train Schedules and AFC Data Trimming. To make the case study easy to program, the planned train schedule instead of actual schedule is used. And the planned train schedule using at weekday during November 2016 is applied in the case study, and all trains are assumed to operate according to the train schedule strictly. Train schedule is used to trim AFC data to obtain the pure travel time by removing entry/exit walking time and waiting time according to the proposed AFC data trimmed method. The process is shown in Table 2.
(3) Data Normalization. The inbound swiping time is selected as the X axis of cluster data set and the pure travel time is selected as the Y axis. However, due to the different dimensions of data points, data normalization is needed to get a better cluster result. The normalization example of the data points is shown in Table 3.
(4) Clustering Process. C#.Net programming language is applied to program coding to achieve the algorithm. Figure 8 shows the process of SynC algorithm. The X axis is inbound time after data normalization while the Y axis is pure travel time after data normalization. And two horizontal lines in each figure represent morning and evening peak period, respectively. Each part in Figure 8 represents a local synchronized status in SynC algorithm. At the first part, each data point is regarded as a cluster center/centroid. The data points automatically get together in local synchronized status, leading centroids to be merged slowly in the following parts. It can be seen that, with the clustering process, data points gradually merge to form cluster centers, and noisy data are isolated obviously at the same time, when reaching the optimal domain distance as (13)- (16). The final result is shown in Figure 9. Point color refers to the cluster they belong to. The more the data points of the same color, the higher the passenger flow this route has. Passenger route selection probability during both peak and flat period is easy to obtain with the result.

Algorithm Analysis.
The cluster algorithm applies the pure travel time which removes entry/exit walking time and waiting time using train schedule. Some comparative analyses are made in this part. Figure 10 shows the cluster results using both AFC data with trimming ( Figure 10(a)) and AFC data without trimming (Figure 10(b)) by train schedule. It is indicated that the trimming results could present metro travel time characteristics clearly while the no-trimming results present passenger travel time disorderly. Thus, pure travel time trimmed by train schedules could represent some discrete characteristics of metro transportation since it could take train schedules into consideration. Table 4 shows cluster results by the distinction of early peak, flat peak, and evening peak. This table shows passengers preference on route choice with different periods. And clusters with small passenger flow are regarded as noisy. Table 5 shows the route list of this OD pair in traditional model used in Shanghai Metro Company [5]. The candidate route sets are generated according to the K-short algorithm with route expected travel time, and the selection probability of each route is calculated by logistics model. It contains some possible routes that passengers may choose to follow and the corresponding selection probability of each route. This table is very important to metro operation since it is used to calculate the passenger flow distribution of the whole network. Also the allocation to each metro line is decided by the line passenger flow computed by the traditional model results. The routes in Table 5 are used to link the cluster centers in Table 4 according to the comparison of travel time. Take  Table 5 (route list) as a contrast; the following results can be summarized from Table 4 (cluster result): (1) There are mainly two routes during morning peak period. About 60% of passengers choose the route with a long time but less transfer (Route No. 3 in Table 5), and 40% of passengers choose the route with Journal of Advanced Transportation 9     Table 5). This result is not similar to Table 5. It is a bit surprising that not all passengers selected the route with the shortest travel time. The possible reasons for selecting the route with a longer travel time but less transfer during peak period are that passengers may want to avoid station congestion. Station congestion may lead to the fat that they miss the first arrival train because of not enough space in vehicle and too many passengers on the platform. Thus, passengers may think the transfer could take them more time in their trips during morning peak.

Result Analysis.
(2) As shown in Table 4, the difference between cluster 3 and cluster 6 is small; thus these two clusters can actually be considered the same one. After linking clusters to routes, we could find that most passengers choose Route No. 1 or Route No. 2, while few choose Route 3 during flat period, which is in line with the result of Table 5. The results can reveal the fact that passengers do not expect shorter travel time but expect more comfortable service instead during flat period. Besides, many noisy points could be found during flat periods, which is relative long travel time with less passenger flow in metro system. It means     It should be noted that smart card data with only a week range are used in the case study. The passenger route selection probability would be more reliable with more smart card data. Therefore, the result of the algorithm can be used to revise traditional model results like those of Table 5.

Algorithm Extension.
The proposed algorithm can be applied to other OD pairs easily on metro network. But there are two limitations in the algorithm of AFC data trimmed by train schedule when it is used for other OD pairs. The one is that passengers can choose any transit line to finish their trips when origin station or destination station is a transfer station. The other one is that passengers can choose either upstream trains or downstream trains of the transit line to finish their trips when origin station or destination station  is a normal station. These two kinds of OD pairs are called unclear routes OD pairs. The algorithm of AFC data trimmed by train schedule cannot be applied to these two kinds of OD pairs. The main reason is that passengers' walking in cost time or walking out cost time is not able to be removed from AFC data since it is not clear which transit line or which upstream/ downstream trains of metro schedule should be chosen.
There are also two limitations in the SynC algorithm when it is used for other OD pairs. One is that it is useless to apply the algorithm into the OD pairs whose travel time of routes is similar. Routes cannot be clustered by these similar travel times. The other is that the SynC algorithm is useless when the passenger flow is very low between the OD pairs. The cluster algorithm cannot work with such less data.
Shanghai metro network is applied to discuss the applicability of the proposed algorithm, as shown in Table 6. The date of selected data is November 15, 2016. There are 119,918 OD pairs in this network and 5,313,949 passengers traveling on that day. Some interesting findings can be obtained as follows: (1) There are about 20% OD pairs (23,390,19.50%) in which passengers can ride more than one line to their destinations. These OD pairs contain 1) both upstream and downstream lines which are both feasible routes and 2) many routes which are feasible in original/destination stations when they are transfer ones; thus the proposed algorithm of AFC data trimming by train schedule cannot be used in this type of OD pairs, accounting for 11.41% passenger flow (606,245).
(2) There are about 6.00% OD pairs (7,193) having similar travel time. The SynC algorithm cannot use these data to cluster distinct points.
(3) However, more than 20% OD pairs (25,737) contain less than 5 passengers on a whole day. Such small passenger flow is useless for cluster. But there are only 26,810 passengers (accounting for 0.50%) traveling though these OD pairs.
(4) AFC data also contains some noisy data; for example, passengers may swipe in and swipe out from same stations. There are about 4,112 OD pairs (3.43%), accounting for passenger flow of 95,712 (1.80%). These data are useless for the analysis of passenger behavior among metro system.
(5) Therefore, besides those above data, the proposed algorithm can be used in about 50% OD pairs (53,233, 49.61%) to cluster passenger travel routes, while more than 85% AFC data (passenger flow, 4,520,205, 84.44%) can be used in clustering. That is to say, most AFC data are useful for the analysis of passenger route behavior.

Conclusion
This paper studied metro passenger route choice with train schedule and cluster algorithm. On the basis of AFC data, the algorithm of AFC data trimmed by train schedule was proposed to obtain pure travel time. The results were then used in synchronous clustering algorithm to analyze the passenger route choice (selection probability) under time constraints. Then, a case study by using Shanghai metro data was conducted to validate the proposed algorithm. It was indicated that the probability of route choice can be calculated through SynC algorithm in different periods, and thus the algorithm can be used to revise traditional model results. The proposed algorithm can help to analyze passenger route preference with smart card data without traditional methods which contains a large number of parameters. And the passenger route preference would be relatively accurate with more smart card data. However, there are some limitations in the proposed method which needs further research. (1) The journey time of different routes over different periods should be different. The travel time in Table 5 in the results of the paper has theoretical values, which are calculated by train section running time and passenger transfer cost time. The results do not consider congestion in trains and variable train operation headway in the calculation. Thus, further research should be made in the determination of dynamic travel time of passenger routes over periods. For example, congestion data in the train carriages and on the platforms of stations, which can be acquired by passenger flow detection devices based on image recognition, are useful for calculating dynamic journey time of passenger routes over periods. (2) Besides, how to link these clustering results to these travel routes automatically needs a further study, in order to make the data process more complete.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.