A Network-Based Model of Passenger Transfer Flow between Bus and Metro: An Application to the Public Transport System of Beijing

,


Introduction
In a public transport (PT) network, it is impossible to provide all passengers with a direct and unimodal PT service between all the stations and stops.Passengers sometimes have to transfer between different lines and often between different modes.A trip by PTcould, therefore, involve one or even more transfers from one mode to another [1,2].In contrast to door-to-door service, inconvenient transfers can disrupt passenger travel and reduce the competitiveness of PT [3,4].A better transfer connection between modes has been shown to improve the level of service of PT in general and thus stimulate its overall usage [5][6][7].To provide a better transfer connection, it is necessary to be able to quantify transfer flows, thus allowing smart transfer planning and management [8].For example, if PT planning and management authorities want to understand pedestrian behavior at a transfer corridor and further improve connection efficiency, they need to estimate and predict the passengers' transfer flow [9].Since the combination of bus and metro is a typical one in many cities, much research has focused on how to provide a better-integrated bus and metro system through such transfer connections [10,11], which is also the focus of this paper.
Many rule-based algorithms have been developed to estimate transfer flow based on smart card data [12,13], but they can only estimate the historical transfer flow of an existing station.To predict the transfer flow of a newly planned station, transit assignment models based on transit users' route choices have been used [1,14,15].Discrete choice models have been used to explain the route choice of travelers based on utility maximization [16].Such models search for the route choice set of travelers and calculate the probability of each choice, resulting in extensive calibration and computation time [17].ere are also studies using only network properties [18] to assign PT passenger flows, which provide a parsimonious alternative to existing passenger assignment models [19].However, this type of approach has still not been used to model transfer flows and there is no research attempt to examine the relation between transfer flow and network properties.In this paper, we aim to fill this gap by establishing a model of transfer flow between metro and bus based on network properties.Some network indicators can be obtained directly from the data [20,21], such as transfer time and the number of bus lines around one metro station [22].Apart from these relatively straightforward indicators, the most important network property introduced in this study is what we call transfer accessibility. is is a newly defined indicator for the radiation of a transfer station given its position in a bimodal PT network.Intuitively, this indicator represents the accessibility of a transfer station, which is proportional to the sum of potential interactions between all reachable metro stations and all reachable bus stops and inversely proportional to generalized travel cost of these interactions.e potential interaction is measured in terms of the potential production of a bus stop (or a metro station) plus the potential attraction of a metro station (or a bus stop).For both production and attraction, we use the number of points of interest (POIs) around each station (or stop) as a proxy, which is a dataset that is typically available nowadays.It should be noted that some research referred to the robustness of transfer connections within a station also as transfer accessibility [23], which should be distinguished from our concept.
Our approach to calculating transfer accessibility based on the sum of potential interactions is very similar to the measurement of gravity-based accessibility [24], which can be regarded as an analogy to Newton's gravitational law [25].Namely, the exchange of people between two cities is directly proportional to the product of population and inversely proportional to the square of the distance between the two cities [24].In this paper, we propose such a gravity-based model to estimate transfer accessibility and then use it as an explanatory variable to establish a regression model of station-level transfer flows.
e paper is organized as follows.First, the methodology is described, which includes the definition of transfer accessibility and the regression model for transfer flow prediction.en, the PT data of Beijing used in our study is further explained.Following that, we present the application of our model to those data.In the final section, we draw conclusions and suggest directions for future research.

Methodology
We assume that the network properties of a station can be related to transfer flow between two modes of transportation.In this study, we aim to test this assumption.Since not all single features are normally distributed and a nonlinear relationship may exist between the independent and dependent variables [26], we take the logarithm of the variables to build the regression model if necessary.e model is presented as follows: where y j is the transfer flow of station j, ε represents the error term, and x p are the different explanatory variables that represent network properties.
Next, we select a group of network properties that are considered to be related to transfer flows.Based on a review of the existing literature, the following network properties are selected (more details in Section 2.2).
(i) Transfer accessibility (the new indicator) (ii) Transfer time [27] (iii) e number of bus stops around each metro station [28] (iv) e number of bus lines per bus stop [22] As summarized in Figure 1, a regression model is established to find the relationship between transfer flow and the four network attributes mentioned above, among which transfer accessibility needs to be calculated based on a gravity model.e gravity model assumes that transfer accessibility at each station is dependent on the number of reachable POIs, PT stops at this station, and a cost function describing the effect of distance.Its calculation process consists of five steps: for a station, (1) find all OD pairs that connect to this station, (2) calculate a proxy for potential trip interactions between every OD pair, specifically in terms of the number of POIs surrounding an origin station plus the one surrounding a destination station, (3) for each OD pair, multiply the interaction by a cost function that describes the effect of distance for each OD station pair, (4) filter out those OD pairs connected by direct transport, such as direct metro or bus lines, and (5) sum the calculation results over all the reachable OD station pairs to calculate gravity-based accessibility.e method can be applied in a PT network that includes bus stops and metro stations.

Dependent Variable.
In this study, the dependent variable is the transfer flow.In order to compute transfer flow from smart card data, it is necessary to first identify what a transfer is.When commuters travel in PT networks using smart cards [29], the following data from each trip is available through smart card data: anonymous identities (IDs) of users, IDs of boarding and alighting stations, and timestamps.
During the past decade, different approaches have been proposed to identify transfers based on smart card data [30], many of which are rule-based approaches.For example, different fixed time thresholds are set for the observed time gaps between consecutive trip legs/segments [31].Transfer time thresholds ranging from 30 minutes to 90 minutes have been used for London to identify transfers with smart card data [12,32].Otherwise, transfer walking distance can also be applied.A maximum threshold of 750 meters on transfer distances was used to estimate transfers in London [33], and 400 meters in e Hague, Netherlands [13].Some approaches further distinguish transfers from short activities, which incorporate the effects of denied boarding, transferring to a vehicle of the same line [13], and the circuitry of the path trajectories [34].
In this paper, we also identify transfers using a rulebased approach.e thresholds of transfer time and transfer distance are set to detect transfers based on smart card data.Our research area is the city of Beijing and we focus on the transfers between bus and metro.Firstly, the complexity of the Beijing PT network is similar to London and Shanghai.Based on the transfer data of London [12] and Shanghai [35], we can preliminarily determine that the transfer time is generally about 30 minutes for these large-scale cities. e maximum transfer distance is set at 2.5 km, based on the assumed maximum walking speed [33].Secondly, in order to test whether 30 minutes are reasonable for Beijing, we analyzed the time interval of two adjacent trips of all passengers, where their trips interval is about 30 minutes and distance is within 2.5 km, based on Beijing smart card data.As shown in Figure 2, the time interval of 95% of trips is less than 25 minutes.erefore, we set our threshold of transfer time as 25 minutes and the maximum transfer distance as 2.5 km.Following these rules, it is possible to estimate transfer flows through every metro station, based on smart card data.
ere are many types of transfer, including internal transfers such as the ones within the metro system, and external transfers between bus and metro.We consider internal transfer between different metro lines as one trip segment since commuters only need to swipe their cards when they get in and out of a metro station and do not swipe their cards when they transfer between different metro lines.In our joint network of bus and metro, one-time transfers between metro and bus comprise the majority of the transfers, accounting for 91% of all transfers between metro and bus, based on Beijing smart card data (Figure 3).us, one-time transfers between metro and bus are our research focus in this paper.

Independent Variables.
In our regression model that predicts transfer flow, there are four independent variables in total.e first independent variable is the transfer time of a trip between the bus and the metro, determined according to the time interval of the traveler swiping their card.Based on the median of transfer times of all transfer trips through one metro station, the transfer time from a metro station to a bus stop (or vice versa) can be obtained.We use the median value of all empirical transfer times at one metro station to represent the general transfer time of this station.For a newly planned station, transfer time can be initially estimated based on the transfer distance and the estimated waiting time.
e second independent variable is the number of bus stops around one metro station, which reflects the potential opportunities for commuters to transfer.We set the radius as one kilometer and count the number of bus stops within this range from each metro station.
e third independent variable is the number of bus lines per bus stop, which reflects the intensity of bus service at a bus stop next to the metro station.e assumption is that if there are more lines at one bus stop, there would be more transfer trips.We explain the first three as follows and will specify the last, the new one put forward in this paper.As it has been introduced before, a gravity-based model is proposed to measure transfer accessibility. is model assumes that transfer accessibility of each station is dependent on the number of reachable POIs in a city, data which is nowadays easy to obtain, and a cost function describing the effect of distance.
We use a toy PT network combining a bus network and a metro network to explain our definition.As illustrated in Figure 4, each node represents a metro station (a blue node) or a bus stop (a black node).ere are four metro stations (A, B, C, and M) and five bus stops (b1, b2, b3, b4, and b5).A link between two bus stops or two metro stations exists if there are PT services connecting them.A dashed line represents the transfer connection between a bus stop and a metro station.For example, commuters can walk between bus stop b1 and metro station M to transfer and continue their trips.
In this gravity-based model, we focus on one transfer station and find all the OD pairs that can be connected through it.In our case, an OD pair should consist of one bus stop and one metro station.When we focus on one metro station, all possible transfer links from one metro station to different bus stops which are located around this metro station will be searched.In the PT toy network example (Figure 4), we focus on metro station M, which has a possible transfer link with bus stop b1.We assume that a trip is transferred from bus to metro; therefore, the origin node could be either bus stop b2 or b3, connected by a bus line to bus stop b1.e destination node could be either metro station A, B, or C, since all metro stations are interconnected, and commuters can travel from metro station M to any other metro station.
For one transfer metro station, we search for all potential OD pairs that are connected through this station.We use the number of POIs surrounding a metro station or a bus stop as a proxy for potential trip production or attraction.For metro station M in the above PT toy network, one needs to calculate the number of surrounding POIs of 6 OD pairs which are connected through this station.For example, the proxy potential trip interaction for metro station M between the OD pair "b2-A" is the sum of the number of POIs around bus stop b2 and metro station A.
e total number of 4 Journal of Advanced Transportation company POIs and housing POIs is counted within a 500meter radius [35] from each metro station and each bus stop.
An OD pair might be connected directly by a single PT mode.If that is the case, the amount of transfer flow between this OD pair would be reduced.erefore, if one wants to estimate transfer demand [36] more accurately, the impact of direct transport should be removed.e number of metro stations, the number of bus lines, the travel time by bus [37], and the standard deviation of travel time will affect commuters' choices.We combine the four factors mentioned above to obtain the transfer demand impact factor ζk(j): where j is the current transfer station, and k is the k th OD pair which is connected through station j .ζk(j) denotes the transfer demand impact coefficient of the k th OD pair transferring at station j. m k and n k are the number of metro lines and the number of bus lines, respectively, which can connect the k th OD pair directly.t ktotal is the total travel time of the k th OD pair when commuters choose to transfer at station j. t kbus is the average bus travel time of the k th OD pair when commuters choose to travel by bus directly.std tkbus is the standard deviation of bus travel time on the k th OD pair when commuters choose to travel by bus directly.
If some metro lines can directly connect the k th OD pair, set ζk(j) � 0, and if there is neither a metro line nor a bus line between the k th OD pair of station j, set ζk(j) � 1.Otherwise, ζk(j) is determined by the effect of multiple parameters, including n k , t ktotal , t kbus , and std tkbus .Bus running times and running time variation will affect service reliability and will further affect the attractiveness of travel by bus [22].erefore, we can assume that the lower the standard deviation of bus travel time is, the more punctual and stable bus travel time will be, which should motivate commuters to use it [22].e higher the number of bus lines between one OD pair, the higher the probability of having a good bus connection; this also motivates commuters to use the bus directly instead of transfer.
We use a combined cost function to model commuters' reluctance to travel a long distance.is function has the following form [25]: where f(c kj ) is a generalized impedance function of travel distance with two parameters for calibration, and c kj is the travel distance traveling through transfer metro station j between the k th OD pair.e shape of this function for different values of its parameters is shown in Figure 5.
e values of n and β should be calibrated to calculate transfer accessibility based on the cost function.In Figure 2, if we focus on metro station M, b2-A is one of all the potential OD pairs which are connected through this station.In this case, the travel distance c kj between the OD pair, "b2-A" is the sum of the distance b2-M and the distance M-A.Based on the estimated n, β, and this travel distance c kj , it is possible to obtain the cost function between the OD pair "b2-A", which is not always decreasing.It first rises and then gradually decreases until it stabilizes near zero with the change in travel distance.
By summing the calculation results of accessibility of station j over all the potential OD pairs which are connected through this station, it is possible to obtain the transfer accessibility of station j. e definition of the transfer accessibility of metro station j is given as follows: where m is the number of OD pairs transferring at station j .k represents the k th OD pair transferring at station j .p k (j) is the potential trip interactions of the k th OD pair transferring at station j .ζ k (j) denotes the transfer demand impact factor of the k th OD pair transferring at station j .f(c kj ) is a cost function describing the effect of distance.

Application to the PT Network of Beijing
3.1.Data.e case study is conducted in the city of Beijing, the capital of China.Some basic information about Beijing and its network is shown in Table 1.
We use network data, smart card data, and POI data in our research.e number of bus stops around one metro station is counted within a one-kilometer radius from each metro station.In Figure 6, nodes represent metro stations, and the depth of color represents the number of bus stops nearby this metro station.
A smart card can be used by Beijing's travelers to board the metro, buses, and public bicycles.According to the National Report on Urban Passenger Transport Development [39], 67.4% of the travelers used a smart card when they travel by PT in Beijing in 2017.erefore, smart card Journal of Advanced Transportation data can somehow be used as a representative sample of the PT passenger population at the time.Notably, our approach can also be applied to the latest PT data obtained from the new smartphone-based payment methods, such as NFC and QR codes, as long as they record the same type of information.Cardholders need to check in and check out when they travel in all PT systems [40].As shown in Table 2, the data used in this paper is from September 4 to September 11 in 2017 (8 days).It contains the records of all the transactions completed by smart cardholders during this period.
Travelers do not need to check out when they transfer within the metro system, but they do need to check out first and check in again if they transfer between metro and bus.e POI data used in this paper were extracted from the Gaode Maps service, which is the Chinese equivalent of Google Maps [41].About 1.2 million POIs of twenty categories can be obtained in Beijing.
e available information of the POI data includes name, coordinates, and category.
e twenty categories include residence and company.ree types of information are extracted from the original POI dataset for each metro station and bus stop, including the total number of surrounding POIs, the number of surrounding residence POIs, and the number of surrounding company POIs [35].
e number of POIs around the metro stations is indicated by the depth of color in Figure 7.

Data Preprocessing.
We use the data from September 4, 2017, as an example to illustrate the preprocessing of the raw data.e number of bus card transactions on this day is 141,192,280 and the number of subway card transactions is 534,1597.Firstly, the anomalous data is removed, including the following cases: (1) when the line number is not available; (2) when there is a missing record of the boarding or alighting stop; (3) when the alighting time is earlier than the boarding time; (4) when the boarding and alighting are at the same stop on the same line; (5) when there is duplicate data; and (6) when the station ID is wrong.After data preprocessing, we obtain 5,070,457 valid bus records and 5,300,593 valid metro records.Consequently, the total number of bus and subway records is 10,371,050.Secondly, the data of users with two consecutive travel records are detected in the combined transit and metro records.We connect two adjacent trip records of the same user into one trip record, leading to three types of travel including a

6
Journal of Advanced Transportation transfer: bus and bus trip, metro and metro trip, and bus and metro trip.We focus on bus and metro trips and obtain 1,082,269 records.irdly, the transfer time and transfer distance are calculated for these bus and metro trips.If the transfer time is less than 25 minutes and the transfer distance is less than 2.5 km for one trip record, we consider it to be a transfer trip.We obtain 566,978 transfer trip records.Similarly, we analyze the remaining 7 days of data to calculate the average transfer flow.

Identifying Transfers and Calculating Variables.
e transfer flow of all metro stations is shown in Figure 8, where it can be observed that stations with more transfer flow are not necessarily located in the city center.
As shown in Figure 9, transfer times range from 3 minutes to 25 minutes.Most of the transfers take around 8 minutes.e number of bus stops within a one-kilometer radius of each metro station ranges from 1 to 25.On average,  Journal of Advanced Transportation there are around 8 bus stops near each metro station.e number of bus lines per bus stop varies from 1 to 13, whilst 3 to 5 seem to appear more often.
Before calculating the transfer accessibility, two parameters n and β in the cost function of the gravity-based model need to be determined in (3).Using (1), we estimate the model using the real PT data in Beijing.e R-squared accuracy that results from the different parameters is indicated by the depth of color in Figure 10.When n � 5 and β � 0.1, the evaluation results are the best; therefore, we use these values.
With 300 metro stations and more than 30,000 bus stops, there would be theoretically about 9 million OD pairs.Based on the formula, we can calculate the transfer accessibility of every metro station which is indicated by the color depth in Figure 11.It can be observed that some metro stations far from the center are highly accessible since some of them are the only connections to a lot of distant bus stops.

Correlation Analysis of Variables.
e correlation between the independent variables was analyzed in Table 3. e correlations between transfer accessibility and other indicators are weak, except for the number of bus lines per bus stop, which is slightly higher.We still keep these two variables, since they both have a significant impact on model accuracy (more detail in Table 4).

Model Estimation.
We established a regression model for each of the four independent variables and the transfer flow to explore the influence of every single predictive attribute.We show the relationship between every independent variable and the dependent variable in Figure 12. e four attributes all have a significant impact on the transfer flow.
In our final dataset, we have 306 metro stations.e data is split in 70%, as a training set, and 30%, as a test set.e model estimation results based on the training set are summarized in Table 5.All of the coefficients have their positive or negative signs as hypothesized and are all significant.
In general, the coefficients of three attributes including transfer accessibility, the number of bus stops, and the number of bus lines per bus stop are positive and significant in explaining the transfer flow.More bus lines and more bus stops would also lead to more transfer flow.Transfer flow decreases with the increase of transfer time.
We use cross-validation to evaluate our model in terms of R-square"( 5)".K-fold method [42] was chosen to do cross-validation.In K-fold cross-validation, the original sample is randomly partitioned into K subsamples.Of the K subsamples, a single subsample is retained as the validation data for testing the model and the remaining K-1 subsamples are used as training data.e cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data.e K results from the folds can then be combined to produce a single estimation.e advantage of this method over repeated random subsampling is that all observations are used for both training and validation, and each observation is used for validation exactly once.We tested different k values and finally set k � 6.
where  y i is predicted value of y using our model, y i is the actual value of y, and y is the mean actual value of y.R-square reflects the extent to which the fluctuation of y can be described by the fluctuation of the independent variables of our model.e value range of R-square is from 0 to 1. e closer R-square is to 1, the more accurate the model is.
We test the prediction results with and without the proposed variable in Table 4. e accuracy of the model is 0.6032 without the variable "transfer accessibility" and 0.6935 with this proposed variable.e combination of the four variables we proposed can obtain higher accuracy.e model we proposed performs well, not only for explaining the data but also for predicting the transfer flows.
Furthermore, we use a residual plot to show the residuals on the vertical axis and the independent variable on the horizontal axis.As shown in Figure 13, the points in a residual plot are randomly dispersed around the horizontal axis, which proves that our linear regression model is appropriate for the data.
We also calculate the F-test [43] to evaluate the accuracy of the model.Our testing approach is illustrated as follows.We start with two hypotheses.H 0 is the null hypothesis that the lagged-variable model does not explain the variance in the transfer flow better than the intercept-only model.H 1 is the     alternate hypothesis that the lagged-variable model is better.We apply the F-test on the two models.In our example, the p value is 1.11e-80, which is an extremely small number.ere is less than 1% chance that the F-statistic of 188.6 could have occurred by chance under H 0 .us, we reject the Null hypothesis and accept the alternate hypothesis H 1 that the complex model can explain the variance in the dependent variable better than the intercept-only model.

Conclusion
In this paper, we have developed a regression model to explain how network-related attributes can be used to model transfer flow in a multimodal PT network.We conducted our case study in a joint bus and metro network in Beijing and several properties were shown to influence transfer flow between these two modes, namely, transfer accessibility, transfer time, and the number of bus lines per bus stop.Among them, the most important property we proposed was transfer accessibility, which was defined to represent the radiation of a station as a transferring hub, given its position in a multimodal PT network.
We believe that our method could be used not only for explaining transfer flow at existing stations but also for predicting transfer flow at newly planned stations.It provides a parsimonious alternative to existing passenger assignment models, which are mostly expensive, given the modeling required as well as data hungriness.Our model can be directly applied to the evaluation of the transfer flow at a new station in Beijing.e model can also be used for other cities as long as they have the same data available as we had, including smart card data, network data, and POI data.e innovation of our study lies in the new approach to modeling passenger transfer flow based on network properties.Also, transfer accessibility is a new concept, which might be useful for other PT research as well.
is work can still be improved in a few ways.Firstly, several features can be added to the existing methodology in the future.Cities with different sizes and thus with different PTnetwork scales can be used to further validate the findings of this paper.Secondly, the number of passengers depends on the time and period.One can consider the temporal effects on transfer flow in future research.Finally, one-time transfers between metro and bus are our research focus in this paper, since it accounts for the majority of the transfers between metro and bus, but it would be interesting to explore the transferability of our model to other complex transfer types in the future.

Figure 1 :Figure 2 :
Figure 1: Main components of the developed methodology and overall research design.

Figure 5 :
Figure 5: Cost functions with different parameters.

Figure 6 :
Figure 6: e metro network in Beijing, China, with the depth of color indicating the number of surrounding bus stops.

Figure 12 :
Figure 12: e relation between four attributes and transfer flow: (a) transfer accessibility, (b) transfer time, (c) the number of bus stops around each metro station, and (d) the number of bus lines per bus stop.

Table 1 :
Basic information of PT network in Beijing.

Table 2 :
Information on smart card data used in this paper.

Table 3 :
e correlation between the independent variables.

Table 4 :
e accuracy of the model with the different variable combination.

Table 5 :
Estimation results of the regression model based on the training set.