Discovering Public Transit Riders’ Travel Pattern from GPS Data: A Case Study in Harbin

This paper proposes a public transit riders’ travel pattern measuring method based on divided cells and public transit vehicle’s GPS data. The method consists of two parts: detecting urban origin and destination areas and measuring the public transit riders’ travel pattern. Moreover, a series of indicators are proposed to reflect the public transit riders’ travel pattern. A case study is carried out to evaluate the methods, which use the GPS data collected from taxis and buses in Harbin, China.The study is expected to provide a better understanding of public transit riders’ travel patterns.


Introduction
The dramatic increase of urban vehicles leads to many serious problems including traffic congestions, road accidents, and air pollutions, which become general conundrums in many metropolitans all over the world, or even in smaller cities. Public transit is considered to be one of the most effective solutions for these general conundrums. Public transit includes various services that provide mobility to the general public, including buses, trains, ferries, shared taxi, and their variations [1,2]. Public transit has many obvious advantages, such as less expenses, more effective mobility, and saving travel time, which makes more and more urban residents turn to public transit service. The city travelers and commuters proportion of Beijing who takes public transit system (including bus, taxi, and rail transit) has continuously increased in last decade. And this number came up to 54.2% in 2015, which represents more than 15.5 million trips per day [3]. This trend also occurred in other metropolitans and smaller cities all over the world.
Public transit riders usually exhibit a fixed travel pattern. That is to say, at a macrolevel, usually a fixed number of Origin-Destination pairs locate in same place of urban area, and the number of trips between these OD pairs stays steady every day [4][5][6]. The usage ratios of each kind of public transit stay steady every day. And the trips at morning or evening rush hours occupy a large proportion of the total trips every day [3]. On the other hand, at a microlevel, a single traveler moves from resident area to work place in the morning and moves back in the evening for each day. If the travel pattern of public transit riders could be identified, the urban public transit manager can benefit from it. For example, analyzing those riders who prefer to choose public transit vehicles rather than private cars helps transit authorities to improve the strategies and even make new policy to attract new riders [7]. With a better understanding of the transfer behavior of public riders, transit agencies can adjust the bus route to make transfer easier, which can enhance the riders' satisfaction [8]. By calculating the shortest path lengths between all station pairs, the original-destination matrix, and trip lengths, transit agencies can develop fare change plans to manage demand or raise revenue [9,10].
The Original-Destination (OD) matrix is a typical representation of residents' travel pattern, which reflects travel demand, trip generation, travel distribution, and so on. The traditional models of establishing OD matrix usually rely on travel behavior survey. In practice, household travel surveys are conducted in many countries [11]. However, the survey data contain many limitations and errors. For example, some metropolitans in Japan (Tokyo, Kyoto, Osaka, etc.) survey residents' travel behavior every 10 years. Since the cities are growing rapidly, the survey data will definitely be out of date [11]. Furthermore, the sampling rate is usually very low, which 2 Journal of Sensors brings sampling errors [12]. Meanwhile, many human factors may affect the accuracy of the OD matrix, such as, willfully filtering some trips, forgetfulness, and other related factors [13].
Compared with traditional survey data, GPS data and smart card data exhibit wider coverage, lower cost, and higher accuracy. With the rapid development of data-based technology, various intelligent transportation systems are widely applied in public transit system. These systems could collect residents' mobility data every day, including longitude, latitude, boarding time, and dropping off time [14]. In the last decade, various researches based on these data have been carried out, for example, mining urban recurrent congestion evolution patterns from GPS-equipped vehicle mobility data [14], comparing accessibility in urban slums using smart card and bus GPS data [15,16], discovering functional zones using bus smart card data [17], and partitioning bus operating hours into time of day intervals based on bus GPS data [18], which makes the data based transportation research to be a hot spot of transportation field [19].
GPS data and smart card data are usually collected from different subsystems of one whole intelligent transportation system, or even from different systems. This is because bus, taxi, and rail transit usually belong to different public transit companies [3]. Therefore, most previous data based researches into transit traveler behaviors utilize smart data [20][21][22] or GPS data [23,24], respectively. Accordingly, bus and rail transit riders' travel pattern can be generated [20][21][22] or travel behavior of taxi riders [23,24], respectively. To the best of my knowledge, public transit riders' travel pattern researches are barely utilizing both smart card and GPS data. However, smart card data only provide riders' boarding and dropping off information and lack locating information [21,22]. As a result, only approximate location can be acquired, which causes inaccuracy of origin and destination inference. Moreover, on the microlevel, GPS-equipped public transit vehicle riders' trips are far less than other public transit vehicles [3]. And the insufficient sampling number will definitely lead to inaccuracy of trip distribution. In this light, the smart data and GPS data should be integrated to discover public transit riders' travel pattern.
The aim of this paper is to propose an effective method to explore the public transit riders' travel pattern in an urban area. There are two subgoals identified: detecting urban origin and destination areas at a cell level and measuring public transit riders' travel pattern.
This paper is organized as follows. Section 2 discusses the definition of cells and locating points that will be applied to this research. Section 3 describes the proposed urban public transit riders' travel pattern measuring method. Section 4 applies the proposed public transit riders' travel pattern measuring methodology using taxi and bus GPS data and the urban road network of Harbin. Section 5 provides conclusions and recommendations for future research.

Definition of Cells and Locating Points
In this part, we are going to define some parameters of cells and locating points. Firstly, the urban area is divided into × small cells with same size. ( , ) is one of these cells, where = 1, 2, . . . , and = 1, 2, . . . , .
According to the taxi GPS dataset, there are four types of occupation status. In this paper, we are going to study the public transit riders' travel pattern, so we defined two types of locating points according to the boarding and dropping off status, which are described as follows: Type 1. taxi.boarding represents taxi vehicles' locating points whose occupation status value is 768 (i.e. the occupation status is boarding).
Type 2. taxi.dropping off represents taxi vehicles' locating points whose occupation status value is 16640 (i.e. the occupation status is dropping off).
The locating points whose occupation status value is 256 (represents the taxi vehicles being vacant) are of no use to the public transit riders' travel pattern, so they are not utilized in our research.
taxi ⟨ID, ts⟩ represents a specific locating point of a taxi vehicle, where ID and ts are the Taxi ID and Timestamp from the taxi GPS datasets.
Passenger only needs to touch smart card once while boarding bus in many cities of China, such as Guangzhou, Xi'an, and Harbin, whereas, in Beijing, passenger needs to touch smart card once again while dropping off. It is easy to know both boarding locating points and dropping off locating points in Beijing. In Harbin, the dropping off locating points can be inferred based on boarding points and time period. In order to simplify the method, in this paper, we use the most periodic commuters' data from the bus datasets. That is to say, if a specific bus rider gets to work every morning and comes back home every evening, only this type of rider's locating points will be included in our research. This type of bus rider generates two trips each day. In this light, we defined two types of bus locating points as follows.
Type 2. bus.dropping off represents dropping off passengers' locating points. bus ⟨ID, ts⟩ represents a specific locating point of a bus vehicle, where ID and ts are the Card ID and Timestamp from the bus GPS datasets.
How to identify the boarding locating points and dropping off locating points from smart card data is described as follows.
Step 1. Extract the locating points with same card ID (i.e., bus ⟨ID , ts 1 ⟩, bus ⟨ID , ts 2 ⟩, . . . , bus ⟨ID , ts ⟩) from the whole smart dataset, where bus_ ⟨ID , ts⟩ is the locating point whose card ID is ID and is the number of this kind of locating points.

Methodology
The methodology for discovering public transit riders' travel pattern is described in this section. Two stepwise methods are proposed to achieve the main goal, including detecting Origin and Destination areas, measuring the commuter pattern between each OD pair.

Detecting Origin and Destination
Areas. In this part, we are going to detect Origin and Destination areas from public transit riders' GPS data (i.e., taxi and bus GPS data). Because of the reasonable urban planning and constructions in recent decades, the urban area is usually divided into many different function zones. Therefore, all the origin points and destination points of passengers' trips will be clustered into several origin areas and destination areas. In this light, we use cluster algorithm to detect origin areas and destination areas in urban area. Moreover, we do not know how many clusters are in advance, and this type of cluster is usually not sphericallyshaped. Therefore, we apply a customized Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to solve this problem.
For a specific cell ( , ), we define a parameter ( , ) to present the locating point (i.e. taxi and bus ) number in this cell during a specific period . Taking a 3 × 3 cells, for example (as shown in Figure 1(a)), in one period all cells' parameters ( , ) can be easily calculated. For two specific cells ( 1 , 1 ) and ( 2 , 2 ), the distance between them dist [ ( 1 , 1 ), ( 2 , 2 )] is calculated as follows: where is the length of the cell. As shown in Figure 1(b), the distance between ( − 1, ) and ( − 1, + 1) is , and the distance between ( − 1, − 1) and ( , ) is ⋅ √ 2. Some relative parameters are defined as follows: Object represents the cell. Core object co: the cell satisfies ≥ , where is the threshold of parameter .
-Neighborhood of a core object is the space within a radius centered at co. Figure 2 illustrates the flow chart of the customized DBSCAN algorithm in this research. The Original DBSCAN algorithm is based on density. Comparatively speaking, the density reachable points in our customized algorithm are defined by ≥ . And we define the minPts value as 1 in our paper.

Measuring Public Transit Riders' Travel Pattern.
The public transit riders' travel pattern can be reflected by some indicators, like trip number between each OD pair, the proportion of different transit, path between each OD pair, travel time of the path, and so on. In order to show these indicators, we defined some preliminary terms as follows.
Tr mode (CL → CL ) is the trip number of a specific transit mode (i.e. taxi or bus) from th cluster to th cluster.
Based on clusters of origin and destination area, given a specific period , the public transit OD matrix can be calculated. mode (CL → CL ) is the proportion of a specific transit mode (i.e. taxi or bus) from th cluster to th cluster, which is calculated as follows: where Tr MODE (CL → CL ) is the trip number of all transit modes from th cluster to th cluster.
Tt mode (CL → CL ) is the travel time by the specific transit mode (i.e. taxi or bus) from th cluster to th cluster.

Case Study
We apply the proposed methods to Harbin city (China). First of all, the datasets used in this case study are described, and the stepwise methods are implemented one by one. Approximately 16,000 operating taxis equipped with GPS device are running around Harbin's urban area day and night. The location information is uploaded to the management system every 30 s during the day and 2 min at night. The data are accumulated to 2G in size and around 25 million rows each day. The taxi GPS data collected from 3rd Aug. to 7th Aug. 2015 is used, consisting of taxi ID, timestamp, latitude, longitude, and status, as shown in Table 1. There are 4 kinds of "status" in the table: 17152, 16640, 256, and 768, which represent occupation, dropping off, vacant, and boarding, respectively.
There are nearly 1500 buses traveling around urban area of Harbin, and these buses belong to 100 routes. The sampling frequency is 30 s. The bus IC records collected from 3rd Aug. to 7th Aug. in 2015 are used, consisting of Route ID, Bus ID, Card ID, Timestamp, Latitude, and Longitude, as shown in Table 2.
In our research, another dataset is digital map of Harbin, which consists of most urban areas of Harbin. The research area is nearly 100 square kilometers and just covers the range of 2nd Ring Road of Harbin, as shown in Figure 3. About 80 percent of GPS points locate in this research area. We divided this area into 250 square cells with same size. And each of the cell is 200 × 200 meters square, which is shown in Figure 3.

Detecting Origin and Destination Areas.
In this paper, we only measure the public transit riders' travel pattern during  working days. According to Section 3.1, we can calculate the ( , ) values of all cells. Taking one week's dropping off data (consisting of taxi.dropping off and bus.dropping off ) for example, that is, from 3rd Aug. to 7th Aug. in 2015, the relative data are illustrated in Table 3. And the sampling time is from 07:00 am to 19:00 pm.
And then we apply the customized DBSCAN algorithm to measure the origin and the destination clusters. We set value as 282.8 meters (i.e. √ 2 ⋅ , where is 200 meters). We apply different values to our experiment, in order to find out the optimal value. Table 4  when the value is larger than 1500, the cluster number is going to be stable. So we set the value as 2,000 each day and 10,000 for 5 days. In this light, we can measure the origin clusters (i.e. boarding clusters) and destination clusters (i.e. dropping off clusters) for the 5 days, as shown in Figure 4.

Measuring Public Transit Riders' Travel Pattern.
For a specific OD pair, we can measure the travel pattern between them, that is, Tr mode (CL → CL ), mode (CL → CL ), and Tt mode (CL → CL ), as mentioned in Section 3.2. Taking a 20 × 17 cells area, for example, as shown in Figure 5, there are   Note. Total is the number of dropping off points in all cells, average is the average number of dropping off points in one cell, minimum is the minimum number of dropping off points in one cell, and maximum is the maximum number of dropping off points in one cell.

Conclusion
This paper presented a cell-based urban public transit riders' travel pattern measurement method. The method used GPSequipped public transit vehicle's locating data, which is more realistic and easy to collect. We proposed a customized DBSCAN algorithm to detect the origin and destination areas. We computed three indicators for each OD pair, which can reflect the relationship between the origin area and destination area. We carried out a numerical case study to evaluate our method, which uses taxi and bus GPS data in Harbin, China. The results can reflect OD pairs and relationship between each OD pair.
In the future, we will improve the customized DBSCAN algorithm by accuracy and efficiency. The travel pattern is a little simple in this paper, so we will further discover some other indicators in depth. And we plan to extend this research by utilizing more kinds of GPS data from different public transit vehicles.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.