^{1}

^{1}

^{1}

^{1}

^{1}

Analysis of passenger travel habits is always an important item in traffic field. However, passenger travel patterns can only be watched through a period time, and a lot of people travel by public transportation in big cities like Beijing daily, which leads to large-scale data and difficult operation. Using SPARK platform, this paper proposes a trip reconstruction algorithm and adopts the density-based spatial clustering of application with noise (DBSCAN) algorithm to mine the travel patterns of each Smart Card (SC) user in Beijing. For the phenomenon that passengers swipe cards before arriving to avoid the crowd caused by the people of the same destination, the algorithm based on passenger travel frequent items is adopted to guarantee the accuracy of spatial regular patterns. At last, this paper puts forward a model based on density and node importance to gather bus stations. The transportation connection between areas formed by these bus stations can be seen with the help of SC data. We hope that this research will contribute to further studies.

Traditional studies on passenger travel patterns and passenger segmentation solely focus on passenger physical characteristics or the use of transit user surveys. This classification has little help of knowing passenger travel habits. Therefore, we need another method to study the temporal and spatial regularity. This method must be based on actual data with passenger travel information. SC data meets the needs.

SC data gathered by automated fare collection systems records travel details which are very valuable. However, passenger travel patterns can only be watched through a period time, and a lot of people travel by public transportation in big cities like Beijing daily, which leads to large-scale data and difficult operation. This paper adopts SPARK platform to solve this problem. Several computers are used to build the platform and calculate together.

This paper adopts a systematic approach to mine the travel pattern and search the temporal and spatial regularity using SC data. After the literature review in this section, this paper introduces the SC dataset adopted and the method used to drop invalid data. We consider each item in the dataset a transaction. Then, this paper rebuilds the SC dataset by reconstructing the completed transactions of SC users into a trip, which can recognize SC user transfer behaviour. After the reconstruction, the density-based spatial clustering of application with noise (DBSCAN) algorithm is adopted to mine the travel pattern and obtain the temporal and spatial regularity of each SC user in Beijing. In spatial dimension, this paper designs an algorithm based on passenger travel frequent items to handle the phenomenon that passengers swipe their cards before arriving to avoid the crowd. Then, we put the temporal data and the spatial data together to classify SC users by temporal and spatial features. Finally, this paper puts forward a model based on distance and node importance to gather bus stations into areas without any intersection, and then we assign the SC data into every area to investigate transportation connection between them. In Section

Analysis of SC data has attracted research interest and a lot of researches have been done in the past few years. Catherine Morency et al. (2007) measured transit variability with SC data. They built an object model to understand the relationship between different elements within the transit network, and then used k-means cluster to indicate the spatial variability of passengers [

In recent years, with the development of the associated modelling methods, solving technology and computing capabilities, the study of SC data has developed rapidly. Jun Liu et al. (2014) presented a traffic monitoring and analysis system for large-scale networks based on Hadoop, an open-source distributed computing platform for big data processing on commodity hardware [

With the help of big data, researchers can identify passenger patterns derived by data through some complex models and algorithms. Le Minh Kieu et al. (2015) adopted the DBSCAN algorithm, which can find clusters of arbitrary shapes based on different parameters, to mine the travel patterns based on around 34.8 million transactions made by a million SC users over 15000 transit stops of the bus, city train, and ferry networks. They segmented transit passengers into four identifiable types based on the above research. However, because of the high algorithm complexity, this algorithm takes a long time to cluster convergence when the dataset is very large [

The Materials and Methods should contain sufficient details so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.

Section

The SC data used in this paper come from Beijing, which is one of the largest cities in the world. 6 million SC records are collected by AFC every day in this city. The total dataset contains around 150 million transactions over 7000 transit stops of the bus from October 1, 2015 to October 30, 2015. The dataset includes the following main fields:

(1) CARDID: The unique SC ID, which has been hashed into a unique number to maintain the privacy of the cardholder.

(2) TRADETIME: The time that passengers swipe their cards when they get on the bus.

(3) MARKTIME: The time that passengers swipe their cards when they get off the bus.

(4) LINEID: The transit routes that passengers take.

(5) TRADESTATION: The station that passengers swipe their cards at when they get on the bus.

(6) MARKSTATION: The station that passengers swipe their cards at when they get off the bus.

The SC dataset only contains information of passengers, which can be used when combined with bus operation data. The bus operation data includes the following fields:

(7) XLDM: The same as LINEID.

(8) ZM: The name of bus stops.

(9) ZDXH: The same as TRADESTATION and MARKSTATION.

(10) ZDJD: The longitude of bus stops.

(11) ZDWD: The latitude of bus stops.

In this paper, we call each item collected by AFC a transaction. As we know, one passenger may take buses many times each working day, so there are several transactions for each SC user every day. Sometimes, a bus trip one passenger takes contains two or three transactions. How to construct the travel trip from individual transactions is a fundamental problem before mining the travel patterns.

This paper adopts an algorithm to connect the individual transactions: there are two principles of this algorithm: first, if two transactions can be connected into one trip, the time interval between transactions must be less than 60 mins [

Here, the first boarding stop and the last alighting stop of a completed trip are defined as the “origin stop” and the “destination stop,” respectively. The time interval between the alighting time of a transaction and the boarding time of the next transaction of the same trip is defined as the transferring time. Figure

Trip reconstruction flowchart.

Check data validity. Because of some hardware problems, the data collected by AFC cannot be used, as two items are exactly the same as each other: any data missed in fields “MARKTIME, TRADETIME, LINEID, MARKSTATION, and MARKSTATION” or the MARKTIME equals TRADETIME in one item. By checking data validity, this paper drops around 1% of the transactions data.

Set time indicator. Each item in the time indicator represents a date. At first, the time indicator points to the first item and then we select whole data based on the date.

All the data selected are classified by “CARDID” to form different groups. Each group represents all the transactions a SC user made on one working day, and the transactions in each group are sorted by “MARKTIME.”

The time interval between current transaction and previous transaction is calculated. If the time interval is less than 60 min, the destination of current transaction is compared to the origin of previous transaction. If they are different from each other, these two transactions are connected to one transaction and then we continue to connect two transactions into one trip until two transactions cannot satisfy two principles mentioned above.

After calculating all the data of different groups on this date, whether time indicator is in the final position is judged. If not, the time indicator moves to the next and then Steps 2–5 is repeated.

A passenger may take several journeys with the same origin and destination at different time of one day, which has a big influence on the analysis. So, this paper defines that passenger travel times is the number of days which has travel records, namely for each person one travel contains all the different trips recorded on one working day. Based on this principle, the number of travel times a passenger takes cannot be larger than the number of working days within one month.

This part adopts DBSCAN algorithm [

(1) If we consider a passenger travels regularly, he must travel by bus several times within a certain time. The DBSCAN algorithm is proper to this data mining. The number of “a certain time” is the

(2) As we can see, a passenger may take bus to deal with individual random events. These trips have a lot of differences with regular trip, and we call these trips “noise.” This algorithm can find the clusters (regular pattern) and deal with the varying noise effectively.

(3) This algorithm can identify a cluster of any shape and size. It means that we can use this algorithm to obtain various travel regular patterns in consideration of travel frequency.

(4) This algorithm does not require the predetermination of initial cores or the number of clusters. This feature is also essential for travel pattern analysis because the number of patterns from an individual passenger is unknown.

(5) Because of the high complexity of the DBSCAN algorithm, this paper extends this algorithm to a distributed platform, which means that we gather SC data based on CARDID to form a group and then calculate passenger’s travel data within each group. After this change, we can use a computer cluster to mine SC data. Each computer in the cluster calculates several group data to increase the speed of calculation.

For a D-dimension dataset containing N points:

(1)

(2) Density of point

(3) Core point

The goal of DBSCAN algorithm is to find out the whole core points in dataset

For SC data analysis,

This part also adopts DBSCAN algorithm to analyse passenger travel behaviour. Because the density of the bus station in Beijing is large and the frequency of buses is high, there is no need for passengers to choose another station to board or alight. So the

We gather SC data by CARDID for each user to form a 3-dimension set containing N points. The 3 dimensions are LINEDID marked

The occurrences number of each element in the set is calculated, and the elements appearing more than four times are selected as a candidate frequent set, and the other elements are put into an infrequent set. A new dimension, the occurrences number, is added to the candidate frequent set, so the candidate frequent set can be expressed as

The distance between different elements in the candidate frequent set is calculated, which can be expressed as follows:

The distance between each element in the infrequent set and each element in the frequent set is calculated. If

In a complex bus transit network, we define the number of different lines passing through a bus station as the node importance

To gather different nodes, this paper puts forward an algorithm based on density and node importance. The flow of the algorithm is as follows:

This step is similar with DBSCAN. A 3-dimension set containing all bus stops is built, whose dimensions are node importance, longitude and latitude. The dataset can be expressed as

For bus stop

After calculating by the trip reconstruction algorithm, this paper finds that the total number of passengers travelled by bus on working days in October is 11966945. Around 30% of passengers travelled by bus on only one working day, around 18% of passengers had travel records on over ten working days. The numbers of passengers show little change when the traveling-working day is from 10 to 17. The details show in Figure

The relationship between passenger number and passenger travel times.

The DBSCAN algorithm is used to analyse the travel behaviour for each passenger and identify whether passengers travel regularly. Passenger number with regular travel time or passenger number with regular travel ODs appears below the sum number of passengers whose trips have a certain

According to the DBSCAN algorithm introduced above, passenger travel regularity in temporal dimension is analyzed. Passenger number with regular time for different travel times is shown in Figure

The relationship between passenger number with regular time and travel times.

Figure

Passenger number variation with regular travel time (X axis) for different time (Y axis) and travel times (Z axis).

Passenger number variation with regular travel time (X axis) for different time (Y axis) and regular travel times (Z axis).

In the temporal dimension, we can see that the morning peak hour begins at 9:00 for irregular passengers, and the number of passenger during a day varies little with travel time (8:00-20:00). This phenomenon is quite different from this of regular passenger. More details show in Figure

Passenger number variation with different time (X axis) and different date (Y axis).

According to the DBSCAN and frequent items algorithm introduced above, passenger number with regular travel ODs is calculated. Passenger number with regular travel ODs for different travel times is shown in Figure

Passenger number variation with regular ODs for different travel times.

Figure

Five origin stops and five destination stops (X axis) with the largest passenger number with regular ODs (Z axis) in the morning for different travel times (Y axis).

Five origin stops and five destination stops (X axis) with the largest passenger number with regular ODs (Z axis) in the evening for different travel times (Y axis).

In the spatial dimension, we count the number of irregular passengers at the 5 largest bus stops on ODs per working day. From Figure

Five origin stops and five destination stops (X axis) with the largest passenger number with regular ODs (Z axis) for different date (Y axis).

This paper distributes passengers with regular ODs according to their travel time (morning or evening) and ODs. Then we chose the top five origin stops and destination stops with the most passengers for a detailed analysis. Sihui station belongs to both the top five origin stops and the destination stops. The distribution and detail data of the top five stops in the morning shows in Figure

Detail data for the top five origin stops and destination stops with the most passengers in the morning.

Stop | Lon[°] | Lat[°] | Passenger number (D) | Passenger number (O) | Stop type |
---|---|---|---|---|---|

①Sihui | 116.4903 | 39.9052 | 13228 | 9295 | OD |

②Xierqi | 116.3014 | 40.0496 | 5195 | 6829 | O |

③Qinghe | 116.3417 | 40.0290 | 3630 | 6601 | O |

④Shahebeidaqiao | 116.2626 | 40.1290 | 1820 | 5178 | O |

⑤Liuliqiao North | 116.3040 | 39.8887 | 7078 | 5022 | O |

⑥Dabeiyao South | 116.4552 | 39.9038 | 15842 | 4755 | D |

⑦Liuliqiao East | 116.3114 | 39.8865 | 11415 | 3837 | D |

⑧Liangjiayuan | 116.4641 | 39.9071 | 10051 | 2405 | D |

⑨Xidiaoyutai | 116.2936 | 39.9226 | 9122 | 3883 | D |

The top five origin stops and destination stops with the most passengers in the morning.

From Figure

detail data for the top five origin stops and destination stops with the most passengers in the evening.

Stop | Lon[°] | Lat[°] | Passenger number (D) | Passenger number (O) | Stop type |
---|---|---|---|---|---|

①Dabeiyao South | 116.4552 | 39.9038 | 3703 | 14671 | O |

②Liuliqiao East | 116.3114 | 39.8865 | 4091 | 12597 | O |

③Sihui | 116.4903 | 39.9052 | 7459 | 12259 | OD |

④Liangjiayuan | 116.4641 | 39.9071 | 1982 | 9707 | O |

⑤Dongzhimen | 116.4302 | 39.9408 | 2729 | 8239 | O |

⑥Tongzhoubeiyuan | 116.6337 | 39.9051 | 5118 | 2426 | D |

⑦Qinghe | 116.3417 | 40.0290 | 5457 | 3022 | D |

⑧Liuliqiao North | 116.3040 | 39.8887 | 5101 | 6730 | D |

⑨Beigao | 116.5063 | 40.0101 | 4589 | 1121 | D |

The top five origin stops and destination stops with the most passengers in the evening.

According to the analysis in both temporal and spatial dimension, different types of passengers can be obtained. Some passengers travel only with regular time, some of them travel only with regular ODs, and some of them travel regularly in both dimensions, while others travel without regularity. Passenger number for these four types is shown in Figure

The relationship between four type passenger number and passenger travel times.

This paper clusters the bus stations. As a result, more than 2000 areas are identified containing around 7000 bus stops. According to the result of area clustering, this paper analyses the top five origin areas and destination areas with the most passengers and the traffic connection between these areas and other areas in the morning and evening. In the morning, passenger number with regular destinations at the top 5 destination areas is larger than that at the top 5 origin areas, which indicates passengers came from a lot of different areas have several same destinations. The result in the evening is quite opposite to that in the morning. The distribution of the areas shows in Figures

Detail data for the top five origin areas and the top five destination areas in the morning.

Core Stop | OD Type | Passenger number with regular origins | Passenger number with regular destinations |
---|---|---|---|

①Liuliqiao North | OD | 21902 | 35248 |

②Dabeiyao East | OD | 20900 | 57376 |

③Qinghe | O | 19044 | 11036 |

④Yanhuang museum | O | 17820 | 21272 |

⑤Sihui | O | 17744 | 27048 |

⑥Zhongguancun South | D | 11256 | 36731 |

⑦Dongsanqi South | D | 12478 | 28126 |

⑧Sanyuanqiao | D | 17035 | 27860 |

Detail data for the top five origin areas and the top five destination areas in the evening.

Core Stop | OD Type | Passenger number with regular origins | Passenger number with regular destinations |
---|---|---|---|

①Dabeiyao East | OD | 52139 | 19781 |

②Liuliqiao North | OD | 32129 | 16167 |

③Zhongguancun South | O | 23828 | 9896 |

④Sihui | O | 23350 | 13759 |

⑤Dongzhimen | O | 18466 | 11463 |

⑥Tongzhoubeiyuan East | D | 13350 | 17211 |

⑦Qinghe | D | 9946 | 15493 |

⑧Tongzhoubeiyuan West | D | 4732 | 13790 |

The top five origin areas and the top five destination areas with the most passengers in the morning.

The top five origin area stops and the top five destination area stops with the most passengers in the afternoon.

The paper studies the traffic links between different areas according to the OD data. There are some interesting conclusions based on the study. Among the passengers whose regular destination is CBD area (core stop is Dabeiyao East), 5% and 4.5% of them came from Tongzhoubeiyuan East area and Tongzhoubeiyuan West area, respectively, which are the most closed two areas connecting to CDB. In the evening, 6.6% and 5.4% of the passengers returned to Tongzhoubeiyuan West area and Tongzhoubeiyuan East area, respectively, from CBD area. The distance between CBD and Tongzhoubeiyuan is around 15 km. 24.0% of passengers whose regular origin area is Sihui in the morning went to other stops within itself. 70% of passengers whose regular destination is Dongsanqi South in the morning came from several areas which are located within 4 km around Dongsanqi South area. This phenomenon means Dongsanqi South area is a core traffic area gathering a lot of passengers from other areas to take the subway to the city center. More detailed data shows in Table

The top five origin and destination areas in the morning and the most closely connected area to them.

Top5 areas | OD Type | Boarding (alighting) passenger number | Closest connecting Area | Alighting (boarding) passenger number | Dist(m) | Prop(%) |
---|---|---|---|---|---|---|

Liuliqiao North | O | 21902 | Yungang | 820 | 15651 | 3.7 |

Dabeiyao East | O | 20900 | Ritanlu | 831 | 1955 | 4.0 |

Qinghe | O | 19044 | Chengfulu | 1200 | 4431 | 6.3 |

Yanhuang Museum | O | 17820 | Anzhenqiao | 1158 | 3268 | 6.5 |

Sihui | O | 17744 | Sihui | 4256 | 0 | 24.0 |

Dabeiyao East | D | 57376 | Tongzhou | 2810 | 14954 | 4.9 |

Zhongguancun South | D | 36731 | Beijing Sport University | 1992 | 4391 | 5.4 |

Liuliqiao North | D | 35248 | Gungang | 1857 | 15651 | 5.3 |

Dongsanqi South | D | 28126 | Tiantong | 2766 | 2189 | 9.8 |

Sanyuanqiao | D | 27860 | Beigao | 1520 | 7752 | 5.5 |

In this paper, four algorithms are used to analyze the temporal and spatial regularity of passengers traveled by bus based on the large scale data of SCs and the traffic relationship between different traffic areas. At first, this paper proposes a trip reconstruction algorithm gathering SC data by CARDID to improve the calculation efficiency using SPARK platform and analyses the times of passengers traveled by bus, that is, the number of days which have travel records in 18 working days. The proportion of passengers with different travel times comes out based on this study. In the temporal dimension, the proportion of passengers who traveled regularly in temporal dimension is obtained and the relationship between this proportion and the times passengers traveled by bus is also described. In the spatial dimension, this paper proposes a data recognition algorithm based on frequent terms to improve the accuracy of SC data and draws some conclusions similar to that in the temporal dimension. According to the temporal and spatial regularities of passengers, passengers are divided into four types: passengers only with regular travel time, passengers only with regular ODs, passengers with both regular travel time and regular ODs, and passengers without regularity. The number of four type of passengers is also obtained. The paper divides the bus area according to the distance between different bus stops and node importance, mainly analyses the passengers with both regular travel time and regular ODs, and determines the traffic connection between different areas.

The data used in this paper came from Beijing public transportation group. This data only can be used in scientific research with permission. There is no access to a public database or web site.

The authors declare that they have no conflicts of interest.

This research is supported by National Key Technologies Research & Development program (2017YFC0804900).