In the case of passengers taking the subway many times in a short time, missing cards in and out of the station, and staying in the subway station for a long time, the previous table join method cannot accurately set the time threshold parameters and correctly match the OD pairs of passengers. In order to solve these problems, an OD matching method based on analysis function is proposed in this paper.
With the rapid development of economy, rail transit has become the development trend of the city. Passenger flow analysis is the basis of safe and reliable operation of urban rail transit. The prediction and accurate grasp of passenger flow characteristics and evolution law can provide decision-making basis for making scientific organization plans such as departure interval and departure frequency. The passenger flow analysis is based on the most original data. The improvement of management and service level of urban rail system depends on the comprehensive grasp, analysis, and application of metro travel data. The matching and calculation of origin destination (OD) matrix is the most important and basic step. When passengers take the subway, they swipe the card when they enter or leave the station. The two card swiping records will be stored in the same table. Every inbound swipe card record will have an outbound swipe card corresponding to it. However, in reality, due to human reasons or equipment failure, the card reading data in and out of the station cannot be completely corresponding. As shown in Figure
Comparison of IC card data of Xiamen metro from March to August in 2019.
Compared with the method based on table join, this paper proposes an OD matching algorithm based on analysis function for metro IC card data analysis.
The primary contributions of this paper can be summarized as follows: An OD matching method based on analysis function is proposed for the data mining and OD matching of metro IC card data. We analyze the differences, advantages, and disadvantages of OD matching methods based on table join and analysis function. We conduct a theoretical analysis of the characteristics of the OD matching method based on analysis function. We use the calculation results of OD matching method based on analysis function; part of data is artificially eliminated to construct dataset, and then OD matching based on table join and analysis function is used according to the constructed dataset. The effectiveness of the two algorithms is verified by comparing the analysis results.
At present, many scholars have done a lot of research on metro IC card data. IC card data records the card information of passengers. Every time a passenger swipes the card in and out of the station, a record will be saved in the system; that is to say, the data of two swiping cards in and out of the station are separated. There is important information in the record, such as card ID, time, inbound or outbound sign, and station ID. Based on the IC card data, Shin [
Yang et al. [
The above research is based on the metro IC card to complete the problem research, and this paper aims to achieve the complete matching of passengers’ OD only from the metro IC card data, find out the abnormal data of missing card swiping in or out of the station, and construct a standard dataset to verify.
The rest of the paper is organized as follows: Section
The symbols used are defined as follows:
Taking Xiamen metro IC card data as an example,
Metro IC card data.
20190427 | 51688 | 20 | 7 | 108 | |
20190427 | 51964 | 20 | 8 | 112 | |
20190427 | 52803 | 20 | 7 | 109 | |
20190427 | 54023 | 20 | 8 | 103 | |
20190427 | 65244 | 20 | 7 | 105 |
A passenger’s complete subway journey is recorded twice with a card: one for entering the station and the other for leaving the station. According to the records in Table
Card data of metro entry and exit station matching obtained by problem solving.
106 | 20190916 | 07 : 35 : 06 | 108 | 07 : 44 : 08 | 2.00 | |
112 | 20190916 | 07 : 36 : 16 | 101 | 07 : 58 : 31 | 4.00 | |
114 | 20190916 | 08 : 45 : 44 | 102 | 09 : 04 : 39 | 5.00 | |
105 | 20190916 | 08 : 34 : 36 | 109 | 08 : 45 : 28 | 2.00 |
Only outbound card swiping and no corresponding inbound card swiping data sheet.
20190916 | 07 : 25 : 02 | 8 | 114 | |
20190916 | 07 : 26 : 17 | 8 | 101 | |
20190916 | 07 : 45 : 34 | 8 | 110 | |
20190916 | 08 : 32 : 11 | 8 | 109 |
Only inbound card swiping and no corresponding outbound card swiping data sheet.
20190916 | 07 : 25 : 02 | 7 | 109 | |
20190916 | 07 : 26 : 17 | 7 | 112 | |
20190916 | 07 : 45 : 34 | 7 | 106 | |
20190916 | 08 : 32 : 11 | 7 | 105 |
There are three cases in the metro IC card record: Passengers swipe cards normally when entering and leaving the station Passengers swipe the card when they enter the station but do not swipe the card when exiting the station or the data is lost Passengers swipe the card when they leave the station but do not swipe the card when entering the station or the data is lost
Cases 2 and 3 are invalid data, which cannot form a complete passenger boarding and alighting record and need to be eliminated in the process of OD matching algorithm. However, only the data in case 1 can be matched successfully. Of course, there are few cases in cases 2 and 3, and most of the subway data belong to the scope of case 1.
The method of table join is to connect the data of card swiping in and out according to certain conditions to form a complete record of passengers getting on and off the train. Its flow chart is shown in Figure Data preparation. Select the fields of card number, transaction date, transaction time, station in and out sign, and station ID from the original IC card data, and change the transaction time into seconds, such as 7 : 35 a.m. to 44,100 seconds. Table join. According to the field
OD matching flow chart of metro IC card data based on table join.
Based on the table join method, 5,400 seconds is taken as the threshold value, and the whole process (swiping card at the station, waiting for the train, getting on the train, arriving at the destination, and swiping the card at the exit) does not exceed 5,400 seconds (1.5 hours): There is no actual basis for selecting 5,400 seconds. In reality, there is a record of subway travel time exceeding 5,400 seconds, which leads to data loss when exceeding the threshold value When there are multiple subway trips in 5,400 seconds, there will be more wrong connections
Oracle
The usage of the lag function is shown above
The metro OD matching algorithm based on the analysis function mainly uses the lag function of Oracle to replace the table join. In this method, the dataset is sorted in ascending order according to the card number and card swiping time, and then the lag function of Oracle is used to take the offset of the upper line of five related fields. Finally, the matching process is completed according to the OD conditions of card number, time, and inbound and outbound sign fields. Its flow chart is shown in Figure Select the fields of card number, transaction date, transaction time, inbound and outbound flag, and station ID from the original data table, and change the transaction time into seconds. For The correct OD pairs were selected according to the four following conditions:
Metro OD matching process based on analysis function.
Taking the metro IC card data in May 2019 as an example, the OD matching SQL statement based on analysis function is as follows:
The following SQL statement uses the
The following SQL statement fetches and stores the data that meets the four conditions of the OD pair to table “metro201905_final,” with the structure shown in Table
Compared with the table join mode of setting time threshold, the analysis function method proposed in this paper avoids the way of setting connection conditions and thresholds for table join and can accurately distinguish the three situations of metro IC card swiping data and will not cause the records exceeding the threshold to be discarded due to the influence of threshold size. Table
Theoretical comparison of different methods for metro OD matching.
Method | Is accuracy affected by the selected threshold? | Does the method need to set corresponding thresholds for different datasets? | Can missing records classification be realized |
---|---|---|---|
Based on table join | Yes | Yes | No |
Based on analysis function | No | No | Yes |
In the OD matching and calculation of the metro, only the records of entering and leaving the station can be matched, but the missing records in and out of the station cannot be matched. However, the situation of missing entering and exiting station records in the original metro data sheet is recorded, and the specific missing data is not known: It is impossible to know the data that can match OD completely The OD results obtained by different matching methods cannot prove the correctness of the matching
Therefore, we hope to have a real dataset that knows the specific missing data situation (including the amount of missing data and the number of missing pieces) and then compare the OD results obtained by different methods with the OD results of the real dataset to prove its correctness.
The flow chart for building the dataset is shown in Figure Step 1: select the card number, transaction date, arrival time, card type, and inbound station ID fields from the OD dataset of complete matching metro. The inbound time field is named as the transaction date, and the inbound station ID field is named as the station ID, and these data are stored in the inbound information table; Step 2: add the inbound and outbound flag field in the inbound information table, and update the field of all data in this table to inbound station; Step 3: select the card number, transaction date, outbound time, card type, and outbound station ID fields from the complete matching OD dataset. The outbound time field is named as the transaction date, and the outbound station ID field is named as the station ID, and these data are stored in the outbound information table; Step 4: add the inbound and outbound flag field in the outbound information table, and update all data in this table to outbound; Step 5: merge the inbound and outbound information tables into a single inbound/outbound table, and arrange the data in the table randomly once; Step 6: randomly extract a part of inbound data according to proportion; Step 7: in the remaining data after deducting the extracted inbound data, a part of outbound data with different card numbers from the incoming data just extracted is randomly extracted; Step 8: after deducting the extracted outbound data, the remaining data is the constructed dataset.
The dataset construction process of complete metro OD data eliminating some inbound and outbound information.
Data volume of all datasets used.
Dataset | Time | Total |
---|---|---|
IC card data of metro in April | April 2019 | 8709743 |
IC card data of metro in May | May 2019 | 9494211 |
Static site information data | — | 24 |
Card main type data | — | 10 |
Note: the main types of cards include one-way ticket, stored value ticket, period ticket, commemorative ticket, employee ticket, e-card, financial IC card, transportation card, and e-ticket.
Research object: the IC card data of rail transit in April and May 2019 in Xiamen City, Fujian Province, are selected. Data details are shown in Table
Dataset: select the metro IC card data, static station information data, and static card type data in April and May 2019.
Using the original IC card data, the OD results obtained by the analysis function method are compared with the OD results based on the table join method.
First of all, a correct metro IC card dataset is constructed; that is, each passenger’s inbound card has a corresponding outbound card swipe, and there is no error or omission in the data. Then the OD matching of the dataset is carried out by using the method based on table join and the method based on analysis function. Finally, the results are verified with the real OD results.
Firstly, a missing metro IC card swiping dataset is constructed; that is, the passengers have a lack of entry or exit records in a subway trip, and then the OD matching of the dataset is carried out by using the method based on table join and the method based on analysis function. Finally, the results are verified with the real OD results.
The accuracy of matching;
The rate of wrong connection caused by the same passenger taking the subway for several times within 5,400 seconds and the ratio of not connecting for more than 5,400 seconds are counted as evaluation indexes. The specific calculation formula is as follows:
In the above equation,
The OD matching results of metro IC card data in April and May 2019 based on table join and analysis function are shown in Tables
OD matching results of two methods based on April 2019 data.
Method | ||||||
---|---|---|---|---|---|---|
Based on table join | 8709743 | 4409973 | 92050 | 10486 | 2.09% | 0.12% |
Based on analysis function | 8709743 | 4409973 | 0 | 0 | 0 | 0 |
OD matching results of two methods based on May 2019 data.
Method | ||||||
---|---|---|---|---|---|---|
Based on table join | 94942111 | 4808486 | 103967 | 11408 | 2.16% | 0.12% |
Based on analysis function | 94942111 | 4715927 | 0 | 0 | 0 | 0 |
According to the OD data calculated by the analysis function method and the dataset construction process of Section 4.1.2, the datasets of April and May 2019 are constructed, respectively, and the OD matching based on table join and that based on analysis function method are compared again. Table
Data analysis after OD matching and calculation of original IC card data.
Date | ||||||
---|---|---|---|---|---|---|
April 2019 | 8,709,743 | 4,328,409 | 28,367 | 24,558 | 2.16% | 0.12% |
May 2019 | 9,494,211 | 4,715,927 | 33,542 | 28,815 | 0 | 0 |
The OD data structure of complete metro is lack of some information of entrance and exit stations.
Date | ||||||
---|---|---|---|---|---|---|
April 2019 | 8,656,818 | 28,567 | 24,238 | 4,299,842 | 4,304,171 | 4,275,604 |
May 2019 | 9,431,854 | 33,011 | 28,295 | 4,682,916 | 4,687,632 | 4,654,621 |
OD matching based on table join and analysis function using constructed dataset.
Date | Method | |||||||
---|---|---|---|---|---|---|---|---|
April 2019 | Based on table join | 4,444,086 | 178,815 | 10,333 | 4.02% | 0.12% | 0 | 0 |
Based on analysis function | 4,275,604 | 0 | 0 | 0 | 0 | 28,567 | 24,238 | |
Real results in Table | 4,275,604 | — | — | — | — | 28,567 | 24,238 | |
May 2019 | Based on table join | 4,753,677 | 79,714 | 11,310 | 1.68% | 0.12% | 0 | 0 |
Based on analysis function | 4,654,621 | 0 | 0 | 0 | 0 | 33,011 | 28,295 | |
Real results in Table | 4,654,621 | — | — | — | — | 33,011 | 28,295 |
By comparing the relationship between the OD passenger flow and the total passenger flow, we can see that the relationship between the table join method and the total passenger flow is as follows: Based on table join method, OD passenger flow volume × 2 > total passenger flow volume. Therefore, it can be considered that there are redundant matches in the metro OD results obtained by the intratable join method, which is unreasonable to some extent. By comparing the metro OD obtained by the two methods, it can be seen that the method based on the analysis function has good performance in the aspects of multiple trips in 5,400 seconds and is unconnected in more than 5,400 seconds. By matching the constructed datasets, the OD matching results obtained by the proposed analysis function method are the same as the real data, but the table join method is not the same as the real data. Using the constructed dataset, the missing inbound and outbound data stored in the missing data table of inbound and outbound stations obtained by the analysis function method are combined with the extracted inbound and outbound data to perform OD matching, and the matching result is the same as the real data.
Therefore, it can be considered that the analysis function method is better than the table join method in OD matching of metro.
This paper proposes an OD matching algorithm based on analysis function for metro IC card data. Compared with the previous table join OD matching algorithm, the method based on analysis function avoids the setting of time threshold, so as to deal with the situation where passengers stay too long in the subway station. Taking Xiamen metro in and out IC swiping card dataset in April and May 2019 as an example, this method is more accurate and powerful than table join method and can identify the correct OD and the wrong or real IC swiping card records in and out of the station.
In the future work, this method should be verified at IC card inbound and outbound dataset of passengers of complex metro network with transfer stations. Time and space complexity of this method should be analyzed and optimized so that it has excellent accuracy and also less time and space cost and could be applied to large-scale dataset.
The GPS data used to support the findings of this study were supplied by Xiamen GNSS Development and Application Co., Ltd. under license and so cannot be made freely available.
The authors declare that they have no conflicts of interest.
This work was supported by China National Social Science Fund (19BXW110).