OD Matching of Metro IC Card Data Based on Analysis Function

In the case of passengers taking the subway many times in a short time, missing cards in and out of the station, and staying in the subway station for a long time, the previous table join method cannot accurately set the time threshold parameters and correctly match the OD pairs of passengers. In order to solve these problems, an OD matching method based on analysis function is proposed in this paper. LAG () is an analytic function in Oracle which allows you to access the row at a given offset prior to the current row without using a self-join. Metro IC card dataset stores the card swiping records of passengers entering and leaving the subway station every time. In this method, the dataset is sorted in ascending order according to the card number and card swiping time, and then, the lag function of Oracle is used to take the offset of the upper line of card ID, transaction date, transaction time, in and out sign, and station ID. Finally, the matching process is completed according to the OD conditions of card number, time, and inbound and outbound sign fields. 'is method does not need to set a time threshold and so as to deal with the situation where passengers stay too long in the subway station. 'e ODmatching results on in and out IC swiping cards dataset in April and May 2019 of passengers of Xiamen Metro Line verify that analysis function method has better OD matching, missing swiping identification accuracy, and effect compared to the table join method.


Introduction
With the rapid development of economy, rail transit has become the development trend of the city. Passenger flow analysis is the basis of safe and reliable operation of urban rail transit. e prediction and accurate grasp of passenger flow characteristics and evolution law can provide decision-making basis for making scientific organization plans such as departure interval and departure frequency. e passenger flow analysis is based on the most original data. e improvement of management and service level of urban rail system depends on the comprehensive grasp, analysis, and application of metro travel data. e matching and calculation of origin destination (OD) matrix is the most important and basic step. When passengers take the subway, they swipe the card when they enter or leave the station. e two card swiping records will be stored in the same table. Every inbound swipe card record will have an outbound swipe card corresponding to it. However, in reality, due to human reasons or equipment failure, the card reading data in and out of the station cannot be completely corresponding. As shown in Figure 1, the comparison of the card reading data in Xiamen from March to August shows that there is a certain deviation between the amounts of card swiping in and out of the station each month, which are not exactly equal. In order to correctly match all the inbound information and outbound information and record them in the same line and separate the missing records of the in and out stations and store them in different tables, it is necessary to match and calculate the OD of the metro. is paper proposes an OD matching algorithm for metro IC card data based on analysis function. It completes the OD matching of metro IC card data and the identification of wrong data and completes the experimental verification with the data from April to May in Xiamen.
Compared with the method based on table join, this paper proposes an OD matching algorithm based on analysis function for metro IC card data analysis. e primary contributions of this paper can be summarized as follows: (1) An OD matching method based on analysis function is proposed for the data mining and OD matching of metro IC card data.
(2) We analyze the differences, advantages, and disadvantages of OD matching methods based on table join and analysis function. (3) We conduct a theoretical analysis of the characteristics of the OD matching method based on analysis function. (4) We use the calculation results of OD matching method based on analysis function; part of data is artificially eliminated to construct dataset, and then OD matching based on table join and analysis function is used according to the constructed dataset. e effectiveness of the two algorithms is verified by comparing the analysis results.
At present, many scholars have done a lot of research on metro IC card data. IC card data records the card information of passengers. Every time a passenger swipes the card in and out of the station, a record will be saved in the system; that is to say, the data of two swiping cards in and out of the station are separated. ere is important information in the record, such as card ID, time, inbound or outbound sign, and station ID. Based on the IC card data, Shin [1] carried out the metro OD matrix processing. Instead of matching each person's OD, Shin counted the OD matrix according to the inbound/outbound sign and station ID in the field, which is not specific to individuals. Shin did not consider calculating the OD match of each swipe record and was only concerned about the statistics. Chen et al. [2] analyzed the high-dimensional and multivariate metro data through the random matrix theory and then predicted the abnormal data of card swiping. rough the analysis of IC card data, Yu et al. [3,4] classified the abnormal OD data of passengers into two categories: passenger anomaly and system anomaly. e data they used in the experiment was generated directly from the AFC system, and they did not point out the principle of OD matching of card swiping data in and out of the station. In Zhiyuan et al.'s work [5], with high visualization frameworks, the massive data of passenger flow in Shanghai metro network is highly graphical in time-space, which is processed from four aspects: the network, line, station, and section. e metro card data is also from AFC system. Li et al. [6] extracted traffic origins and destinations (OD) information of travelers from the multisource data and used the extracted data for traffic zone division. Finally, a multimode traffic forecasting model was established on this basis. Moon et al. used Seoul smart card data to measure the traffic convenience of each region by considering the actual experience of passengers in the travel network [7]. Kim et al. used Seoul's bus IC card data to introduce a sticky index to quantify the user's preference range and to measure the degree of habitual behavior of individuals and bus routes [8]. Li et al. proposed a framework for extracting potential paths from smart card data. Taking Beijing as an example, spatial clustering algorithm was used to provide a basis for customizing bus routes [9][10][11][12]. Lee et al. used the data of Seoul's bus smart card to evaluate the transfer efficiency of bus and subway transfer stations and put forward improvement strategies. e results showed that the evaluation results of DEA model for transfer efficiency are reasonable [13]. Ha and Lee used metro smart card data from Seoul to classify travel modes into working and nonworking trips to study changes in urban activities and spatial structure [14]. Zhou et al. classified and explored the functions of metro station area according to the smart card data of Wuhan metro system and passenger travel data [15]. Zhang et al. card data of Shenzhen metro to estimate how passenger flows are allocated to different routes and trains from empirical analysis [18]. Lin et al. [19] proposed an automated multistage method for inferring the time variable in various components of a metro network. ey evaluated the proposed method for a route planning application, using smart card data from Singapore, and compared the estimated results with ground truth values.
Yang et al. [20] considered the optimization problem for timetables in subway systems. Subway passenger flow forecasting models were presented for peak-hour flow by Pan et al. [21], for special event occurrences by Ni et al. [22], and for the Beijing subway using spatiotemporal correlations by Wang and Cai [23]. Chen et al. analyzed the spatiotemporal characteristics of multimode travelers by combining the taxi FCD, the metro IC card data, and the GPS trajectories of Mobike and proposed a binomial logit model (BNL) to estimate mode choices for both peak and off-peak periods. e metro IC card data they used already contained the in and out information after OD matching [24]. Sun and Guan proposed measuring the metro network vulnerability from the perspective of line operation. Passenger flow distribution and redistribution were simulated for different disruption scenarios based on all-or-nothing assignment rule [25]. e above research is based on the metro IC card to complete the problem research, and this paper aims to achieve the complete matching of passengers' OD only from the metro IC card data, find out the abnormal data of missing card swiping in or out of the station, and construct a standard dataset to verify. e rest of the paper is organized as follows: Section 2 overviews the metro IC card data and OD matching method. Section 3 describes the OD matching algorithm based on analysis function. In Section 4, we describe the experimental dataset construction and present qualitative and quantitative results.

Data and Methods
e symbols used are defined as follows: N ic represents the amount of IC card data in a dataset N od represents the number of OD pairs in a dataset N err represents the number of records of wrong connection caused by multiple subway ride records of a passenger in 5,400 seconds N et represents the number of OD pairs whose time difference exceeds the threshold of 5,400 seconds k 1 represents the ratio of the number of records of wrong connection caused by multiple subway ride records of a certain passenger within 5,400 seconds k 2 represents the proportion of the records of passengers who have not successfully matched OD due to a subway trip time of more than 5,400 seconds N in represents the amount of inbound IC card data in a dataset N out represents the amount of outbound IC card data in a dataset N mi represents the amount of outbound card swiping corresponding to the missing inbound card in a dataset N mo represents the amount of inbound card swiping corresponding to the missing outbound card in a dataset

Introduction to the Composition of the Data Dictionary.
Taking Xiamen metro IC card data as an example, TICK-ET_ID stands for transaction card number, TXN_DATE stands for transaction date, TXN_TIME stands for trading time, TICKET_MAIN_TYPE stands for card type, TRANS_CODE represents the type of transaction (boarding and alighting), in which 7 represents inbound and 8 represents outbound, and TXN_ STATION_ ID represents the current card swiping subway station ID. e specific data are shown in Table 1.

Formal Description of Metro OD Matching.
A passenger's complete subway journey is recorded twice with a card: one for entering the station and the other for leaving the station. According to the records in Table 1, this paper studies how to correctly match the OD pairs when two card swiping records exist in the same table and there are two different records, which are recorded as N od ; Table 2 shows 145 information of OD pairs; and we find out the outbound data of missing corresponding inbound card swiping and the inbound data of missing corresponding outbound card swiping, which are recorded as N mi and N mo and stored in Tables 3 and 4 respectively.

Difficulties of the Problem.
ere are three cases in the metro IC card record: (1) Passengers swipe cards normally when entering and leaving the station (2) Passengers swipe the card when they enter the station but do not swipe the card when exiting the station or the data is lost (3) Passengers swipe the card when they leave the station but do not swipe the card when entering the station or the data is lost Cases 2 and 3 are invalid data, which cannot form a complete passenger boarding and alighting record and need Mathematical Problems in Engineering to be eliminated in the process of OD matching algorithm. However, only the data in case 1 can be matched successfully. Of course, there are few cases in cases 2 and 3, and most of the subway data belong to the scope of case 1. Table Join. e method of table join is to connect the data of card swiping in and out according to certain conditions to form a complete record of passengers getting on and off the train. Its flow chart is shown in Figure 2, and the steps are as follows:

Introduction of Metro OD Matching Method Based on
( (1) ere is no actual basis for selecting 5,400 seconds. In reality, there is a record of subway travel time    exceeding 5,400 seconds, which leads to data loss when exceeding the threshold value (2) When there are multiple subway trips in 5,400 seconds, there will be more wrong connections

Introduction to Oracle LAG () Function.
Oracle LAG () [26] is an analytic function that allows you to access the row at a given offset prior to the current row without using a selfjoin:

LAG (expression [, offset] [, default]) OVER ( [query_partition_clause] order_by clause )
e usage of the lag function is shown above. expression is a scalar expression evaluated against the value of the row at a given offset prior to the current row. offset is the number of rows that you want to backward from the current row. e default is 1. For default, if the offset goes beyond the scope of the partition, the function returns the default. If you omit default, then the function returns NULL. e query_parti-tion_clause divides rows into partitions to which the LAG () function is applied. By default, the function treats the whole result set as a single partition. e order_by_clause specifies the order of the rows in each partition to which the LAG () function is applied. Similar to the LEAD () function, the LAG () function is very useful for calculating the difference between the values of current and previous rows.

Introduction to the Metro OD Matching Method Based on Analysis Function.
e metro OD matching algorithm based on the analysis function mainly uses the lag function of Oracle to replace the table join. In this method, the dataset is sorted in ascending order according to the card number and card swiping time, and then the lag function of Oracle is used to take the offset of the upper line of five related fields. Finally, the matching process is completed according to the OD conditions of card number, time, and inbound and outbound sign fields. Its flow chart is shown in Figure 3, and the steps are as follows: Taking the metro IC card data in May 2019 as an example, the OD matching SQL statement based on analysis function is as follows: e following SQL statement uses the LAG () function to get the offsets of the five fields for the metro IC card data in May 2019. "metro201905" is a table of metro IC card data in April 2019, with the structure shown in Table 1:

eoretical Analysis and Comparison of Two Methods.
Compared with the table join mode of setting time threshold, the analysis function method proposed in this paper avoids the way of setting connection conditions and thresholds for table join and can accurately distinguish the three situations of metro IC card swiping data and will not cause the records exceeding the threshold to be discarded due to the influence of threshold size. Table 5 shows theoretical comparison of different methods for orbit OD matching.  station records in the original metro data sheet is recorded, and the specific missing data is not known:

Experimental Dataset
(1) It is impossible to know the data that can match OD completely (2) e OD results obtained by different matching methods cannot prove the correctness of the matching erefore, we hope to have a real dataset that knows the specific missing data situation (including the amount of missing data and the number of missing pieces) and then compare the OD results obtained by different methods with the OD results of the real dataset to prove its correctness.

Dataset Construction Process.
e flow chart for building the dataset is shown in Figure 4 and dataset construction process is mainly divided into the eight following steps: Step 1: select the card number, transaction date, arrival time, card type, and inbound station ID fields from the OD dataset of complete matching metro. e inbound time field is named as the transaction date, and the inbound station ID field is named as the station ID, and these data are stored in the inbound information table; Step 2: add the inbound and outbound flag field in the inbound information table, and update the field of all data in this table to inbound station; Step 3: select the card number, transaction date, outbound time, card type, and outbound station ID fields from the complete matching OD dataset. e outbound time field is named as the transaction date, and the outbound station ID field is named as the station ID, and these data are stored in the outbound information table; Step 4: add the inbound and outbound flag field in the outbound information table, and update all data in this table to outbound; Step 5: merge the inbound and outbound information tables into a single inbound/outbound table, and arrange the data in the table randomly once; Step 6: randomly extract a part of inbound data according to proportion; Step 7: in the remaining data after deducting the extracted inbound data, a part of outbound data with different card numbers from the incoming data just extracted is randomly extracted; Step 8: after deducting the extracted outbound data, the remaining data is the constructed dataset.

Introduction to Instance Objects and Datasets.
Research object: the IC card data of rail transit in April and May 2019 in Xiamen City, Fujian Province, are selected. Data details are shown in Table 6. Dataset: select the metro IC card data, static station information data, and static card type data in April and May 2019.

Evaluation Method.
(1) Using the original IC card data, the OD results obtained by the analysis function method are compared with the OD results based on the table join method.
(2) First of all, a correct metro IC card dataset is constructed; that is, each passenger's inbound card has a corresponding outbound card swipe, and there is no error or omission in the data. en the OD matching of the dataset is carried out by using the method based on table join and the method based on analysis function. Finally, the results are verified with the real OD results. (3) Firstly, a missing metro IC card swiping dataset is constructed; that is, the passengers have a lack of entry or exit records in a subway trip, and then the OD matching of the dataset is carried out by using the method based on table join and the method based on analysis function. Finally, the results are verified with the real OD results.

Evaluating Indicator.
(1) e accuracy of matching; (2) e rate of wrong connection caused by the same passenger taking the subway for several times within 5,400 seconds and the ratio of not connecting for more than 5,400 seconds are counted as evaluation indexes. e specific calculation formula is as follows: In the above equation, k 1 represents the ratio of the number of records of wrong connection caused by multiple subway ride records of a certain passenger within 5,400 seconds; N err represents the number of records of wrong connection caused by multiple subway ride records of a passenger in 5,400 seconds; N od represents the amount of passenger flow after OD; k 2 represents the proportion of the records of passengers who have not successfully matched OD due to a subway trip time of more than 5,400 seconds; N et represents the number of OD pairs whose time difference exceeds the threshold of 5,400 seconds.

Experimental
Results. e OD matching results of metro IC card data in April and May 2019 based on table join and analysis function are shown in Tables 7 and 8.
According to the OD data calculated by the analysis function method and the dataset construction process of Section 4.1.2, the datasets of April and May 2019 are constructed, respectively, and the OD matching based on table join and that based on analysis function method are compared again. Table 9 shows the OD matching results of the original IC card data in April and May 2019 through the analysis function.
e results of dataset construction are shown in Table 10.  Figure 4: e dataset construction process of complete metro OD data eliminating some inbound and outbound information.

Analysis of Experimental Results
. By comparing the relationship between the OD passenger flow and the total passenger flow, we can see that the relationship between the table join method and the total passenger flow is as follows: (1) Based on table join method, OD passenger flow volume × 2 > total passenger flow volume. (2) erefore, it can be considered that there are redundant matches in the metro OD results obtained by the intratable join method, which is unreasonable to some extent.
(3) By comparing the metro OD obtained by the two methods, it can be seen that the method based on the analysis function has good performance in the aspects of multiple trips in 5,400 seconds and is unconnected in more than 5,400 seconds.
(4) By matching the constructed datasets, the OD matching results obtained by the proposed analysis function method are the same as the real data, but the table join method is not the same as the real data.
(5) Using the constructed dataset, the missing inbound and outbound data stored in the missing data table of inbound and outbound stations obtained by the analysis function method are combined with the extracted inbound and outbound data to perform OD matching, and the matching result is the same as the real data.
erefore, it can be considered that the analysis function method is better than the table join method in OD matching of metro.

Conclusion and Future Work
is paper proposes an OD matching algorithm based on analysis function for metro IC card data. Compared with the     previous table join OD matching algorithm, the method based on analysis function avoids the setting of time threshold, so as to deal with the situation where passengers stay too long in the subway station. Taking Xiamen metro in and out IC swiping card dataset in April and May 2019 as an example, this method is more accurate and powerful than table join method and can identify the correct OD and the wrong or real IC swiping card records in and out of the station.
In the future work, this method should be verified at IC card inbound and outbound dataset of passengers of complex metro network with transfer stations. Time and space complexity of this method should be analyzed and optimized so that it has excellent accuracy and also less time and space cost and could be applied to large-scale dataset.

Data Availability
e GPS data used to support the findings of this study were supplied by Xiamen GNSS Development and Application Co., Ltd. under license and so cannot be made freely available.

Conflicts of Interest
e authors declare that they have no conflicts of interest.