Data quality is essential for its authentic usage in analysis and applications. The large volume of automated collection data inevidently suffers from data quality issues including data missing and invalidity. This paper deals with an invalid data problem in the automated fare collection (AFC) database caused by the erroneous association between the fare machines and metro stations, e.g., a fare machine located at Station A is wrongly associated with Station B in the AFC database. It could lead to inappropriate fare charges in a distance-based fare system and cause analysis bias for planning/operation practice. We propose a tensor decomposition and isolation forest-based approach to detect and correct the invalid associated fare machines in the system. The tensor decomposition extracts features of passenger flows and travel times passing through fare machines. The isolation forest coupled with a neural network (NN) takes these features as inputs to detect the wrongly associated fare machines and infer the correct association stations. Case studies using data from a metro system show that the proposed detection approach achieves over 90% accuracy in detecting the invalid associations for up to 35% invalid associations. The inferred association has a 90% accuracy even when the invalid association ratio reaches 40%. The proposed data-driven invalid data detection method is useful for large-scale data management in terms of data quality check and fix.
Smart card data collected from the automatic fare collection (AFC) system (i.e., AFC data) enable many beneficial applications in the public transportation system such as collective and individual mobility analysis, system state monitoring, and operation planning and control [
Data problems are prone to happen due to the following reasons: Human factors: in the AFC system, the transaction records may be missing if passengers fail to tap in/out properly. Infrastructure failure: for example, AFC records are triggered when a passenger taps in/taps out through an entrance/exit fare machine. The malfunctioning of fare machines may lead to issues of missing data (machine fails to record or upload transactions) and invalid data (erroneous transactions). Inadequate data management. Daily data management for transportation systems is a complex practice. Missing and invalid data may happen in the process of database merging, maintenance, or system update.
Among those data problems, missing and invalid data problems are the most critical and common ones. Figure
Missing and invalid data problems in the AFC dataset.
The paper deals with the invalid data problem to detect the hidden association errors of the complete and seemingly valid data. Specifically, it aims to detect the invalid association between fare machines and stations in the AFC data. For example, fare machine 001# is located in Metro Station A, but wrongly associated to Station B in the AFC database (Figure
Invalid association problem between fare machine and metro station.
We develop a data-driven approach, based on tensor decomposition and machine learning techniques, to automatically detect such invalid associations using AFC data, and also infer the correct association stations that a fare machine belongs to. The approach works in two steps: the tensor decomposition is utilized to extract the flow volume and travel time patterns of each fare machine. Then, the isolation tree technique and NN models are designed to detect the incorrect linked fare machines and infer their correct association stations based on the extracted features from tensor decomposition.
The remaining is organised as follows: Section
Data quality is one of the most important issues in big data area. Low or bad data quality is costly. For example, it is reported that bad data or poor data quality costs US businesses 600 billion dollars annually [
Although many studies deal with missing data in transportation, to the best of our knowledge, there is no study on detecting or fixing the association errors in transportation or other related areas, particularly the fare machine-station invalid association problem.
The key idea for a data-driven detection approach is to extract the passenger flow or/and travel time patterns between fare machines and stations. Feature extraction is one of the most important issues in the machine learning field. Feature extraction reduces the resources required to characterize a large set of data or/and a huge dimensions of input information. Plenty of methods are proposed in the machine learning community dealing with the feature extraction. These methods can be roughly divided into two parts: conventional statistical learning methods and deep learning-based method. Conventional statistical learning methods such as principle component analysis (PCA) [
In our problem, passenger flow and travel time patterns are related to multiple modes, e.g., time and location. Tensor is a nature choice to represent and capture these patterns. Tensor is a multidimensional extension of matrix [
The invalid associations (between fare machines and stations) are treated as anomalies. Anomaly detection is an important topic in data mining. The anomaly detection could be roughly divided into three categories, statistical, machine learning, and deep learning models. Statistical method: statistical methods are the early explorations of the anomaly detection. The methods in this category first make assumptions of the distribution of the studied dataset. The samples with low probabilities are treated as anomalies. Rousseeuw and Driessen [ Machine learning-based methods: the most widely used anomaly detection methods are the machine learning-based methods, which generally have two categories: supervised and unsupervised methods. Supervised methods [ Deep learning-based methods: the emerging deep learning models bring new opportunities to better solve the anomaly detection problem. Hundman et al. [
Let
Given an AFC dataset
Mathematically, the problem is defined as follows:
For convenience, we define the concept of fare machine-related passenger flow (MRF). For an entrance fare machine, MRF refers to the passenger flow tapping in an entrance fare machine of the origin station and tapping out at a destination station (using any machine) during a certain time slot. For an exit fare machine, MRF represents the passenger flow tapping in at an origin station (using any machine) and tapping out at an exit fare machine during a certain time slot. MRF can be characterized using different features, such as flow volume and travel time. Indicators extracted from the MRF features can be used to characterize fare machines. The hypothesis is that MRF features share more similar patterns if the fare machines are located at the same station than at different stations.
The flow volume and travel time are selected to characterize the MRFs of fare machines. These two features reflect system dynamics from both the demand (mobility patterns) and supply (network and operations) points of view as well as their interactions. They provide complementary knowledge and therefore give a more comprehensive view of the MRF patterns. They are defined for entrance and exit fare machines separately: For entrance fare machines, MRF flow volume measures the number of passengers passing through each fare machine at an origin station and going to a destination station. For exit fare machines, it represents the number of passengers entering the metro system at an origin station and tapping out through an exit fare machine. MRF flow volume reflects the mobility behavior of passengers. MRF travel time indicates the average travel time from a fare machine to a destination station for entrance fare machines and from an origin station to a fare machine for exit fare machines. It reflects the supply characteristics of the metro system, e.g., geographical relationship between stations and scheduling, but also demand characteristics of certain stations as it includes time waiting to board a train under capacity constraints.
Figure MRF feature extraction module: it constructs the MRF flow volume and travel time tensors to characterize fare machines and extracts latent MRF flow and travel time features using the tensor decomposition technique. Invalid association detection module: it detects the invalid associations (between fare machines and stations) in two steps. The valid and invalid associations are initially detected using the isolation forest method. Then, the invalid associations are reinspected (the feedback arrow) using neural networks (trained with the valid association data). Association station inference: it infers the station that a fare machine (detected as invalid association) belongs to using the trained neural networks.
Overview of the proposed framework.
For data representation, tensors are used to characterize the MRF flow volume and travel time. A tensor is a high-order generalization of a matrix. The multiway property of a tensor fits the nature of MRF features. For example, MRF flow volume can be characterized by “machine mode” (
To construct the MRF flow volume tensor, the mode variables above are transformed into numerical indices: Machine mode: the fare machines are labeled from 1 to Time mode: the hourly interval is used to represent the tap-in time Day mode: day mode represents the date, thus Station mode: the stations are labeled from 1 to
The MRF flow volume is represented by a size
Structure of MRF flow volume tensor. MRF flow volume tensor consists of 4 modes, i.e., time (
Similarly, the MRF travel time tensor is denoted as
The properties of MRF flow volume and travel time tensors are different, though they share the same structure. The difference stems from the tensor cells that have no AFC observation. For the MRF flow volume tensor, the value of such cells is 0 since the MRF flow volume for the corresponding
Tensor decomposition is used to extract fare machine features from the MRF flow volume and travel time tensors. Given the different properties of these two tensors, different tensor decomposition methods are developed to extract the MRF flow volume and travel time features, respectively.
For MRF flow volume tensor
The CP decomposition of
CP decomposition of the MRF flow volume tensor.
Computing the CP decomposition of
The feature matrix
CP decomposition cannot be applied directly to extract travel time features. This is because the travel time tensor has nonnumerical (i.e., NaN) entries, which makes the operation
The weight tensor
In the initialization phase, NaN cells are filled with random values. As these values are multiplied by 0 during the optimization, they do not influence the results of the optimization objective (optimal solution). After optimization,
The MRF flow volume and travel time feature vectors of each fare machine are concatenated into one single vector to characterize the corresponding fare machine.
As fare machines at the same station share similar surrounding Point of Interests (POIs), the MRF features of these fare machines tend to be similar. Therefore, we should first extract the MRF feature of each station. Then, the MRF feature of each machine is compared to the station MRF feature. If a fare machine has a similar MRF feature with a station, then this station is likely to be the association station of the fare machine. We divide the inference process into two successive problems P1 and P2.
To solve P1, we first give two assumptions: (1) the MRF features of the invalid associations are anomalies to their recorded stations. More formally, let
Based on this assumption, the isolation forest method is adopted to solve P1. The isolation forest model is an unsupervised model for anomaly detection, which could be directly used for the contaminated dataset. The only requirement of this method is that the outlier should be few and different with the normal instances. This exactly fits the aforementioned assumption. The isolation forest detects the outliers using a special measurement: partitions. The isolation forest “isolates” observations by randomly selecting a dimension of the MRF feature vector and then randomly splitting the space between the maximum and minimum values of the selected dimension. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate an MRF feature is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality. Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular fare machines, they are highly likely to be anomalies [
Based on the results from the isolation forest, we can divide the fare machine MRF feature vectors into two parts:
The fare machines with their MRF features in
In P2, a reinspection of the fare machines in
Neural network (NN) is used to model the station MRF feature using the MRF features in
For an MRF feature
Using
We utilize AFC data from an urban metro system to evaluate the proposed detection and inference approach. The data cover 7 days from January 15 to 21 in 2018. The fare machine-station association information is carefully checked to ensure its validity for benchmarks. Figure
Number of entrance and exit fare machines in the studied metro system.
We randomly select 1000 entrance fare machines and 1000 exit fare machines and collect the corresponding AFC transaction records to construct the experimental dataset. We randomly choose a set of fare machines and modify their associated stations (invalid associations). The proposed approach is validated with the ratio of invalid associated fare machines ranging from 5% to 40%. The approach runs 20 times per scenario to avoid random errors. Table
Model parameters.
Optimal value (potential values) | |
Number of components ( | 8 (1–15) |
Optimization algorithm | |
Error tolerance | 1 |
Maximum number of iterations | 100 (10, 100, 500, 1000) |
Value | |
Threshold score (the threshold score is calculated with the | 0 |
Number of estimators | 1000 (200, 500, 1000, 1500) |
Value | |
The number of top stations in P1 reinspection | 5 |
Number of hidden layers | 2 (1–5) |
Optimizer | Adam (Adam refers to the optimization algorithm proposed in [ |
Number of neurons | (16, 5) |
Table
Confusion matrix of the valid association detection.
Invalid association (positive) | Valid association (negative) | |
---|---|---|
True detection (true) | True positive (TP) | True negative (TN) |
False detection (false) | False positive (FP) | False negative (FN) |
A set of performance metrics is used to comprehensively evaluate the model performance, including accuracy (Accu), true positive rate (TPR), and false positive rate (FPR):
Figure
Model detection performance (a) TPR, (b) FPR, and (c) accuracy with invalid association ratio ranging from 0.05 to 0.6.
For the P2 evaluation (rematching wrongly associated fare machines to stations), we quantify the model’s capability to effectively allocate large probabilities to the correctly matched stations. We use the top-
Table
Model performance in rematching invalid associated fare machines.
Invalid association ratio (%) | ||||||||
---|---|---|---|---|---|---|---|---|
5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 | |
Top-1 | 76.4 | 77.8 | 77.1 | 78.9 | 75.7 | 74.1 | 70.9 | 69.1 |
Top-2 | 86.8 | 87.1 | 87.2 | 88.5 | 87.4 | 85.1 | 82.1 | 81.4 |
Top-3 | 90.8 | 91.1 | 91.7 | 91.8 | 91.7 | 89.9 | 88.3 | 87.7 |
Top-4 | 93.1 | 93.4 | 94.4 | 93.6 | 93.9 | 92.5 | 90.9 | 91.0 |
Top-5 | 94.1 | 95.0 | 95.9 | 95.3 | 95.1 | 94.1 | 92.9 | 93.3 |
The results show that the top-
The foundation of the detection or inference model being effective is the quality of the MRF features. That is, the fare machines at different stations are preferable to have significantly different MRF features. To explore the feature quality, we utilize the principle component analysis (PCA) [
Figure
MRF feature vectors between Station A and (a) Station B, (b) Station C, (c) Station D, and (d) Station E.
Ensuring data quality is essential for its effective use in practice. The paper proposes a model to detect the invalid data in the AFC dataset, caused by the erroneous association between fare machines and stations (e.g., due to delayed updating dictionaries or incorrect data merging). It combines tensor decomposition, isolation forest, and NN methods to detect the invalid associations in the recorded dataset and infer the correct association station that a fare machine belongs to.
The model is validated using the AFC data in a busy metro system. The experiment results show that the invalid association can be detected with more than 90% accuracy when the invalid association ratio is low. Also, the model is robust to invalid associations and it can still achieve 69.62% accuracy in the extreme case when the invalid association ratio is 55%. The association station inference results indicate that the top 3 inferred stations from the model are highly likely to include the correctly associated station of the studied fare machine (around 90%). This provides an important implication for further field investigations to these probable stations in practice.
The proposed model provides useful knowledge for the AFC data management in terms of data quality check and fixing invalid data. Though the study focuses on the invalid data detection problem, the model is general and can be generalized to inference applications, e.g., inferring the alighting stations for the bus system having only the boarding records. As the extracted MRF features are meaningful, further studies could focus on the analysis based on the MRF features, for example, analysing the different utilization of fare machines in different gates of the same station to improve the infrastructure efficiency.
The AFC data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.