A crucial task in traffic data analysis is similarity pattern discovery, which is of great importance to urban mobility understanding and traffic management. Recently, a wide range of methods for similarities discovery have been proposed and the basic assumption of them is that traffic data is complete. However, missing data problem is inevitable in traffic data collection process due to a variety of reasons. In this paper, we propose the Bayesian nonparametric tensor decomposition (BNPTD) to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. BNPTD is a hierarchical probabilistic model, which is comprised of Bayesian tensor decomposition and Dirichlet process mixture model. Furthermore, we develop an efficient variational inference algorithm to learn the model. Extensive experiments were conducted on a smart card dataset collected in Guangzhou, China, demonstrating the effectiveness of our methods. It should be noted that the proposed BNPTD is universal and can also be applied to other spatiotemporal traffic data.

Recent advances in data acquisition technologies and mobile computing lead to a collection of large quantities of urban traffic data from various sources, such as loop detectors data, GPS data, and smart card data. These datasets can capture rich spatial-temporal information of the whole transportation system and enable some traffic analysis. A crucial task in a data-driven transportation system is similarity pattern discovery. For example, as shown in Figure

The graph description of similar patterns in the transportation system. (a) The same trip origin and trip destination. (b) The similar passenger flow time series.

These similarities are beneficial for urban mobility pattern understanding and the authorities' policy-making. For example, for aggregate-level, the classification management can be adopted in metro systems and the managers should pay more attention to station A and station B to prevent congestion during the morning peak. For individual-level, the travelers that have similar trip rules can be found, that is, familiar stranger [

In general, the similarity pattern could be extracted by clustering methods, such as K-means algorithm and the density-based spatial clustering of applications with noise (DBSCAN) algorithm [

Aiming at the missing data imputation, a variety of methods have been proposed such as multioutput Gaussian processes [

Inspired by the recent work of Bayesian tensor decomposition, we propose a novel framework named Bayesian nonparametric tensor decomposition (BNPTD) to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. The BNPTD consists of two components: (1) Bayesian tensor decomposition and (2) Dirichlet process mixture model (DPMM). DPMM is a nonparametric Bayesian mixture model that has shown great promise for data clustering and allows for the automatic determination of an appropriate number of clusters. These two components are combined with a hierarchical probabilistic model and a variational inference algorithm is presented to derive the posterior distributions of all the model parameters and hyperparameters. It should be noted that the combination not only finds groups of similar objects in the case of missing data but also offers adaptive prior to Bayesian tensor decomposition model which can further improve better imputation performance.

In summary, our contributions are summarized as follows:

We proposed a Bayesian nonparametric tensor decomposition model to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously.

We presented a variational inference algorithm to learn the BNPTD model. Variational inference tends to be faster and easier to scale to large-scale datasets than classical methods, such as Markov chain Monte Carlo sampling.

Extensive experiments were conducted on a smart card dataset from Guangzhou, China. The results demonstrate that our approach successfully finds interpretable similarity pattern and recovers the missing values well in the case of incomplete traffic data.

The rest of this paper is structured as follows. Section

In this section, we review the previous approaches regarding missing data imputation and similarity pattern discovery in the traffic data analysis.

Missing data problem is inevitable in traffic data collection process, and the reasons are manifold. For example, some observations will be lost if the sensors were broken. Besides, when we use mobile sensors such as floating car, some locations will not be covered at some times. So missing data imputation is always a hot topic in academic research and predecessors have solved this problem using a wide range of methods. The traditional approach is to use time series method such as autoregressive integrated moving average (ARIMA) and its variants [

The collaborative filtering methods such as tensor decomposition have been proved more effective and efficient than other methods. Tensor (i.e., multiway arrays) is a generalization of matrix. The two most popular tensor decomposition frameworks are Tucker and CANDECOMP/PARAFAC (CP), which both could capture the underlying multilinear factors. Kolda et al. [

A crucial task in traffic data analysis is similarity pattern discovery and the similarities can help to demand control, personalized travel service, and anomaly detection. The similarity pattern can be extracted by clustering methods. Zhao et al. [

Besides, the nonparametric clustering such as Dirichlet process mixture model (DPMM) is widely used for similarity pattern discovery. DPMM is a nonparametric Bayesian mixture model which allows for the automatic determination of an appropriate number of clusters. The most commonly used description of DPMM is Chinese Restaurant Process that can be solved by MCMC sampling [

In this paper, we propose Bayesian nonparametric tensor decomposition that is comprised of Bayesian tensor decomposition and Dirichlet process mixture model via a hierarchical probabilistic model to achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. Compared to the previous studies, DPMM is employed to low-rank feature extracted from Bayesian tensor decomposition rather than raw data, which can solve similarity pattern discovery in the case of incomplete traffic data effectively. Moreover, DPMM is beneficial for missing data imputation as an adaptive prior distribution.

The proposed BNPTD consists of two components: Bayesian tensor decomposition model and Dirichlet process mixture model. In the following subsections, we first give detailed descriptions of these two components. Secondly, we provide an overview of our Bayesian nonparametric tensor decomposition model. Finally, we derive variational approximations to learn the proposed BNPTD. The description below is taken from smart card data as an example and can be applied to other spatiotemporal traffic data.

A smart card data record usually has multidimensional attributes such as origin station, destination station, day, and time of day. We organize the smart card data into a three-order tensor

Then, CP decomposition factorizes the tensor into a sum of rank-one tensors; that is,

CP decomposition of a third-order tensor.

Elementwise, (

CP decomposition can be solved by the alternating least squares method [

The maximum likelihood estimation of (

We also place a conjugate prior distribution over the precision; that is,

The graphical model of Bayesian tensor decomposition is illustrated in Figure

The graphical model of Bayesian tensor decomposition.

The smart card data can be represented with a three-order tensor (station

Dirichlet process mixture model is the Bayesian nonparametric technique which allows for the automatic determination of an appropriate number of mixture components. In this paper, we adopt the stick-breaking construction of DPMM [

For each mode

Draw

Compute

Draw

For each station

Draw

Draw

The graphical model of DPMM.

The BNPTD is comprised of Bayesian tensor decomposition and DPMM, and these components are coupled via a hierarchical probabilistic model. The graphical model of BNPTD is illustrated in Figure

The graphical model of BNPTD.

The advantages of the combination are as follows: (1) missing data imputation and similarity pattern discovery can be achieved simultaneously, which is more efficient and avoids errors accumulation, and (2) DPMM can offer adaptive prior distribution to Bayesian tensor decomposition, corresponding to adaptive regularization term and further better imputation performance.

The goal of BNPTD model learning is to derive the posterior distributions of parameters and hyperparameters. Variational inference is the method that approximates the posterior distributions through optimization [

We further focus on the mean-field variational family that the variational distribution

On the basis of the mean-field variational family, the optimal variational posterior distribution of each

(1)

(2) for

(3) Update the variational posterior distribution

(4) Update the variational posterior distribution

(5) for

(6) Update the variational posterior distribution

(7)for

(8) Update the variational posterior distribution

(9) for

(10) Update the variational posterior distribution

(11) Update the variational posterior distribution

(12) until maximum iterations exhausted

In order to evaluate the performance of our methods, we conduct extensive experiments on a real-world smart card dataset collected from subway stations in Guangzhou, China. This dataset contains a large number of tap-in/tap-out records from 3/7/2017 to 16/7/2017. Because the subway agency does not provide services all day, we focus on the smart card data from 6 a.m. to 10 p.m., which contain the main trip time of the whole day. We construct the dataset into a three-order tensor (station

In order to show the effectiveness of the adaptive prior distribution, we compare BNPTD with some baselines including DA (daily average), BTD (Bayesian tensor decomposition), and GAIN (deep generative models [

In our experiments, we evaluate the imputation accuracy with two widely applied metrics, namely, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE):

The missing ratio ranged from 10% to 50%. Table

Comparison of BNPTD with other baselines in the random missing scenario (the numbers 10, 20, 30 after the two methods denote CP rank and the best models are highlighted in bold).

Method | 10% | 20% | 30% | 40% | 50% | |||||
---|---|---|---|---|---|---|---|---|---|---|

RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | |

BTD (10) | 57.37 | 34.04 | 56.58 | 33.61 | 62.35 | 36.20 | 65.23 | 37.64 | 65.51 | 37.83 |

BNPTD (10) | ||||||||||

GAIN | 56.91 | 34.77 | 56.29 | 3446 | 62.22 | 37.29 | 65.24 | 38.91 | 65.66 | 39.27 |

BTD (20) | 28.26 | 47.93 | 28.64 | 48.86 | 28.96 | 53.66 | 30.74 | |||

BNPTD (20) | 46.61 | 50.13 | 29.43 | |||||||

GAIN | 46.74 | 29.24 | 48.26 | 29.71 | 49.31 | 30.18 | 54.15 | 32.04 | 50.51 | 30.76 |

BTD (30) | 43.79 | 26.59 | 44.96 | 26.90 | 45.79 | 27.21 | 46.80 | 27.66 | 49.21 | 28.77 |

BNPTD (30) | ||||||||||

GAIN | 44.19 | 27.66 | 45.46 | 28.02 | 46.39 | 28.44 | 47.41 | 28.92 | 49.92 | 30.20 |

DA | 74.29 | 44.07 | 72.57 | 43.00 | 79.23 | 45.80 | 81.81 | 46.93 | 80.82 | 46.33 |

Comparison of BNPTD with other baselines in the continuous missing scenario (the numbers 10, 20, 30 after the two methods denote CP rank and the best models are highlighted in bold).

Method | 10% | 20% | 30% | 40% | 50% | |||||
---|---|---|---|---|---|---|---|---|---|---|

RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | |

BTD (10) | 60.92 | 36.73 | 59.67 | 35.76 | 65.79 | 38.38 | 69.10 | 39.76 | 69.71 | 40.03 |

BNPTD (10) | 60.19 | 59.10 | 65.01 | |||||||

GAIN | 36.88 | 36.08 | 38.94 | 68.02 | 40.76 | 68.97 | 41.36 | |||

BTD (20) | 49.85 | 30.70 | 51.29 | 30.74 | 52.22 | 30.97 | 57.62 | 32.74 | 53.77 | 31.35 |

BNPTD (20) | 49.26 | 50.80 | 51.60 | |||||||

GAIN | 31.01 | 31.11 | 31.52 | 56.46 | 33.56 | 53.06 | 32.39 | |||

BTD (30) | 49.29 | 30.60 | 50.70 | 30.87 | 52.08 | 31.35 | 53.78 | 31.94 | 58.05 | 34.02 |

BNPTD (30) | ||||||||||

GAIN | 48.77 | 31.31 | 50.42 | 31.71 | 52.01 | 32.48 | 54.22 | 33.73 | 58.41 | 36.00 |

DA | 73.36 | 43.89 | 71.81 | 42.61 | 78.36 | 45.15 | 80.93 | 46.12 | 80.67 | 45.95 |

The CP rank

In the case of Cluster 1, the subway stations have large passenger volume every day, and there is no difference between a working day and nonworking day. From Figure

The second departure pattern is anomalous, and the passenger volume of nonworking days from 8/7/2017 to 9/7/2017 is larger than the working day. With our web search, the reason is that there is an exhibition happening near the subway stations. So, the proposed BNPTD also finds some special events effectively and can be used for anomaly detection.

Aimed at Cluster 3, the passenger volume of the stations is lower than 600 in 10 minutes. These stations are not usually crowded and the transportation potential could be further excavated to service more passengers.

Cluster 4 has a larger passenger volume compared to Cluster 3. The stations belonging to Cluster 4 have small peak values during the morning rush hour or evening rush hour. Besides, these stations are mainly situated in urban fringe area and far from the central business area from Figure

Cluster 5 is a morning peak departure pattern, and the peak values of weekends are relatively small to weekdays. The stations are mainly in close proximity to the residential area in Figure

The stations with the sixth departure pattern have larger passenger volume in the evening rush hour, and the peak values of weekends are relatively small to weekdays. The stations are mainly located in the work district, and many people working in this area usually take the subway and return home after going off work.

The seventh departure pattern has an extremely large passenger volume in the evening rush hour, and these stations are situated in the middle of the central business district (CBD) from Figure

The result of similarity pattern discovery. (a) Cluster 1. (b) Cluster 2. (c) Cluster 3. (d) Cluster 4. (e) Cluster 5. (f) Cluster 6. (g) Cluster 7.

The subway stations distribution in Google Maps.

In this subsection, an evaluation measure called purity is adopted to show the clustering stability of BNPTD with different missing ratios. Purity is an external index that is used to measure the similarity of the formed clusters to external clusters such as ground truth [

It is easy to see that

The purity with different missing ratios.

In this paper, a novel Bayesian nonparametric tensor decomposition model called BNPTD is proposed. The model combines the Bayesian tensor decomposition and Dirichlet process mixture model via a hierarchical probabilistic model, which can achieve incomplete traffic data imputation and similarity pattern discovery simultaneously. Moreover, we derive a variational inference algorithm to learn the model efficiently. Experiments on a real-world smart card dataset show the effectiveness of the proposed model.

Actually, the similarity pattern extracted in this paper not only helps us understand the urban mobility patterns, but also can improve the traffic prediction as prior knowledge. For future work, we plan to develop multitask prediction based on the above similarities.

The data used to support the results of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This research was supported by the project of National Natural Science Foundation of China (no. U1811463) and the Natural Science Foundation of Guangdong 292 Province, China (no. 20187616042030004).

The deduction process of optimal variational posterior distributions with regard to parameters and hyperparameters can be found in Supplementary File.