Traffic Volume Data Outlier Recovery via Tensor Model

Traffic volume data is already collected and used for a variety of purposes in intelligent transportation system (ITS). However, the collected data might be abnormal due to the problem of outlier data caused by malfunctions in data collection and record systems. To fully analyze and operate the collected data, it is necessary to develop a validate method for addressing the outlier data. Many existing algorithms have studied the problem of outlier recovery based on the time series methods. In this paper, a multiway tensor model is proposed for constructing the traffic volume data based on the intrinsic multilinear correlations, such as day to day and hour to hour. Then, a novel tensor recovery method, called ADMM-TR, is proposed for recovering outlier data of traffic volume data.The proposed method is evaluated on synthetic data and real world traffic volume data. Experimental results demonstrate the practicability, effectiveness, and advantage of the proposed method, especially for the real world traffic volume data.


Introduction
In order to alleviate the traffic congestion problem and facilitate the mobility in metropolises, large amounts of traffic information are collected as a part of intelligent transportation system (ITS) such as CVIS (Cooperative Vehicle Infrastructure System) in China.These collected traffic data have wide range of applications.The real time traffic information is provided to travelers to support their decision for making process on the optimal route choice [1].As shown by the work of Kim et al. [2], the real time information can contribute to reduce the operation cost and maximize resource utilization.In addition to these applications, the collected data could be applied to maximize the utilization of the infrastructure for smooth flow of the traffic.One such application of real time traffic data is traffic information control [3].On the other hand, several data mining techniques have been applied to mine time related association rules from traffic databases and their results have been used for traffic prediction such as the works of Williams et al. [4] and Xu et al. [5].From the above discussion, it is concluded that the collected traffic data are essential for many potential applications in ITS.
In real world, the collected data are always corrupted due to noise values, especially outlier value, which may be caused by detector failures, communication problems, or any other hardware/software related problems.The presence of outlier data in the database would degrade significantly the quality as well as reliability of the data and might impede the effectiveness of ITS applications.Therefore, it is essential to fill the gaps caused by outlier data in order to fully explore the applicability of the data and realize the ITS applications.
While many different kinds of traffic data such as traffic volume, speed, and occupancy are collected, the focus of this research is on the traffic volume outlier data recovery.It is supposed that the detectors collecting traffic information are set up at road sections and the collected values represent the traffic volume for those road sections.The aim of this research is to recover the traffic volume outlier data for road sections.
Literature survey in the related field shows that several filtering recovery techniques have been applied to recover the outlier traffic data [6][7][8].Filtering methods include techniques such as singular value decomposition, wavelet analysis, immune algorithm, and spectrum subtraction.Filtering methods formulate the traffic volume as time series Mathematical Problems in Engineering model and smooth the traffic waveform.These approaches recover the outlier data of day by day through spectrum analysis and feature information extracting.However, the traffic data through the same location is significantly similar from day to day and these approaches cannot utilize such characteristic.Pei and Ma [6] show that similarity is an important factor impacting on recovery performance.While the above methods consider only one mode similarity, the recovery performance is mainly dependent on the smooth threshold.Unfortunately, the smooth threshold is empirically determined.
In order to improve the recovery performance and consider the multidimension characteristic of traffic data, mining the multimode similarities will make a great contribution for recovering outlier value.Our approach is based on utilizing multimode correlations of traffic data; that is, traffic data have different correlation on different modes, such as week mode, day mode, and hour mode.More concretely, the feature of the proposed method is to recover the outlier value using the traffic volume information of the different modes.But, the problem is not so simple, because traffic volumes from many days might be corrupted by outlier data simultaneously.In order to consider the multiple outlier traffic volumes, we use tensor modeling the traffic volume.
In order to solve the traffic volume outlier data problem, we formulate the traffic volume recovery problem as a data recovery problem based on the assumption that the essential traffic volume is low--rank/low rank and the outliers are sparse.That is, the corrupted traffic volumes can formulated as where A is the observed traffic volume which is corrupted, L is the recovered traffic volume, and S represents the outliers.
In the problem, the entrances of corrupted traffic data are unknown.One straight solution is optimizing the following problem under the assumption that the -rank of L is small and the corrupted outliers are sparse or bounded: The tensor recovery problem of (2) has been studied in recent years, which will be detailed in Section 3. In this paper, a new data recovery method based on tensor model called Alternating Direction Method of Multipliers for Tensor Recovery (ADMM-TR) is proposed to handle the outlier traffic volumes.
This paper makes three main contributions.(1) We use tensor to model the traffic volume and take advantage of the multiway characteristics of tensor, which could explore the multicorrelations of different modes in traffic data; and (2) we formulate the problem of the traffic volume outlier data recovery as a tensor recovery problem; (3) we proposed ADMM-TR algorithm by extending ADMM from matrix to tensor case to solve the formulated tensor recovery problem for traffic volume, and the convergence of ADMM-TR is proved.It also should be noted that the proposed ADMM-TR method is different from [9], which reported that extended ADM for tensor recovery is proposed.In fact, they presented ADM for tensor completion, in which the data are missed and the entrances of missing data are known.While in this research, the objective is to recover the data which are corrupted including missing or noised and the entrances of corrupted values are unknown.
The paper is organized as follows.We present the review of tensor model in Section 2. Section 3 briefly reviews the related data recovery methods.In Section 4, tensor model for traffic volume is constructed, traffic data recovery problem is formulated, and an efficient algorithm is proposed to solve the formulation.Also a simple convergence guarantee for the proposed algorithm is given.In Section 5, we evaluated the proposed method on synthetic data and real world traffic volume data.Finally, we provide some concluding remarks in Section 6.

Notation and Review of Tensor Models
In this section, we adopt the nomenclature of Kolda and Bader's review on tensor decomposition [10] and partially adopt the notation in [11].
The -rank of an -dimensional tensor A ∈ R  1 × 2 ×⋅⋅⋅×  , denoted by   , is the rank of the mode- unfolding matrix  () : The inner product of two same-size tensors A, B ∈ R  1 × 2 ×⋅⋅⋅×  is defined as the sum of the products of their entries, that is, The corresponding Frobenius norm is ‖A‖  = √⟨A, A⟩.Besides, the  0 norm of a tensor A, denoted by ‖A‖ 0 , is the number of nonzero elements in A and the  1 norm is defined The -mode (matrix) product of a tensor A ∈ R  1 × 2 ×⋅⋅⋅×  with a matrix  ∈ R ×  is denoted by A×   and is size In terms of flattened matrix, the -mode product can be expressed as

Review of Data Recovery Methods
Recently, the problem of recovering the sparse and lowrank components with no prior knowledge about the sparsity pattern of the sparse matrix, or the rank of the low-rank matrix, has been well studied.Authors of [12] proposed the concept of "rank-sparse incoherence" and solved the problem by an interior point solver after being reformulated as a semidefinite problem.However, although interior point methods normally take very few iterations to converge, they have difficulty in handling large matrices.So this limitation prevented the usage of the technique in computer vision and the traffic volume recovery in this research.
To solve the problem for large scale matrices, Wright et al. [13] have adopted the iterative thresholding technique to solve the problem and obtained scalability properties.Lin et al. have proposed an accelerated proximal gradient (APG) algorithm [14] and applied techniques of augmented Lagrange multipliers (ALM) [15] to solve the problem.Yuan and Yang [16] have utilized the alternating direction method (ADM) which can be regarded as a practical version of the classical ALM method to solve the matrix recovery problem.The ADM method has been proved to have a pleasing convergence speed and results in [16] demonstrated its excellent performance.
Inspired by the idea of [16], this paper extends the sparse and low-rank recovery problem to tensor case, which is due to the fact that the multidimensional traffic data can be formulated into the form of tensor.

ADMM-TR for Traffic Volume Outlier Recovery
In this section, we show the solution of problem (2).The tensor model is firstly constructed for traffic volume in Section 4.1.Then we present the tensor recovery problem in Section 4.2.In Section 4.3, the classical ADMM approach is introduced.In Section 4.4, we convert the original problem into a constrained convex optimization problem which can be solved by the extended ADMM approach and present the details of the proposed algorithm.Also the convergence guarantees of the proposed algorithm are given in this section.In this paper, quantitative analysis of traffic data correlation is analyzed based on the traffic volume data downloaded from http://pems.dot.ca.gov/.The correlation coefficient applied to measuring the data correlation is given by [17]

Tensor Model for
where  refers to the whole data points; (, ) refers to the correlation coefficient matrix.Table 1 gives the results of correlation coefficient of four modes, which is hour, day, week, and month.Conventional methods usually use day-to-day matrix pattern to model the traffic data.Although each mode of traffic data has a very high similarity, these methods do not utilize the multimode correlations, which are "Day × Hour, " "Week × Hour, " and "Link × Hour, " simultaneously and thus may result in poor recovery performance.
To make full use of the multimode correlation and traffic spatial-temporal information, traffic data need to be constructed into multiway data set.Fortunately, tensor pattern based traffic data can be well used to model the multiway traffic data.This helps keep the original structure and employ enough traffic spatial-temporal information.

The Tensor Recovery
Problem.The problem of (2) is NP hard, since it is not convex.Then, we use an approximation formulation, as shown in (8), min where A ∈ R  1 × 2 ×⋅⋅⋅×  is the given matrix to be recovered; (8) relaxes the constrain for recovering a low--rank tensor from a high-dimensional data tensor despite both small entry-wise noise and gross sparse errors.
Recently, Liu et al. [18] have proposed the definition of the nuclear norm of an -mode tensor: Based on this definition, the optimization in (8) In order to recover ( L, Ŝ), instead of directly solving (10), we solve the following dual problem: min : The problem in (11) is still difficult to solve due to the interdependent nuclear norm and  1 norm constraints.To simplify the problem, the formulation can be reformulated as follows: where   is the matrix representation of mode- unfolding (note that   is a permutation matrix; thus      = );  () and  () are additional auxiliary matrices of the same size as the mode- unfolding of L (or S).
Theorem 1 (See [19,Proposition 5.2]).Assume that the optimal solution set  * of (13) is nonempty.Furthermore, assume that   is bounded or else the matrix  *  is invertible.Then a sequence { () ,  () ,  () } generated by (15) is bounded, and every limit point of { () } is an optimal solution of the original of problem (13).

ADMM Extension to Tensor
Recovery.We observe (12) is well structured in the sense that the separable structure emerges in both the objective function and constraints.Thus, we propose an algorithm based on an extension of the classical ADMM approach for solving the tensor recovery problem by taking advantage of this favorable structure.
The augmented Lagrangian of ( 12) is where   ,   are Lagrangian multipliers and   ,   > 0 are penalty parameters.
Then we can now directly apply ADMM with this augmented Lagrangian function.
Computing   .The optimal   can be solved with all other variables to be constant by the following subproblem: min As shown in [20], the optimal solution of ( 17) is given by where   Λ   is the singular value decomposition given by and the "shrinkage" operator   () with  > 0 is defined as Computing   .The optimal   can be solved with all other variables to be the constants by the following subproblem: min  By the well-known  1 minimization [21], the optimal solution of ( 21) is where   is the "shrinkage" operation.
Computing L. Now we fix all variables except L and minimize   over L. The resulting minimization problem is the minimization of a quadratic function: The objective function is differentiable, so the minimizer L min is characterized by (  (L))/L = 0. Thus, we obtain Computing S. Now we fix all variables except S and minimize   over S. The resulting minimization problem is the minimization of a quadratic function: The objective function is also differentiable, so the minimizer S min is characterized by (  (S))/S = 0. Thus, we have For comparing with RSTD [22], we also choose the difference of L and S in successive iterations against a certain tolerance as the stopping criterion.The pseudocode of the proposed ADMM-TR algorithm is summarized in Algorithm 1.
Proof.We check the assumptions of Theorem 1.   is not bounded, but   *   =  is a constant multiple of the identity operator.Thus Theorem 1 can also be applied to ADMM-TR and Theorem 2 can be derived.

Numerical Experiments
This section evaluates the empirical performance of the proposed algorithm on synthetic data and compares the results with RSTD (Rank Sparsity Tensor Decomposition) [22].Also, experiments on traffic volume data outlier recovery illustrate the efficiency of the proposed method in traffic research filed.
We use the Lanczos algorithm for computing the singular values decomposition and adopt the same rule for predicting the dimension of the principal singular space as [22].And the parameters are set as  =  = [ 1 / max ,  2 / max , . . .,   / max ]  and  = 1/sum([ 1 / max ,  2 / max , . . .,   / max ]  for all experiments, where  max = max{  }.  is set to 1/√ max as suggested in [23]. All the experiments are conducted and timed on the same desktop with an Pentium (R) Dual-Core 2.50 GHz CPU that has 4 GB memory, running on Windows 7 and MATLAB.
The entries of sparse tensor S 0 are independently distributed, each taking on value 0 with probability 1- and each taking on impulsive value with probability .We apply the proposed algorithm to the tensor A 0 = L 0 +S 0 to recover L and S and compare with RSTD.For these experiments, two cases of -rank are investigated, -rank = [5,5,5] and rank = [10,10,10].Table 1 presents the average results (across 30 instances) for different .
The quality of recovery is measured by the relative square error (RSE) to L 0 and S 0 , defined to be Tables 2 and 3 show that the proposed algorithm (ADMM-TR) is about 10 percent faster than RSTD proposed in [22] and achieves better accuracy in terms of relative square error.Though both of the algorithms involve computing a SVD per iteration, we observe that the proposed algorithm take much fewer iterations than RSTD to converge to the optimal solution.
The more impulsive entries are added, that is, for higher value of spr, the less probable it becomes for the tensor recovery problem.In addition, the problem becomes more sophisticated when the -rank is higher for the ground truth tensors of the same size.In Table 1, different spr is set for two tensor cases.And results show the recovered accuracy decreases as the spr grows for a certain case.In particular, the recovered accuracy for the tensor A 0 ∈ R 40×40×40 with -rank = [5,5,5] decreases sharply when the spr is up to 40%, while the phenomenon occurs for tensor A 0 ∈ R 40×40×40 with rank = [10,10,10] when the spr is about 25%.This is due to that relative low rank and low sparse ratio are precondition of the tensor recovery problem.

Traffic Volume Data.
To evaluate the performances of the proposed method in traffic volume data outlier recovery, a complete traffic volume data set is used as the test data set.We use the data of a fixed point in Sacramento County which is downloaded from http://pems.dot.ca.gov/.The traffic volume data are recorded every 5 minutes.Therefore, a daily traffic volume series for a loop detector contains 288 records, and the whole period of the data lasts for 16 days, that is, from August 2 to August 17, 2010.
Based on multiple correlations of the traffic volume data, we model the data set as a tensor model of size 16 × 24 × 12 which stands for 16 days, 24 hours in a day, and 12 sample intervals (i.e., recorded by 5 minutes) per hour.The ratios of outlier data are set from 5% to 15% and the outlier data are produced randomly.All the results are averaged by 10 instances.where  ()  and  ()  are the th elements which stand for the known real value and recovered value, respectively. denotes the number of recovered traffic volumes.
Table 4 presents the relative errors of traffic volume outlier data before and after being recovered by ADMM-TR.The results show that the RSE L 0 and MAPE L 0 for traffic volume data corrupted by outlier data are about 5 times than the data after being recovered by ADMM-TR.Figures 1, 2, and 3 present the profiles of traffic volume data for a day.The results show that ADMM-TR could recover the traffic volume outlier data with perfect performance.

Conclusions
In this paper, we concentrate on the mathematical problem in traffic volume outlier data recovery and proposed a novel tensor recovery method based on alternating direction method of multipliers (ADMM).The proposed algorithm can automatically separate the low--rank tensor data and sparse part.The experiments show that the proposed method is more stable and accurate in most cases and has excellent convergence rate.Experiments on real world traffic volume data demonstrate the practicability and effectiveness of the proposed method in traffic research domain.
In the future, we would like to investigate how to automatically choose the parameters in our algorithm and explore additional applications of our method in traffic research domain.
Traffic Volume.The correlations of traffic volume data are critical for recovering the corrupted traffic volume data.Traditional methods mostly exploit part of correlations, such as historical or temporal neighboring correlations.The classic methods usually utilize the temporal

Figure 1 :Figure 2 :Figure 3 :
Figure 1: Comparisons with raw traffic volume data, data corrupted by outliers with 5% ratio, and data recovered by ADMM-TR.

Table 1 :
The similarity coefficient of four modes.

Table 4 :
Traffic volume outlier data before and afterbeing recovered.