There are increasing concerns about missing traffic data in recent years. In this paper, a robust missing traffic flow data imputation approach based on matrix completion is proposed. In the proposed method, the similarity of traffic flow from day to day is exploited to impute missing data by the low-rank hypothesis of constructed traffic flow matrix. And the physical limitation of road capacity and nonnegativity is also considered through the optimization process, which avoids the possibility of producing negative and overcapacity values. Moreover, the proposed algorithm can impute missing data and recover outlier in a unify framework. The experiment results show that the proposed method is more accurate, stable, and reasonable.
1. Introduction
Traffic information collected by various kinds of sensors are a vital component of intelligent transportation system (ITS) which aims to influence travel behavior, reduce traffic congestion, improve mobility, and enhance air quality [1]. For example, the real-time traffic information can be provided to drivers before and during their travels for supporting their decision of route choice [2] and it is also an important guideline for modern traffic control system to adjust the signal timing [3]. Moreover, after proper preprocessing, the real-time traffic data can be used as the real-time traffic state estimation of transportation networks [4]. On the other hand, several data mining techniques have been applied to mine time related association rules from historical traffic databases and its results have been used for traffic prediction such as the works of Qiao et al. [5] and Zargari et al. [6].
However, the missing traffic data problems remain inevitable due to detector faults or transmission distortion in many places. About 10% of daily traffic flow is usually missing in Beijing, China [7]. Turner et al. [8] reported that almost a quarter of data from San Antonio, Texas, is missing; and more than 5% of data are lost within the PeMS traffic flow database [9]. The missing data adversely affect the applications of intelligent transportation system; for example, the traffic control system requires sufficient traffic flow data (i.e., traffic volumes, occupancy rates, and flow speeds) to generate appropriate traffic management strategies [10, 11]. In traffic forecast area, if there exists missing data, the predicting performance will reduce sharply [12, 13]. Clearly, missing data problem is a large obstacle for any of the functions for which ITS data is to be used.
In the past decades, numerous imputation methods have been proposed to handle missing traffic data problem. These imputation methods can be roughly divided into two parts: interpolation based and inductive learning based methods.
Interpolation based methods always fill the missing data with a weighted average value calculated from part of known data. Yin et al. [14] use historical averages from the same detector at the same time period but in neighboring days to replace missing data. Zhong et al. [15] interpolate error traffic data by traffic data from similar daily flow variation patterns considering the error type and traffic condition. Such approaches only use part of the traffic information and always fail to accurately estimate missing value at high missing ratio.
On the other hand, inductive learning methods try to build imputation modeling from the a priori characteristics of traffic data. Most of these methods are based on some assumptions of traffic flow data. Autoregressive Integrated Moving Average (ARIMA) is based on the assumption that the historical value and future value of traffic flow provide an indication of the missing value [16]. The probabilistic principle component analysis (PPCA) based methods assume that the basic characteristic of traffic flow variations can be captured by the probability distribution of PPCA [7, 17]. The recent tensor-based methods assume that traffic data are highly correlated in multimode (day, week, link) and construct traffic flow data into multiway array (tensor) to capture these correlations. By utilizing the essential characteristics of traffic flow information by the assumptions, these kinds of imputation methods often outperform traditional interpolation methods [7]. It can be concluded that most of the inductive missing traffic data imputation methods make the following assumptions.
Assumption 1.
Traffic flow has a high similarity from day to day and week to week but also link to the neighboring link which can be utilized to impute missing traffic flow.
Assumption 2.
The traffic flow data have not been spoiled by the outliers, which frequently occur in real-world traffic information system.
While the inductive based imputation methods achieve somewhat success by the assumptions for traffic flow data, there are still some shortcomings in these methods. Firstly, according to the traffic flow theory, the volume of traffic flow is a certain value from zero to the road-capacity (the maximum traffic flow obtainable on a given roadway using all available lanes). But most data-driven imputation methods ignore this limitation of traffic volume. Secondly, the traditional methods such as the PCA-based methods cannot work well with big outlier or errors without the preprocessing of corrupted traffic data [7]. In fact, we have proposed several tensor completion methods [17–20], which make full use of multimode correlations [21] of traffic flow data, to impute missing traffic data. However, in our former works, the nonnegativity (lower bound) and capacity (upper bound) of traffic flow are still ignored. As a result, it is possible that these methods will produce some unreliable results.
To tackle these shortcomings, this paper proposes a traffic flow data imputation methods based on matrix completion. In the proposed method, the traffic data are constructed into a day×interval matrix. The similarity of day mode is captured by the assumption that the constructed matrix is low-rank. By adding a limitation in the objective function, the proposed method can restrict the reconstructed traffic flow data between zero and road capacity. Moreover, for the traffic flow data corrupted by outliers, the proposed method can simultaneously impute the missing data and recover the outliers by the sparse assumption. It should be noted that the method we proposed here can be considered as an idealized version of robust PCA (RPCA) but different from the natural approaches to robust PCA [22]. The proposed approach needs not to preprocess the data and isolate the outlier before the imputation. This advantage allows the proposed matrix completion (MC) methods outperform traditional imputation methods especially for the traffic flow corrupted by outliers.
To give a detailed explanation of the proposed method, the rest of this paper is organized as follows. The methodology and algorithm of the proposed method are proposed in Section 2. Section 3 presents the imputing testing results including comparison with other methods. The conclusion and future works are conducted in Section 4.
2. Methodology
In this section, a brief description of the matrix completion is presented. Then, we described the proposed missing traffic flow data imputation method in detail.
2.1. Review of Matrix Completion Methods
Let M be an m×n matrix of rank r (<m or n); the low matrix M has some available sampled entries {Mij:(i,j)∈Ω} where Ω is a subset of sampled cardinality. Then [17] proves that most matrices M of rank r can be perfectly reconstructed by solving the optimization problem:
(1)min∥X∥*s.tXij=Mij(i,j)∈Ω.
In (1), the functional ∥X∥* is the nuclear norm of the matrix M, which is the sum of its singular values.
In [23], the singular value thresholding (SVT) algorithm is used to solve an approximate optimization problem of (1):
(2)minτ∥X∥*+12∥X∥F2s.tPΩ(X)=PΩ(M),
where PΩ is the orthogonal projector onto the span of matrices vanishing outside of Ω so that the (i,j)th component of PΩ(X) is equal to Xij if (i,j)∈Ω and zero otherwise.
The Lagrange multiplier of (2) is
(3)L(X,Y)=τ∥X∥*+12∥X∥F2+{Y,PΩ(M-Xk)},
where {A,B}=trace{ABT}, with optimization variable X∈Rm×n. Fix τ>0 and a sequence {δk}k≥1 of scalar step sizes. Then starting with Y0=0∈Rm×n, the algorithm inductively defines
(4)Xk=Dτ(Yk-1),Yk=Yk-1+δkPΩ(M-Xk),
where
(5)Dτ(X)=USτ(Σ)VTifX=UΣVT.
Sτ(x) is the constriction factor of x, where
(6)Sτ(x)={x-τifx>τx+τifx<-τ0else.
More details for SVT can be found in [24].
Literature [25] develops an augmented Lagrange multiplier method to solve MC. In their methods, the MC problem is formulated as
(7)min∥A∥*s.tA+E=DPΩ(E)=0
as E will compensate for the unknown entries of D and the unknown entries of D are simply set as zeros. Then the partial augmented Lagrangian function is
(8)L(A,E,Y,μ)=∥A∥*+{Y,D-A-E}+μ2∥D-A-E∥F2.
Then, A and E are updated according to the subproblems of (8),where A is updated by
(9)argminA∥A∥*+μ2∥D-A-E+μ-1Y∥F2=Dμ-1(D-E+μ-1Y).E is updated by
(10)argminPΩ(E)=0∥D-A-E+μ-1Y∥F2=PΩ(D-A+μ-1Y).
More detailed information of the augmented Lagrange multiplier (ALM) method can be found in literature [25].
2.2. The Proposed Algorithm
The goal of the proposed method is to impute the missed traffic data considering both the physical limitation of traffic flow data and the possible corruption by outliers. Firstly, the traffic flow data in a local place are formed into the matrix mode as follows:
(11)A=|a11…a1n⋮⋱⋮am1…amn|.
In this matrix aij represents the discretized volume on day j at time interval i within the given day. For the physical limitation, aij changes in a particular range from zero to road capacity. The total number of days is n and each day is divided into m time intervals. Supposing the set of observed traffic volume data is Ω, and the traffic volume is corrupted by sparse outliers. Hence, the missing traffic data imputation problem is translated into a corrupted matrix completion problem [26–28]; the optimization problems can be described as
(12)minrank(A)+λ∥PΩ(E)∥0s.tA+E=D,0≤A≤C,
where ∥∥0 represents the number of the nonzero entries and PΩ is the orthogonal projector onto the span of matrices vanishing outside of Ω.
The minimums of ∥∥0 (ℓ0-norm) and the rank of matrix are NP-hard problem [29]. To convert the objective function (12) into a convex optimization problem, the rank of A is approximated by the nuclear norm (the sum of the singular values) of the matrix, and the ∥∥0 of PΩ(E) is approximated by the ℓ1-norm of the matrix (the sum of the absolute value of its entries) [29] as follows:
(13)min∥A∥*+λ∥PΩ(E)∥1s.tA+E=D,0≤A≤C.
Because the solution of E is easier than A, the function is converted into the following form:
(14)min∥A∥*+λ∥PΩ(E)∥1s.tA+E=D,D-C≤E≤D.
Considering the faster computation speed and higher accuracy, the augmented Lagrangian method (ALM) [25] is employed to optimize the problem.
By introducing a Lagrange multiplier Y to remove the equality constraint, one has the Lagrangian function of (14):
(15)L(A,E,Y,μ)=∥A∥*+λ∥PΩ(E)∥1+〈Y,D-A-E〉+μ2∥D-A-E∥F2.
Lin et al. [25] proved that updating A and E once when solving this subproblem is sufficient for A and E to converge to the optimal solution of (15). A is updated by
(16)argminA∥A∥*+μ2∥D-A-E+μ-1Y∥F2=Dμ-1(D-E+μ-1Y),
where
(17)Dμ-1(D-E+μ-1Y)=USμ-1(Σ)VTifD-E+μ-1Y=UΣVT.Sμ-1() is the constriction factor, where
(18)Sτ(x)={x-τifx>τx+τifx<-τ0else.E is updated by
(19)argminD-C≤E≤Dλ∥PΩ(E)∥1+μ2∥D-A-E+μ-1Y∥F2=PD-c≤E≤D(PΩ(Sμλ(D-A+μ-1Y))+PΩ-(D-A+μ-1Y)PΩ(Sμλ(D-A+μ-1Y))).PD-c≤E≤D is the projector onto the span of matrices ranging from D-C to D. This leads to a matrix completion based traffic data imputation method (MCI) described in Algorithm 1.
<bold>Algorithm 1: </bold>
Matrix completion based traffic data imputation method.
Input: Observation samples Dij, i,j∈Ω, of matrix D∈Rm×n
The evaluation of an imputation method’s performance is a multiobjective problem. In this section, four key performance indicators of the proposed method are discussed, which are the accuracy, stability, robustness, and computation complexity.
3.1. The Test Data
To evaluate the proposed method, traffic flow datasets from PeMS [9] open database are used. The dataset is collected from Detector 400141. The detector is located at north bound freeway I880. The freeway has four lanes under surveillance. The sampling period is between July 11, 2013 and July 30, 2013. The data are almost all observed with a 99.9% observed ratio.
The assumption is an important premise for the missing traffic data imputation methods based on inductive learning. For the proposed MCI, the traffic volume matrix is assumed to be low-rank. The correctness of this assumption is validated by the low-rank approximation of the original data. The low-rank approximation data is computed by singular value decomposition according to Eckart-Young theorem [30]. If we consider singular value decomposition (SVD) of the constructed traffic volume matrix MG, we get
(20)MG=UG∇GVGT,
where columns of UG and VGT are left-singular vectors and right-singular vectors of MG, respectively. The diagonal entries of ∇G are equal to the singular values of MG.
The full-rank matrix MG can be approximated as a low-rank matrix MG^ by the SVD of MG, namely,
(21)MG^=UG∇G^VGT,
where ∇G^ is the same matrix as ∇G except that it contains only the r largest singular values (the other singular values are replaced by zero). The low-rank approximation of the selected traffic data is given in Figure 1.
The original data (blue) and approximated low-rank data (red) from Detector 400141. The low-rank data basically keeps the characteristic of origin data.
As we can see from Figure 1, the approximated low-rank data basically keeps the characteristic of origin data. From the results, it can be concluded that the low-rank hypothesis of traffic volume matrix is reasonable.
3.2. Quantitative Measures
The set of measures including MAE, MAPE, and SDE allows one to directly evaluate the performance of multiple imputation techniques
3.2.1. Accuracy
In this paper, the mean absolute percentage error (MAPE) is used to evaluate the performance of missing traffic data imputation. However, the MAPE will be lower if the traffic volumes are higher [31]. In observance of this phenomenon, this paper also applies the mean absolute error (MAE) as a complementary measure for MAPE.
The mean absolute error (MAE) is defined to be
(22)MAE=1n∑t=1n|xt-Mt|.
The mean absolute percentage error (MAPE) is defined to be
(23)MAPE=1n∑t=1n|xt-Mtxt|×100%,
where n is the total number of missing data, xt is the observed value, and Mt is the reconstructed value.
3.2.2. Stability
The standard deviation of errors (SDE) of the test methods is evaluated. The smaller SDE means that the errors are tightly clustered around the mean value [29]:
(24)SDE=var(Mt-xt).
3.2.3. Robustness
The robustness is evaluated by the accuracy on the dataset added outlier under different missing ratio.
3.3. The Results without Outlier
In this part, we evaluate the performance of MCI algorithm and compare it with other state-of-the-art algorithms including PCA-based PPCA [7], SVT [22], and IALM [23] on random missing case without outlier. For MCI, the tolerance on the ∥∥fof D-A-E divided by ∥∥f of D in the gradient is set to 0.01, and the maximum number of iterations is set to 103λ which is set to 150. For SVT and IALM, the tolerance on the ∥∥f is set to 0.01, and the maximum number of iterations is also set to 103. For PPCA, similar to [7], the tolerance is set to 0.01, and the latent space is set to 15.
In order to better verify the change in imputation performance, the total missing ratio (the number of missing data points divided by the total number of data points) is set from 5% to 70%. The MAE and MAPE curves are shown in Figure 2.
MAE and MAPE curves for three matrix completion methods and PPCA method.
In Figure 2, all the methods achieved equal results under missing ratio lower than 30%. However, the performances of other methods except for MCI degrade sharply when the missing ratio is higher than 50% except MCI. The reason may be that the MCI can utilize the physical limitation of traffic flow in the imputation process while the other methods ignored the physical limitation of traffic flow data.
Traffic volume must be nonnegative and less than the value of road capacity. The PPCA imputation strategy is a kind of statistical method which imputes data through the a priori statistical characteristics of data. As shown in Figure 3, the PPCA method may give a negative value of volume during the low flow rate interval. The phenomenon also can be found in the other two matrix completion strategies without the constraints of nonnegativity and road capacity in their objective function. Our proposed MCI can tackle this shortcoming by adding the limits to the algorithm. The negative and overcapacity value is not observed in the experiments of MCI. The possibilities of four methods that produce unreliable results in our experiments are given in Table 1 (the frequency of experiment results with unreliable results: negative or overcapacity).
The possibilities of producing unreliable results.
Methods
MCI
PPCA
IALM
SVT
Frequency
0%
38.57%
7.14%
11.43%
The negative value reconstructed by PPCA.
In the above experiments, the accuracy of MCI has been tested. Then, we will test the stability of imputation methods by SDE under different missing ratio. As shown in Figure 4, the SDEs of MCI and IALM using augmented Lagrangian function are lower than PPCA and SVT. It suggests that the MCI not only can compute missing data more accurate but also more stable by employing the augmented Lagrangian function.
The standard deviation of errors (SDE) of the test methods.
3.4. Missing Data Imputation with Outlier
The above experiments assume that the data have not been spoiled by outliers. However, the traffic flow series are often corrupted by the outliers which are caused by numerous reasons [32]. Unfortunately, these outliers are usually not easy to be isolated by the traditional missing traffic data imputation approaches. Thus, the recovery of outlier and imputation of missing data are often completed in different frameworks separately [7, 29].
For the problem, the MCI algorithm makes it possible to impute missing data and recover outlier in a unify framework by adding the sparse matrix E.
There are various kinds of outlier in traffic data. Here, we only consider two common scenarios of outliers:
volume out of range (VOR): percentage of the detector records with volumes larger than 1000 v/5 min;
volume repeating zero (VRZ): percentage of the detector records with repeating zero volumes for 30 min.
It is hard to enumerate all the situations with different mixing ratios of the two outliers’ scenarios. In the experiments, the methods are tested on a typical situation by assuming that the mixed VOR and VRZ data have a ratio of 1 : 1. Ratios of outlier data are set from 5% to 15% and the outlier data are produced randomly. The missing data ratio is set to 30%. All the results are averaged by 10 instances. The MAE and MAPE for missing data and outlier recovery are both given in Table 2. Figure 5 presents the part of traffic volume data and reconstructed volume data. The results show that MCI could impute the missing data and recover the traffic volume outlier data with a reliable performance.
The recovery and imputation accuracy with different ratio of outliers.
Outlier ratio
Missing data imputation
Outlier recovery
MAPE
MAE
MAPE
MAE
5%
13.81%
32.0799
14.89%
36.9587
10%
14.47%
33.8156
15.32%
39.1584
15%
15.95%
39.7424
15.76%
46.6486
Comparisons with raw traffic volume data, data with 30% missing data and corrupted by outliers with 15% ratio, and data reconstructed by MCI.
3.5. Computation Complexity of MCI Approach
As the same as IALM [25], it is not necessary to compute the full SVD in MCI. By using Lansvd [21], a fast SVD method that only computes singular values larger than a particular threshold and their corresponding singular vectors, the complexity of the singular value decomposition is not a problem for MCI. And the computation speed of MCI is faster than traditional matrix completion based methods such as SVT by utilizing the augmented Lagrangian function [25].
It is not easy to choose the parameter λ which is the weight parameter between the rank of matrix and the number of sparse outlier. For the traffic data without outliers, setting λ larger than 100 can obtain a good performance. But for data corrupted by outliers, a proper lower value of λ will achieve better results. In this paper, we suggest λ=150 for real application for data without corruption of outlier and λ=0.05 for corrupted data.
4. Conclusion and Future Works
In this paper, a matrix completion method which fully utilizes the physical limitation of traffic volume and the day mode similarity has been proposed dealing with missing traffic flow problem. The experiment shows that the proposed method is more reasonable, accurate, and stable than the state-of-art methods for traffic flow data. Moreover, the proposed MCI can impute missing data and recover the outlier in a unify framework with a reliable performance.
Future research should look into missing traffic data imputation method that incorporates spatial and temporal correlations among adjacent detectors to improve imputation accuracy. In addition, future studies may evaluate the performance of MCI on other parameters such as speed and occupancy. It still needs more researches on the appropriate choice of parameter for the MCI.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The research was supported by NSFC (Grant nos. 61271376, 51308115, and 91120010), National Basic Research Program of China (973 Program no. 2012CB725405), and Beijing Natural Science Foundation (4122067).
ZhangJ.WangF.-Y.WangK.LinW.-H.XuX.ChenC.Data-driven intelligent transportation systems: a surveyJouR. C.ChenK. H.A study of freeway drivers’demand for real-time traffic information along main freeways and alternative routesAzimiradE.ParizN.SistaniM. B. N.A novel fuzzy model and control of single intersection at urban traffic networkvan HinsbergenC. P. I. J.SchreiterT.ZuurbierF. S.van LintJ. W. C.van ZuylenH. J.Localized extended kalman filter for scalable real-time traffic state estimationQiaoW.HaghaniA.HamediM.A nonparametric model for short-term travel time prediction using bluetooth dataZargariS. A.SiabilS. Z.AlaviA. H.GandomiA. H.A computational intelligence-based approach for short-term traffic flow predictionQuL.LiL.ZhangY.HuJ.PPCA-based missing data imputation for traffic flow volume: a systematical approachTurnerS.AlbertL.GajewskiB.EiseleW.Archived intelligent transportation system data quality: preliminary analyses of San Antonio TransGuide dataPeMSCalifornia performance measurement systemhttp://pems.dot.ca.gov/CarlsonR. C.PapamichailI.PapageorgiouM.MessmerA.Optimal mainstream traffic flow control of large-scale motorway networksNguyenL. H.SchererW. T.Imputation techniques to account for missing data in support of intelligent transportation systems applications2003UVACTS-13-0-78University of Virginia, Center for Transportation StudiesXuJ. R.LiX. Y.ShiH. J.Short-term traffic flow forecasting model under missing datavan LintJ. W. C.HoogendoornS. P.van ZuylenH. J.Accurate freeway travel time prediction with state-space neural networks under missing dataYinW.Murray-TuiteP.RakhaH.Imputing erroneous data of single-station Loop detectors for nonincident conditions: comparison between temporal and spatial methodsZhongM.LingrasP.SharmaS.Estimation of missing traffic counts using factor, genetic, neural, and regression techniquesLiL.LiY.LiZ.Efficient missing data imputing for traffic flow by considering temporal and spatial dependenceTanH.FengG.FengJ.WangW.ZhangY. J.LiF.A tensor-based method for missing traffic data completionTanH.FengJ.ChenZ.YangF.WangW.Low multilinear rank approximation of tensors and application in missing traffic dataTanH.ChengB.FengJ.FengG.WangW.ZhangY. J.Low-n-rank tensor recovery based on multi-linear augmented Lagrange multiplier methodTanH.BinC.WuhongW.Yu-JinZ.BinR.Tensor recovery via multi-linear augmented lagrange multiplier methodProceedings of the 6th International Conference on Image and Graphics (ICIG '11)August 20111411462-s2.0-8005304040710.1109/ICIG.2011.160YuanX.JunfengY.Sparse and low-rank matrix decomposition via alternating direction methodspreprint 2009de la TorreF.BlackM. J.A framework for robust subspace learningCandèsE. J.RechtB.Exact matrix completion via convex optimizationCaiJ.-F.CandèsE. J.ShenZ.A singular value thresholding algorithm for matrix completionLinZ.ChenM.MaY.The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices2010, http://arxiv.org/abs/1009.5055CandesE. J.PlanY.Matrix completion with noiseCandèsE. J.LiX.MaY.WrightJ.Robust principal component analysis?NatarajanB. K.Sparse approximate solutions to linear systemsIndykP.RužićM.Near-optimal sparse recovery in the L1 normProceedings of the 49th Annual IEEE Symposium on Foundations of Computer Science (FOCS '08)October 20081992072-s2.0-5794908902110.1109/FOCS.2008.82StewartG. W.On the early history of the singular value decompositionChenC.KwonJ.RiceJ.SkabardonisA.VaraiyaP.Detecting errors and imputing missing data for single-loop surveillance systemsChenS.WangW.van ZuylenH.A comparison of outlier detection algorithms for ITS data