Traffic data plays a very important role in Intelligent Transportation Systems (ITS). ITS requires complete traffic data in transportation control, management, guidance, and evaluation. However, the traffic data collected from many different types of sensors often includes missing data due to sensor damage or data transmission error, which affects the effectiveness and reliability of ITS. In order to ensure the quality and integrity of traffic flow data, it is very important to propose a satisfying data imputation method. However, most of the existing imputation methods cannot fully consider the impact of sensor data with data missing and the spatiotemporal correlation characteristics of traffic flow on imputation results. In this paper, a traffic data imputation method is proposed based on improved low-rank matrix decomposition (ILRMD), which fully considers the influence of missing data and effectively utilizes the spatiotemporal correlation characteristics among traffic data. The proposed method uses not only the traffic data around the sensor including missing data, but also the sensor data with data missing. The information of missing data is reflected into the coefficient matrix, and the spatiotemporal correlation characteristics are applied in order to obtain more accurate imputation results. The real traffic data collected from the Caltrans Performance Measurement System (PeMS) are used to evaluate the imputation performance of the proposed method. Experiment results show that the average imputation accuracy with proposed method can be improved 87.07% compared with the SVR, ARIMA, KNN, DBN-SVR, WNN, and traditional MC methods, and it is an effective method for data imputation.
National Key R&D Program of China2018YFC0808706National Natural Science Foundation of China51570810531. Introduction
With the rapid development of the social economy, many kinds of the massive road infrastructure are implemented [1–4] but traffic congestion still exists in the highway. Therefore, it is necessary to collect the information of the highway for the convenience of people's travelling demand. With the development of information technology, the collection of highway information becomes possible, and the collection equipment used for highways includes Bluetooth sensor, remote traffic microwave sensor, video sensors, and loop detectors. However, traffic flow data are lost in different degrees due to sensor damage, malfunction, or transmission errors, etc. Missing data makes it difficult to extract valid information from traffic data. Meanwhile, the missing data is also an obstacle in the traffic and the travel time prediction field [5–8], and the integrity of traffic flow data is the premise of data analysis in ITS. Therefore, it is very important to put forward an effective traffic data imputation method. At present, various methods have emerged in the field of traffic flow data imputation. These imputation methods can be roughly divided into three categories: prediction methods, interpolation methods, and statistical learning methods [9].
Traffic flow prediction models [10–12] are critical for road traffic management in complex road networks. Prediction methods usually build predictive models with historical data and treat missing data as values to be predicted. There are many ways to build traffic flow prediction models, from a simple null value imputation to complex spatiotemporal imputation models [13]. The representative prediction methods include Autoregressive Integrated Moving Average Model (ARIMA) [14–16], Bayesian networks (BNs) [17–19], and support vector regression (SVR) [20, 21]. Elshenawy et al. [22] proposed an intelligent data imputation method with ARIMA model and presented a mechanism based on Hyndman-Khandakar algorithm to determine ARIMA parameters. Sun et al. [23] partitioned a day into different time section and used SVR to forecast traffic flow data. Chen et al. [24] proposed an Autoregressive Integrated Moving Average with Generalized Autoregressive Conditional Heteroscedasticity (ARIMA-GARCH) model for traffic flow prediction. However, these prediction methods failed to utilize the sensor information with missing data, which would affect data imputation accuracy.
Interpolation methods are divided into temporal-neighboring and pattern-neighboring [25]. Temporal-neighboring methods fill up the missing data by the known data from the same sensors at the same daily time but on some neighboring days [20, 26]. Pattern-neighboring methods use the similarity characteristics of the daily traffic flow data [27] and estimate missing data using historical data collected from the same sensors on different days [17, 20]. The typical pattern-neighboring methods include K-nearest neighbors (KNN) model [28, 29] and Local Least Squares (LLS) [30, 31] model, and the key difficulty of these methods is to determine the neighbors by an appropriate distance metric [32, 33]. Nguyen et al. [34] used the mean value of the historical data to estimate missing data. Smith et al. [35] used historical data or the data from surrounding periods and locations to impute the missing data. The interpolation model assumes that the daily traffic flow data are similar, but the actual traffic flow data fluctuates and changes with time. Therefore, it is impossible to obtain satisfactory imputation performances.
The method based on statistical learning has been developed in recent years. This method primarily assumed the probability distribution model of traffic data and used iterative methods to estimate the parameters of the probability distribution. Then the observed data was used to impute the missing data. The statistical learning methods include Probabilistic Principal Component Analysis (PPCA) [6, 9], Bayesian Principal Component Analysis (BPCA) [26], neural network method [36], and Markov Chain Monte Carlo (MCMC) [37]. The MCMC is a typical imputation method based on statistical learning. The basic idea of the MCMC method regards the missing data as the target parameter and estimate the parameter by the sample values of the parameter. Y Higashijima et al. [38] proposed a regression tree imputation method and used a preprocessing method to improve imputation accuracy. Wei et al. [39] proposed a data-driven imputation method and used k-means clustering to group the most correlated road segments; the trained model is able to estimate the missing data at multiple locations under a unified framework. Although the methods based on statistical learning have strong hypothesis about traffic data, their performance is superior to traditional imputation methods [40] because the assumed probability distribution captures the essentials of traffic flow.
The methods based on prediction and interpolation simply impute the data with the temporal or spatial correlation characteristic and only consider the information of historical data. The historical imputation methods fill the missing data with the known data point collected on the same sensors at the same daily time but from different days. These methods require higher stability of historical data, but traffic flow data is usually unstable and fluctuate to some extent in practical applications. The traditional imputation method sets all the missing data to zero and uses the data matrix with zero-padding to participate the operation for the data imputation, which cannot consider the impact of missing sensor data into the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results. In order to address the above problems, a traffic data imputation method is proposed based on improved low rank matrix decomposition (ILRMD). Compared with the traditional imputation method, the ILRMD method fully considers the impact of missing data in the imputation results. In the process of data imputation, the ILRMD method does not directly discard the information of missing data, and the effect of missing data is reflected in the coefficient matrix. The reconstructed data matrix multiplied by the coefficient matrix, containing the missing data information, is the imputation result. The ILRMD method uses not only the traffic data around the sensor including missing data, but also sensor data with data missing. The information contained in the missing data is fully considered, and the spatiotemporal correlation characteristics of the traffic flow are adequately utilized. The tested results with traffic data collected from the Caltrans Performance Measurement System (PeMS) show that the proposed algorithm has superior imputation accuracy.
The rest of this paper is organized as follows. Section 2 reviews the related work in traffic data imputation and gives a brief introduction. The traditional imputation approach is introduced in Section 3. Section 4 describes the ILRMD method proposed in this paper. Section 5 discusses the result analysis and method comparison. Section 6 makes the conclusion of this paper and gives some recommendations.
2. Related Work
With the rapid development of machine learning, pattern recognition, computer vision, and data mining, the processing of big data is becoming more and more important. The scale and growth rate of big data are continuously increasing, but large-scale high-dimensional data is often correlative and redundant. Therefore, it is necessary to perform reasonable compression processing on large-scale data. In order to reduce data redundancy, Candes [41] proposed the concept of low rank sparse matrix decomposition in 2009, which is also called Low-Rank Matrix Recovery (LRMR), Low-Rank Matrix Decomposition (LRMD), or Robust Principal Component Analysis (RPCA).
2.1. Low-Rank Matrix Decomposition
For a given data matrix D∈Rm×n distributed in a linear subspace with approximately low dimension, it can be decomposed into a low-rank matrix A and a sparse matrix E [42].(1)minA,ErankA+λE0s.t.D=A+E
where E0 represents the L0 norm of the matrix E and λ represents the compromise factor of matrices A and E.
Since the optimization problem of (1) is a NP-hard problem, it can be relaxed to the convex optimization problem [41–43], which is noted as follows: (2)minA,EA∗+λE2,1s.t.D=A+E
where A∗ represents the nuclear norm of matrix A; E2,1=∑i=1m∑j=1neij2 is the L21 norm of the matrix E.
The low-rank characteristic of recovered matrix determines the matrix imputation performance. Therefore, choosing the suitable LRMD solution method is crucial. The main algorithms for solving LRMD problem include Iterative Threshold method [44, 45], the Dual Approach [46], Accelerated Proximal Gradient Algorithm [47], and Augmented Lagrange Multiplier method [48]. In this paper, Augmented Lagrange Multiplier method is used.
2.2. Matrix Imputation Based on Low-Rank Matrix Decomposition
Generally, we cannot recover all the data with partial sample data. But Candes [42] proved that the missing data can be recovered more accurately when data matrix is low or near low rank. From the Section 2.1, the low rank matrix A is acquired based on LRMD, which can be used to impute the missing data.
The model of matrix imputation can be noted as follows:(3)minArankAs.t.PΩA=PΩD
where Ω is the set of known element subscripts, and Ω⊆Rm×n, PΩ:Rm×n→Rm×n is a linear projection operator, which can be defined as follows:(4)PΩDij=Diji,j∈Ω0i,j∉Ω
The optimization problem of (3) is also a NP-hard problem, so it needs to be relaxed into a convex optimization problem:(5)minAA∗s.t.PΩA=PΩD
2.3. Matrix Imputation Based on Low-Rank Matrix Representation
The low-rank matrix imputation method mentioned above directly minimizes the rank of imputed data. In order to improve imputation efficiency, a self-expression is applied to LRMD, which is called the low-rank matrix representation [49, 50]. The data matrix D is represented as a linear combination with a dictionary matrix B, that is, D=BZ. The matrix Z is the coefficient matrix, and it is expected to be low rank. Z can be obtained by solving the optimization problem in the following:(6)minZrankZs.t.D=BZ
Equation (6) can be convexly relaxed to obtain the following:(7)minzZ∗s.t.D=BZ
If the data matrix D is selected as the dictionary matrix, (7) can be noted as follows:(8)minzZ∗s.t.D=DZ
In practical applications, the data matrix D may be disturbed by noise. In order to enhance the robustness, (8) can be revised as follows:(9)minZ,EZ∗+λE2,1s.t.D=DZ+E
A data matrix D is represented by a data dictionary B, and the coefficient matrix Z is sparser when D has higher similarity with B. But the stochastic noise is usually appended in data matrix D, which will influence the correlation within the data matrix. When the stochastic noise E is removed, the correlation of data matrix can be enhanced. D is selected as a dictionary, and its essence is to reveal the correlation within the data matrix. When the coefficient matrix Z is sparse, data columns in data matrix D are represented by each other’s columns with few coefficients as possible. For the traffic flow data, it has high spatiotemporal correlation characteristics, but it is affected by the weather, holidays, and other factors, which makes the traffic flow data have stochastic volatility. Therefore, if the influence of this stochastic volatility on the traffic data is removed, the correlation between the traffic data will be enhanced. After removing the influence of stochastic noise, the correlation between the data itself is further explored, and the similarity between the data is expressed with as little information as possible. Then the internal correlation of traffic flow data is used to impute the data.
2.4. The Solution of the Coefficient Matrix
In order to obtain the solution of (9), a variable J is introduced and let J=Z to separate the variable Z. The coefficient matrix Z can be calculated with the Augmented Lagrange Multiplier method, and the optimization model becomes the following:(10)minzJ∗+λD-DZ2,1s.t.J=Z
Construct an Augmented Lagrange function as (11), where Y is a Lagrange Multiplier, ·F2 is Fibonacci norm, which represents the sum of the absolute squares of elements, and μ is a weight to tune the error term Z-JF2.(11)LZ,J=J∗+12D-DZF2+Y,Z-J+μ2Z-JF2
The Exact Augmented Lagrange Multiplier (EALM) method is used to solve the matrices J and Z according to the following:(12)Jk+1=argminJLZk,J=argminJJ∗+Y,Z-J+μ2Z-JF2=argminJJ∗+μ2Z-J+YμF2(13)Zk+1=argminZLZ,Jk+1=argminZ12D-DZF2+Y,Z-J+μ2Z-JF2=argminZ12D-DZF2+μ2Z-J+YμF2
The updating of the coefficient matrix Z is as follows. Firstly, a projection matrix W is used to express the unmissing position of the matrix D, and D=W.∗(D). For convenience, set G=J-Y/μ and (13) can be expressed as follows:(14)minZ12W.∗D-W.∗DZF2+μ2Z-GF2
In order to get a derivative about Z in (14), the cross product should be changed to inner product. The matrices of (14) are spread in column as follows:(15)minZ12∑i=1nwi·∗di-wi·∗Dzi22+μ2∑i=1nzi-gi22
where wi, di, and zi are, respectively, the ith column of matrices W, D, and Z.
Change vector wi to a diagonal matrix, i.e., W^i=diag(wi) and wi.∗di=W^i∗di. Therefore, (15) can be expressed as follows:(16)minZ12∑i=1nW^i∗di-W^i∗DziF2+μ2∑i=1nzi-gi22
For simplifying (16), W^i∗di is denoted as Ki, and W^i∗D is denoted as Hi. Then (16) can be simplified as follows:(17)minZ12∑i=1nKi-HiziF2+μzi-gi22
For (17), zi can be updated by the following:(18)zi=HiTHi-μI-1HiTKi-μgi
Then repeat the above process until the objective function convergence. The coefficient matrix Z can be obtained when the termination condition is met, and it is expressed as follows:(19)Z=z11z12⋯z1nz21z22⋯z2n⋮⋮⋱⋮zn1zn2⋯znn∈Rn×n
3. Traditional Imputation Method with LRMD
The traditional method imputed the missing data by zero-padding operation. For an original matrix Do=[d1,d2,⋯dn]∈Rm×n, suppose that dp is missing, where dp represent pth column in D0. The missing column of the matrix D0 is imputed by 0, which can be represented as a matrix D^1:(20)D^1=d^1,d^2,⋯,d^p-1,0,d^p+1,⋯d^n∈Rm×n
where d^ij=0j=pdijj∈{1,2,⋯n}&j≠p(i=1,2,⋯m) is the specific elements in the matrix D^1.
Multiplying D^1 by the pth column of coefficient matrix Z, d^p=[d^1p,d^2p,⋯,d^mp]T can be recovered by the following:(21)d^ip=∑k=1nd^ik·zkp=di1z1p+di2z2p⋯+dip-1zp-1p+dip+1zp+1p+⋯+dinznp
The zero-padding operation is used for the traditional matrix imputation method to filling the missing column. Then the reconstructed matrix is multiplied by the corresponding column of the coefficient matrix Z; the imputed data of the missing column is obtained. This method only uses the data around the missing column to impute the missing data; that is to say, the missing column does not contribute to the imputation result. Generally, the sensors including missing data have the highest correlation with final imputation results. However, the missing data is set to zero directly in the traditional imputation method, which ignores the effect of the missing data on the imputation results and reduces the accuracy of the imputation results.
4. Traffic Data Imputation with ILRMD
The missing data generally can be divided into three different types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing at Determinate (MAD). This paper mainly deals with the problem of determinate missing. In road networks, traffic data was collected by various types of sensors, which usually demonstrated high temporal-spatial correlation characteristics; that is, traffic data have low-rank characteristic.
In a road network, suppose that there are m sensors and each sensor has n data samples, which can be denoted as a data matrix Dm×n. This paper assumed that the data in the pth sensor is missing in Dm×n. The traditional imputation method based on LRMD failed to consider the impact of missing data columns on imputation results. In order to address this shortcoming and combine the temporal-spatial correlation characteristics of traffic flow, this paper proposes a data imputation method based on ILRMD.
4.1. The Proposed ILRMD Model
In (9), it is assumed that dij, eij are the elements of the jth (1<j<n) observed sensor at the ith (1<i<m) time, respectively, existing in the observed matrix D∈Rm×n and the noise matrix E∈Rm×n. zij is the element of the coefficient matrix Z, and the coefficient matrix Z∈Rn×n. According to the multiplication rule, the following is obtained:(22)dij=∑k=1ndikzkj+eij=di1z1j+di2z2j+⋯+dinznj+eij
Then, (22) can be transformed into the following:(23)dij=di1z1j+⋯+dij-1zj-1j+⋯+dinznj1-zjj+eij1-zjj
The coefficient matrix of the jth observed sensor can be expressed as follows:(24)Z0j=z1j1-zjj⋯zj-1j1-zjjzj+1j1-zjj⋯znj1-zjjT
The final coefficient matrix Z0 of all observed sensors is described as follows:(25)Z=Z01,Z02,…,Z0n=z211-z11z121-z22⋯z1j1-zjj⋯z1n1-znnz311-z11z321-z22⋯z2j1-zjj⋯z2n1-znn⋮⋮⋱⋮⋱⋮zj-111-z11zj-121-z22⋯zj-1j1-zjj⋯zj-1n1-znnzj11-z11zj21-z22⋯zj+1j1-zjj⋯zjn1-znn⋮⋮⋱⋮⋱⋮zn11-z11zn21-z22⋯znj1-zjj⋯zn-1n1-znnn-1×n
Assuming that D2 represents the matrix D that removes the pth column. According to the matrix multiplication rule, the matrix D2=[d1,d2,⋯,dp-1,dp+1,⋯dn]∈Rm×(n-1) is multiplied by the pth column of the coefficient matrix Z0. The value d~p=[d~1p,d~2p,⋯,d~mp]T is obtained and can be noted as follows:(26)d~ip=∑k=1p-1d~ik·zkp1-zpp+∑k=p+1nd~ik·zkp1-zpp=di1z1p+⋯+dip-1zp-1p+dip+1zp+1p+⋯+dinznp1-zpp
The ILRMD method proposed in this paper assumes that a certain column of data in the matrix is lost and then multiplies the matrix by the coefficient matrix Z0 to recover the missing data. The influence of all observed sensors is considered including the sensor with missing data. In (24), if the value Zjj is zero, the data of the surrounding sensors is used for imputation. If the value Zjj is not zero, both the data of the surrounding sensors and the sensor including missing data are used.
The differences between the ILRMD method and the traditional imputation method are discussed as follows. The traditional imputation method performs the zero-padding operation on the missing column and then is directly multiplied by the corresponding column of the coefficient matrix Z. The traditional imputation method utilizes the data collected from the surrounding sensors to recover the matrix and ignores the effect of the sensors including missing data. The ILRMD method assumes that the pth column of the data is completely missing and the matrix D2 represents the matrix D after removing the pth column data. Then after the conversion, the weight that is most relevant to each sensor itself is expressed in another form, in order to reduce the effect of the most relevant weight to the imputation result. From (22)-(24), a coefficient matrix Z0 is obtained. The coefficient matrix Z0 considers not only the surrounding sensors, but also the influence of the sensor including missing data. Ultimately the matrix D2 is multiplied by the coefficient matrix Z0 for obtaining imputation result.
The main steps of the proposed imputation method are as follows.
Step 1.
The traffic flow data is preprocessed by smoothing and filtering, and the complete traffic flow data of one day is randomly selected to construct the training matrix D.
Step 2.
The preprocessed matrix D is decomposed into the low-rank matrix A and the sparse matrix E according to (1).
Step 3.
According to (9), matrix A is decomposed into D and Z, and, from (10) to (20), the coefficient matrix Z is solved.
Step 4.
Construct test matrix D and set matrix D2 as the dictionary matrix. D2 represents the matrix D that removes the pth column.
Step 5.
The coefficient matrix Z0 is obtained according to (25) and the missing data which need to be imputed is obtained by (26).
4.2. Performance Evaluation Criteria
The evaluation criteria to measure the error of the imputed data included root mean square error (RMSE), mean absolute error (MAE), mean squared percentage error (MSPE), and mean absolute percentage error (MAPE). The RMSE and MAPE are selected in this paper. The formulas are as follows:(27)RMSE=1N∑i=1Nyi-y^i2MAPE=1N∑i=1Nyi-y^iyi×100%
where N is the total number of the missing data, yi is the actual value of the ith missing data point, and y^i is the corresponding estimated value.
5. Experiment Results5.1. Data Description
The data used to evaluate the performance of the proposed model was collected in mainline detectors provided by the PeMS database, which includes more than 39,000 individual sensors that span the highway system in all major metropolitan areas of California. In this paper, 46 mainline sensors numbered from 1108512 to 1221232 are selected to perform data imputation test from April 1st, 2018, to April 30th, 2018. The traffic flow data is aggregated at 5-minute intervals and generate 288 data points for the daily flow. The data of 1 day, 7 days, and 14 days are, respectively, selected to construct the training matrix; however, the experimental results show that the improvement of the imputation accuracy is not obvious when the training samples become larger and larger. Therefore, the traffic flow data on April 23th, 2018, is used as training data, and the data on April 30th, 2018, is used as test data. The data in sensor numbered 1108512 is assumed to be missing, which needs to be imputed. According to the analysis of the spatial-temporal correlation characteristics of traffic flow, the traffic flow data on the same day in different consecutive weeks have high regularity and relevancy. Therefore, this paper selects traffic flow data from the same day on consecutive weeks (two Mondays) to perform the experiment. The traffic flow data of 46 observed sensors on April 23th, 2018, are selected as training matrix, and the data in sensor numbered 1108512 on April 30th, 2018, is assumed to be missing, which needs to be imputed.
Due to the influence of people’s willing for a trip, weather, and other factors, the traffic flow data presents certain stochastic fluctuation and abrupt. In order to reduce the impact of stochastic fluctuation of traffic flow data on imputation results, a five-point smoothing filtering method was used to preprocess the data. The original and filtered data, in the sensor numbered 1108512 on April 8, 2018, are shown in Figure 1.
The original data and filtered data.
From Figure 1, it can be seen that the filtered data intuitively reflects the regularity of the traffic data, and the abrupt points are effectively filtered out in the original traffic flow data.
In this paper, the training data and the test data are all preprocessed with a smoothing filtering method at first, which can remove the abnormal points in the sensor data. Then we randomly assume that a sensor data is missing and then impute the missing sensor data with the proposed model.
5.2. Results and Performances Analysis5.2.1. Influence of Parameter λ
The compromise factor λ is an important parameter of low rank matrix decomposition, and the different λ values have an important impact on the performance of data imputation. In order to verify the effectiveness of ILRMD method, the influence of parameter λ is analyzed. The RMSE and MAPE of imputation results changes with the compromise factor λ are, respectively, shown in Figures 2(a) and 2(b).
The effect of compromise factor λ on error function.
RMSE changes with compromise factor λ
MAPE changes with compromise factor λ
From Figure 2, we can see that, for the traditional MC method, both RMSE and MAPE gradually decrease with the increase of the compromise factor λ. After RMSE and MAPE reach the minimum value (λ=0.08), which increase again. For the ILRMD method, RMSE and MAPE all decrease with the change of λ. When λ=0.15, they reach the minimum and then increase slowly. In any case, the traditional MC method is far less effective than ILRMD method. Therefore, in order to compare the imputation results of the two methods in the best state, λ is set as 0.08 for traditional MC method and 0.15 for ILRMD method in this paper.
5.2.2. The Selection of the Training Data
Due to traffic flow has high spatial-temporal correlation characteristics, it is necessary to analyze the effect of different training data to imputation results. However, the selection of training data has little influence on the performance of the proposed ILRMD method. In order to show that the performance of the proposed method is not sensitive to the time, the traffic flow data of four days (April 21th, 2018, April 22th, 2018, April 23th, 2018, and April 24th, 2018) are randomly selected as training data to impute the data of April 30th, 2018. The experimental results are shown in Figures 3(a), 3(b), 3(c), and 3(d).
The imputation results of traditional MC and ILRMD methods.
The imputation results of traditional MC and ILRMD methods
The imputation results of traditional MC and ILRMD methods
The imputation results of traditional MC and ILRMD methods
The imputation results of traditional MC and ILRMD methods
It can be seen from Figure 3 that the proposed ILRMD method always has good performance and is not sensitive to the selection of training data. And the imputation performance of different training data is shown in Table 1.
The performance comparison of different training data.
Training data
MAPE
RMSE
April 21th, 2018
0.0294
0.0454
April 22th, 2018
0.0364
0.0588
April 23th, 2018
0.0260
0.0453
April 24th, 2018
0.0207
0.0409
It can be seen from Table 1 that the proposed method always has good performance although the different training data is used. The results indicate that the selection of time has little influence on the proposed ILRMD method. Therefore, we only select the traffic flow data of one day (April 23th, 2018) to verify the proposed model in the paper.
5.2.3. Comparison of Imputation Results
For the purpose of verifying the performances of ILRMD method, the proposed method is compared with the traditional method. The imputation results of the ILRMD method under the best condition (λ=0.15) and the traditional method under the best condition (λ=0.08) are shown in Figures 4(a) and 4(b).
The imputation results of traditional MC and ILRMD methods.
The imputation results of two imputation methods (λ=0.08)
The imputation results of two imputation methods (λ=0.15)
From Figure 4, it can be seen that the imputation results of traffic flow data through the ILRMD are more accurate than the traditional MC method. Although the imputation result is obtained in the optimal compromise factor λ with the traditional MC method, there is a big deviation between the imputation result and the real data, and the ILRMD method still recovers the missing traffic data more accurately. When compromise factor λ is set as the optimal value for the ILRMD method, the imputation result is almost identical with the real value, but there are more deviations in traditional methods. It is observed that the imputation results of the proposed ILRMD method have similar traffic patterns with the real traffic flow, especially in morning and evening peak hours.
5.2.4. The Comparison of ILRMD and Other Imputation Methods
In order to evaluate the advantages of our proposed approach, the ARIMA, SVR, DBN-SVR, WNN, KNN, and Traditional MC imputation methods are selected under the premise of testing with the same experimental data. In the ARIMA model, the orders of autoregressive p, moving average q, and difference d are, respectively, set as 5, 5, and 1. In the SVR model, the nuclear function is configured as “rbf”, the number of iterations is 10,000, and the penalty factor is taken as 0.01. In the WNN model, the number of iterations is 1000, the number of the hidden layer nodes is 3. In the DBN-SVR model, the number of network layers in the DBN model is set as 3 and the number of iterations is 200. The ILRMD model proposed in this paper is compared with these imputation methods; the imputation results of different models and real traffic flow are shown within one day in Figure 5.
The imputation results of different models.
It can be seen from Figure 5, the imputation traffic flow has similar traffic patterns with the real traffic flow. The DBN-SVR model has the worst imputation performance; the ARIMA, SVR, KNN, and WNN are better than the DBN-SVR, while they show weakness compared with the ILRMD method. The imputation value of the proposed ILRMD model is almost coincided with the measured data. It is observed that the proposed ILRMD model has better imputation performance.
The error analysis test is conducted using two error evaluation criteria, which is expressed in Table 2. In order to more precisely verify the performance of the proposed model, another sensor numbered 1119921 is randomly selected to perform the test. In Table 2, the sensors numbered 1108512 and 1119921 are, respectively, assumed to be imputed to verify the performance of the proposed model. It can be seen from Table 2, when the sensors numbered 1108512 and 1119921 are assumed to be imputed, the proposed ILRMD models all have the best performance compared with other approaches. These experiments can verify that the ILRMD model proposed in this paper is an effective method for data imputation.
The performance comparison of data imputation models.
Missing sensor ID
1108512
1119921
Model
MAPE
RMSE
MAPE
RMSE
KNN
0.1338
0.5022
0.1095
0.4260
SVR
0.1024
0.5396
0.1401
0.2424
WNN
0.1442
0.6660
0.3994
0.1229
DBN-SVR
0.1129
0.6402
0.0994
0.7280
ARIMA
0.2078
0.6065
0.2047
0.5968
Traditional MC
0.3717
0.1830
0.1350
0.1099
ILRMD
0.0260
0.0453
0.0495
0.0672
From Table 2 of the first condition (1108512 sensor), it can be seen that the imputation accuracy of the ILRMD model, respectively, improves 93.01%, 74.61%, 95.96%, 80.57%, 96.30%, and 81.97% compared with the traditional MC, SVR, ARIMA, KNN, DBN-SVR, and WNN methods. The average imputation accuracy is 87.07% higher than other imputation methods. Results demonstrate that the proposed ILRMD model has the best performance compared with other approaches, and it is an effective method for data imputation.
6. Conclusions and Recommendations
In the paper, a data imputation method is proposed to impute the missing traffic flow data. Different from the most known traffic flow data imputation methods, the ILRMD model makes an effective use of the information of missing sensors and takes full advantage of the high spatiotemporal correlation characteristics of traffic flow data. The experiment result shows that the proposed imputation method is superior to other methods. However, this paper focuses on dealing with the missing traffic data at a single sensor; we only considered one observed sensor with missing data. In practical terms, the missing traffic data is always distributed on multisensors.
In our future research, the missing data analysis on multisensors is being studied. The concept of missing rate can be introduced, and the more effective data imputation method can be proposed for different degrees of missing data in order to improve the imputation accuracy.
Data Availability
The data used in this paper are collected from the Caltrans Performance Measurement System (PeMS) in 46 sensors numbered from 1108512 to 1221232 on 04/01/2018~04/27/2018. If any researcher requests for these data, he can log into the website: http://pems.dot.ca.gov/.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was partly supported by the National Key R&D Program of China (2018YFC0808706) and the National Natural Science Foundation of China (Grant no. 5157081053). The authors are also grateful to the PeMS for providing the data.
WangQ.SunH.Traffic structure optimization in historic districts based on green transportation and sustainable development concept2019201918919626310.1155/2019/9196263DuanL.ZhangY.LaiJ.Influence of ground temperature on shotcrete-to-rock adhesion in tunnels2019201916870908710.1155/2019/8709087WangX.LaiJ.GarnesR. S.LuoY.Support system for tunnelling in squeezing ground of qingling-daba mountainous area: a case study from soft rock tunnels2019201917868253510.1155/2019/8682535DuanL.LinW.LaiJ.ZhangP.LuoY.Vibration characteristic of high-voltage tower influenced by adjacent tunnel blasting construction2019201916852056410.1155/2019/8520564ZhaoN.LiZ.LiY.Improving the traffic data imputation accuracy using temporal and spatial informationProceedings of the 7th International Conference on Intelligent Computation Technology and Automation, ICICTA 2014October 2014China3123172-s2.0-84921734663ChenC.WangY.LiL.HuJ.ZhangZ.The retrieval of intra-day trend and its influence on traffic prediction2012221031182-s2.0-8486278223410.1016/j.trc.2011.12.006van LintJ. W. C.HoogendoornS. P.van ZuylenH. J.Accurate freeway travel time prediction with state-space neural networks under missing data2005135-63473692-s2.0-3364676281810.1016/j.trc.2005.03.001SunH.WangQ. P.ZhangP.Spatial-temporal characteristics of tunnel traffic accidents in China from 2001 to present4536414QuL.LiL.ZhangY.HuJ.PPCA-based missing data imputation for traffic flow volume: a systematical approach200910351252210.1109/tits.2009.20263122-s2.0-70349166727LuoX.NiuL.ZhangS.An algorithm for traffic flow prediction based on improved SARIMA and GA20181910.1007/s12205-018-0429-4LuoX.LiD.ZhangS.Traffic flow prediction during the holidays based on DFT and SVR646145010.1155/2019/6461450LuoX. L.LiD. Y.YangY.ZhangS. R.Spatiotemporal Traffic Flow Prediction with KNN and LSTM2019201910414535310.1155/2019/4145353LañaI.OlabarrietaI.VélezM.Del SerJ.On the imputation of missing data for road traffic forecasting: new insights and novel techniques201890183310.1016/j.trc.2018.02.021NihanN. L.Aid to determining freeway metering rates and detecting loop errors1997123645445810.1061/(ASCE)0733-947X(1997)123:6(454)LeeS.FambroD. B.Application of subset autoregressive integrated moving average model for short-term freeway traffic volume forecasting1999167817918810.3141/1678-222-s2.0-0033226152TanM.-C.WongS. C.XuJ.-M.GuanZ.-R.ZhangP.An aggregation approach to short-term traffic flow prediction2009101606910.1109/tits.2008.20116932-s2.0-61849156325LiL.LiY. B.LiZ. H.Missing traffic data: comparison of imputation methods201481515710.1049/iet-its.2013.0052VlahogianniE. I.KarlaftisM. G.GoliasJ. C.Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach200513321123410.1016/j.trc.2005.04.0072-s2.0-23844513726PascaleA.NicoliM.Adaptive Bayesian network for traffic flow predictionProceedings of the 2011 IEEE Statistical Signal Processing Workshop, SSP 2011June 2011France1771802-s2.0-80052245483Castro-NetoM.JeongY.-S.JeongM.-K.HanL. D.Online-SVR for short-term traffic flow prediction under typical and atypical traffic conditions20093636164617310.1016/j.eswa.2008.07.0692-s2.0-58349104545ZengD. H.XuJ. M.GuJ. W.LiuL.XuG.Short term traffic flow prediction based on online learning SVRProceedings of the 2008 Workshop on Power Electronics and Intelligent Transportation System, PEITS 2008August 2008China6166202-s2.0-56449099841ElshenawyM.El-dariebyM.AbdulhaiB.Automatic imputation of missing highway traffic volume dataProceedings of the 2018 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops)March 2018Athens37337810.1109/PERCOMW.2018.8480120SunZ. Q.PanJ. S.DuanQ.Study on a New Traffic Flow Forecasting MethodProceedings of the International Conference on Natural Computation2008349353ChenC.HuJ.MengQ.ZhangY.Short-time traffic flow prediction with ARIMA-GARCH modelProceedings of the 2011 IEEE Intelligent Vehicles Symposium, IV'11June 2011Germany6076122-s2.0-79960759575YinW. H.Murray-TuiteP.RakhaH.Imputing erroneous data of single-station loop detectors for nonincident conditions: Comparison between temporal and spatial methods201216315917610.1080/15472450.2012.6947882-s2.0-84867162135QuL.ZhangY.HuJ.JiaL.LiL.A BPCA based missing value imputing method for traffic flow volume dataProceedings of the IEEE Intelligent Vehicles Symposium (IV '08)June 200898599010.1109/ivs.2008.46211532-s2.0-57749178570NiD. H.LeonardJ. D.GuinA.Multiple imputation scheme for overcoming the missing values and variability issues in its data200513112931938LiuZ. B.SharmaS.DatlaS.Imputation of missing traffic data during holiday periods2008315525544SunB.ChengW.GoswamiP.BaiG.Short-term traffic forecasting using self-adjusting k-nearest neighbours2018121414810.1049/iet-its.2016.0263ChangG.WuQ. Y.LuoL.Missing data imputation for traffic flow based on weighted local least squresProceedings of the International Conference on Automatic Control and Artificial Intelligence201213511354ChangG.ZhangY.YaoD.Missing data imputation for traffic flow based on improved local least squares201217330430910.1109/TST.2012.6216760DuanY. J.A deep learning based approach for traffic data imputationProceedings of the IEEE 17th International Conference on Intelligent Transportation Systems (ITSC)2014896901ShangQ.YangZ.GaoS.TanD.An imputation method for missing traffic data based on FCM optimized by PSO-SVR2018201821293524810.1155/2018/29352482-s2.0-85042652575NguyenL. N.SchererW. T.2003Charlottesville, Va, USAUniversity of VirginiaSmithB. L.SchererW. L.ConklinJ. H.Exploring imputation techniques for missing data in transportation management systems20031836113214210.3141/1836-172-s2.0-1642365673GovesC.NorthR.JohnstonR.FletcherG.Short term traffic prediction on the UK motorway network using neural networks2016131841952-s2.0-84978768029NiD. H.LeonardJ. D.Markov chain monte carlo multiple imputation for incomplete its data using bayesian networks2005193515767HigashijimaY.YamamotoA.NakamuraT.NakamuraM.MatsuoM.Missing data imputation using regression tree model for sparse data collected via wide area ubiquitous networkProceedings of the 2010 10th Annual International Symposium on Applications and the Internet, SAINT 2010July 2010Republic of Korea1891922-s2.0-78649254815KuW. C.JagadeeshG. R.PrakashA.SrikanthanT.A clustering-based approach for data-driven imputation of missing traffic dataProceedings of the 2016 IEEE Forum on Integrated and Sustainable Transportation Systems, FISTS 2016July 2016China162-s2.0-84988429044LiL.LiY.LiZ.Efficient missing data imputing for traffic flow by considering temporal and spatial dependence201334910812010.1016/j.trc.2013.05.0082-s2.0-84880340417CandèsE. J.RechtB.Exact matrix completion via convex optimization20099671777210.1007/s10208-009-9045-5MR2565240ZhangF.YangJ.A linear subspace learning approach via low rank decompositionProceedings of the 2011 2nd International Conference on Innovations in Bio-inspired Computing and Applications, IBICA 2011December 2011China81842-s2.0-84862915943WrightJ.GaneshA.RaoS.Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimizationProceedings of the Advances in Neural Information Processing Systems (NIPS)200920802088CaiJ. F.CandèsE. J.ShenZ. W.A singular value thresholding algorithm for matrix imputation201020419561982ChoK. H.ReyhaniN.An iterative algorithm for singular value decomposition on noisy incomplete matricesProceedings of the 2012 Annual International Joint Conference on Neural Networks, IJCNN 2012, Part of the 2012 IEEE World Congress on Computational Intelligence, WCCI 2012June 2012Australia162-s2.0-84865074118GaneshA.LinZ. C.WrightJ.WuL.ChenM.MaY.Fast algorithms for recovering a corrupted low-rank matrixProceedings of the 2009 3rd IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, CAMSAP 2009December 2009Netherlands2132162-s2.0-77951126761KimchuanT.SangwoonY.An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems201063615640LinZ. C.ChenM. M.MaY.The augmented lagrange multiplier method for exact recovery of corrupted low rank matrices, pp. 9, https://arxiv.org/abs/1009.5055, 2010DuR.ZhangY.WangB. Y.Low-rank representation based traffic data imputation methodProceedings of the International Joint Conference on Neural Networks201651275134LiuG. C.LinZ. C.LinS. C.Robust recovery of subspace structures by low-rank representation2013351171184