Highway Travel Time Prediction Using Sparse Tensor Completion Tactics and K-Nearest Neighbor Pattern Matching Method

1School of Traffic and Transportation, Beijing Jiaotong University, Beijing, China 2MOE Key Laboratory for Urban Transportation Complex Systems eory and Technology, Beijing Jiaotong University, Beijing, China 3School of Mechanical and Electronic Control Engineering, Beijing Jiaotong University, Beijing, China 4Department of Civil and Architectural Engineering and Construction Management, University of Cincinnati, Cincinnati, OH, USA


Introduction
Travel time is a traffic parameter that can reflect traffic conditions effectively, and it is the most popular traffic information for travelers.How to predict travel time accurately and timely is a classic research question in intelligent transportation system (ITS) research.
In many countries, the travel time estimation and prediction studies are based on the traffic data collected by loop vehicle detectors.In China, although the highway mileage ranks first in the world currently, because of the limitation of early construction budget, no loop vehicle detectors or other traffic detection equipment was set on highways.In recent years, with the development of the remote transportation microwave sensor (RTMS) technology, some highways in China began to put an RTMS every 2 to 5 km to monitor the traffic flow conditions.RTMS is a fixed traffic information collection device installed by a side mount (the working process of RTMS is described in Section 2).This makes it possible to carry out highway travel time prediction in China.
Abnormal operation, noise interference, or transmission line failure can lead to incomplete traffic data collected by RTMSs, and the large distance between two sensors on a highway can lead to a sparse data problem.Data missing due to equipment failure and other hardware-related problems can be regarded as data missing at random.After introducing the concept of virtual sensor nodes, the sparse data problem caused by the large distance between microwave sensors can be regarded as data missing at the virtual node.These phenomena inevitably affect the quality of the traffic data collected by the remote transportation microwave sensor, thus affecting the accuracy of travel time prediction.In order missing data recovery method based on the structure of "vector or matrix" can be mainly divided into statistical type, interpolation type, and predictive type.This method does not excavate the multimode traffic data characteristics such as time, space, week, and day, and the completion precision is limited.The tensor model is a new method to solve the missing data problem that can completely use the multimode characteristic of data and complete missing data in highdimensional space [11].
Aiming at solving the sparse data problem, Quiroga and Bullock [16,17] proposed a travel time estimation method based on the velocity integral model and the position interpolation model using expressway vehicle GPS.Wang et al. [18] proposed a model based on the naïve Bayesian method to achieve traffic volume estimation on the road network where GPS samples are not covered.Xu [19] used a twodimensional linear interpolation method and a segmentation method to estimate the travel time between expressway stations by remote transportation microwave sensor data.The author of this paper has proposed a solution to the sparse RTMS data problem: Zhao et al. [20] used a two-dimensional linear interpolation method to solve the problem of sparse microwave detector data.This method has had a positive effect on improving the effectiveness of traffic congestion detection.
As a research hotspot of modern intelligent transportation systems, there are many travel time prediction methods.These methods can be divided into model-based methods (ARIMA method [21], Bayesian method [22], multivariable linear regression [10], neural network prediction algorithm [23], Kalman filter algorithm [24], support vector machine [25], etc.), and data-driven methods (-nearest neighbor matching [5], nonparametric regression [26], etc.).In this paper, the travel time is predicted based on the RTMS data.The sampling interval is short and the sampling frequency is high, so a large amount of data can be collected in a short time.The model-based method has too many parameters when dealing with massive data, and model structure is complicated.The data-driven approach does not need establishing a model or the identification of a large number of parameters.Therefore, it is suitable for handling massive data.The -nearest neighbor algorithm is a data-driven method, which can make use of multimodal information to cluster data from similar modes together for prediction.The basic idea behind the -nearest neighbor algorithm is to obtain a complete historical database, extract the data characteristics, and then match the historical data pattern that is most similar to the current data pattern for future situation prediction.The author of this paper has studied travel time prediction based on the -nearest neighbor algorithm [5].However, in practical application, it is found that the traditional -nearest neighbor algorithm has some limitations, such as poor ability to adapt caused by the fixed value of  and low prediction accuracy when there is a large amount of abnormal data or under congested traffic conditions.Therefore, this paper uses the -nearest neighbor algorithm to predict travel time and further optimizes the -nearest neighbor algorithm.
In summary, the major purpose of this paper is similar to the literature [10]: to solve the issues caused by missing data  and sparsity problem and then improve the accuracy of travel time estimation and prediction.Therefore, this paper firstly uses the data completion method based on the tensor form to complete the missing RTMS data.Then, the concept of virtual nodes is introduced, and the two-dimensional linear interpolation method is used to fill the completely missing data at the virtual node to estimate the travel time.The traditional travel time prediction method based on the nearest neighbor algorithm is optimized.The state vector (speed, traffic volume, and time of day) is subdivided.The degree of traffic congestion is increased with the new state vector.The  value is calibrated by the cross-validation method.Finally, the method was verified using the remote transportation microwave sensor data of Jinggangao highway.The framework of this paper is shown in Figure 1.

Data Preprocessing
Because RTMS has the characteristics of sublane detection, the collected data cannot be directly used for the travel time prediction.It should be sorted into section data first.Due to equipment failure, transmission failure, and other reasons, there are some abnormalities in RTMS data, which seriously affected the quality of the data.This section describes the data preprocessing process.Firstly, the sublane data is integrated into section data.Secondly, an abnormal data judgment method is based on the threshold method and traffic flow theory.
For the preprocessing of RTMS data, the author of this paper has proposed a method in the literature [19].For the convenience of the reader's understanding, this part introduces the data preprocessing method in the literature [19].
The data in this paper was collected from the Jinggangao highway, Dujiakan station to Jingliang Road station, between January 1, 2016, and March 15, 2016.The experiment segment has three toll stations.The number of sensors is 11.The experiment segment has three lanes in each direction.The highest speed limit of the freeway is 120 km/h and the length of this highway segment (Dujiakan station to Jingliang Road station) is 9.6 km.Jinggangao Expressway is an expressway linking Beijing with major southern cities such as Guangzhou, Zhuhai, Hong Kong, and Macao.The section studied in this paper is located in the capital of China, Beijing.It is one of the main lines to get in and out of Beijing in the southwest.The traffic on the experiment segment is heavy every day and it experiences traffic congestions during peak hours.
As shown in Figure 2, the RTMS [20] is a fixed-type traffic information collection device based on digital radar technology, which can obtain real-time traffic information data.Take traffic speed detection as an example; when a vehicle enters the monitoring area in Figure 2, the RTMS will detect and record its entry time.When the vehicle leaves the detection area, the remote transportation microwave sensor will also record its departure time.The vehicle speed can be  occupy Time occupancy (6) car class Type of vehicle (7) direction id Direction of lane (8) lane id Number of lane (9) road id Number of road (10) transfered Flag of transfer calculated using the time difference and detection distance.Therefore, the detected traffic speed is the average speed of all the vehicles that pass through the detection zone during a certain period for a certain lane.RTMS can detect other traffic data including intensity, occupancy, and car style.Table 1 shows the traffic information that can be collected by RTMS.This paper uses 7 italic items in Table 1.
. .Data Integration.RTMS has the characteristic of sublane detection.RTMS detects traffic data of six lanes.However, the basis of traffic flow prediction is instantaneous section speed and intensity.Taking into account the contribution rate of traffic volume on the section speed calculation, a weighted method that can convert all the data of each lane to section data is presented [5].The data of six lanes was integrated into available section average speed and total intensity data, as shown in (1), and the data integration period is 5 minutes.
where V represents the average speed of all vehicles passing through the detection section;   represents the average speed of the vehicle passing through the detection section in each lane during the detection period;   represents the traffic volume of each lane passing through the detection section during the detection period;  represents the total traffic volume through the detection section during the detection period.
. .Abnormal Data Identification.The abnormal data judgment method is based on the threshold method and traffic flow theory.The threshold method takes into account the fact that the value of the traffic volume parameter must change within a reasonable threshold interval at a certain time interval.Traffic volume theory holds that the relationship between the three parameters of traffic volume satisfies a certain regularity.Based on the theory of traffic volume, this paper obtains the discriminant rules of threshold form and realizes the identification and judgment of abnormal data.
According to the characteristics of the test section, the rules were determined as follows.
Step (recognition based on the threshold method).The rules determined according to the threshold method are shown in (2).The collected traffic volume data that does not meet (2) can be recognized as abnormal data.
where  is the vehicle volume;  is the road capacity;  is the data acquisition cycle;   is the correction factor (the value of   varies between 1.3 and 1.5); V is the location average speed;   is the limit speed of the road;  is the occupancy;  V is the correction factor (the value of  V varies between 1.3 and 1.5).
Step (recognition based on traffic volume theory).According to the traffic volume theory and the vehicle driving conditions on the road, the abnormal data determination rule is as follows: (1) The location average speed  is 0; the volume  is not 0. (2) The volume  is 0, and the occupancy rate  and the location average speed V are not 0 at the same time.
(3) The occupancy rate is 0, and the volume rate is greater than the set value.
(4) The relationship between speed and volume is satisfied in where   is the maximum volume and   is a correction factor.
. .Validation and Analysis.According to the "Technical Standard of Highway Engineering (JTG B01-2014)," it is known that the road capacity of expressways in China's plains area is 2,000 vehicles per hour and the speed limit is 120 km/h.Therefore, according to the traffic engineering theory, the relationship between speed and volume should meet Figure 3 is a speed-volume scatter plot drawn using the data from one sensor on a test road section.On the -axis, the unit is veh/5 min in one direction.The points outside the curve are recognized as abnormal data.

Random Missing Traffic Data Recovery with Tensor Reconstruction
Abnormal operation, noise interference, transmission line failure, or abnormal data removal can lead to a random datamissing problem.The data-missing problem inevitably affects the quality of the traffic data collected by the RTMS.In this section, a missing data recovery method based on the static . .Correlation Analysis of Traffic Data.Similarity is one of the factors impacting imputation performance.Combining the multidimensional characteristics of traffic data, mining the multimode similarities will make a great contribution to imputing missing values [11].In this paper, the correlation analysis is used to reveal the similarity of traffic data.
The multimode characteristic is the result of people's understanding and the description of characteristics from different angles.The analysis result is different when people analyze one thing from different angles or using different methods.For the traffic data, it showed significant multimode characteristics when observed from the day, week, space, or time mode, and the researchers found that traffic data showed a strong multicorrelation in these modes.Figure 4 shows the distribution of remote transportation microwave sensor data in each mode.Figure 4(a) shows the distribution of data in day mode and time mode; Figure 4(b) is a threedimensional diagram which shows the distribution of data in space mode, where the -axis represents the moment, the -axis represents the location, and the -axis represents the volume.Figure 4(c) is a three-dimensional diagram which shows the distribution of data in week mode.The -axis represents the moment, the -axis represents the week, and the -axis represents the volume.It can be seen that RTMS data has similar characteristics in different modes.It can be seen from Figure 4(a) that the volumes between 8 and 9 AM on Thursday were lower than other periods.As 8-9 AM is the peak of the weekday, traffic congestion during this time of Thursday caused a drop in traffic volume.
In this paper, the following equation in the literature [10] is used to give the correlation of the various modes of traffic data: where  refers to the whole data points and (, ) refers to the correlation coefficient matrix.Table 2 shows the correlation of traffic data tensors in day mode, week mode, space mode, and time mode.
It can be seen from Table 2 that the traffic data show a high correlation in space, week, and day modes.However, in the time mode, the traffic volume shows different states at different times, which makes the traffic volume tensor exhibit a low correlation in time mode.It can also be seen that, due to the weak correlation between weekdays and weekends, compared to the "week" mode, the correlation in the "day" mode is low.
. .Establishment of Traffic Data Tensor Model.The tensor is a multidimensional array, which is a high-order extension of the first-order vector and the second-order matrix.The position of the element needs to be represented by three or more variables.The N-order tensor is expressed as A ∈ R  1 × 2 ×⋅⋅⋅×  .Section 3.1 illustrated that the expressway traffic data show strong temporal and spatial correlation, which provides a sufficient base for the construction of a traffic data tensor model.Based on the analysis results of Section 3.1, this paper constructs the tensor form of traffic data A ∈ R 4×11×7×288 , as shown in Figure 5, by combining the multimode traffic volume spatial-temporal characteristics.hypothesis can successfully approximate the multimode correlation of the traffic data.In essence, when the tensor is of low rank or has sparse characteristics, it can be said that the tensor is redundant and can be expressed in a more compact form, so that a complete tensor can be reconstructed with fewer sampling tensors.To solve the problem of lost data recovery of the tensor, it is generally assumed that the tensor to be recovered has a low-rank structure; that is, its data is distributed in a low-dimensional linear subspace.Obviously, the traffic data tensor model constructed by the actual data without any human intervention is not of low rank, and most of the time it is full rank.Therefore, a low-rank approximation is needed.In order to reveal the working principle of the low rank, the high-order singular value decomposition (HOSVD) on the tensor is used for a multimode low-rank approximation.HOSVD can use a tensor with a low-rank structure to approximate a known tensor.The HOSVD model of the tensor decomposes the tensor into a kernel tensor, multiplying a matrix of -mode in each mode as shown in

. . Low-Rank Analysis of Traffic
The tensor  is called a kernel tensor, and its elements represent the relationship between the principal components of various models.  is the expansion matrix of -mode. ×   is the -mode product of tensor  and matrix .
Figure 6 shows the partial results of the low-rank approximation of the traffic tensor using the n-mode rank [3,3,3,3].As shown in Figure 6, the low-rank approximation of the tensor with rank [3, 3, 3, 3] can maintain the main change of 4 × 11 × 7 × 288 traffic volume tensor data.The appropriate multi-low-rank hypothesis can effectively capture the multirelevance of traffic volume, gain the main changes of traffic volume data, and thus estimate the missing data.
In this paper, the tensor recovery algorithm based on lowrank and multilinear full-rank decomposition has the same purpose as the HOSVD model.They all make a low-rank approximation for a given tensor.
. .Tensor Completion Based on Matrix Decomposition.In this paper, the missing tensor data recovery via multilinear low/full-rank decomposition model is used to estimate the missing data in a tensor.For the convenience of the reader's understanding, this part first introduces the method taking the fourth-order tensor in this part of the paper as an example.
As shown in (7), the tensor completion method uses the optimization algorithm to recover the missing data by minimizing the tensor rank as low as possible.That is, the recovered tensor has a low-rank structure.min where rank(A) represents the rank of the tensor and Ω is the set of known element subscripts.B is the original volume (or speed) tensor and A is the recovered tensor.The constraint is that the value with the subscript Ω is equal to the value of the corresponding data in B.
However, the formula is not solvable.Assuming that the given tensor has an -mode low-rank structure, (7) can be transformed to minimize the -ranks of the given tensor, as shown in

𝐹 ( min
where  () and  () are the mode-i unfolding of A and B.
In order to make full use of the information of the tensor patterns, the rank of each mode of the tensor is simplified by the weighted summation strategy, as shown in In order to simplify the tensor completion problem, it is simplified by minimizing an  norm using the full-rank decomposition of the matrix.The tensor completion problem can be transformed into  ( () ,  () ,  () ) = min  () , () , () for  = 1, 2, 3, 4 respectively, where   represents the transpose of matrix ,  −1 represents the Moore-Penrose inverse of the matrix , and orth( ()  ()  ) is a set of standard orthogonal bases formed by the columns of the matrix  ()  ()  .The core strategy of the algorithm is that when a set of variables are optimized, other group variables can be treated as constants.In the process of optimizing the tensor completion problem, three sets of variables are  () ,  () , and  () , respectively.The  () ,  () , and  () group includes 4 variables, respectively.In order to obtain the final optimization results, the solution method used in this paper is to optimize  () ,  () , and  () alternately, and for each optimization, the specific value of  () ,  () , and  () will be updated once.The first optimization of the algorithm uses the initial value, and once the algorithm reaches the convergence criterion, the optimization is stopped and the result is obtained.The algorithm flow chart is shown in Figure 7.
The convergence conditions in Figure 7 where tol is the convergence threshold.This method converts the missing tensor recovery problem into a weighting multilinear low-rank decomposition form.Then, a differential gradient strategy and a block iterative strategy are used to solve this problem.Figure 8 shows the traffic volume and speed data collected by a microwave detector on January 5, 2016, and the result of missing data completion.

Solution of the Data Sparseness
Problem Based on a Two-Dimensional Linear Interpolation Method So far, using the tensor theory, the recovery of missing data is completed.Due to the large distance between every two sensors on the highway, the collected traffic data is obviously sparse, which leads to a large calculation error when the travel time is estimated.This paper introduces virtual nodes between real nodes and solves the sparse data problem using the two-dimensional linear interpolation algorithm.Then, the velocity data after interpolation is used to estimate the travel time between two toll stations.
. .Virtual Node Settings.As shown in Figure 9, this paper adds some virtual microwave detection nodes on the road, which makes the distance between two neighboring nodes less than 1 km after adding the virtual node.The actual node location is where RTMS is installed.
From Figure 9, it can be seen that the sparse data is as data missing at the virtual node.In order to solve this problem, this paper uses a two-dimensional linear interpolation method [5,20] to interpolate the velocity data at the virtual node.
. .Two-Dimensional Linear Interpolation Method.For the convenience of the reader's understanding, this part introduces the two-dimensional linear interpolation method in the literature [5,20].
The principle of the two-dimensional linear interpolation method is shown in Figure 10.The first step of the method is to calculate average speeds in time of adjacent detectors utilizing the temporal linear interpolation.The equations are as follows: where (, ) and ( + 1, ) represent average speeds of detection points  and  + 1 at time , respectively; (, ℎ) represents the speed of detection point  in time period ℎ;  0 is the start of time period ℎ; Δ is the duration of ℎ.The second step is to calculate the average speeds of detectors by the spatial linear interpolation which can be expressed as  where (, ) is the average speed of the detection point  at time ; () is the position where the vehicle is at time ;   is the position of detector ;  is the distance between the detector  and the detector  + 1.
. .Travel Time Estimation.Then, the travel time is estimated based on the piecewise method as follows.
Based on Figure 11, the procedure of travel time estimation between two toll stations on the expressway is as follows.
Step .Using the average speed detected by RTMS before and after the toll station, the location average speed at the toll station is calculated by the two-dimensional linear interpolation method.
Step .The road sections between two neighboring RTMS nodes are defined as   .Calculate the travel time   on   .
If   ≤ 1000 m, then If   > 1000 m, then the virtual nodes are added to make the distance between the nodes less than 1 km.The travel time between two nodes can be calculated using (15).
Step .Calculate the travel time between toll stations A and B,  = ∑   ; Figure 12 shows the distribution of the actual detector nodes and virtual detector nodes of Jinggangao highway (Dujiakan station to Jingliang Road station).
Take the travel time estimate of Dujiakan to Jingliang Road station at 8:00 on January 4 as an example.The velocity at the virtual node and the toll station is given by the twodimensional linear interpolation method as shown in Table 3.
Finally, the travel time calculation result is 419 s.    prediction based on weighted average functions.Taking into account the deficiencies of the traditional KNN method, optimization is made in three aspects.Firstly, in addition to the three original state vectors of speed, volume, and time of day, the congestion levels are introduced as a new state vector.Secondly, by analyzing the change characteristic of traffic volume, the whole day is divided into seven periods.Thirdly, considering the low adaptability and precision caused by the fixed  value of the traditional KNN algorithm, the cross-validation method is used to calibrate the  value.The procedure of optimal -nearest neighbor algorithm is introduced in this section.

Travel Time Prediction Based on the 𝐾-Nearest Neighbor Algorithm
. .State Vector Construction . . .Speed and Traffic Volume.The average speed and traffic volume of road sections were selected as the state vector to describe the road traffic status.As the traffic volume at the toll station cannot be measured directly, the traffic volume obtained by the first detector after the toll station consistent with vehicle heading direction was selected as a state parameter.Taking into account the continuity of the traffic state in the time dimension, the speed and volume of the first three cycles before the current state were selected as the state vector of the historical model library.A velocity vector and a traffic volume vector in the pattern library can be expressed as . . .Time of the Day.Time of day affects the travel time on the road; travel time is impacted by social activities and traveler habits, and the ratio of traffic volume to car type is different at different periods, so the travel times are not the same.According to [27], by analyzing the change of traffic volume at different times of the day, this paper divides the whole day into seven periods, as shown in Table 4.
. . .Traffic Congestion Degree.Different weather, maintenance, accidents, and other events will affect the traffic conditions and travel times, so the road traffic condition needs to be joined into the state vector of the pattern library.
This paper uses the degree of the traffic congestion to characterize the impact of different events on road traffic conditions.
It was found that the change in road section average speed can reflect the fluctuation of the traffic volume condition.In this paper, the road section average speed was used as a standard to assess the degree of traffic congestion.Refer to the Ministry of Transport of the People's Republic of China, "Highway Network Operation Monitoring and Service Interim Technical Requirements."The degree of traffic congestion was divided into four degrees as shown in Table 5.
. .Pattern Library Establishment.The state vector of the historical model library includes period, speed, traffic volume, and congestion degree.The decision attribute is travel time (estimation by method in Section 4).The state vectors and decision attributes were coded as shown in Table 6.Appropriate  values play a key role in accurate predictions [5].In order to determine the best  value of the historical data library in each period, this paper uses a crossvalidation method.The specific steps are as follows.
Step .Assume that the minimum and maximum  values are  min and  max , respectively.
Step .The historical data of every period are randomly divided into  parts, with each being recorded as  1 ,  2 , . . .,   .  ,  = 1, 2, . . .,  are used as test data, and the other  − 1 parts are combined as the new historical data library.
Step .Calculate the average value of the prediction performance index using the test data under different  values.
Step .When the performance index is optimal, the corresponding   is the optimal  value of this period.
The optimal  value of the test road section was calibrated as follows.The minimum and maximum  values are 1 and 50, respectively, and the historical data library in each period is divided into 10 parts on average.Taking the historical data  for the segments from Dujiakan to Jingliang Road station as an example, the results of the optimal  value calibration in different periods are shown in Table 7. Taking the peak period at night as an example, the average percentage change for absolute error in the case of different  values is shown in Figure 13.When  = 22, the mean absolute error is minimized.
. .Travel Time Prediction Weighted Average Function.In this paper, the weighted average method is used to make predictions based on the nearest neighbor data.Assuming  (1) Considering the influence of the randomly missing data on prediction accuracy, the correlation and low-rank characteristics of the microwave vehicle data are analyzed, and then the traffic tensor model is established.Based on multilinear low-rank decomposition, a tensor completion algorithm is used to fill the random missing traffic data.
(2) Virtual nodes are introduced between real nodes, and the sparse data problem is solved using the two-dimensional linear interpolation algorithm.Then, the velocity data after interpolation is used to estimate the travel time between toll stations.
(3) The optimal -nearest neighbor algorithm is proposed for travel time prediction, with the travel time as the decision attribute and the speed, traffic volume, time of the day, road section, and traffic congestion level as the state vector.The historical model library is established.Taking the Euclidean distance as the criterion, the similarity pattern is searched using the -nearest neighbor strategy, and the weighted average method is used to predict travel time.Practical data collected from Jinggangao highway in Beijing is used to verify the algorithm.The results show that the optimal -nearest neighbor algorithm can accurately predict travel time, and the accuracy of prediction results using data after completion is higher than when using original data.

Figure 1 :
Figure 1: Framework of the paper.

Figure 4 :
Figure 4: The trend of the various modes of traffic data.(a) Day mode and time mode.(b) Space mode.(c) Week mode.

Figure 5 :
Figure 5: Tensor model of traffic data.

Figure 6 :
Figure 6: Low-rank approximation results of traffic data.

Figure 10 :
Figure 10: Principle of two-dimensional linear interpolation methods.

Figure 12 :
Figure 12: Relative position of detectors and toll stations."" refers to the driving direction; "|" refers to the toll station.

where
is the start time of the next travel time forecast period;  is the length of the predicted cycle; V − indicates the average speed of the vehicle during the period [ − ,  − ( − 1)];  − indicates the traffic volume during the period [ − ,  − ( − 1)].

Figure 13 :
Figure 13: MAE under different values of .

Table 1 :
Traffic information collected by microwave detectors.

Table 2 :
The relevance of each pattern.

Table 3 :
The section speed at the virtual node and the toll station.

Table 4 :
Time of division in one day.Flat in the morning Peak in the morning Trough at noon Peak at noon Flat at night Peakat night Trough at night

Table 6 :
Property and encoding of each state component.−3,V −2 , V −  −3 ,  −2 ,  −  Value Calibration.The basic idea of KNN algorithm is to set a quantity  to search up -nearest points to the current state in the search area, and use these points to make predictions.Use all those points if there are fewer than  points in the search area, and keep the -nearest points if there are more than  points in the search area.

Table 8 :
Travel time prediction error.