Big Data-Driven Based Real-Time Traffic Flow State Identification and Prediction

. With the rapid development of urban informatization, the era of big data is coming. To satisfy the demand of traffic congestion early warning, this paper studies the method of real-time traffic flow state identification and prediction based on big data-driven theory. Traffic big data holds several characteristics, such as temporal correlation, spatial correlation, historical correlation, and multistate. Traffic flow state quantification, the basis of traffic flow state identification, is achieved by a SAGA-FCM (simulated annealing genetic algorithm based fuzzy c -means) based traffic clustering model. Considering simple calculation and predictive accuracy, a bilevel optimization model for regional traffic flow correlation analysis is established to predict traffic flow parameters based on temporal-spatial-historical correlation. A two-stage model for correction coefficients optimization is put forward to simplify the bilevel optimization model. The first stage model is built to calculate the number of temporal-spatial-historical correlation variables. The second stage model is present to calculate basic model formulation of regional traffic flow correlation. A case study based on a real-world road network in Beijing, China, is implemented to test the efficiency and applicability of the proposed modeling and computing methods.


Introduction
Real-time traffic flow state identification and prediction is one of the critical components of intelligent transportation system.It is of practical significance to identify and predict traffic flow state quickly, precisely and timely.As the first problem that needs to be solved, traffic flow state can be measured by level of service (LOS), which is first introduced in the 1965 Highway Capacity Manual (HCM 1965) [1].The latest version of Highway Capacity Manual (HCM 2010) [2] divided LOS into six levels.Sasaki and Iida [3] divided LOS into three levels.China GB 50220-1995 [4] divided LOS into four levels.Existing studies of LOS evaluation can be classified into four categories in transportation literature: subjective evaluation based models [5], statistical analysis based models [6], artificial intelligence based models [7], traffic flow theory based models [8,9].
Many previous research works concerning traffic flow state identification especially for Beijing have been carried out.Guan and He [10] analyzed the statistical features of speed distribution at different density and divided traffic flow state into four levels based on flow-rate-density plane.Liao et al. [11] studied traffic state identification based on perceptual experiment.Xia et al. [12] built traffic state rapid identification model based on fuzzy theory.Qu et al. [13] determined the relation between traffic state and travel speed.Moreover, Beijing Traffic Management Bureau built Beijing Regional Traffic Conditions and LOS Evaluation System mainly based on fixed detectors [14], and Beijing Municipal Commission of Transport built Traffic Performance Index System mainly based on floating car data [15].
Real-time traffic flow prediction aims at evaluating anticipated traffic flow state at a future time.Many studies focused on traffic flow state and parameters prediction.Liu et al. [16] analyzed multidimensional parameters and developed the traffic prediction models of different dimensions based on the support vector machine.Dong et al. [17] proposed a preselection space time model to estimate the traffic condition at poor-data-detector, especially nondetector locations.Canaud et al. [18] presented a probability hypothesis density filtering based model for real-time traffic flow state prediction.Pan et al. [19] put forward a modified stochastic cell transmission model to support short-term traffic flow state prediction.Antoniou et al. [20] proposed an approach for local traffic flow state estimation and prediction based on data-driven computational approaches.Zhang et al. [21] designed a traffic flow state estimator based on extended-Kalman-filter method.
Traffic big data, massive and multisource, brings both opportunities and challenges to effective traffic management and control.During data processing, traffic big data meets the same difficulties with the general big data, such as capture, storage, search, sharing, analytics, and visualization.The differences between traffic big data and traditional data in the field of traffic flow state identification and prediction are obvious.For traffic flow state identification, the advantage of traffic big data is mainly manifested in the full coverage.That is, traffic big data can represent traffic flow characteristics as much as possible.However, the difficulties of data processing are enhanced with the increasing of the size of data.Therefore, big data-driven methods should have a stronger capacity of data processing.For traffic flow state prediction, the advantage of traffic big data is mainly manifested in the multisource.That is, the traffic big data has a higher accuracy to describe the relationships between traffic flow state of section  at time  and the others.However, the relationships are not easy to be found.So, big data-driven methods should have a clearer description of physical significance of traffic flow.Traffic big data can improve the efficiency of real-time traffic flow state identification and prediction.However, it may face more challenges.
Taking into account all the present researches in this field, there is still a lack of consideration of traffic big data.Further researches remain to be conducted on the direction of traffic flow state analysis.In this paper, the method of real-time traffic flow state identification and prediction, which perceives ability to handle big data, is put forward in detail.The remainder of this paper is organized as follows.Section 2 presents the basic characteristics of traffic big data.In Section 3, the methodology of real-time traffic flow state identification and prediction is proposed.A case study based on a real-world road network is carried out in Section 4 to demonstrate the performance and applicability of the proposed method.Finally, conclusions are drowned in Section 5.

Traffic Big Data Analysis
where   is the traffic flow parameter at time , representing flow   , speed V  , or occupancy   . is a function that describes trends of traffic flow time-series data.

Spatial
where  , is the traffic flow parameter of section  at time , representing flow  , , speed V , , or occupancy  , . defines the relationship between the data of upstream and downstream sections.

Historical Correlation.
For different weekdays or weekends, trip distribution of inhabitants in the same period shows similar characteristics.The law of traffic flow cycle is especially evident.Correlation between dynamic traffic flow data and historical data is as follows: where  , is the traffic flow parameter on day  at time , representing flow  , , speed V , , or occupancy  , . defines the relationship between the data of different workdays or holidays at the same time.

Multistate.
Traffic flow state can be determined by the traffic flow parameters.The correspondence between traffic flow states and traffic flow parameters is as follows: where  is LOS.V, , and  are the traffic flow parameters.MS defines the relationship between traffic flow states and traffic flow parameters.

Methodology
3.1.Overall Framework.The overall traffic flow state identification and prediction framework is outlined in Figure 1.
(a) SAGA-FCM Based Traffic Clustering Model.As a random social phenomenon, traffic flow state holds fuzzy characteristics.For example, "congested" and "uncongested" are the fuzzy descriptions of traffic flow.It is difficult to describe these states quantitatively.Since membership function can be used to explain the fuzzy phenomenon, fuzzy -means algorithm (FCM) is designed to evaluate traffic flow state [22].However, FCM is sensitive to its initial clustering center and it is easy to fall into the local optimum, especially for big data [23].SAGA-FCM algorithm integrates the strong local search ability of simulated annealing algorithm (SA) and the strong global search ability of genetic algorithm (GA) to overcome the drawbacks of traditional FCM algorithm [24].
(b) DFT Based Historical Data Fusion Model.Road traffic flow state information is periodic [25].Each object to predict can be associated with a set of historical data.Actual intelligent traffic system has accumulated massive historical data.However, it is difficult to achieve the integration of historical data.During the on-going development of traffic prediction system in Beijing, discrete Fourier transform (DFT) is found effective in achieving the average of historical data and noise elimination [26].Traffic parameters are discrete time series sequence; therefore DTF is applied to integrate massive historical data [27].
(c) Regional Traffic Flow Correlation Model.Temporal correlation, spatial correlation and historical correlation are the characteristics of traffic big data.Based on this principle, regional traffic flow correlation model can be established to predict traffic flow parameters.Taking simple calculation and predictive accuracy into consideration, a bi-level optimization model is proposed to simplify the regional traffic flow correlation model.

(d) Correction Coefficients Optimization Model.
Considering the amount of temporal, spatial, and historical variables, each of which corresponds to an unknown parameter, the model is too difficult to calibrate.Therefore, a two-stage model for correction coefficients optimization is put forward to reduce the number of variables and ensure accuracy simultaneously.

Basic Traffic Clustering Model.
Based on HCM 2010 [2], LOS () can be divided into six types, as  = [ 1 ,  2 , . . .,  6 ].Considering the monotonicity between traffic flow parameters and LOS, speed (V) and occupancy () are chosen as clustering parameters.Therefore, data sample used for traffic clustering model can be written as  = { 1 ,  2 , . . .,   },   = (V  ,   ).Since traditional FCM algorithm is put forward to find an optimal classification, the objective function is expressed as where is the degree of membership of which the sample data   belongs to LOS   , shorten for   .  can be calculated by the flowing formulation can be calculated by the flowing formulation Modification of the clustering centers and degrees of membership is the core of FCM.When the algorithm is convergent, clustering centers   can be found, and degrees of membership   can be calculated.Therefore, fuzzy clustering result can be achieved.

SAGA-FCM Based Solution
Algorithm.SAGA-FCM algorithm is used to improve the clustering quality of FCM.The process of SAGA-FCM algorithm can be summarized as in the following steps [28].
Step 5 (fitness value judgment).If    >   , replace all of the old individual with the new individual; otherwise, replace the old individual with the new individual by probability ( = exp(  −    )).
Step 7 (temperature judgment).If   <  end , the new solution is outputted, and that is the most optimized assembly sequence; otherwise,  +1 =   , turning to Step 3.

Calculation of LOS.
Based on SAGA-FCM algorithm, six clustering centers () can be found.Thus,   is calculated by formulation (7).Combining the numerical magnitude of   , the LOS of   ((  )) is achieved: where (  ) is the LOS of   .
The corresponding inverse Fourier transform formulation is Normally, take  = .To facilitate the program, the natural logarithm complex power form above is converted to trigonometric form: Suppose that the traffic flow varies in one week cycle, which includes 7 days with different characteristics.Each day contains 288 pieces of data with a time interval of 5 minutes.Apply DFT to average massive historical data and to eliminate noise, so as to achieve the integration of historical data.Consider For example, the process of DFT of traffic volume () can be summarized as in the following steps.
Step 3. Filter Fourier coefficients series to excise all coefficients less than the specified amplitude threshold of coefficient.Record the rest of the coefficients.The value of the threshold needs to be chosen by analyzing the historical data.
Step 4. Restore data at any point in the sequence ().

Basic Model Formulation.
The basic assumption of the regional traffic flow correlation model is that traffic flow has a strong temporal-spatial-historical correlation, namely: (i) in the temporal series, the traffic flow of the last moment can be regarded as the continuation of the current traffic flow; (ii) in the spatial series, downstream sections of the traffic flow can be seen as a continuation of the upstream traffic flow; (iii) in the historical series, traffic demand characteristics determine that traffic flow characteristics of the same day in the same period are similar.
Basic form of regional traffic flow correlation model is expressed as where where   ( − ) is the traffic flow parameter of section  at time  − . is time delay,  ≥ 0.  1 ,  2 , . . .are regression coefficients.
According to formulation (2),  is generally achieved by neighbor regression model.Thus,   () is created by the following equation: where   ( − ) is the traffic flow parameter of section  at time  − . 1 ,  2 , . . .are regression coefficients.

Effectiveness Proof of Proposed Formulation.
The general form of formula ( 16) is where  is the state of physical system with the measurement of multisensors.  is the measured value of th sensor, and they are independent.  is a weighting factor.
The total mean square error  2 is a multivariate quadratic function of   : where   is the mean square error of   .
According to extreme value theorem of multivariate quadratic function, weighting factor can be calculated as follows: Assume that the detection accuracy varies with sensors, the mean square error of the measurements from the least accurate sensor is  2 min and that from the most accurate sensor is  2 max .Then, The formula above shows that the measurement accuracy of the system can be improved by multisensor measurement and that participation of an inaccurate sensor also helps improve the accuracy of the system.

Bilevel Optimization Model.
For big date-driven analysis, the speed and accuracy of data processing are both important.The number of unknown parameters in formulation ( 16) is large and uncontrolled.Therefore, taking simple calculation and predictive accuracy into consideration, a bilevel optimization model is proposed.
The upper level model is formulated to optimize the speed of date processing by minimizing the number of temporal-spatial-historical correlation variables.The upper level formulation is min The lower level model is articulated in accordance with predictive accuracy.Therefore, the lower level formulation is proposed in order to minimize predictive encoding.The lower level formulation is min        () − V () where V () is actual value of   ().
In the bilevel optimization model, the purpose of objective function is to achieve a best combination, which holds both fast calculation speed and high prediction accuracy.

Correction Coefficients Optimization Model.
The more the temporal-spatial-historical correlation variables are, the slower the program calculates and the higher the accuracy is.The bilevel optimization model above is difficult to solve.To put it in another way, we can define a threshold of computing speed and therefore derive the maximum of acceptable number of temporal-spatial-historical correlation variables.In addition, we can put forward a new variable which can be calculated to replace the variable   .Since the alternative process will bring some errors, which are likely to be systematic, a linear correction method is put forward.Therefore, the correction coefficients optimization model is a two-stage model.

The First Stage Model.
The first stage model is mainly used to determine the number of unknown parameters .It can be determined through the following steps.
Step 1 (determination of the range of ).The number of unknown parameters should not exceed a threshold value.After several data tests, it is found that when  is less than 4, the correlation coefficients are not very different from relatively high values; when  is more than 8, the rapid decay of the correlation coefficients is observed with relatively low values.Therefore  ranges from [4,8].
Step 2 (calculation of the correlation coefficients).Calculate the correlation coefficients (  ) between the studied section  and the related section , solve the max correlation coefficient (  ) corresponding to each section , and determine the corresponding time delay (); namely,   = max   .  can be solved by Step 3 (selection of the relative section).Sort   in ascending order, and take the first  section as a range of temporal and spatial correlation.

The Second Stage Model.
The second stage model is mainly used to reduce parameters and determines the base model.Through the first stage model, the number of unknown parameters is reduced.But the model is still difficult to solve and can be further simplified.Step 1 (calculation of the normalization factor).Take normalized   as the parameters of   ( − ): Step 2 (solving the system error).Introduce the correction parameters  and  to calibrate, in order to reduce the system error caused by parameter substitution:

Case Study
4.1.Data Characteristics.Taking a section of the Second Ring Road (Section 1, as shown in Figure 2) and its surrounding roads in Beijing, China, as the object of study, it verifies the effectiveness and feasibility of the proposed method.Basic traffic flow data are detected by microwave detectors.As shown in Figure 2, Section 1, Section 2, Section 3, and Section 9 are all Second Ring Road.Section 4 and Section 5 are Expressway Tie Line.Section 6, Section 7, Section 8, and Section 10 are Secondary Road.Section 2, 3, . . ., 10 are the temporal-spatial correlated sections of Section 1.
For traffic flow state identification, it needs enough data to divide LOS.The historical data of Section 1 (one week, five minutes for the interval) are used to build SAGA-FCM based traffic clustering model.The amount of data is large enough to represent nearly all the traffic flow characteristics.For traffic flow prediction, it needs enough data to describe the temporal-spatial-historical correlation characteristic, that is, to build correction coefficients optimization model.The historical data of Section 1 (one week, five minutes for the interval) are used for temporal-spatial correlation analysis.The historical data of Section 1, 2, . . ., 10 (one week, five minutes for the interval) are used for temporal-spatialhistorical correlation analysis.The historical data of Section 1 (one month, five minutes for the interval) are used for historical correlation analysis.(1) Parameter Initialization.Choose 9 of the upstream and downstream sections of Section 1 based on their attributes and transport network characteristics.Let  ∈ [4,8].

Traffic Flow State Identification
(2) Search of Temporal-Spatial-Historical Correlative Sections.
Calculate the correlation coefficient between Section 1 and its relative sections, time series data, and historical trend data.The top 6 of the calculation result are shown in Figure 5, where the abscissa is the time delay  and the vertical axis is the value of the correlation coefficient   .
Compare   , decide the value of corresponding , and then get   .According to the distribution characteristics of   , let  = 4.The search results of temporal-spatial-historical correlative sections are Section 9, Section 1 (time series data), Section 1 (historical trend data), and Section 2.
Results of temporal-spatial-historical correlation coefficients are shown in Table 2.
(3) Calculation of Normalization Parameter.Calculate the normalization parameters   , as shown in Table 2.
(4) Solving Correction Parameters.Apply regression analysis to solve correction parameters  and :   where V  () is the velocity of section  at time  and V  1 () is the historical relative data of Section 1.
Apply the same method to analyze flow and occupancy: where   () is the flow of Section  at time ,   1 () is the historical relative data of Section 1,   () is the occupancy of Section  at time , and   1 () is the historical relative data of Section 1.
Fitting curve is shown in Figure 6.

Online Application.
Online calculations mainly deal with real-time data based on prior knowledge.Figure 7 shows the predicted results.The prediction errors of speed, flow, and occupancy are controlled around 10%.Traffic state deviation is shown in Table 3. Two-level-deviation and Three-leveldeviation are considered as errors, which are controlled in 10%.

Conclusions
In the era of big data, real-time traffic flow state identification, and prediction may face many challenges.This difficulty is decided by the characteristics of big data.This paper uses speed and occupancy to build the traffic flow state clustering  Traffic flow big data strongly shows temporal, spatial, and historical correlations.The regional traffic flow correlation model is established for real-time traffic flow prediction.The characteristics of big data make it difficult to resolve the model.In order to reduce parameters and ensure calculation speed and calculation accuracy, the correction coefficients optimization model, which can be divided into two stages, is put forward.Effectiveness of the method has been validated by the case study.
The core of this paper is to present a traffic temporalspatial-historical correlation model, which comprehensively considers the temporal, spatial, and historical correlations of traffic flow big data.Compared with the model based on a single nature, the accuracy of proposed model is relatively high.Besides, this model quantitatively solves, from the perspective of spatiotemporal correlation analysis, the weight selection problem of different analytical methods in the traditional combined model.The traffic temporalspatial-historical correlation model can be applied in other researches such as identification and correction of abnormal data and traffic congestion mechanism analysis.

2. 1 .
Temporal Correlation.By laying fixed and mobile traffic flow detectors, dynamic traffic data of a section can be obtained.Dynamic traffic flow data  are time-series data, which continuously change over time  with a certain trend; namely,

Figure 1 :
Figure 1: Overall traffic flow state identification and prediction framework.

Figure 2 :
Figure 2: Spatial location of research object.

Figure 3 :
Figure 3: Traffic flow state evaluation result.

4. 2 . 1 .
Offline Training.The main task of offline training is to establish the evaluation method of traffic flow state.Case study of traffic flow state identification is done based on the two clustering methods (FCM and SAGA-FCM).The value of objective function calculated by SAGA-FCM is better than the one by FCM.Different traffic flow states are respected by different colors.As shown in Figure3, traffic flow state is divided into six levels, as A to F. Clustering centers are shown in Table1 .

4. 2 . 2 .
Online Application.The main task of online application is to realize traffic flow state identification.Traffic flow state identification result is shown in Figure4.
Correlation.A regional transportation network consists of multiple intersections and roads.It exists a spatial association between traffic flow data of neighboring junctions or sections and that of target junctions or sections; namely,  , =  ( 1, ,  2, , . ..) , () is the traffic flow parameter of section  at time , representing flow   (), speed V  (), or occupancy   ().(),   (), and   () are the estimated value of   ().() is calculated by temporal correlation analysis.() is calculated by spatial correlation analysis.() is calculated by historical correlation analysis.,   , and   are coefficients of these three variables.According to formulation (1),  is generally achieved by regression analysis model.Thus,   () is created by the following equation:

Table 3 :
Comparative analysis.And real-time traffic flow state identification is based on speed and occupancy.Traditional fuzzy -mean clustering does not meet the requirements of big data analysis.To improve the feasibility of traffic flow state clustering, this paper uses the simulated annealing genetic algorithm based fuzzy -means (SAGA-FCM).Case study shows that the value of objective function calculated based on SAGA-FCM is better.