Unrecorded Accidents Detection on Highways Based on Temporal Data Mining

Automatic traffic accident detection, especially not recorded by traffic police, is crucial to accident black spots identification and traffic safety. A new method of detecting traffic accidents is proposed based on temporal data mining, which can identify the unknown and unrecorded accidents by traffic police. Time series model was constructed using ternary numbers to reflect the state of traffic flow based on cell transmission model. In order to deal with the aftereffects of linear drift between time series and to reduce the computational cost, discrete Fourier transform was implemented to turn time series from time domain to frequency domain.The pattern of the time series when an accident happened could be recognized using the historical crash data.Then taking Euclidean distance as the similarity evaluation function, similarity datamining of the transformed time series was carried out. If the result was less than the given threshold, the two time series were similar and an accident happened probably. A numerical example was carried out and the results verified the effectiveness of the proposed method.


Introduction
Road accidents are regarded as one of the leading causes of death for people between the ages of 5 and 44 according to the World Health Organization [1].More than that, traffic crashes result in serious economic losses on account of traffic congestion which in turn leads to a wide variety of adverse consequences such as traffic delays, supply chain interruptions, travel time unreliability, and increased noise pollution, as well as deterioration of air quality [2].Thus, reducing or avoiding traffic collisions is of great significance to traffic safety.High collision concentration location (HCCL) [3] detection is an effective means to find out the accident black spots and take some necessary continuous improvement measures.Historical accident data is the necessary foundation of any research on this subject.One of the main problems of the accident data was considered as the heterogeneity [4] in the previous studies.A great many methods had been proposed to solve this problem, such as latent class clustering [5][6][7][8], Bayesian networks (BNs) [9][10][11], and continuous risk profile (CRP) [12,13].One of the commonalities between these methods is that historical accident data, more specifically, recorded historical accident data, were used as the basic data.However, not all traffic crashes are known and recorded by traffic police.It is undeniable that some minor accidents often happened on highways and were settled privately for trivial losses.Usually, traffic accident black spots are identified mainly based on the traffic crash data recorded by traffic police departments [14].This just helps to find out the collisions we know, while ones we do not know, which did happen, keep unconsidered yet.In part, these observations motivated our study.
Traffic accidents are contingent events and are difficult to detect if there is no alarm.Nonetheless, a traffic accident was bound to create an impact on traffic flow pattern and cause different levels of congestion [15,16].The traffic volume, traffic speed, and traffic density were changed by crashes, even minor accidents.Thus, capturing the change of these traffic flow parameters is very helpful for traffic accidents, especially unrecorded traffic accidents detection.
Given this, an automatic traffic accident detection method was proposed in this paper.Traffic flow data was used and simulated by cell transmission model (CTM).According to the different inflow between two cells, time series model was constructed to reflect traffic flow state.Time series pattern, when accidents happened, was established based on historical accident data.To overcome the defect of Euclidean distance, which does not consider the linear drift in the time domain, and to reduce the computational cost, discrete Fourier transform was implemented to turn the time series from time domain to frequency domain.Leveraging the strengths of temporal data mining at finding time-varying patterns, any time series that were similar to the given pattern could be figured out, namely, the unknown and unrecorded accidents, by similarity search.The premier aim and contribution of this paper are to find out "the hidden accidents, " such as compounding in private, using temporal data mining method.A case study using the real highway traffic data in Harbin, China, was conducted for verification.

Construction of Time Series Reflecting Traffic Flow State.
For highway traffic flow, there was a unique state in each period.One of the ways to describe this was the metaphor of a screen capture for the traffic flow over consecutive periods of time.Every picture reflected the state of the traffic flow at a certain time.These pictures constituted a sequence over time.This was the consideration of building the time series model to describe the evolution of traffic flow in this paper.Traffic conditions estimation was achieved through dynamic traffic assignment (DTA) simulation that utilized temporal aspects of a transportation system.Different values of inflow between cells in CTM were typically expressed as ternary numbers (0, 1, and 2).A series of ternary numbers, generated by CTM, were introduced to illustrate the traffic flow state.Then, time series data were created by converting ternary numbers to decimal numbers.Thus, the traffic flow state could be reflected by time series data, and it was the basic work for unrecorded accidents mining.

Cell Transmission Model.
To model the propagation of traffic flow and construct time series data in the section below, the spread of highway traffic flow was simulated by CTM in this paper.CTM was proposed by Daganzo [17,18] and was considered as a proper method.It was believed that the relationship between traffic flow () and density () was of the form depicted figurally as follows: where V,  max , , and   denoted the free-flow speed, the maximum flow (or capacity), the backward wave speed, and the maximum (or jam) density, respectively, as shown in Figure 1.
In Figure 1, if the density is less than  1 , the traffic flow  is equal to V; if it is between  1 and  2 , then  reaches its maximum,  max ; if it is between  2 and   ,  is equal to (  − ); and  is 0 when the density reaches   .
Then, the continuous Lighthill-Whitham-Richards (LWR) equations [19,20] for a single highway link were discretized through this method and could be approximated by a set of difference equations.The state of the system was updated over time.Thus, the discontinuous changes of traffic flow could be captured.In CTM, a single road was divided into homogeneous sections (cells), , whose lengths equaled the distance traveled by free-flowing traffic speed in one clock interval.The state of the system at instant  was then given by the number of vehicles contained in each cell,   ().The following parameters were defined for each cell.

Density
() is the maximum number of vehicles that can be present in cell  at time , and   () is the maximum number of vehicles that can flow into cell  when the clock advanced from  to  + 1.
These constants could vary with time (e.g., contingent traffic incidents or conscious traffic control measures), but this dependence was able to be ignored for simplicity of notation.The first constant   () was defined to be the product of the cell's length and its jam density, and the second one was the product of the time interval and the cell's capacity.
If cells were numbered consecutively starting with the upstream end of the road from  = 1 to , the recursive relationship of the CTM, as discussed by Daganzo [17,18], could be expressed as where   () was the inflow to cell  in the time interval (, +1), given by where  = /V.The formulas (2a) and (2b) constituted the fundamental equations of CTM.Equation (2a) expressed the status updates of cells over time, while the latter gave the variations for updating.

Constructing Time Series of Traffic
Flow.Time series data was a sequence of data evolving through time.There were two strengths of temporal data mining: data-based and pattern-based.The former was more likely to approach the truth; the latter was more likely to extract the features.Every traffic accident was considered to change the traffic flow state more or less.Features of this change were supposed to be extracted by temporal data mining and these features were able to be used to find out "the hidden accidents." Gao et al. used the NaSch traffic model to simulate the evolution of traffic flow [21,22].The state of traffic flow at each time period was regarded as a node of a network.Then, a complex network was constructed, also known as a multiplemode system model, which could describe the evolution of traffic flow.Zhao et al. studied state estimation of this kind of systems [23].However, the temporal relations among these nodes were ignored in their research.According to CTM, at each time step, the inflow   () to cell  could be  −1 (),   () or [  () −   ()].When   () =  −1 (), the state of the cell was represented by number 0; when   () =   (), it was represented by number 1, otherwise represented by number 2. Then, the state of the system was expressed by a sequence of ternary numbers, for example, {0, 0, 1, 2, 1}.
The model employed was described as follows.For  cells, we assumed that, at time period (,  + 1), the state was represented by a set of ternary numbers, namely,   = { 1 ,  2 , . . .,   }.Then,   is the precursor of  +1 .Here, each ternary number   can be considered as an element, which can take three different states; that is,   = 0, 1, 2. Figure 2 depicted the evolution of traffic flow with the tick of a clock.
As shown in Figure 2, when the clock advanced from 0 to 4, the system (a single segment) state change could be discovered clearly using the "screen capturing" method.The state   was time-varying and was represented by a sequence of numbers.For convenience, a parameter  was introduced in this paper to represent the value of   and  = ∑  =1 3    ; thus, ternary numbers were converted to decimal numbers; for example, {0, 0, 1, 2, 1} was converted to 16.Thus, time series data was created (see Figure 3).
Each single decimal number represented a system's state.As shown in Figure 3, at time step 9, for instance, the value of the system state was 78; thus, the ternary numbers were {0, 2, 2, 2, 0}, which was the state of traffic flow.

Feature Extraction.
Noise in the raw time series data could reduce accuracy and creditability of data mining.Linear drift was certainly an example.In many clustering analysis methods, -means, for example, Euclidean distance, was frequently used as a similarity measure function.The unrecorded accidents detection method proposed in this paper was mostly based on similarity mining, which would be discussed later.Linear drift was the most important factor to influence the accuracy of the results.
In this case, if Euclidean distance was used and  was beyond the threshold, these two time series  and  were not considered to be similar by mining algorithms.While they had similar shape and trend apparently, the judging result was inaccurate obviously.Aiming to prevent such errors and to realize data compression and reduce the computational cost, it was necessary to extract feature from the original time series data, using the image in feature space to replace the original one.
Discrete Fourier transform (DFT), which had unique merits in time series analysis, was an alternative way.For a given time series object, DFT could be used to turn it to frequency domain from time domain.According to Parseval theory, the time-domain energy function was equal to the frequency-domain energy function.And most of the energy in frequency domain concentrated on the first few coefficients; hence, other coefficients could be omitted.Thus, the remaining coefficients were able to be seen as the features of the original time series.For a given time series  = {  },  = 0, 1, . . .,  − 1, translating the series from time domain to frequency domain, the new sequence obtained by DFT was denoted by  = {   },  = 0, 1, . . .,  − 1, where Taking the data in Figure 3, for example, the result of DFT was shown in Figure 5.
The area, in which the frequency was greater than 0.0, was the real transformed data of the original finite time series.17 elements were left out of the initial total of 21 ones.Data compression had been realized and as a result of the transformation from time domain to frequency domain, linear drift no longer existed as well as other noises in the time domain.

Unrecorded Accidents Detection Using Similarity Mining.
Similarity search is an important research field in temporal data mining.As mentioned above, each recorded accident could bring a piece of time series data, and all recorded historical accidents data over a period of time could explain the traffic flow trends when accidents happened.After data processing, using the above methods, which could be called data preprocessing, clustering analysis was supposed to be implemented in this paper.The classical -means method would be able to meet the accuracy requirements due to the appropriate data preprocessing.Results of cluster were considered as "normal traffic flow trends" under accidents and the "hidden accidents" could be found out by similarity search.
The method of similarity measurement used in this paper was Euclidean distance.Set the time sequence of "normal traffic flow trends" of a single road segment which was {  }, whose length was .The time series to be measured was denoted by {  } with its length ,  ≥  generally.In similarity search, subsequences of {  }, whose lengths were , were the measuring object.These subsequences were denoted by {  }.It was known that the number of {  } was ,  = −+1.Then, the similarity metric function could be defined as follows: where   was the scaling factor.The calculation times were  −  + 1 obviously.So far, for a single road segment, the procedure to detect unrecorded accidents was described in Figure 6.

The Study Site and Data Preparation.
A case study was conducted based on the data extracted from records collected on Beijing-Harbin Expressway (G1) between Harbin and Lalinhe from January 2010 to July 2011.The traffic accidents dataset included a total of 73 crashes recorded on the 72kilometer length road in total.The study site and the traffic accidents data were shown in Figure 7.
In one and a half years, for such a highway with the annual traffic reaching 8-10 million, "only" 73 crashes happened.Experience told us that it did not indicate how safe the highway was, but some minor accidents were not known or recorded.It was meaningful to detect the unrecorded collisions for traffic safety research.
As the limit of detection device, traffic flow data worked out using data collected by highway toll collection system.For detailed calculation process, readers could refer to Weng et al. [24].The estimation accuracy could meet the requirements technically.

Numerical Experiment.
In the numerical experiment, the time horizon was set within half an hour and the time interval was 30 seconds.The accidents happened at the eleventh clock interval, namely, 5 minutes after the start time of simulation.Thus, there were 60 points in one piece of time series data.The free-flow speed was 120 km/h; thus, the length of each cell was the product of the free-flow speed and the unit clock interval that was 120 km/h * 30 s = 1 km.Based on the historical accident data, the accident location was set in the middle cell in CTM and five cells were brought in to represent a single highway segment where a crash happened, as shown in Figure 8.
The virtual cell in Figure 8 was on behalf of the demand generation.There were five inflow states, as shown by the five arrows in the figure above.Using the method shown in Figure 6, first of all, it was significant to identify the "normal traffic flow trends" under accidents.Time series data was conducted using the method mentioned above and DFT was implemented then for each crash record.-means cluster analysis was adopted and the results were indicated in Figure 9.
As is shown in Figure 9, two clusters were generated by this method.In Figure 9(a), there were 34 accidents gathered together.The common point of them was that congestion formed and dissipated gradually in the simulation period.The shapes of these time series data were similar to the normal distribution curve.According to the accident records of traffic police, most of these crashes were single-vehicle crashes and both-car accidents.In Figure 9(b), 33 crashes were involved and the congestion was not eliminated in study period.Twothirds of them were crashes between two heavy trucks and one-third were multivehicle accidents.This kind of crashes could cause severe congestion and influence traffic seriously.Actually, another cluster existed in this case and only five accidents were involved.This cluster had less interference to traffic, but, for the records, they were collisions between vehicles and pedestrians.It was a serious threat to traffic safety but beyond the scope of this paper, so the result was not shown here.
For the unrecorded accidents, usually there were no casualties and little losses.The reason why they were not recorded was that these minor accidents were settled privately and even kept unknown by the traffic police.So, there was every reason to believe that these unrecorded accidents were single-vehicle crashes or both-car accidents.Thus, the cluster shown in Figure 9(a) was our target in the similarity search.For verification, one accident out of the 73 crashes was separated out, pretending to be unknown.
By similarity search, the "prepared" accident was found out and the time series data was shown in Figure 10.The traffic flow trend in Figure 10 was indeed similar to the cluster in Figure 9(a).At the first 10 clock intervals in both figures, the traffic was smooth because of the unsaturated traffic flow and no accidents.Then, congestion formed and dissipated gradually in a certain period.The result of similarity search proved the reliability of the proposed method.

Conclusions
Unrecorded accidents were significant to identify traffic accident-prone location.Based on the observation that the traffic volume, traffic speed, and traffic density were changed by crashes, even minor accidents, an automatic traffic accident identification method was proposed.As most of the current studies did not pay enough attention to the time factor when studying the relationship between traffic state and crashes on highways, this paper proposed a method to construct time series data using traffic flow data when accidents happened.To avoid the defect of not considering the linear drift in the time domain between two sequences, DFT was carried out to extract features from original time series data.Traffic flow trend could be well understood by clustering analysis.Then, through the method of similarity search, unrecorded accidents, which were believed to be single-vehicle crashes or both-car accidents, were found out.The case study using real data in Harbin showed the feasibleness of the proposed method.

Figure 1 :
Figure 1: The relationship of traffic flow and density in CTM.

Figure 2 :
Figure 2: The traffic flow trends with the tick of a clock.

ClockFigure 3 :
Figure 3: Time series data reflecting traffic flow trends of a five-cell road.

Figure 4 :
Figure 4: Linear drift between time series  and .

Figure 7 :
Figure 7: The study site and data.

Figure 8 :
Figure 8: CTM for a single highway segment.