A New Video-Based Crash Detection Method: Balancing Speed and Accuracy Using a Feature Fusion Deep Learning Framework

,


Introduction
Traffic crashes can cause property damage, injuries, death, and nonrecurrent congestions.Accurate and fast crash detection can help improve the response speed of incident management, which in turn reduces injuries/fatalities and congestions induced by crash occurrence.us, developing such crash detection methods is necessary and important for traffic incident management.
Traditional crash/incident detection methods mostly rely on traffic flow modeling techniques [1][2][3][4][5][6][7].e basic idea of traffic flow modeling is to identify nonrecurrent congestion, based on data from loop detectors, microwaves, and probe.However, nonrecurrent congestion and recurrent congestion can be difficult to be differentiated without enough and sound historical data.us, the performance of traffic flow modeling approach heavily depends on the data quality obtained from traffic detectors.Moreover, it could often fail when the traffic environment is too complex (e.g., multimodal traffic in urban area).us, detection accuracy of such method is sometimes not guaranteed.Another emerging method is to identify incident based on crowdsourcing data [8].However, such method could also suffer from underreporting issues when there is no witness around the incident scene.Nowadays, with the development of intelligent transportation systems (ITS), video cameras have been widely installed in many cities and highways.anks to their wide coverage, vision-based crash detection techniques have gained increasing research attention in the recent years [9].eir basic concept is to automatically identify crash scenes based on the features of traffic images/videos through computer-vision techniques.Such techniques, as a promising intelligent crash detection method, are expected to significantly reduce human labors and have achieved relatively high detection accuracy [10][11][12].
To ensure detection accuracy, a video-based crash detection method needs to be capable of extracting important crash features from traffic images/videos.In general, there are two main types of features of interest: motion (temporal) features and appearance (spatial) features.Appearance features include apparent vehicle damage, vehicle rollovers, and pedestrian fallen-off.Motion features need to be continuously identified, including the intersection of vehicle trajectories and the gathering of pedestrians.From this perspective, current crash detection methods can be classified into two groups: motion feature-based methods and feature fusion-based methods.
Many research works are based on motion features, such as the intersection of vehicle trajectories, the overlap of bounding box detectors, and the speed change of vehicles.Some used background subtraction methods to extract vehicles' motion features (acceleration, direction, and velocity), based on which certain rules and thresholds were applied to identify crashes [9,[13][14][15].Maalou et al. [16] tracked vehicles' motion based on optical flow methods and used heuristic methods to find a threshold for crash identification.Sadeky et al. [17] used Histogram of Flow Gradient (HFG) as the motion features and discriminated crash from noncrash, based on logistic regression.Chen et al. [18] developed an Extreme Learning Machine (ELM) for crash identification, based on motion features represented by Scale-Invariant Feature Transform (SIFT) and optical flow.In recent years, with the development of deep learning methods (e.g., Faster R-CNN (Faster Region-based CNN) [19] and YOLO (You Only Look Once) [20][21][22]), the performance of vehicle detection and tracking has been significantly improved.Vicente and Elian [23] used YOLO model to detect motion features and used support vector machine (SVM) for crash identification.Lee and Shin [24] used Faster R-CNN for vehicle detection and Simple Online and Real-Time tracking (SORT) for vehicle tracking.Based on those motion features, the incident/crashes in tunnels were detected.Paul [25] applied Mask R-CNN (Mask Region-based CNN) for motion feature extraction and used rules for crash detection.Motion feature-based models only depend on vehicle motions.
is requires a high precision of object detection and tracking.When the traffic environment is complicated, vehicle detection and tracking performance could be decreased, resulting in low crash detection performance.Furthermore, some crashes may not be detected only based on motions, such as vehicle rollover and pedestrian fallen-off.
Recently, the feature fusion-based (i.e., appearance and motion) crash detection methods have become increasingly popular.ere are two types.One is based on unsupervised learning methods.For instance, Singh and Mohan [26] and Yao [27] developed a crash detection model based on autoencoder methods.Another type is based on supervised learning framework, which normally combines a module (e.g., convolutional neural network) for spatial feature extraction and a module (e.g., a recurrent neural network) for temporal feature extraction.Batanina et al. [28] used Convolutional 3D (C3D) model to capture both spatial and temporal crash features from simulated video crashes.en, a domain adaption (DA) transfer learning was applied to the real-world condition.e accuracy has been improved by 10%.Huang et al. [29] employed two-stream network to separately extract appearance features and motion features, which were then further combined to detect crashes.According to previous literature, the performance of crash detection can be improved by feature fusion methods.
Although feature fusion-based methods have achieved a better performance than motion feature-based methods, some improvements still can be made.To simultaneously capture both motion and appearance features, such models oftentimes have complicated structures and a large number of parameters.As such, those models require a lot of computing resources and long computational time, which prevent them from being used in a real-time traffic environment.us, the current fusion-based models need to find a better balance between detection accuracy and speed.
To fill the gap, we proposed a new feature fusion-based urban traffic crash detection framework, aiming at achieving a good balance between detection accuracy and speed.First, we introduced attention module into residual neural networks to improve the performance of detecting local appearance features.Meanwhile, we linked ResNet with Conv-LSTM model to simultaneously capture crashes' appearance and motion features.
e proposed model is expected to achieve high accuracy as well as fast detection speed for crash detection.e remainder of the paper is organized as follows: Section 2 introduces methods used in this study.Section 3 discusses data preparation.Section 4 presents modeling results and discusses research findings.Section 5 provides the research conclusion and future directions.

Methodology
In this section, we introduce our proposed model in detail.Figure 1 shows the overall framework of our model.First, the attention module was combined with ResNet to capture the appearance features of the crash images.e ResNet can improve the speed of conventional convolution neural network, while the attention module can enable the model to focus on localized appearance features instead of other irrelevant information to further boost the model.en, the output feature map is reduced in dimension via a 1 × 1 convolutional layer, which is then chronologically input into the Conv-LSTM network to further extract the motion features of crashes.Conv-LSTM has an advantage over conventional recurrent neural network (e.g., LSTM) in terms of being lighter and retaining spatial information.Finally, a global pooling layer and a fully connected layer were used to detect a crash (or noncrash).
e following is a detailed description of the residual network ResNet, attention module, and Conv-LSTM module in the framework.

Residual Neural Network (ResNet).
Residual neural network (i.e., ResNet) was proposed in 2015 [30] where W i is the 3 × 3 convolution operation and i is the layer index.
e second type (Figure 2(b)) often appears in deeper residual networks (ResNet50/101/152).Each residual module includes three convolution layers (1 × 1, 3 × 3, and 1 × 1), the output of which is the sum of the input (i.e., the output from the last residual module) and its convolution.
where X i is the 1 × 1 convolution operation.
e selection of ResNet depends on computational capability and training data amount.Deeper network could be more powerful with adequate training data.

Visual Attention Module.
In this paper, we further extend ResNet by integrating visual attention modules.
e visual attention module squeeze-and-excitation (SE) Block was first proposed by Hu et al. [31]. is module has been widely used because it is relatively simple and is able to improve the efficiency of many convolutional network models.SE Block belongs to the channel attention mechanism, which gives different weights to different channels of a feature map.As is known, in convolutional neural networks, different channels correspond to different feature extractions.e different classification tasks should lay particular emphasis on different feature selections.
e concept is similar to the way that human beings identify objects.For example, people may pay more attention to the shape features when judging cats and dogs, while they may focus on texture features when judging jaguar and leopard (belonging to Felidae).us, SE Block improves the ability of feature selection for convolutional neural networks.
As shown in Figure 3, SE Block converts the input X, X ∈ R H×W×C , to U, U ∈ R 1×1×C , through a global average pooling operation F sq , as shown in the following equation: U � F sq (X). (3) After the global average pooling, the output U is passed through a fully connected layer with a weight of W, W ∈ R C×C , that is, F ex (•, W) in Figure 3, and the result V is as shown in equation (5), where " * " refers to matrix multiplication.
e above is the activation function, and the result V is also called attention weight.Finally, multiply the attention weight V and the input X by the channel weight to adjust the importance of different channels of the input (equation ( 6)).

Journal of Advanced Transportation
An improved visual attention module over SE is called convolutional block attention module (CBAM), which was first proposed by Gupta [32].Based on the basic channel visual attention module (i.e., SE), CBAM innovatively introduces the spatial visual attention module, as shown in Figure 4. Different from the basic module, the spatial visual attention module initially performs maximum pooling and average pooling operations F s sq (•) on the input X S by channel and then converts the two-layer feature map to a single-layer feature map through a 1 × 1 convolutional layer with a weight of W, as shown in F s ex (•, W) in Figure 4. Finally, softmax is used to convert the original distribution to a probability distribution and adjust the importance of the model to different spatial positions of the input X S .e process can be expressed by the three following equations: CBAM module can be embedded into residual modules to improve its feature selection performance.Figure 5 shows how the two modules are integrated.

Feature Fusion Module (Conv-LSTM).
e Conv-LSTM module was first used in precipitation nowcasting [33], the structure of which is shown in Figure 6.Traditional LSTM input requires data flattening, which often causes spatial   e Conv-LSTM module inherits the gating structure adopted by the traditional LSTM, while it uses convolution neuron as a basic unit to retain spatial features.e data modeling process is as follows.
First, the inputs χ t and H t−1 are stacked along the channel dimension to generate [χ t , H t−1 ]; then a one-dimensional convolution F(•; W χ , W H ) performs convolution operation on [χ t , H t−1 ]: en, obtain [f t , i t , g t , o t ] by using activation function on [Y f , Y i , Y g , Y o ], as shown in the following equation: Finally, the outputs C t and H t of the Conv-LSTM module at the time step t are obtained by gating operations, as shown in the two following equations:

Data Preparation
To the best of our knowledge, there is no public database for crash detection task.us, in this study, all data were acquired from local police in China.We prepared two datasets.e first dataset is an urban city traffic image dataset, which contains 5061 traffic crash images and 5573 noncrash traffic images.e first image dataset was used to train the ResNet plus attention module, while the video dataset was used to train the whole network.By transferring the pretrained ResNet module, the convergence speed of the whole network could be boosted.To note that, images/video clips were manually labeled with either crash or noncrash.As such, the capability of the trained model was expected to identify crashes among normal traffic scenes.

Results and Discussion
All experiments in this study were carried out on a laptop equipped with Nvidia GTX 1060 GPU.Some detailed parameters of the laptop are as follows: (1) I7-7700HQ CPU @2.80 GHZ and (2) GTX 1060 (6G) GPU, core frequency: 1506-1709MH, and floating-point operation: 4.4 TFLOPs.
First, a set of deep learning models was compared for differentiating crash images (positive) from noncrash images (negative), with the purpose of finding a best crash appearance feature module, which was further linked to Conv-LSTM.VGG-16 and ResNet-50 were used as baseline models.Four extended models were developed by incorporating SE and CBAM modules into VGG and ResNet.e training dataset included 3861 crash and 4373 noncrash images, while the testing dataset included 1200 images for each category.Table 1 shows the performances of those models (i.e., crash appearance feature extraction models) in the test dataset.Compared to VGG-16 and ResNet-50, extended models with attention modules generally had higher detection accuracy.Among those, the ResNet-50 + CBAM model achieved the highest accuracy of 90.17%.Figure 8 shows the testing accuracy for each training epoch for those crash appearance extraction models.It can be also found that all models had much more falsepositive (FP) cases than false-negative (FN) cases.is indicates that those models tend to determine noncrash traffic scenes as crashes.Some traffic conditions (e.g., stopped vehicles and heavy congestions with many overlapping pedestrians and vehicles) could have very similar appearance features to those of crash scenes.us, models solely based on appearance features cannot well identify those conditions.
We further visualize those models based on the gradientweighted class activation mapping (Grad-CAM) technique [34], as shown in Figure 9. ResNet appeared to be better than VGG in terms of focusing on the appearance features of crashes.For example, VGG failed to identify the appearance features of crash D, while ResNet identified them correctly.When adding attention modules, the extended models (ResNet 50 + SE/CBAM) could better focus on appearance    2 shows the performance of the six candidate models.
e results indicate that model 1 had the lowest detection accuracy and speed compared to other models.In general, this model had a good performance in detecting multivehicle crashes.However, it largely failed to detect vehicle-pedestrian crashes and single-vehicle crashes.e reason could be that such models can only recognize crash motions (e.g., the intersection of vehicle trajectories, abnormal behaviours of nonmotorists [35,36]) instead of crash appearance (e.g., fallen people, vehicle rollover, vehicle damage, etc.). Figure 10 shows some crash scenes that were falsely detected by model 1.
Compared to model 1, feature fusion-based models had better performance in detection accuracy.Among feature fusion-based models, the rule-based model (model 2) and the LSTM models (i.e., models 3 and 6) had lower detection accuracy than the Conv-LSTM models (models 4 and 5).e basic idea of the rule-based model is to determine a crash based on the number of detected crash frames of a video clip.Based on experiments, a highest accuracy can be achieved when the threshold is set to 10 (i.e., 10 frames).Since such method requires no sequential information, it may not well identify crash motion features (the FN rate is high).LSTM models require the flattened layout of appearance feature maps, which could lose spatial information.Conv-LSTM can simultaneously detect both motion features and appearance features, while retaining their original information considerably (FN decreases compared to rule-based models).
Regarding detection speed, the proposed model framework considerably outperformed motion-based deep learning models.In order to get a high detection accuracy of motion objects, motion-based models often require powerful deep learning models for vehicle detection and tracking.In general, Conv-LSTM achieved the highest detection accuracy with acceptable detection speed (FPS > 30).
Furthermore, a typical feature fusion-based model (i.e., C3D model) was also compared to our best model (i.e., model 6).As shown in Table 3, an overfitting issue occurred for the C3D model, with a training accuracy of 99.89% and a test accuracy of 67.22%.e reason is that the C3D model has much more parameters (over 10 times) than our proposed model.Since the dataset is limited, the model was easily overfitted.Regarding computational loads and detection speed, the proposed model was also better than the C3D model in terms of FLOP (floating point operations) and FPS.
Of note, the best Conv-LSTM (i.e., model 6) models still have some false-positive cases.Some noncrash scenes (congestions) cannot be well identified by the model, as shown in Figure 11. is is probably due to limited sample size.Another reason could be that the proposed model tends to focus on part of images (thanks to attention module), while ignoring the understanding of the whole traffic scene.
As for misdetection (i.e., FN), some typical cases were discussed here (Figure 12).e first crash is that two vehicles collided with each other and led to an explosion.When it happened, the fire quickly covered the whole traffic scene.Such case is very rare in our current dataset, so that the trained model cannot well identify appearance features.e second to fifth crashes all happened in congested or complex traffic environment.In such environment, crash features were blocked or were difficult to be identified, especially when the original image quality is not high.

Conclusion
Detecting crash in a timely and accurate manner is important for traffic incident management.Previous videobased crash detection models suffer from low detection accuracy (e.g., some motion-based models) or high computational costs (e.g., large feature fusion-based models).To fill the gap, in this paper, we proposed a new feature fusionbased deep learning model framework with the purpose of achieving a balance between accuracy and speed for urban traffic crash detection.To this end, ResNet with attention modules was developed to capture the appearance features of crash images.ResNet has faster speed than conventional convolution neural network, while the attention module enables ResNet to focus on localized appearance features other than irrelevant information to further boost the model's speed.Conv-LSTM was introduced to link to ResNet to simultaneously capture appearance and motion features.Compared to conventional recurrent neural network (e.g., LSTM), Conv-LSTM can retain most of the spatial information with relatively fewer parameters.
Based on modeling results, the ResNet with attention modules can improve the performance of detecting localized appearance feature of crashes.Compared to simple rules and LSTM, the Conv-LSTM can better capture the motion features of crashes.e proposed model achieved the overall accuracy of 87.78% with relatively fast detection speed (FPS > 30), which outperformed conventional motion-based models and existing feature fusion-based models.us, the proposed method is a promising crash detection method, achieving a good balance between speed and accuracy.
and has been widely used in various deep learning-based computer-2 Journal of Advanced Transportation vision tasks for extracting image features.e purpose of ResNet solves the problem of training difficulties caused by gradient explosion or vanishing in deep convolutional neural networks.Compared to other conventional neural networks (e.g., VGG (Visual Geometry Group Network)) continuously stacking convolutional layers to obtain higher image expression capabilities, ResNet stacks flexible residual modules to obtain stronger expression ability instead.ere are two main types of residual modules.e first type (Figure 2(a)) often appears in a shallow residual network (ResNet 18/34).Each residual module includes two 3 × 3 convolutions, the output of which is the sum of the input (i.e., the output from the last residual module) and its convolution.e ReLU activation function is used to obtain the output of the current residual module, as shown in the following equation:

Figure 1 :
Figure 1: e proposed crash detection model framework.

e
crash images include multiple types, such as single-vehicle, multivehicle, and non-motorist-related crashes.Figure 7 shows some examples.Another dataset is an urban surveillance video dataset, which contains 420 crash video clips and 432 noncrash video clips.e duration of each video clip is around 20 seconds, with 24/25 frames per second.

Figure 8 :
Figure 8: Testing accuracy for each training epoch for crash appearance extractors.

Table 1 :
Performance of crash appearance feature extractors.

Table 2 :
Crash detection models' performances on testing set.