Semantic Detection of Vehicle Violation Video Based on Computer 3D Vision

. In order to study the semantic detection accuracy of 3D vehicle accident video, an accident detection method combining 2D image and 3D information was proposed. The 3D semantic bounding box generated by the 3D detection and tracking task of the vehicle is used to extract the motion features of the vehicle, it includes the trajectory of the vehicle and the dimension and orientation of the 3D bounding frame, and the 3D semantic bounding frame is used to establish the evaluation index of the accident detection. The experimental results show that the average loss function of each batch of 1000 images is calculated by the stochastic gradient descent method to update the parameter values. The learning rate was set to 0.001 in the ﬁrst 30,000 iterations and 0.0001 in the last 10,000 iterations. The MOTA of the CEM algorithm is 78.4%, FP is 1.1%, and FN is 3.5%, and the MOTA of the 3-DCMK algorithm is 88.6%, FP is 0.9%, and FN is 1.9%. The MOTA of this method is 89.3%, FP is 0.9%, and FN is 1.2%. The 3D target semantic detection of vehicle accident video has stability and accuracy.


Introduction
With the development of autonomous driving industry, relevant research has been promoted.On-board sensors, onboard control systems, high-precision positioning maps, and other technologies have developed vigorously in recent years.However, for autonomous driving technology, safety is the primary factor to consider [1].To ensure the safety and reliability of autonomous driving, we need a precise understanding of the surrounding environment.Automatic driving system converts sensor information into semantic information, and target detection, as the basis and an important step of semantic information extraction, plays an indispensable role in the automatic driving system.In the automatic driving system, the detection of objects such as people and vehicles is an important part.How to accurately identify objects and quickly locate them is a difficult problem for autonomous driving [2].Different from the target detection in the picture, the environment of automatic driving is three-dimensional space, and three parameters of the center point and length, width, and height of the object need to be obtained.And for vehicles, it is more necessary to know the current driving direction [3].erefore, traditional twodimensional target detection methods cannot meet the needs of autonomous driving, and a more accurate perception of the three-dimensional space around the vehicle is needed.However, the image data alone cannot well satisfy the perception tasks under various weather and extreme light conditions.A vehicle accident detection method based on 3D object detection to generate 3D semantic bounding box is proposed.e method extracts the vehicle posture from the convolutional neural network and predicts the 3D bounding box of the vehicle.3D object detection and 2D to 3D extended Kalman filter were combined to recover the 6-dOF 3D vehicle attitude and tracking track in the video sequence, and the 3D semantic bounding frame was used to establish the evaluation index of accident detection.According to different data sources, 3D target detection can be roughly divided into three directions: target detection based on point cloud data, target detection based on image, and target detection based on the combination of heterogeneous data [4].Point cloud data can be obtained by lidar or binocular camera.ese three modes all have their own characteristics, which combine to produce different research methods, image, as the most common research data in computer vision; although the picture cannot well represent the threedimensional space, the 3D state of the object can be estimated with the help of the information on the image.Monocular cameras provide detailed information in the form of pixels to display shape and texture features at a greater scale.ese features can be used to detect lane lines and traffic lights or other categories of information.However, the drawback of monocular cameras is also obvious: the lack of depth information.Due to the lack of depth information, most methods based on monocular vision are based on two-dimensional image target detection algorithms; firstly, the position of the object on the 2D image is obtained, and then the 3D target border parameter regression is carried out to get the final target position.Some of the previous detection methods on two-dimensional images can be directly performed using model-based or segmentationbased target detection; then, with the application of deep learning in the field of object detection, the deep learning method is gradually adopted for 3D object detection.In view of this research problem, Wang et al. used 3D pixel features to detect vehicles in images; a new model representation 3DVP is proposed, which is segmented after two-dimensional image detection and projected into three-dimensional space to obtain accurate 3D position information [5].e concept of subcategory is proposed; that is, objects with similar attributes are regarded as the same category, and the 3DVP structure is used to construct subcategories; a convolution network is used to generate a heat map for each subclass of the proposed region; after the region of interest is pooled, the network outputs category classification and twodimensional border estimation.e algorithm works well for the problems of occlusion and partial missing objects, but these scenes also appear as a category in the model dictionary; if the vehicle attitude does not appear in the model dictionary, it cannot be accurately detected.To overcome this problem, Ma et al. used a multitask network to estimate vehicle state, partial position, and shape.Car shapes generate 3D candidate box shapes from a series of key points.Firstly, the 2D candidate box is obtained through the improved Region Proposal Network (RPN), and then the candidate box is obtained based on the inferred 3D model matching [6].With the development of deep learning, the algorithm obtains the two-dimensional target border through the neural network and then carries out the three-dimensional parameter regression.Lv et al. proposed Mono3D, which firstly generated numerous detection candidate boxes in three-dimensional space, then feature extraction was carried out for the candidate boxes, and scores were calculated for each candidate box projected on the image plane by using various features such as semantics or shapes, and regression of 3D candidate boxes was carried out [7].However, due to the lack of depth information, the accuracy of position detection is poor.Based on the current research, a vehicle accident detection method based on 3D object detection to generate 3D semantic bounding box is proposed; this method extracts the vehicle posture from the convolutional neural network and predicts the 3D bounding box of the vehicle.Combining 3D object detection and 2D to 3D extended Kalman filter, the 6 degrees of 3D vehicle attitude and tracking trajectory are restored in the video sequence, and the 3D semantic bounding frame is used to establish the accident detection evaluation index.e test results show that each batch of 1000 images was trained by the stochastic gradient descent method, and their average loss function values were calculated to update the parameter values.e learning rate was set to 0.001 in the first 30,000 iterations and 0.0001 in the last 10,000 iterations.e MOTA of the CEM algorithm is 78.4%,FP is 1.1%, and FN is 3.5%, and the MOTA of the 3-DCMK algorithm is 88.6%, FP is 0.9%, and FN is 1.9%.e MOTA of this method is 89.3%, FP is 0.9%, and FN is 1.2%.e semantic detection of vehicle accident video for 3D targets has stability and accuracy [8].

Algorithm Framework.
Figure 1 shows the framework of the proposed algorithm based on monocular vision for 3D object detection and multitarget tracking; the algorithm is mainly composed of three stages.Stage 1: a 3D object detector is trained using convolutional neural network to detect vehicles on the road.Stage 2: using extended Kalman filter to predict and update the status of the detected vehicles, 2D-3D multitarget tracking is generated, so as to realize 6 degrees of 3D target attitude recovery in the whole video sequence.In the last stage, a vehicle behavior function is established based on the trajectory and appearance of the moving vehicle to determine the occurrence of the accident.

3D Vehicle Detection.
e three-dimensional bounding frame of the vehicle can provide information such as position, dimension, and orientation, i.e., 9 degrees of freedom represent the motion state of the vehicle.(x, y, z) represents the position of the vehicle center point in the 3D coordinate system; (w, h, L) represents the width, height, and length of the vehicle; (α, β, c) are the deflection angle, heading angle, and pitch angle of the vehicle.Based on the darknet-53 network model, the model is trained end-to-end, and then 3D bounding box is formed by degrees of parameters and camera projection matrix.A feature description module similar to the feature pyramid network (FPN) is extracted from the convolution layer, which includes deeper convolution layer and deconvolution layer to capture image details.Higher resolution output tensors can capture richer image details, such as distant objects in the field of view, while lower resolution output tensors can capture more image context information.e normalized size of each input image is 416 * 416, and the designed output category is vehicle, the output parameters are 9 degrees of freedom parameters plus a confidence parameter, three feature maps with different sizes are output, and then the tensor depth is shown in formula as follows: 3 * (10 + 1) � 33. (1)

Advances in Multimedia
For these output parameters, a 3D Box Proposal Network (3DBPN) loss function is designed; 3DBPN loss function mainly includes position L loc , dimension L size , and orientation L θ .x i , y i , z i and  x i ,  y i ,  z i represent the real space coordinates and the predicted coordinates in space, respectively; w i , l i , h i and  w i ,  l i ,  h i represent the real size and the predicted size of the object, respectively; θ i and  θ i represent the true and predicted directions, respectively.Different weight α is set for each loss function to further improve the prediction performance of parameters.e loss function is shown in formulas ( 2)-( 5): e candidate generation boxes with low score are filtered by the nonmaximal suppression algorithm to improve the 3D bounding boxes with high accuracy.e 3D detection algorithm of the vehicle provides the position prediction of the vehicle in the monitoring view, which is convenient for the next step of 3D tracking.

Vehicle Tracking.
e relationship between frames is established in the current detection results, and the 2-3D extended Kalman filter (EKF) is fused to predict the motion and state of the vehicles on the road; thus, three-dimensional attitude tracking of the vehicle is realized [9].After the 3D frame size and angle information of the vehicle detected in the initial frame are output, the initial posture of the vehicle is determined.In two consecutive frames, the extended Kalman filter is used to estimate the tracking state of the target in the next frame by 3D reprojection.e observation results are combined with 3D object detection and 2D detection to form all the observation models; the MRF model is used to select the appropriate observation results from a large number of generated hypothetical observations as the initial detection frame for the next frame.Combined with projection and backprojection operations, the locus points are applied to the center of the 3D bounding box.
e recursive neural network LSTM is used to correlate and match the position of vehicles on the road across frames.e velocity is obtained from the LSTM model and is used to estimate the 3D pose to update the 3D position.According to the average size of the training data, the size of the 3D bounding box can be updated periodically with the distance of the visual field.6) and ( 7): Vehicle collisions also bring about rapid changes in the corresponding 3D bounding frame.An appropriate range of changes is set for each vehicle to compare the 3D bounding frame in normal driving.e prior average dimensions of length, width, and height of the bounding frame are set as s, the variation range of reasonable size s i is set as shown in the following formula: e dimension variance O 3D (W, H, L) of the vehicle 3D bounding frame is calculated; the weight parameter w s is set, and the vehicle bounding box size evaluation function D s can be obtained as shown in formulas ( 9) and (10).
e variances of the three rotation angles O 3D (α, β, y) of the vehicle enclosure were calculated accordingly; set the weight parameter wθ to obtain the vehicle rotation angle evaluation function D θ , as shown in formulas (11) and (12).
A vehicle accident detected in a continuous frame is determined by the threshold parameter A t , and the detection result function is set as follows: there was a traffic accident, A > A t ; no traffic accident occurred, otherwise [10].

Training Process.
is experiment mainly tests the performance of vehicle detection and tracking and accident detection under the view of intersection monitoring.e data set used for the experimental test is from KITTI data set and intersection monitoring data set, covering different weather environment and traffic flow, from which 100 video sequences are extracted and annotated.ree-fifths of the total traffic video frames were used for training, and the rest were used as test frames.Each batch of 1000 images was trained by the stochastic gradient descent method, and their average loss function values were calculated to update the parameter values.e learning rate was set to 0.001 in the first 30,000 iterations and 0.0001 in the last 10,000 iterations.

Experimental Results.
e experiment is compared with several advanced 3D attitude detectors.e vehicles were detected using the KITTI dataset official evaluation method  at three different levels of occlusion.In order to evaluate the accuracy of 3D object detection, three evaluation criteria are formulated according to the actual situation [11].Firstly, the error rate between the real distance and the estimated distance between the center point of the object's 3D bounding frame and the camera is calculated.Secondly, the change trend of the intersection ratio between the detected 3D bounding frame and the real 3D bounding frame is compared with the increase in the real distance between the object and the camera.Figures 2(a) and 2(b) show the experimental comparison of the method with other methods under the first two evaluation criteria for 3D detection of the KITTI dataset.
e final evaluation criterion is vehicle orientation comparison, which compares the mean directional similarity (AOS), mean accuracy (AP), and directional score (OS) by detecting three data sets with different degrees of occlusion.Compared with other three 3D detection methods, the comprehensive accuracy of the method is higher than that of other methods.
In 3D target tracking, scale change, direction change, and occlusion will affect tracking performance.e video sequences from 50 traffic surveillance angles were tracked and compared with other tracking methods.e average accuracy (MOTA), false positive (FP), and false negative (FN) of multitarget tracking were used to evaluate the robustness of the model.As shown in Table 1, the 3D target tracking trajectory in this paper is clear, and the vehicle attitude of 3D bounding box regression is relatively accurate [12].
For the evaluation method of traffic accident detection performance, the ROC curve was drawn by using the two evaluation indexes of true positive rate (TPR) and false positive rate (FPR).As shown in Figure 3, compared with BiSTM and STM, the proposed algorithm can identify traffic accidents on the road in a variety of environments with higher accuracy.

Conclusions
A vehicle accident detection method based on 3D object detection to generate 3D semantic bounding box is proposed; the method extracts the vehicle posture from the convolutional neural network and predicts the 3D bounding box of the vehicle.3D object detection and 2D to 3D extended Kalman filter were combined to recover the 6 degrees of 3D vehicle attitude and tracking track in the video sequence, and the 3D semantic bounding frame was used to establish the evaluation index of accident detection.e experimental results show that each batch of 1000 images was trained by the stochastic gradient descent method, and their average loss function values were calculated to update the parameter values.e learning rate was set to 0.001 in the first 30,000 iterations and 0.0001 in the last 10,000 iterations.
e MOTA of the CEM algorithm is 78.4%,FP is 1.1%, and FN is 3.5%, and the MOTA of the 3-DCMK algorithm is 88.6%, FP is 0.9%, and FN is 1.9%.e MOTA of this method is 89.3%, FP is 0.9%, and FN is 1.2%, and the 3D target semantic detection of vehicle accident video is stable.

Figure 2 :
Figure 2: Comparison of 3D detection experiments on KITTI data set.(a) Experimental comparison of center point distance equalization error.(b) Experimental comparison of 3D cross-parallel ratio.

Table 1 :
Experimental comparison of tracking effect.