Traffic Sign Detection via Improved Sparse R-CNN for Autonomous Vehicles

Traﬃc sign detection is an important component of autonomous vehicles. There is still a mismatch problem between the existing detection algorithm and its practical application in real traﬃc scenes, which is mainly due to the detection accuracy and data acquisition. To tackle this problem, this study proposed an improved sparse R-CNN that integrates coordinate attention block with ResNeSt and builds a feature pyramid to modify the backbone, which enables the extracted features to focus on important information, and improves the detection accuracy. In order to obtain more diverse data, the augmentation method used is speciﬁcally designed for complex traﬃc scenarios, and we also present a traﬃc sign dataset in this study. For on-road autonomous vehicles, we designed two modules, self-adaption augmentation (SAA) and detection time augmentation (DTA), to improve the robustness of the detection algorithm. The evaluations on traﬃc sign datasets and on-road testing demonstrate the accuracy and eﬀectiveness of the proposed method.


Introduction
Traffic sign detection based on computer vision plays a crucial role in the autonomous driving system. e deep neural networks have successfully applied in many fields, such as computer vision, communications [1], and networking [2]. Applying detection algorithms based on deep learning to autonomous vehicles has become a hot topic for researchers. e traffic sign detection system can automatically detect and recognize traffic signs in real traffic scenes and then transmit the results to the decision-making module to ensure that the vehicle drives safely in accordance with traffic rules.
It remains challenging to detect and recognize traffic signs in on-road scene due to their unstable features on different occasions, such as illumination, weather, and noise.
ese complex factors will reduce the detection accuracy of traffic signs. e datasets used to train the detection model also affect the detection accuracy due to the data collection scenes, weather conditions, and time of data collection problem.
To solve the problems mentioned above, the motivation of this study is to enable the feature extraction process by backbone more focused on the object and detect multiscales objects. In order to improve the robustness of the detection model, the dataset used for training should be as diverse as possible, which can simulate traffic signs in complex traffic scenarios. e transformer structure [3] has recently become a hot topic due to its competitive performance especially when vision transformer (ViT) [4] and DERT [5] are proposed to make transformer applied in computer vision.
e sparse R-CNN [6], which is inspired by the transformer, is a purely sparse method for object detection compared to the ordinary CNN models. It uses a series of learnable proposal boxes and features to replace the thousands of candidates generated by traditional region proposal algorithm, such as selective search [7] in R-CNN and region proposal networks (RPN) [8] in faster R-CNN. e proposed traffic sign detection method in this study is based on sparse R-CNN. e contributions of this study are listed as follows: (1) e proposed method multiscale sparse R (MSR)-CNN integrates coordinate attention block into the backbone network ResNeSt [9], which can improve the model to find a region of interest in images. en, a feature pyramid network is used for multiscale detection.
(2) e data augmentation driven by complex traffic scenarios is used to make the dataset used in training more diverse. (3) In order to improve the detection accuracy in onroad scenarios, this study designs self-adaption augmentation (SAA) in front of the MSR and detection time augmentation (DTA) module behind the MSR. (4) is study presents an annotated traffic sign dataset called Beijing Union University Chinese Traffic Sign Detection Benchmark (BCTSDB).
e rest of this study is organized as follows: in the related work section, we introduce object detection and traffic sign detection algorithms in recent years. e details of the proposed traffic signs detection model are presented in the proposed method section. e following section focuses on its implementation and comparison with previous methods. e final section summarizes the proposed method and looks forward to the future direction.

Related Work
Traffic signs are usually defined as eye-catching colors in the design process to improve identifiability, so that traffic signs can be distinguished from the environmental background. Many traditional traffic sign detection methods rely on extracting features from visual information such as color, edge, and shapes.
Reference [10] proposed a traffic sign detection method based on the HOG [11] feature and SVM [12]. Firstly, the method segmented the image of traffic signs by the color threshold to remove a lot of interference and then used the maximum stable extremum region algorithm to detect the connected region. Shape is another important feature of traffic signs. Literature [13] used the shape-based method to comprehensively consider the shapes of triangle, circle, and square, and the connected component was used for shape recognition to remove the regions without traffic signs in the images. e above detection methods are usually affected by illumination, occlusion, distortion, and scale. When applied to real traffic scenes, their slow detection speed and low accuracy cannot meet the needs of autonomous driving systems. In recent years, deep learning algorithms have been widely used in object detection tasks for their competitive performance. e detection method based on deep learning is mainly divided into dense algorithms and dense-to-sparse algorithms [6]. e dense algorithms also called one-stage algorithms, such as the you only look once (YOLO) series [14][15][16][17], single-shot multibox detector (SSD) [18], and RetinaNet [19], which directly output the location and category of densely bounding boxes from features in a single-shot way. ey directly predict anchor boxes or key points [20] densely covering spatial positions, which are built on dense candidates, and each candidate will be classified and regressed, respectively. Especially in anchorbased algorithms, for each position in the feature map (H × W), k anchor boxes need to be set, which leads to H × W × k anchors.
ese candidates are assigned to ground-truth object boxes in training time and then are needed nonmaximum suppression (NMS) to remove redundant predictions during inference time.
e dense-to-sparse algorithms are also known as two-stage algorithms. is kind of algorithm first uses region proposal algorithm (e.g., selective search [7] and RPN [8]) to select a small set of foreground regions proposals from preset dense candidates in the first stage, and then region proposals are put into the subsequent network for classification and position regression in the second stage, such as R-CNN [7], fast R-CNN [21], and faster R-CNN [8]. More researchers began to use deep learning methods for traffic sign detection. Yang et al. [22] used adversarial machine learning to generate adversarial examples in order to improve the detection robustness of autonomous vehicles but did not consider the effect of the environment on the detection. He et al. [23] presented a traffic sign detection using CapsNet [24] based on visual inspection. However, it extracts HOG feature from images, which does not contain semantic information. Dewi et al. [25] use YOLOv4 with synthetic training data to detect traffic signs. Domen et al. [26] propose an improved mask R-CNN [27] to address the full pipeline of detection with end-to-end learning, which cannot detect small and multiscale traffic signs. Cao et al. [28] improved faster R-CNN through the high-resolution backbone network [29] and prime sample strategy [30]. Xie et al. [31] proposed improved cascade R-CNN [32] for traffic-sign detection. e above methods all use a two-stage-based algorithm with high computational complexity to detect traffic signs. ese methods have improved the performance of the algorithm, but the used backbone networks were designed for classification such as VGG [33] and ResNet [34], which cannot extract deeper semantic information and contextual information due to the limited receptive field size and lack of cross-channel interaction. e sparse R-CNN method proposes a new object detection paradigm called sparse algorithms [6], which avoids RPN and replaces it with a set of N-learned object proposals. N is much smaller than the number of anchor boxes used by the dense algorithms or dense-to-sparse algorithms in the first stage. Unlike the two-stage algorithm, this method has no RPN structure and the proposal boxes are generated by a set of preset learnable parameters. e comparisons of these methods are listed in Table 1. It can be summarized from this table that the current algorithms applied in traffic sign detection are not fully integrated with attention, multiscale, and data augmentation, which results in a decrease in the detection accuracy of the model trained on the dataset in on-road testing. In this study, our work will incorporate these three factors into the process of detecting traffic signs based on sparse R-CNN to improve the accuracy of traffic signs detection.

Proposed Method
In order to improve the detection accuracy of traffic signs, the proposed framework is illustrated in Figure 1. It consists of two phases: training and inference. Our main contribution is the following three parts: (a) MSR including two parts: integrating coordinate attention block with backbone network ResNeSt and building a feature pyramid for multiscale detection. (b) Data augmentation for complex traffic environment. (c) e designed SAA and DTA modules are used to improve on-road detection accuracy.

Pipeline.
In our proposed framework, as depicted in Figure 1, x tr represents the training images, and x in represents the testing images.
In the training phase, there are two modules that process images x tr synchronously as follows: (1) the first module is MSR. Images x tr are first augmented by the data augmentation method and then sent to the MSR to detect traffic signs. In MSR, the features are extracted using our proposed backbone network. In the feature extraction process, the acquired five-scale pyramid feature is denoted by C 1 , C 2 , C 3 , C 4 , C 5 . e results obtained by the coordinate attention block and a 3 × 3 convolution kernel are P 1 , P 2 , P 3 , P 4 , P 5 . en corresponding proposal boxes and proposal features are input into the dynamic head to generate object features, finally, the loss value is calculated, and back-propagation training is carried out to obtain the final model. (2) e second module is SAA. It is a classifier that learns to divide illumination into low-, normal-, and highlight classes in the training phase.
In the inference phase, the test images are first sent to SAA trained on the training images to classify the illumination of x in . According to the obtained image category from SAA, the nodes in the data augmentation channel were activated for data augmentation. ese extended samples are input into the MSR to obtain detection results, and the final output results contained the target probability, category probability, and position information, which are processed by the DTA.

Sparse R-CNN.
Sparse R-CNN avoids the manual setting of a large number of hyper parameters for candidate boxes and many-to-one label assignments. More importantly, the final prediction result can be directly output without NMS as illustrated in Figure 2. ResNeSt was used as the backbone network for feature extraction in the proposed framework.
In the proposed sparse R-CNN, we use the CIoU [35] loss for bounding box regression. CIoU solves the problem of not being able to directly optimize the parts where the bounding box and ground truth do not overlap. e distance between the two boxes, overlap rate, scale, and penalty terms are all taken into consideration, making the target box regression more stable.
is can also prevent divergence in the training process. e loss function of CIoU adds an impact term α] based on the loss function of DIoU [35], which considers the length-towidth ratio between the predicted and ground-truth boxes.
e CIoU loss function is defined as follows: where α is a trade-off parameter, and ] is a parameter used to measure the consistency of the aspect ratio. Furthermore, ρ(.) is the distance between the central points of the two boxes, and c is the diagonal length of the smallest enclosing box covering the two boxes. Starting from a sparse set of learnable proposals, a sparse R-CNN generates proposal boxes to extract the region of interest (ROI) and proposal features to learn ROI features. Both are learnable parameters. e dimensions of a learnable proposal box are N × 4, where N represents the number of object candidates, generally ranging from 100 to 300, and there are four boundaries of the object box. e network sets a fixed number of boxes as learning parameters. e dimension of the learnable proposal feature is N × d, where d represents the dimension of a feature, which is generally 256. e ROI feature extracted by the proposal boxes generates a one-to-one interaction to supplement high-level feature information such that the features of the ROI are more conducive to location and classification. e interaction design is called a dynamic instance interactive head. It binds  [13] HOG + SVM ---

Dense algorithms
Yang et al. [22] Adversarial network --✓ He et al. [24] HOG + CapsNet ---Dewi et al. [25] YOLOv4 -✓ ✓ Dense-to-sparse algorithms Domen et al. [26] Mask R-CNN √ --Cao et al. [28] Faster R-CNN -✓ ✓ Xie et al. [31] Cascade R-CNN ✓ ✓ -Sparse algorithms Sun et al. [6] Sparse R-CNN --the proposal box, ROI, and feature vector and detects each ROI separately without an NMS operation. us, the candidate boxes can be sparse, and the interactions between features can also be sparse. e backbone network extracts a feature map, and each proposal box and proposal feature are fed into its exclusive dynamic head to generate the object feature. e matching cost is defined as follows: Here, L cls and L L1 are the focal loss and L1 loss, respectively. Moreover, L ciou represents the CIoU loss, and λ cls , λ L1 , and λ ciou are the coefficients of each component. e final loss L joint is the sum of all pairs normalized by the number of objects inside the training batch. Sparse R-CNN can be seen as a new detection paradigm that has changed the framework of the dense detector and the dense-to-sparse detector by abandoning the concepts of anchor boxes or reference points.

Backbone
Network. Convolutional neural networks are originally designed for image classification. Although they have competitive performance in the classification task for the limited receptive field size and lack of cross-channel interaction, these networks will be limited in the field of object detection and image segmentation. Object detection networks with cross-channel representations can solve these problems. Reference [9] proposed a ResNeSt-based splitattention blocks. Compared to the ResNet, it does not require additional calculations. ResNeSt draws on the idea of the ResNeXt network [36], dividing the input into k pieces, each marked as cardinal 1−k, and then splitting each cardinal into R pieces, marked as split 1−r; hence, there are G � k × R pieces in the total group. e structure of ResNeSt is shown in Figure 3.
In the proposed method, average pooling with a kernel size of 3 × 3 was used to reduce the spatial dimensions, and the 7 × 7 convolution was replaced by three 3 × 3 convolutions, which ensured that the receptive field remained the same and reduced the number of parameters. A 2 × 2 average pooling is added before the 1 × 1 convolution with a step size of two in the jump connection.
We construct a pyramid with a five-scale feature map  pyramid level. Each time, the level is increased by one, and the resolution size is reduced by half. e feature maps were extracted by top-down convolution to reduce the degradation that occurred as the depth of the convolutional layers increased, and all maps had 256 channels. e five features at different scales are processed by the multiscale coordinate attention block to enhance attention to the traffic signs. en, the outputs are processed by a 3 × 3 convolution to obtain feature maps P 1 , P 2 , P 3 , P 4 , P 5 , which contain both high-resolution spatial information and low-resolution semantic information. Figure 4 illustrates the structure of the multiscale attention backbone network.

Multiscale Coordinate Attention.
Existing attention methods in computer vision, such as SENet [37], BAM [38], and CBAM [39], only consider local area information or do not consider spatial information. erefore, in the proposed method, we used the multiscale coordinate attention method [40]. It uses two one-dimensional global pooling operations to aggregate the input features along the vertical and horizontal directions into two separate direction-aware feature maps. en, the two feature maps with the embedded specific direction information are coded into two attention maps, each of which captures the long-distance dependency of the input feature map along that spatial direction. e location information can be saved in the generated attention map. en, the two attention maps are applied to the input feature maps by multiplication to emphasize the representation of the attention region. e specific structure of the coordinate attention block is illustrated in Figure 5. In the proposed method, we use the coordinate attention block to extract the pyramid features and obtain the traffic sign features of different scales. As shown in Figure 6, the first column is the original image, the second column is the feature map without the attention mechanism, and the third column is the feature map with the attention mechanism. Using the attention mechanism can help the network to find the region of interest in images.

Data Augmentation.
Deep convolution neural networks have been successfully applied in the field of computer vision. is type of method is data driven and requires a large amount of training data. As the network architecture becomes deeper, more parameters need to be learned. More data are required to allow the model performance to become superior. In a complicated traffic environment, especially in China, many factors affect traffic sign detection, such as  Journal of Advanced Transportation illumination, weather, and noise. However, traffic sign datasets are not diverse enough, which do not contain data from different seasons, different times, and different weather, and only include traffic signs under certain conditions. is will affect the detection effect of the system in the actual environment. In the proposed method, we use data augmentation to improve the robustness of the model. Dan and Dieterich [41] used data augmentation methods such as adding noise and blur in their proposed framework, but these methods are not suitable for the environment of autonomous driving. erefore, the data augmentation method used in this study mainly simulates the complex environment of autonomous driving to make the detection model more stable.
Twenty data augmentation methods were used in the proposed method, as shown in Figure 7. ey can be divided into two categories: pixel-level and spatial-level methods. Pixel-level methods change the input image, leaving other properties such as bounding boxes and the spatial position of the object unchanged. e spatial-level method changes both the spatial information and object position of the input image.
e main purpose of this kind of augmentation method is to simulate the interference factors in complex traffic scenarios.
We also design a box-level data augmentation to supplement the traffic sign data that appear less in the dataset. It replaces the objects in the bounding box and blurs the border. It uses transform T to mix the two images I b and I o to create a new image I a . (T(x, y)). ( In the formula, I o is the object image, I b is the background image, and M is a binary mask of object using ground-truth annotations. I b and I o are selected images from datasets. e proposed method extracts the object region from I o and proportions it to the object region in I b .
is result in a gradient at the junction, and the method further uses Gaussian kernel α to blur this junction and alleviate abnormal fitting caused by drastic gradient changes. e box-level augmentation and results are shown in Figure 8 and 9.

Self-Adaption Augmentation (SAA).
In order to reduce the influence of illumination factors in the on-road detection stage, we proposed the SAA module. e illumination is the greatest influence factor on the processing of traffic sign detection. In the proposed framework, a VGG-16 neural network was trained to classify the illumination of the input image. e illumination is divided into low-, normal-, and high-light classes. In the onroad test stage, when the image is input to the SAA module, it classifies the input image according to the illumination. If the image is under low-or high-illumination conditions, the brightness of the image is adjusted and then processed by the DTA module. We adjusted the original image to two different degrees according to the light intensity ( Figure 10).  Figure 11, where three enhanced sample images are generated by SAA. en, the augmented samples are input to the detector for processing. Specifically, if the image is under normal illumination, the trained detection model will be directly used for detection.

Detection Time Augmentation (DTA).
e trained detection model used in real traffic scenarios may output false or missed detections, which will cause autonomous vehicles to make wrong decisions and cause traffic accidents. To increase the robustness of the model, we propose a DTA  Journal of Advanced Transportation method for traffic sign detection, as illustrated in Figure 12. It first applies SAA to the input image to generate multiple samples. en, the augmented images are processed by the trained model, and different results are obtained, which can be divided into three parts: the probability that the image contains the object P(obj, x), the category probability of the object P(cls, x), and the location information of the bounding boxes (w, h, x, y). e framework proposed in this study uses a voting mechanism for the output results, determines whether the input image contains the detection object, and outputs the result: Here, X i is the i-th augmented image, and A indicates the augmentation methods used in the proposed framework.
If the augmented image contains the object, obj(·) is 1; otherwise, it is 0. e final Obj value depends on the statistical results of each augmented sample image, as follows: e object category Cls is obtained according to the above equation, where C is the total category trained in the detection model, and f(c) is used to count the categories of detected objects in the augmented image based on whether cls is larger than the preset threshold. e final bounding box coordinates are calculated from the average coordinates of the same detection object on different images, as shown in the following equation. Finally, L j represents the coordinate position of the same object on different images:

Datasets
Traffic sign detection is an important task in the field of autonomous driving. Although some general image datasets, such as VOC, ImageNet, and MSCOCO, contain images of traffic signs, these data cannot be used to train traffic sign detection models. e accuracy and robustness of a network trained with different traffic sign datasets in an actual environment are different. e widely used German Traffic Sign Detection Benchmark (GTSDB) [42] contains three types of traffic signs (mandatory signs, warning signs, and prohibitory signs) and consists of 900 images with a size of 1360 × 800. e dataset published by the Laboratory for Intelligent and Safe Automobiles (LISA) [43] includes 47 types of US traffic signs, with 7,855 images and 6,610 signs ranging in size from 6 × 6 to 167 × 168 pixels taken from American traffic video frames. e traffic signs contained in these datasets are quite different from Chinese traffic signs in color and shape; hence, models trained with datasets such as GTSDB or Lisa cannot be directly applied to the recognition of Chinese traffic signs.
Tsinghua-Tencent 100K (TT-100K) [44] is a public dataset collected in China, containing 16,000 images and consisting of 27,000 traffic sign instances divided into 211 categories. e Chinese Traffic Sign Dataset (CTSD) published by the Chinese Academy of Sciences contains 1,100 images divided into 700 training images and 400 testing images with sizes of 1024 × 768 and 1280 × 720. e CSUST Chinese Traffic Sign Detection Benchmark (CCTSDB) [45] expands the CTSD by adding 5,200 images collected from the highway with a size of 1000 × 350. Models trained on these public datasets cannot be applied to real traffic scenarios.
In this study, we present a collected and annotated dataset of traffic signs named Beijing Union University Chinese Traffic Sign Detection Benchmark (BCTSDB). e autonomous vehicle used to collect data equipped with sensors is shown in Figure 13. e sensing system consists of a millimeter-wave radar, monocular camera, lidar, infrared camera, GPS receiver, and ultrasonic radar. e monocular camera is used to record traffic scenes video.

Experimental Setup.
e experimental parameters of the proposed model are summarized in this section. e computer configuration included two NVIDIA TITAN V graphics cards, with a total of 24 GB VRAM. Pytorch was used to implement the network structure. Adam was used as an optimizer with a weight decay [46] of 0.0001. e initial Journal of Advanced Transportation learning rate was set to 2.5 × 10 −5 . e backbone network was initialized using pretrained weights from ImageNet [47] and Xavier [48] for new layers. e default number of proposal boxes, proposal features, iterations, and SAA were 100, 100, 6, and 3, respectively. Our method was evaluated on the BCTSDB and TT-100K. We replace BN with SynBN to accelerate model training, and the training parameters are no longer affected by the number of GPUs, which has been successfully used in MegDet [49].

Performances on BCTSDB.
e experiment used average precision (AP) to compare different models and their accuracies. Both recall and precision are considered in the calculation of the AP, which takes the average value of the precision rate at each recall point from 0 to 1. Precision is the ratio at which the original object is accurately detected, and recall is the proportion of labeled objects in the image that are detected correctly.
We randomly divided the BCTSDB into 14,591 training set images containing 23,440 annotated labels and 1,099 test set images containing 1,803 annotated labels. Figure 15 shows the detection results of BCTSDB. e top part of the figure is the original image, and the bottom part is the detection result, which displays the detected bounding box on the images. It can be clearly seen in Figure 15 that the method proposed in this study can effectively detect traffic signs.  Figure 13: Autonomous vehicle sensors' layout.  Figure 16. It also shows that our proposed model converges faster and has a lower loss value.
e experiments in this study proved that DTA can effectively improve the accuracy and robustness of the model. For instance, in Figure 17, the upper image shows that the network has missed detections without the DTA module, and the lower image shows the detection result after using DTA, revealing that our method can detect all traffic signs in the image.
To verify the effectiveness of each proposed module, we conducted ablation experiments. To the MSR baseline, data augmentation, multiscale attention, DTA, and SynBN were gradually added. e same parameters and training schemes were used for each ablation experiment. e result of ablation studies for the BCTSDB is listed in Table 2. e AP50 and AP75 of our proposed method obtain 3.3% and 3.7% improvement, respectively, based on the sparse R-CNN with ResNeSt101.
We further evaluate the effectiveness of commonly used data augmentation and box-level data augmentation, as listed in Table 3. Experiments have proved that both the commonly used data augmentation methods and box-level augmentation can improve the detection accuracy of the model.
Comparisons among the different methods are presented in Table 4, which lists the detection results based on RetinaNet, YOLOv3, YOLOv5, faster R-CNN, cascade R-CNN, and sparse R-CNN with different backbones. Our method can obtain more competitive results, with AP50 and AP75 values of 99.1% and 96.2%, respectively, which are better than the results of other methods. Compared to the original sparse R-CNN with ResNet101, our model can improve AP50 and AP75 by 4.4% and 12%, respectively. It can also be seen that the method proposed in this study improves the detection accuracy and has little impact on FPS.

Performances on TT-100K.
We also evaluated the method proposed in this article on the TT-100K data, using 6,103 images containing 16,524 labeled boxes as the training dataset and 3,067 images containing 8,181 labeled boxes and 221 categories of traffic signs as the test dataset. It can be seen from the comparison results as Table 5 lists that the detection accuracy of the proposed method is greatly improved compared with the existing algorithms. Compared to the original sparse R-CNN with ResNet101, our model can  improve AP50 and AP75 by 7.9% and 8.9%, respectively and run at 18 fps using our proposed backbone network. e TT-100K dataset itself has the problem of an uneven distribution of image categories, and hence the current detection algorithm results are generally low for TT-100K. e bottom row of Figure 18 illustrates the detection results of the proposed method on TT-100K.

On-Road Testing.
To further evaluate the model performance in real traffic scenarios, we assemble the model into the autonomous vehicles. e autonomous vehicles used for the on-road testing are shown in Figure 19. And the on-road test area is illustrated in Figure 20. is area is the urban road environment competition in China's Intelligent Vehicle Future Challenge. e area includes many types of intersections, urban road traffic signs, and road markings.
In this part, we used the maximum detectable distance to evaluate our proposed method and calculated the average distance with the standard deviation as the error according to the images collected during the autonomous driving mode. In Figure 21, the box represents the quartiles, the line inside the box represents the median of the distance, and the ends of the boxes represent the minimum and maximum of each set of distances. It can be concluded from the experimental results that our method can detect traffic signs up to      200 meters away, which provides more processing time for the decision-making module and control module of the autonomous driving system.

Conclusion
Traffic sign detection can achieve high accuracy in an ideal environment, but when applied to autonomous vehicles, the detection accuracy will be reduced due to complex traffic scenes. In this study, we contribute to this gap through an improved sparse R-CNN method. e main contribution of this study is to integrate the attention mechanism and feature pyramid into the backbone network, so that the extracted features can focus on useful information. e data augmentation method is used to simulate complex traffic scenes. We also present a traffic sign dataset BCTSDB. e use of SAA and DTA modules can make the on-road traffic sign detection of the autonomous vehicle more robust. e experimental results on the BCTSDB and TT-100K datasets verify the effectiveness of the method in this study. e AP50 and AP75 of proposed method are 99.1% and 96.2% for BCTSDB, and 53.1% and 48.7% on TT-100K, respectively, which indicates that our proposed method achieves state-ofthe-art results.
In the future, our work will focus on how to improve the high accuracy detection algorithm to achieve fast detection speed.
e XAI [50] development may provide a quick solution to this problem. While consider applying HD map, V2X and 5G technologies to autonomous driving are a way to accelerate the industrialization.

Data Availability
e image data used to support the findings of this study are available from the corresponding author upon request.