Research on Multiscene Vehicle Dataset Based on Improved FCOS Detection Algorithms

. Whether in intelligent transportation or autonomous driving, vehicle detection is an important part. Vehicle detection still faces many problems, such as inaccurate vehicle detection positioning and low detection accuracy in complex scenes. FCOS as a representative of anchor-free detection algorithms was once a sensation, but now it seems to be slightly insuﬃcient. Based on this situation, we propose an improved FCOS algorithm. The improvements are as follows: (1) we introduce a deformable convolution into the backbone to solve the problem that the receptive ﬁeld cannot cover the overall goal; (2) we add a bottom-up information path after the FPN of the neck module to reduce the loss of information in the propagation process; (3) we introduce the balance module according to the balance principle, which reduces inconsistent detection of the bbox head caused by the mismatch of variance of diﬀerent feature maps. To enhance the comparative experiment, we have extracted some of the most recent datasets from UA-DETRAC, COCO, and Pascal VOC. The experimental results show that our method has achieved good results on its dataset.


Introduction
In recent years, with the rapid development of the automobile industry, the number of motor vehicles in the city has also developed rapidly.By the end of 2020, the number of motor vehicles in the country has reached more than 300 million, followed by the colossal traffic pressure and many traffic regulation problems.When the traffic pressure and traffic control problems become more serious, it will bring many inconveniences to the production and life of urban residents and will also restrict the rapid development of cities and towns.On the other hand, with the gradual maturity of artificial intelligence, the application of artificial intelligence in vehicles has also received extensive attention, and the application of vehicle detection is the focus of this article.Moreover, vehicle detection covers more and more areas, such as road traffic monitoring, the automatic lifting of the entrance guard of the community, some charging pile parking places, and automated driving.
Pedestrians or private cars are often parked in unguarded charging areas, causing inconvenience to new energy owners.When the vehicle is fully charged, you need to remind the owner to move the car out of the charging pile.
ere are still many problems with vehicle detection in autonomous driving.e vehicle will be seriously obscured or cause misidentification in different scenarios, which is often one of the leading causes of accidents in self-driving cars.And, before the license plate recognition, it is often necessary to recognize the body to narrow the identification range, reduce the interference of the surrounding environment, and improve accuracy.Based on these conditions, the vehicle detection is still a challenging task.e primary purpose of vehicle detection can be divided into two.One is to determine whether a vehicle is detected (such as a bus, truck, and car) in the video or image.During the detection process, it needs to determine its location and mark it.Second, the specific category needs to be determined.e particular sort of vehicle needs to be determined by analyzing the semantic information (e.g., [1]) of the car in the frame to complete the vehicle detection task.
Neural network design often follows several elements.e main one is to reduce the information path and enhance the information dissemination; for example, the residual connection [2] and dense connection [3] have played a perfect role.It is also effective in improving the flexibility and diversity of information channels, typically using a splittransform-merge strategy [4].ere are also many methods [5][6][7] to combine high-resolution image information with high-level semantic information.
Driven by these cutting-edge algorithms, we propose an improved-FCOS algorithm to detect vehicles, consisting of mainly three points.Firstly, the car belongs to a rigid structure and changes laterally under different shooting angles and scene occlusions.
e receptive field of the standard convolution kernel is often rectangular, extending to the surroundings, which may not completely cover the entire car.We introduce Deformable ConvNet [8] which enables the convolution kernel to adaptively learn the response's position deviation according to the target's deformation.Adding a bottom-up module to the original FPN further enhances the flow of information between different feature layers, reducing the distance between the bottom layer and the top layer.Pang et al. [9] put forward a new idea called the Libra R-CNN before this, thinking that today's detectors all follow the region selection, feature extraction, and then the gradual convergence under the guidance of multitask loss, which directly affects the effect of model training.Based on the former balance concept, we add a balanced module after the improved FPN, which integrates feature maps of different resolutions, uses this feature to strengthen the previous pyramid, and then adds nonlocal attention to enhance the contextual network connection.
is operation can reduce the inconsistency of bbox head recognition due to different variances, and the whole process is cost-free through interpolation and pooling.e work of this paper is as follows: (1) e introduction of DCN [8] can make the receptive field of the convolution kernel to adaptively change (2) We added a bottom-up module behind the traditional FPN [6] to reduce the distance from bottom to top (3) We added a balanced module after the improved FPN to reduce the inconsistency of the bbox head prediction

Related Works
is section mainly introduces the method of research used for vehicle detection and FCOS algorithm.

Detection Algorithm.
Many methods based on combining the feature extraction and classifiers have been proposed to achieve vehicle detection formerly.For example, HOG [10] feature detection and then HOG [10] and LBP [11] features are combined to improve the accuracy of the vehicle detection further; Li Xiangfeng et al. put forward the Haar [12] feature algorithm.Although these algorithms achieve better detection results in simple scenarios, they are challenging to deal with complex systems.
After 2012, the convolutional neural network-based algorithm in deep learning became popular, extracting semantic information about vehicles from different feature layers and solving the problem of insufficient robustness of traditional algorithms when the dataset is sufficient.
Detection can be done in three ways.One is a two-stage regression algorithm (e.g., [9,[13][14][15][16][17][18][19][20][21]).ese algorithms all propose an anchor as a prior to further improve the accuracy and speed up the convergence of the network model, which is often slower than single-stage detection algorithms in speed.
e last one is the direction proposed by Facebook, represented by a transformer [28], which introduces the transformer in NLP into CV, greatly simplifying the network model.However, it has some shortcomings in accuracy, and it is still far away in the engineering deployment, such as [29][30][31][32].

FCOS Algorithm.
We choose the anchor-free network because the anchor-based algorithm has many limitations: (1) e dataset is susceptible to the size, number, and aspect ratio of anchors, and different tasks need to be readjusted, which is not conducive to generalization (2) To better match the GT box, many anchors need to be generated, most of which are marked as negative samples, which will cause an imbalance between positive and negative examples (3) It is necessary to calculate the value of IoU, which consumes a lot of computing power, which slows down the detection speed and increases the cost FCOS [22] is an anchor-free detection algorithm.e accuracy of the previously proposed anchor-free algorithm is quite different from that of the anchor-based algorithm.And, FCOS [22] successfully surpassed the anchor-based detection algorithm and became the SOTA of the year through alternative solutions.
e definition of the positive and negative samples in the FCOS [22] algorithm is quite different from before.If a location (x, y) falls into any GT box, it is a positive sample and regresses the distance between this point and the bounding box l * , t * , r * , and b * , as shown in the following equation: ( e previous anchor-free algorithm has no optimal solution for overlapping the GT box regions, so the point 2 Complexity regression has ambiguity.In FCOS [22], the ambiguity is significantly reduced through the feature pyramid.As we all know, the shallow layer of the neural network is rich in more detailed features, which is beneficial to small target detection [33].Higher level has more semantic features, which are used to detect large targets.To reduce the overlapping of objects with significant differences, the parameter m i refers to the maximum distance of the feature map i.If a location (x, y) satisfies max(l * , t * , r * , b * ) > m i or max(l * , t * , r * , b * ) < m i−1 , the point to a negative sample is set without regression.Among them, m i is set to 0, 64, 128, 256, 512, +∞, respectively, divided into five intervals to reduce the overlapping area.If there is an overlapping area in a layer, it directly returns to the smallest area.
To further constrain those prediction boxes far away from the center of the GT box, FCOS [22] samples the centerness method to solve this problem (Equation ( 2)) and uses BCE loss [34] to optimize the centerness branch.

Methods
is chapter will give a detailed overview of the improved part of the algorithm (the complete structure of the algorithm in Figure 1).e deformable convolution [8] is added to the suppression factor that can reduce the influence of noise and background.e added bottom-up module could significantly reduce the loss of data reaching the top.e integrated feature map is put into the nonlocal attention structure by the balanced module and then redivided to obtain a new feature map.In this paper, the improved FCOS dramatically increases the accuracy but does not increase too much calculation.

Deformable Convolution.
At present, the convolution unit commonly used in the CNN is a fixed geometric structure.In the same feature layer, the receptive field size of all activation units is the same.Considering its characteristics, the vehicle changes horizontally under different shooting angles and scene occlusions.e receptive field of the standard convolution kernel is often rectangular, extending to the surroundings, and may not completely cover the entire car.erefore, to allow the detection algorithm to adapt to the scale, posture, and geometric changes of the vehicle target, we introduced deformable convolution [8], which adaptively determines the size of the receptive field to improve the accuracy of detection and positioning.Compared with standard convolution, the DCN [8] is more in line with the actual situation.Its principle is shown in Figure 2.
We introduced the deformable convolution of the DCN [8] (it is illustrated in Figure 3).Deformable convolution is mainly composed of two parts: (1) the previous feature map is convolved to obtain the deviation; (2) according to the deviation value, the new sampling coordinates are obtained and convoluted to generate a new feature map.e model will focus on the area outside the target in actual training, introducing noise and not conducive to detection, adding a suppression factor to make the model more focused on the target we need.
For example, R � (−1, −1), (−1, 0), . . ., (0, 1), (1, 1) { } denotes the coordinates of the 3 × 3 convolutional kernel and the convolution calculation is shown in Equation (3).Among them, Δp k refers to the deviation value, where Δm k denotes the suppression factor, which mainly assigns different weights to the target area and the noise background area.e sampling coordinates of the convolution kernel on the original feature map are p k + Δp k .In the actual calculation, the coordinates of the former have a decimal point.We use bilinear interpolation to solve this problem, as shown in Equation ( 4).
In the experiment, the DCN [8] was added to the C3-C5 layer of the backbone, which brought a considerable increase in accuracy, and we added the C2 feature map to improve the learning ability of the model further.
3.2.Improved FPN.We modified the FCOS [22] neck module.e previous FPN [6] mainly improves the target detection effect by fusing high-and low-level features, especially for small-size targets.As we all know, high-level features contain semantic information, while low-level features contain more specific descriptions of detailed information.Driven by PAN [7] (the champion of instance segmentation competition that year), this paper adds a bottom-up path augmentation module as in Figure 4 after the traditional FPN [6].However, in the FPN [6] algorithm, a top-down process is required.e transfer of shallow features to the top layer requires dozens or more than one hundred network layers.Obviously, after such a multilayer transfer, the superficial feature information will be seriously lost.e bottom-up path augmentation added in this article can connect the shallow features to the P2 through the lateral connection of the original FPN underneath and then pass from P2 to the top layer along with the bottom-up path augmentation.e number of layers passed is less than 10, which can better retain the shallow feature information.

Balance Module.
e previous improvement makes the original image go through the FPN [6] layer to perform multiscale feature extraction from top to bottom and then go through the bottom to top to enhance the positioning feature information.e information of adjacent resolution feature maps is aggregated and strengthened, but there still exist some problems.e former does not consider the aggregation relationship of the hierarchical feature information between different resolutions, and the variance of each feature map is different.When sent to the bbox head, there   4 Complexity will be inconsistency problems.anks to Libra RCNN's [9] success, we add a balance module after the improved FPN (Figure 5).It resizes feature maps of different resolutions to the same size, then adds the feature elements together, and gets divided by the number of levels to achieve aggregation.e feature information of different scales is aggregated in the N4 feature map and then sent to the following refine structure (Equation ( 5)). e refine structure introduces a nonlocal attention mechanism, which is used to capture long-distance dependence, that is, how to build the connection between two nonadjacent pixels on the image.When calculating the response of an arbitrary position, nonlocal attention will consider the relationship of the feature map context to assign weights adaptively.
where i is the location of the output feature map, j is the location of other different feature maps, x is the input feature map, f is the pairing calculation function for the two feature maps to calculate the correlation between the ith position and all other positions, g is the unary input function for transform information, and C(x) is the normalization function.
Figure 6 is the specific form of nonlocal attention.First, g(x i ) convolves the input feature map three times to obtain the θ, φ, and g features, and then it calculates the correlation between the two positions through the f function (3 parts in the structure diagram).
Equations ( 7) and ( 8) calculate the Gaussian distance in the embedding space by the corresponding parts 1 and 2 in the structure diagram.
en, the dimensions of the above three features are reshaped except for the number of channels, and the correlation is calculated by matrix point multiplication of θ and φ.Finally, the weights are 0∼1 by the softmax operation, as follows: Equations ( 9) and ( 10) are sorted out to obtain the following equation: Finally, the attention coefficient is correspondingly multiplied back to the feature matrix g, plus the number of extended channels.e result and the original input feature map are used for the residual operation (4 in the structure diagram) to obtain the refined feature map, enhancing the relationship between the feature maps and balancing the variance.

Loss Function.
e final loss function is In order to highlight the improvements, we have not changed the loss function of the original algorithm.Here, L cls is the focal loss as in [26], and it can greatly reduce the problem of imbalance between positive and negative samples.L reg is the IoU loss as in UnitBox [35], which considers that the correlation between the coordinates is different from the weighted sum of L1 and L2 loss.N pos denotes the number of positive samples, and λ being 1 in this paper is the balance weight for L reg .e summation is calculated over all locations on the feature maps } is the indicator function, being 1 if c * i > 0 and 0 otherwise.

Experiment
Our experiment performed detection on three diverse datasets, including UA-DETRAC, MSCOCO2017, and Pascal VOC, for joint training (where each dataset only uses pictures in the category of car, bus, and truck).

Accuracy Experiment.
We report our main results on the test (approximate 2K images) by uploading detection results to the server.We firstly forward the input image through the network and obtain the predicted bounding boxes with a predicted class.Unless specified, the postprocessing and data enhancement of the algorithm will use the official default of mmdetection.We hypothesize that the performance of our detector may be improved further if we carefully tune the hyperparameters.We compare the mainstream algorithms in recent years, and the results are in Table 2. Compared with other algorithms in current years, our method achieves the best performance on this dataset.

Model Complexity.
We also tested the complexity of each model on the dataset as shown in Table 3.It can be seen from the table that the one-stage network often has fewer parameters than the two-stage network and our model has dramatically improved the accuracy and reduced them.For the GFLOPs indicator, the parameter has only risen a little.

Visualization of Results
. Figure 9 is a visualization of the effect of the algorithm.It can be seen that vehicles can be detected no matter whether they are at a distance or in some unique scenes, even when there is little picture information, which proves the superiority of our algorithm.

Conclusions
We improved the detection algorithm based on the anchorfree FCOS [22] and introduced the DCN [8] based on the original backbone to broaden the receptive field of the convolution kernel and also introduced a bottom-up module to improve the FPN and reduce the loss between the information transmission.A balance module is added because the variance of the feature pyramid affects the accuracy, which has a good effect and proves the superiority of the improved algorithm.

Figure 2 :Figure 3 :
Figure 2: All examples use a 3 × 3 convolutional kernel.(a) Standard convolution: green dots are the sampling points.(b) Deformable convolution sampling locations (dark blue points) with augmented offsets (light blue arrows) in the deformable convolution.

Figure 4 :
Figure 4: Block diagram of bottom-up path augmentation.

Figure 1 :
Figure 1: Improved FCOS algorithm structure: we clearly show all the tricks in the original picture for easy reading.Among them, C2∼C5 are the feature layers output by the backbone, the first N2∼N6 are the improved FPN, and the second is the result of adding local attention.

Figure 7 :
Figure 7: Ablation experiment example.(a) Figure without DCN, where its network attention is relatively distracted, and the sampling area cannot cover the overall target.(b) After adding DCN, the effect figure shows that the DCN sampling area is more comprehensive and can learn more features.

Figure 8 :
Figure 8: e linechart of cls and bbox loss, where the earlier loss_cls converges faster and the fluctuation is minor.
[36]rification, and test sets.isarticlesamplesallvehiclepictures in the training and validation sets, and the total dataset is about 20,000 (the specific information is in Table1).All the annotation information is represented by the VOC format, 90% is used for training and verification, and the remaining 10% is used for the testing.eexperimentsused in this article are all based on the mmdetection[36]framework developed by Shangtang,

Table 1 :
e distribution of the dataset.

Table 2 :
Our proposed algorithm vs. other algorithms with ResNet-50 and FPN as default.

Table 4 :
Ablation study for the proposed methods.