UAV Image Small Object Detection Based on Composite Backbone Network

Small objects in tra ﬃ c scenes are di ﬃ cult to detect. To improve the accuracy of small object detection using images taken by unmanned aerial vehicles (UAV), this study proposes a feature-enhancement detection algorithm based on a single shot multibox detector (SSD), named composite backbone single shot multibox detector (CBSSD), which uses a composite connection backbone to enhance feature representation. First, to enhance the detection e ﬀ ect of small objects, the lead backbone network, VGG16, is kept constant, and ResNet50 is added as an assistant backbone network, and the residual structure in ResNet50 is used to obtain lower feature information. The obtained lower feature information is then fused to the lead network through feature fusion, allowing the lead network to retain rich lower feature information. Finally, the lower feature information in the prediction layer increases. The experimental results show that CBSSD has a signi ﬁ cantly higher recognition rate and a lower false detection rate than conventional algorithms, and it still maintains a good detection e ﬀ ect under low illumination. This is of great signi ﬁ cance to small object detection using images taken by UAVs in tra ﬃ c scenes. Furthermore, a method to improve the SSD algorithm is proposed.


Introduction
Recently, with the rapid development of artificial intelligence, unmanned aerial vehicle (UAV) detection technology has been widely applied to real traffic scenes [1,2]. Vehicle and pedestrian detection, as an important part of UAV detection technology, is of research significance [3,4]. Object detection methods can be classified as conventional machine learning and deep learning methods. Conventional machine learning methods first preprocess the image, and then the candidate area is determined using the sliding window technique. Subsequently, features of the candidate regions are extracted, and a classifier is used to determine the classification information of an object to realize object detection. Common machine learning methods include the scale-invariant feature transform [5], histogram of oriented gradient [6], Harr [7], and speeded up robust feature [8]. However, because conventional machine learning methods are based on the manual design of features, and the process of feature extraction is too complex, these methods often face problems, such as poor generalization ability, slow detection speed, low detection accuracy, and difficulty in adapting to detection tasks in different scenarios.
To address the abovementioned problems, the general object detection method based on convolutional neural networks (CNNs) has gradually gained research attention. At present, object detection methods based on deep convolutional networks are classified as two-and single-stage detection methods. Among the two-stage detection methods, the Faster R-CNN proposed by Ren et al. [9] has the best performance. This network introduces a regional proposal network (RPN) that can simultaneously predict the object boundary and object score of each position. After end-to-end training, high-quality regional suggestions are generated to improve the detection accuracy of the network. Given the efficiency issue, the single-stage method was proposed, with representative methods being you only look once (YOLO) [10] and the single shot multibox detector (SSD) [11][12][13]. YOLO uses the feature graph at the top of the CNN to predict category confidence and border bias, and it processes the detection problem as a regression problem, which provides the advantage of fast detection. However, YOLO uses a fully connected network, which leads to the loss of spatial information, positioning errors, and missed object detection, especially a poor detection effect on small objects, affecting the final detection accuracy. SSD borrows the Anchor idea in Faster R-CNN and uses multiple feature maps of different scales for detection. SSD can detect objects of various sizes because the receptive fields of each feature map are different. However, the semantic information of the SSD shallow feature map is poor; therefore, it is not suitable for small object detection. To further improve the detection effect of SSD on small objects, Li et al. [14] proposed a feature fusion SSD (FSSD) model, which is an enhanced SSD model with a novel lightweight feature fusion module. This can significantly improve SSD performance. In the feature fusion module, the features of different layers are connected at different scales. Some subsampling blocks generate new feature pyramids, which are sent to multi-bounding box detectors to predict the final detection results. Recently, Liu et al. [15] proposed a new target detection method called the composite backbone network architecture (CBNet). This approach improves the performance of object detectors by combining multiple identical backbones, and CBNet can be easily integrated into most of the advanced detectors, thus significantly improving their performances.
In summary, although object detection technology has been well developed, problems still arise in the detection of small objects. To solve this problem, we propose a composite backbone SSD (CBSSD) object detection method. Based on the CBNet network, we introduce the ResNet50 [16][17][18]    Mobile Information Systems network. The residual structure was used to improve the feature extraction ability of the network, retain richer underlying information, and merge deep and shallow features to improve the detection accuracy of the network.

Methods
Based on the CBNet network, the CBSSD method consists of a lead backbone network SSD and an assistant backbone network ResNet50, as shown in Figure 1.

Lead Network
2.1.1. Network Structure. In this study, the SSD model with VGG16 [17,19,20] as the main network was selected. VGG16 is a classical network with a network depth of 16. It uses 3 × 3 convolution kernels of a single size. The SSD method is based on the feedforward convolution network, which generates a set of priori-bounding boxes of fixed sizes and scores in the priori-bounding boxes of object class instances and then generates the final detection result through nonmaximal suppression (NMS) [21,22]. The first few network layers are based on the standard architecture of high-quality image classification, which is called the basic network. Feature extraction layers, conv8_2, conv9_2, conv10_2, and conv11_2, were added to the basic network. SSD differs from YOLO in that SSD performs predictions on the previously selected five feature maps in addition to object detection on the final feature map. Figure 2 shows the schematic of the SSD network prediction. Note that the detection process is not only conducted on the added feature graph but also on the basic network feature graphs conv4_3 and conv7 to ensure that the network has a good detection effect on small objects.

Priori-Bounding
Box. SSD designs a priori-bounding box of different quantities, scales, and width-to-height ratios for each feature graph. These priori-bounding boxes are composed of a series of object-bounding boxes of fixed quantity and size generated by certain rules. The specific size of the priori-bounding box is determined by the scale and width-to-height ratio, and each layer of the feature map corresponds to a scale, which is generated as where s k represents the scale of the priori-bounding box in the k th feature graph, s min is 0.2, s max is 0.9, m represents the number of feature graphs used for detection, and the value of m is 6 in SSD. Each grid on each layer of the feature map must set different numbers and sizes of prioribounding boxes. In particular, each grid of conv4_3, conv10_2, and conv11_2 generates four priori-bounding boxes with a width-to-height ratio a r1 of {1,2,1/2}. conv7, conv8_ 2, conv9_2 each grid on conv7, conv8_2, and conv9_2 feature maps produce six priori-bounding boxes with a width-to-height ratio a r2 of {1,2,1/2,3,1/3}. After determining the scale and width-to-height ratio, the size of the priori-bounding box can be obtained as follows: where w a k and h a k are the width and height of the prioribounding box, respectively, and a r is a r1 or a r2 . For the priori-bounding box with a width-to-height ratio of 1, In SSD, the number of priori-bounding boxes in the first detection layer is 38 × 38 × 4 = 5776, 19 × 19 × 6 = 2166, 10 × 10 × 6 = 600, 5 × 5 × 6 = 150, 3 × 3 × 4 = 36, and 1 × 1 × 4 = 4. In total, the network outputs 5776 + 2166 + 600 + 150 + 36 + 4 = 8732 priori-bounding boxes.

Composite Backbone
Network. The objects in UAV images are mostly small and are subject to severe fuzzy and texture distortion problems and obscure features. Thus, it is difficult for some networks to extract key feature information and influence the recognition ability of classifiers. Therefore, based on CBNet [23], a composite backbone network was proposed, which combines two public backbone networks. Moreover, ResNet50, which can better maintain the details of the lower layer, was selected as the assistant backbone network. By maintaining the lead backbone network, the lower features extracted by ResNet50 are fused layer-by-layer into the VGG16 lead backbone network. The feature layer obtained after fusion is replaced by the original feature layer of the lead backbone network as a new feature layer for the next convolution step (Figure 3).
In the assistant backbone network, the result of each phase can be considered as a higher-level feature. The output of each feature level is part of the lead backbone input and flows to the parallel phase of the subsequent backbone. In this manner, multiple higher and lower features are fused to produce richer feature representations. This process can be expressed as follows: where ⨁ represents element addition, F l represents the output features of the lead backbone at the current stage, F a represents the output features of the assistant backbone, F out represents the display feature fusion results, and F OUT is the input value of the next layer of the lead backbone.
The process from F out -F OUT is tuned via channels. As shown in Equation (5), ε acts as a 1 × 1 convolution operation. In theory, this composite connection method can be used at the trunk layer, and our experiment used the most basic and useful composite connection method. This shows that the proposed composite connection method is not limited by the feature size. To simplify the operations, 150 × 150, 75 × 75 and 38 × 38 feature layers were selected on the lead backbone corresponding to the output of the threelayer ResNet50. This results in an imbalance between the positive and negative samples. As a result, when calculating loss, negative samples occupy a large proportion, making it difficult for the model to converge. Therefore, after matching, a difficult sample-mining strategy is used to control the ratio of positive to negative samples at 1 : 3 and input the samples into the network for training. The loss function selected in this study was the same as that used in the conventional SSD network, which is the weighted sum of positioning loss (smooth L1 [24][25][26]) and confidence loss (Softmax [27][28][29]), as expressed by where x is the matching result of the prediction-bounding box and the real-bounding box of different categories, c is the category confidence information of the predictionbounding box, l is the location information of the prediction-bounding box, and g is the location of the real enclosure. N represents the number of matched prioribounding boxes. When N = 0, the total loss is 0. α is the weight coefficient, L loc ðx, l, gÞ is the position loss, and L conf ðx, cÞ is the classified loss. The position loss is a smooth L1 loss between the prediction-bounding box and the realbounding box, as expressed by where x k ij represents whether the ith prediction-bounding box and the jth real-bounding box match in category k. If they match, the value is 1; if they do not match, the value is 0. Similar to Faster R-CNN, SSD performs regression on the central coordinate (cx, cy), width w, and offset of height h of the priori-bounding box. The calculation method is expressed by the following: The classic Softmax loss was used for loss classification, as expressed by where x p ij represents whether the ith prediction-bounding box and the jth real-bounding box match in category P. If they match, the value is 1; if they do not match, the value is 0. Inĉ

Experiment
3.1. Implementation Details. The proposed framework uses a composite connection of VGG16 and ResNet50 as the backbone. In the training phase, the learning rate of the first 50 epochs was set as 5 × 10e −4 , and the learning rate was automatically reduced by 50% when the loss function did not decrease by more than three times. The initial learning rate of the training for more than 50 epochs was set as 10e -4 , and the learning rate was automatically reduced by 50% when the loss function did not decrease by more than three times. The training was completed when the loss function did not decrease after three attempts at lowering the learning rate. The experimental environment used in this study was as follows: CPU was Intel I5-9400F; the main frequency was 2.90 GHz (six cores); 16 GB memory; GPU was RTX2060Super; the operating system was 64-bit Windows; and the machine learning framework was Tensorflow2.3.

The Datasets.
Two datasets were used in this study: Pascal VOC 2012 datasets [30] for testing the feasibility of the network and Visdrone2019 UAV aerial photography datasets [31] for training.
3.2.1. Pascal VOC2012 Datasets. As one of the benchmark datasets, Pascal VOC2012 has frequently been used in object detection, image segmentation experiments, and model effect evaluations. The datasets consist of four major categories and 20 subcategories, with 17125 images, including images and test images.

Visdrone2019
Aerial Datasets. The Visdrone2019 aerial datasets are low-altitude aerial datasets, mostly used for small object detection. There are 13 types of objects in the datasets and 7,634 images in the datasets. Most of the images in the datasets are traffic maps, which contain dense small objects.

Performance Inspection.
In this study, the mean average precision (mAP) was used to evaluate the quality of the The calculation methods for mAP are expressed by the following: where TP represents the number of accurately predicted target boxes, FP represents the number of target boxes that failed to predict, FN represents the number of missed ground truths, and AP represents the average precision.

Performance Test.
To test the detection effect of the proposed network, a performance test was conducted using the Pascal VOC 2012 datasets, and the performance was compared with conventional object detection algorithms. The experimental results revealed that the proposed network outperformed several conventional object detection algorithms in terms of mAP. As shown in Table 1, the proposed algorithm demonstrated the highest mAP, thus confirming its superior feasibility.
3.5. Training. To verify the detection effect of the proposed network on small objects, training was conducted using the Visdrone2019 UAV aerial photography datasets, and the performance of the proposed CBSSD algorithm was compared with the conventional object detection algorithm. Experimental data show that the proposed algorithm exhib-ited a significant improvement over the original network in terms of mAP. As shown in Table 2, the detection accuracy was improved by 7.5% compared with the original algorithm, and the improvement rate was as high as 65%. The improvement was therefore confirmed. As shown in Figure 4, CBSSD has advantages in detection accuracy for each category of objects, especially for small objects. Figure 5 shows the detection results of the CBSSD on the Visdrone2019 datasets. As shown in the figure, CBSSD can maintain high performance despite dense and blurred images and uneven lighting. Figure 6 shows a comparison of the detection results between the CBSSD algorithm and the classical detection algorithm. The figure shows that the CBSSD algorithm has a better detection effect than several classical object detection algorithms.
The CBSSD algorithm has a significant effect on dense small object detection, as shown in Figure 7. Unmanned aerial images are an important feature of many small objects, and the general algorithm for this type of object detection exhibits a lower performance, because the characteristics of the figure for this type of object in information loss are serious. CBSSD maintains the characteristic diagram with more low-level detail information; therefore, for this type of object, the detection effect is better.
The CBSSD algorithm still has good detection effects for images with weak light intensities and uneven lighting, as shown in Figure 8. CBSSD is also excellent in low-light environments, where the object texture is distorted, which makes detection more difficult.
In summary, the experiments showed that the detection accuracy of the proposed CBSSD algorithm significantly improved. Object detection and recognition were significantly increased, recognition accuracy significantly improved, and error detection reduced. For dense small  Mobile Information Systems   9 Mobile Information Systems objects, the detection effect was significantly enhanced. In particular, in the case of uneven lighting, fuzzy still maintained a good detection effect.

Conclusion
This study analyzed the problems associated with small object detection from UAV aerial images. By combining the existing feature extraction trunk in the form of a composite connection, a trunk with stronger feature expression ability is proposed, which solves the problem of poor monitoring when UAV aerial images were captured in dense, fuzzy, and uneven light. The experimental results showed that, compared with other algorithms, the proposed CBSSD algorithm significantly improved the detection effect of small objects in UAV aerial images. Hence, UAV aerial image detection technology can be better applied to traffic scenes. Moreover, an improvement method for the SSD algorithm was proposed.
In the future, a clustering algorithm will be used to cluster the size of feature-bounding boxes suitable for an SSD network, to solve the problems associated with manually setting the size of feature-bounding boxes in the SSD network, and to further increase the detection effect of small objects.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare no conflicts of interest.   Mobile Information Systems