YOLOv5-Based Vehicle Detection Method for High-Resolution UAV Images

To solve the feature loss caused by the compression of high-resolution images during the normalization stage, an adaptive clipping algorithm based on the You Only Look Once (YOLO) object detection algorithm is proposed for the data preprocessing and detection stage. First, a high-resolution training dataset is augmented with the adaptive clipping algorithm. Then, a new training set is generated to retain the detailed features that the object detection network needs to learn. During the network detection process, the image is detected in chunks via the adaptive clipping algorithm, and the coordinates of the detection results are merged by position mapping. Finally, the chunked detection results are collocated with the global detection results and outputted. The improved YOLO algorithm is used to conduct experiments comparing this algorithm with the original algorithm for the detection of test set vehicles. The experimental results show that compared with the original YOLO object detection algorithm, the precision of our algorithm is increased from 79.5% to 91.9%, the recall is increased from 44.2% to 82.5%, and the mAP@0.5 is increased from 47.9% to 89.6%. The application of the adaptive clipping algorithm in the vehicle detection process e ﬀ ectively improves the performance of the traditional object detection algorithm.


Introduction
With the rapid development of the social economy and accelerated urbanization, traffic problems are becoming more and more serious [1]. Effective traffic monitoring helps to solve increasingly serious traffic problems. Once AI enters Agenda at the national level, intelligent transportation systems will become the development in the trend [2][3][4]. An unmanned aircraft has wide application prospects in the field of transportation, and the UAV equipped with high-definition cameras has great development potential and advantages in parking lot management, intelligent traffic control, and disaster rescue [5][6][7][8][9]. Using the improved YOLO algorithm, according to the characteristics of fast recognition speed, high accuracy, and good detection effect, it can give full play to the advantages of auxiliary decisionmaking in a variety of complex traffic conditions. Compared with vehicle detection through ground images, aerial image taken by UAV is slightly different: the ground view is mainly taken by a fixed camera. The aerial view is taken from the top view by a mobile UAV with a camera. Therefore, some side information about the vehicle is lost [10]. The image quality of the camera carried by the UAV is much higher than that of the ground camera (most cameras are 4 K, and some high-end models can output images with a resolution of 8 K), and the amount of information carried by the image is huge. Therefore, images need to be used correctly and reasonably. In addition, in aerial images, objects of interest are usually small and dense. For example, when a DJI Inspire 2 Zenmuse X7 drone is used, the output image size is 5760 × 3240 pixels; for such a high resolution, a vehicle may only be 50 × 50 pixels or less [11], and it is very challenging to detect such a small vehicle in large images.
In the field of deep learning algorithms, image classification networks based on convolutional neural networks such as AlexNet, VGG, and ResNet [12][13][14][15] have been developed to enhance ImageNet classification competition to achieve higher scores. Convolutional neural networks have been increasingly used in the object detection field [16,17]. Redmon et al. [18] proposed the You Only Look Once (YOLO) object detection network; it treats object detection as a regression problem and uses an end-to-end framework to directly predict category and location information. The following year, Redmon and Farhadi [19] proposed an improved version named YOLO9000, which added anchor boxes to make it easier for the detection head to predict the target box and added batch normalization (BN) to reduce the overfitting of the model. The most recent version of the YOLO object detection algorithm is YOLOv5, which significantly improves the accuracy and efficiency of the object detection algorithm by replacing the backbone to CSP-DarkNet and adding some data augmentation methods like mosaic.
Ground target detection based on the deep learning method has been well developed. However, the current technology still has some shortcomings in vehicle detection from UAVs, such as a small set of targets consisting of pieces of cars in parking lots. Taking the YOLO object detection network as an example, the downsampling factor of YOLO is 32, and the network outputs a 13 × 13 prediction grid. If the distance between two target objects is less than 32 pixels, then the network has errors when the targets are differentiated [11].
Therefore, some researchers are committed to improving the network structure. Zhong et al. [20] used convolutional neural networks to generate vehicle-like regions from the feature maps of different layers in the backbone and pooled the features of the deep and shallow layers, which is helpful to detect small objects more effectively. Yang et al. [21] used cross-layer skip connections to overcome the feature loss caused by deep convolutional neural networks for small objects. Sommer et al. [22] showed that the current region proposal network (RPN) did not work effectively for small objects, so the RPN network, including the fast R-CNN improvement, was used to detect small objects. The above researchers have conducted in-depth studies on network structures. However, due to the strict limitation of the input size of the convolutional neural network, the above algorithms are weak in terms of enhancing the vehicle detection process of high-resolution images.
Due to the limitations of the convolutional neural network, the current mainstream target detection network has strict requirements for the size of the input image. Different object detection networks have different requirements for the resolution of the input image. Images that do not meet the corresponding resolution need to be compressed or zero-padded and adjusted to meet the requirements before being detected again. The faster R-CNN [23] uses 1000 × 600 pixel images as the regular input, SSD [24] uses 300 × 300 or 512 × 512 pixel images as the input, and the latest YOLOv5 algorithm uses 640 × 640 pixel images as the input. However, the resolution of images captured by UAVs is much higher than the image size acceptable for the above object detection models. The loss in the process of image compression will seriously affect the detection of small targets in the target detection network.
In order to solve the problem of feature loss in the process of UAV high-resolution image target detection, an adaptive clipping algorithm based on UAV image as the input of training and detection is proposed in this paper. The algorithm is based on the YOLOv5 object detection network. During the process, high-resolution images are input to the network for training after being adaptively clipped according to the input size requirements. After training, the small object detection problem is transformed into a standard problem using a sliding window for sliding chunk detection through the step size calculated by the adaptive clipping algorithm. The algorithm is evaluated by using accuracy, recall, and map, and the effect of the algorithm is verified by testing actual vehicle detection images.
The rest of this paper is organized as follows: Section 2 presents the principles and implementation of the YOLOv5-based adaptive clipping algorithm, Section 3 describes the experimental procedure of the algorithm in this paper based on a modified VisDrone dataset, and Section 4 presents and analyzes the results of the operation of the proposed algorithm. Finally, a conclusion is drawn in Section 5.

Description of the Methodology
The workflow of the proposed YOLOv5-based highresolution UAV image vehicle detection algorithm is shown in Figure 1.
The drone acquires high-resolution images or videos, which are processed to form an image library, organized into an initial training dataset using manual labeling, and split into a final training dataset after processing by the proposed adaptive clipping algorithm that is used to train the YOLOv5 object detection algorithm. The corresponding model weights are obtained.
The detection process uses the improved adaptive clipping detection algorithm to take chunks of the images on the test set. After obtaining the coordinate position of the current image's clipping detection frame, the coordinates are adjusted according to the sliding window step given by the adaptive clipping algorithm. Then, the adaptive clipping detection coordinate frame is merged with the coordinate frame of the original image detection after nonmaximum suppression. Finally, the complete object detection image is outputted.

The Proposed Adaptive Clipping Method
3.1. YOLOv5 Object Detection Algorithm. The proposed adaptive clipping algorithm applies to both the training data preprocessing process and the detection process of the object detection algorithm. The YOLOv5 algorithm, as the latest version of the YOLO algorithm, is known for its breakneck detection speed and high accuracy. Currently, the YOLOv5 model has a detection speed as low as 2 ms per image on a single NVIDIA Tesla v100. The proposed algorithm requires the input image to be detected in chunks and then combined into a single image; therefore, the YOLOv5 algorithm is chosen as the object detection algorithm to ensure a high detection speed during real-time performance. The YOLOv5 network model consists of three main structures: the backbone, the feature pyramid network, and the detection head. The backbone network is responsible for extracting features from different images at different scales, the feature pyramid network is responsible for fusing features from different scales and passing them to the detection network, and the detection network is responsible for predicting the object category in it using the image features and generating the object bounding box. The YOLOv5 network structure is shown in Figure 2.

Adaptive Clipping of Datasets.
Taking a DJI Inspire 2 Zenmuse X7 UAV as an example, the maximum image size output by the camera is 5760 × 3240 pixels, and the size of a vehicle on the ground is only approximately 30-50 pixels when the UAV is flying at an altitude of 50-100 meters. The algorithm compresses the input image to 640 × 640 pixels during the object detection process. At this time, the length of the vehicle on the ground is only 4-6 pixels, and the image detail features of the vehicle suffer a large amount of loss. Figure 3 shows the detailed features of the vehicle in the same area before and after the compression of the original image.
In this paper, we propose an adaptive image clipping algorithm for the training set of high-resolution images captured by UAVs. In the process, the high-resolution images are slid and clipped with overlap according to the output size required by the object detection network to generate a new dataset after data augmentation. The clipping frame coordinates are calculated as follows: where I w denotes the number of horizontal pixels in the original image, I h denotes the number of vertical pixels in the original image, F w represents the width of the input image of the object detection network, F h represents the height of the input image of the object detection network, N w denotes the number of clip frames finally generated in the horizontal direction, and the calculation results in parentheses are rounded down, and N h represents the final number of clip boxes generated in the vertical direction. The calculated results in parentheses are rounded down. S w is the step length of the horizontal sliding of the clip frame, and S h is the step length of the vertical sliding of the clip frame.
The workflow of the sliding window equations is shown in Figure 4. First, we calculate how many windows are    (1) and (2). We allow the window to exceed a portion of the image. We then distribute the excess equally as the overlap of the sliding window in Formulas (3) and (4). Note that when the image size is just divisible by the sliding window, we add an extra window and then divide the entire window equally for overlap.
The label format of the YOLOv5 algorithm is the normalized relative coordinate value. For example, ð0:5, 0:5Þ represents the center point of an image, and ð1, 1Þ represents the point in the bottom right corner of an image. Therefore, the original labels need to be mapped according to the rules of adaptive clipping to generate the labels of the new image, and the algorithm flow of label mapping proposed in this paper is shown in Algorithm 1.      output from the clip map. IoU is the intersection over the union of the ratio discriminant function, which is responsible for calculating the ratio of the intersection over the union of two regions. IoU is calculated as 3.3. Adaptive Clipping Detection. The network structure of the YOLOv5 object detection algorithm has strict requirements concerning the resolution of the input raw images. The default input image size in YOLOv5 is 640 × 640; thus, all images larger than this resolution will be compressed, and image detail features are inevitably lost during the compression process. This paper proposes adaptive clipping of images in the inference process using the adaptively clipped image coordinates calculated using Formulas (1)-(4) to address the above issues. The algorithm uses the input image width required by the network during the inference process F w , as in Formula (1); the input image height required by the network F h , as in Formula (2); and the calculated chunk detection frame coordinates to perform clipping with overlap on the original images and detect the clipped images separately. The algorithm flow is shown in Algorithm 2.
In Algorithm 2, img is the image input with the original resolution, and the clipped image size is the input image size of the object detection algorithm (640 in this paper). The output of the Adaptive clipping function is calculated by Formulas (1)-(4). The Model function is the YOLOv5 network training model, which returns the prediction frame information of the input image. The Concat function is the combination function, which outputs the tensor after the combination of multiple tensors. Finally, the NMS function is the nonmaximum suppression function, which eliminates the redundant prediction frames by removing the object frame with the greatest overlap with the confidence value.
Since single images inevitably contain some large objects, to avoid detection errors caused by the incomplete combination of object features when a single large object is split into multiple clips, the algorithm inputs the whole image for inference after the inference of the clips. Finally, nonmaximum suppression is used for all inference results, including the clipped images and the whole images. The principle of this part of the algorithm flow is shown in Figure 5.

Experiments
The VisDrone drone dataset [25] was filmed and produced by the AISKYEYE team at Tianjin University, and the base dataset consists of 260,000 frames of video, with more than   Mobile Information Systems 10,000 still images from 14 different cities collected by various models of drones. The VisDrone dataset is labeled with ten categories, namely, pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. However, it suffers from an imbalance in the data distribution of different classes. To overcome this problem, uniform variables are used to verify the validity of the algorithm. We have removed the category labels for people and nonmotorized vehicles. According to various vehicle characteristics, retain only the car, van, bus, and truck categories, and unify the names of the above categories into one named car by modifying the labels. Facilitate the monitoring and identification of objectives. The adjusted training set has a total of 6471 images, the validation set has a total of 548 images, and a total of approximately 175,000 cars are labeled. We use an Intel I7-7700 CPU with 16 GB of memory and an NVIDIA RTX 2070 GPU (8 GB) for experiments, and the deep learning framework is Python 3.7 with PyTorch1.8.

Data Preprocessing Results.
The training set is adaptively clipped using the proposed algorithm, and the clipping process discards the images that do not contain the object in the generated clipping map. The algorithm generates 35,742 images for the training set and 2656 images for the validation set. The labels of the clipped training set are reassigned using Algorithm 1 according to the YOLO-TXT format. The format requirements are shown in Figure 6.
Each image generates a txt file of the same name, and each line in the txt file represents the label of an individual object. The first column is the object class, numbered from 0. Since all classes were merged, only one class is included in the dataset. The second and third columns are the XY coordinates of the object frame, and the coordinate positions are normalized using the aspect pixel values of the original image as the denominator. The fourth and fifth columns are the aspect pixel values of the object frame, which are also normalized using the aspect pixel values of the original image as the denominator. The converted label image is shown in Figure 7.

Clipping Test
Results. The YOLOv5 model is modified using Algorithm 2. We use transfer learning to initialize the model parameters, and the pretrained model is trained on the MS COCO dataset. The detection process of the algorithm is shown in Figure 8. We chunk the input image according to its size and the model's hyperparameters. The global detection branch takes the original image and infers it directly, while the chunking detection branch uses image chunks for detection. For example, the original map in Figure 8 is calculated using the algorithm to be divided into six blocks for inference. After the inference, the target boxes of the two detection branches are combined, and the redundant target boxes are removed using a nonmaximum suppression algorithm. The final result will be marked on the image at the end of the above process. To verify the generalization performance of the adaptive clipping algorithm, we compare the performance to the faster RCNN [23] and cascade RCNN [26] on the transformed VisDrone dataset. The experimental group uses the adaptive clipping algorithm to train and detect the data. In contrast, the control group uses the original algorithm to train and detect the high-resolution images directly.

Results and Analysis
We use different metrics, including the precision, recall, and mean average precision (mAP), to verify the effectiveness of the network. For a classification problem, the samples can be classified as true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) according to the combination of the ground truth and the prediction from the neural networks. The formulas for the precision and recall are shown in Formulas (6) and (7), respectively.
The mAP is the average of the detection precision for all categories and is calculated as  where JðP, RÞ is the average precision function, which is calculated using the current category number k. The precision rate P with the recall rate R forms the P-R area under the curve. n is the total number of categories, and k is the current category.

Analysis of Model Training
Results. The loss function is used to determine the training state of the model in the current iteration and to calculate the difference between the predicted and true values during the iteration. The YOLOv5 loss function is calculated as where l object is the confidence loss, l box is the bounding box loss, and l class is the category loss. Since there is only one class in the training set of this paper, l class is 0. The loss function curve of the training process is shown in Figure 9.
As the loss curve shows, at 200 rounds, the curve essentially stops decreasing, and the network training is essentially complete. The value of the loss function of the training set decreased from an initial 0.3187 to approximately 0.1397, and the value of the loss function of the validation set decreased from an initial value of 0.5425 to 0.2487.
The precision measures how accurate a model is at recognizing an object. The recall rate is how much a model searches for the entire object when recognizing the object. Figure 10 shows the variation of the precision and recall during the training of the model according to the number of epochs. The highest precision achieved by the model during training is 0.93087, and the highest recall is 0.8169.
The mAP is an evaluation metric that assesses network performance in the object detection field. mAP@0.5 is the area under the P-R curve of the network when setting the detection IOU ratio threshold to 0.5. mAP@0.5:0.95 is the average value of the area under the P-R curve of the network when setting the detection positive case intersection and ratio threshold from 0.5 to 0.95, calculated individually at a step size of 0.05. Thus, mAP@0.5:0.95 is harder to achieve. Figure 11 shows the mAP curve during training. The final mAP@0.5 achieved by the algorithm is 0.894, and the mAP@0.5-0.95 is 0.623.
As shown in Table 1, we compare the original dataset with the data processed by the proposed algorithm, the faster RCNN, the cascade RCNN, and YOLOv5. The results show that before using the proposed algorithm, the mAP of the cascade RCNN exceeds the faster RCNN and YOLOv5, and the precision and inference of YOLOv5 improve over time. After the adaptive clipping algorithm is used, the parameters of all three object detection frameworks are improved to some extent, and our algorithm outperforms the other two algorithms in all metrics. The inference time is controlled within an acceptable range.

Analysis of Detection Results.
To prove the rigor of the analysis, 500 images in the test set that are not involved in training are used for testing. The detection function provided by the original YOLOv5 algorithm and the improved adaptive clipping detection function are applied to the test set. The detection results are evaluated based on the label value calculation. The detection results, which are presented in Table 1, show that the original model has significant feature losses due to the input image compression problem when detection is performed on high-resolution images; therefore, the detection results of the original model are lower than those of the model with the proposed algorithm in all indices. Figure 12 shows a comparison of the detection effect between the proposed algorithm and the original algorithm. (a-c) and (g-i) are the detection effects of the proposed algorithm, and (d-f) and (j-l) are the detection effects of the original algorithm. In (a-f), in which the UAV flies at a low altitude and is tilted, the vehicle object size is approximately 100 pixels in the close view and only 30 pixels or less in the far view. (d-f) Show that the original algorithm has a good detection effect for near vehicles, but for far vehicles, a large area is not detected. The proposed algorithm can detect both small objects at a distance and large objects nearby because of the adaptive clipping of the detection images. The images detected in (g-l) are images taken at high altitudes, and the object size is generally smaller than 50 pixels. At this point, the advantage of the proposed algorithm becomes apparent. (g, j) Show that the original algorithm detects only two buses and one car as large objects. In contrast, the proposed adaptive clipping detection algorithm detects all 45 vehicles. The second figures in (h, k) show the detection effect of large dense objects. Because the objects are too small and dense, the original algorithm detects only one vehicle, while the proposed algorithm detects 255 objects, accounting for 95.1% of all 268 objects. The vehicle targets in (i, l) are smaller than 30 pixels in size.

Mobile Information Systems
The original algorithm did not detect any targets, while the algorithm in this paper detected 50 targets, including all 48 objects plus some false positive detections.

Conclusion
This paper proposes a vehicle detection method based on high-resolution images captured by UAVs, which addresses that traditional object detection algorithms are limited by images and object size. High-resolution images can limit the performance of the network when detecting small targets. So, we take the YOLOv5 object detection algorithm as the baseline. And we proposed an adaptive clipping algorithm of high-resolution images during data preprocessing and detection to detect small object vehicles. We introduce evaluation indices such as precision, recall, and mAP to evaluate the performance of the algorithm and design comparison experiments to verify the algorithm's effectiveness. The conclusion of improving the resolution of the UAV aerial image is obtained.
The framework detection speed determines the vehicle detection efficiency and real-time performance during UAV operations, so improving the operating speed of the algorithm is the goal of future research. Furthermore, in subsequent research, the single-scale object detection process for the object detection network and the network model structure can be improved, for example, by using model pruning, backbone structure optimization, and reparameters. Therefore, UAVs can be widely used in intelligent traffic management.

Data Availability
The data underlying the results presented in the study are available within the manuscript.

Conflicts of Interest
There is no potential conflict of interest in our paper, and all authors have seen the manuscript and approved to submit it to your journal.