Improved YOLOv4 for Pedestrian Detection and Counting in UAV Images

UAV (unmanned aerial vehicle) captured images have small pedestrian targets and loss of key information after multiple down sampling, which are difficult to overcome by existing methods. We propose an improved YOLOv4 model for pedestrian detection and counting in UAV images, named YOLO-CC. We used the lightweight YOLOv4 for pedestrian detection, which replaces the backbone with CSPDarknet-34, and two feature layers are fused by FPN (Feature Pyramid Networks). We expanded the perception field using multiscale convolution based on the high-level feature map and generated the population density map by feature dimension reduction. By embedding the density map generation method into the network for end-to-end training, our model can effectively improve the accuracy of detection and counting and make feature extraction more focused on small targets. Our experiments demonstrate that YOLO-CC achieves 21.76 points AP50 higher than that of the original YOLOv4 on the VisDrone2021-counting data set while running faster than the original YOLOv4.


Introduction
UAV remote sensing is widely used in agricultural and forestry plant protection, production monitoring, geographic mapping, public security inspection, emergency rescue, and other civil fields. With the continuous improvement of hardware performance, the application research of UAVs based on computer vision has attracted the attention of relevant experts and scholars. Compared with the fixed camera on the street, UAVs have stronger flexibility and can monitor and detect any range of public places, factories, and road traffic. In UAV aerial images, especially overhead images, crowd occlusion is rare. e main difficulty is that the human target is small, and too many down sampling times will lead to the loss of key information. Using mainstream target detection algorithms to detect and count pedestrians in UAV images is an effective method. e task locates the human body by learning the feature of the human body or head in the image information.
e pedestrian counting result is the number of the located human bodies.
Early studies on pedestrian counting mainly focused on developing sliding windows to detect people, and then using this information to calculate the number of people [1]. For human detection, detection based on the whole human body or local detection is usually adopted. Detection based on the whole human body [2,3] is a traditional pedestrian detection method.
ese methods train the classifier through the feature information extracted from the whole. ese features include the Haar feature [4], directional gradient histogram feature [5], and edgelet feature [6].
Since then, various machine learning algorithms such as support vector machine and the random Forest have improved the prediction results of varying degrees, but these methods will be affected by high-density people and have limitations. To solve this problem, researchers used local detection [7][8][9] by estimating the number of people in the specified area by constructing a classifier based on face, head, or shoulder. With the development of convolutional neural networks and target detection technology, more and more deep neural network models are applied to counting tasks.
For example, [10] completes the counting of people through face detection, and [11] uses the YOLO algorithm to complete the counting of people through human body detection. Literature [12] proposed a new network structure SAF R-CNN to train special subnetworks for large and small target pedestrians and capture their unique characteristics. e counting method based on regression is generally used in the crowd counting scene. By learning the characteristic information corresponding to the crowd in the image, the number of people can be regressed directly, or we can regress to the crowd density map, and then calculate the number of people from the crowd density map. At present, in the field of population counting based on regression, the deep convolutional neural network has become a research hotspot and is widely used in many scenes. Cong et al. first proposed the crowd counting model Crowd CNN based on neural network in 2015 [13]. e model has a six-layer convolutional neural network, which realized the most effective performance on UCSD and other data sets at that time. Wang et al. proposed a seven-layer convolution neural network model and achieved good results on the UCF data set [14]. Zhang et al. proposed a multi-column convolutional neural network structure to map the image into a population density map and proposed a labeled ShanghaiTech data set [15]. Liu et al. proposed an end-to-end trainable deep network structure, which can use the features obtained by multiple receptive fields of different sizes, learn the importance of each feature in the image position, adaptively encode the scale of context information, and put forward the prospect of counting on the UAV platform [16]. Jiang et al. proposed a method to reduce the counting error caused by the difference of population density. e method has two networks, namely DANet (Density Attention Network) and ASNet (Attention Scale Network). DANet provides ASNet with attention masks related to regions of different density levels. ASNet first generates density maps and scaling factors, and then multiplies them by attention masks to output separate attention-based density maps [17]. e following problems still exist in the above research progress: (1) Small targets in UAV images are easy to be ignored, and key information is easy to be lost after multiple down sampling. (2) e background of aerial images is complex, it is difficult to pay more attention to the target area, especially in scenes with a sparse number of pedestrians. e method of generating density map is easy andleads to large errors in the number of people counted.
We proposed an end-to-end small target pedestrian detection and counting network model named YOLO-CC (YOLO and Crowd counting) based on UAV aerial images. e following are the main novelties and contributions of this study: (1) CSPDarknet-34 is used as the backbone network for feature extraction and the down sampling times of the original YOLOv4 are adjusted.
(2) e density map generation network module is embedded into our model, which can generate the density map, calculate the number of people, and enhance the attention of the backbone to the target area. (3) e multiscale convolutional neural network is applied to the density map generation module.
e results show that the proposed method has a good performance on small target pedestrian detection and counting, and has a strong real-time performance. On the VisDrone2021-counting data set, the AP 50 value is 39.32% and the MAE is 7.29. YOLO [18] is the first proposed one-stage target detection algorithm based on deep learning. e algorithm learns and extracts the features of the whole image through neural networks to predict each boundary box and the categories of all objects in the image at the same time. Firstly, the input image is divided into an S × S grid. Each grid cell needs to predict B bounding boxes and C categories. Each bounding box contains 5 parameters: x, y, w, h and confidence. e confidence is expressed as the IoU (Intersection over Union) between the prediction box and the real box. Finally, an s tensor is outputted, and the above information is mapped to this tensor. After decoding the information, the NMS (nonmaximum suppression) method is used to remove the duplication.

Related Work
e anchoring technology is introduced in YOLOv2 [19], which uses the offset between the anchor and the real frame to locate the target, normalizes the output of each layer, and accelerates the convergence speed of the network. e output images are feature maps with the size 13 × 13, which are obtained by five down sampling from images with the size 416 × 416, and these feature maps can meet the detection of most targets, but cannot meet the needs of multiscale target detection. YOLOv3 [20] uses Darknet-53 as the backbone network, which increases the depth of the network, and extracts three feature layers for result prediction. e scale of the feature map is 13 × 13, 26 × 26, and 52 × 52, respectively. Feature maps of different scales are used to predict targets of large, medium, and small size in images, respectively. YOLOv4 adopts the FPN (feature pyramid network) model for feature fusion in the output feature map, which improves the detection accuracy of small targets.
YOLOv4 [21] adopts the CSPDarknet-53 as the backbone network for feature extraction, which further increased the detection accuracy while keeping the network depth unchanged.
e neck of YOLOv4 uses the PANet (Path Aggregation Network) for feature fusion. SPP (spatial pyramid pooling) is used to expand the receptive field, which uses the maximum pooling method of k � [1 × 1, 5 × 5, 9 × 9, 13 × 13], and then concatenates the feature maps of different scales. e main differences of the YOLO series are given in Table 1. Figure 1 shows the YOLOv4 structure.

Proposed Work
e YOLO-CC network model we designed consists of the human body detection network module and the density map regression network module. e overall structure of the network is shown in Figure 2. e main goal of the human body detection network module is to locate and mark the human body. e main goal of the density map regression network module is to generate a density map to estimate the number of pedestrians and strengthen the attention to the target area in the process of feature extraction.

Human Body Detection Network
Module. Due to too many parameters and too many down sampling times of the original feature extraction network model CSPDarknet-53, it easily cause the problems of slow convergence and network degradation. We optimized the network structure with lightweight-constructed CSPDarknet-34 as the backbone feature extraction network and reduced the number of convolution layers in the residual network from 53 to 34. e size of input image scale is 416 × 416 and the number of channels is 3. e main function of the first convolution unit is to increase the number of channels to 32, and the subsequent convolution operation is mainly composed of three CSP (Cross Stage Partial) residual units. e number of residual blocks contained in each CSP residual unit is 1, 8, and 6, respectively. Each residual block consists of a trunk composed of two convolution layers and a residual edge. Each convolution block in CSP residual unit is composed of a convolutional layer, a batch normalization layer, and a mish activation function. e purpose of the batch normalization layer is to make the distribution of network parameters of each layer as consistent as possible and accelerate the convergence speed. Compared with the ReLU activation function, the mish activation function is smoother  Figure 1: YOLOv4 structure.
Computational Intelligence and Neuroscience at zero point and has stronger generalization ability. Mathematically, the mish activation function can be described as follows: where x represents the input matrix. CSPDarknet-34 includes three down sampling operations. Each down sampling operation is completed by a convolution layer with a step of 2 and a convolutional kernel size of 3 × 3. e extracted effective feature images are C2 and C3, with dimensions of 128 and 256 and scales of 104 × 104 and 52 × 52, respectively. After the C3 feature map is output, it is sent to SPP to extract spatial feature information under different sizes. en, after three convolutional layers, the dimension of the C3 feature map is reduced to 128, the main purpose is to summarize the effective features and reduce the amount of subsequent calculations. FPN is used for feature fusion. Its purpose is to combine the position information of the low-level feature layer with the semantic information of the high-level feature layer. e specific method is to splice the C3 with the C2 after sampling, output the YOLO head, and then the C3 outputs the YOLO head after a few operations. e target location is based on the anchor, because the target size of the data set is small, we only use two scales of anchor, 10 × 10 and 15 × 15, respectively. e max-IOU matching algorithm is used to count the matching degree between the ground truth and anchor, and select the largest matching anchor as the prediction box of the current target.

Density Map Regression Network Module.
In order to make the network model more intuitively output the crowd density in the image and make the feature extraction pay more attention to the target area, inspired by the way of generating mask graph in [22] to enhance the attention to the target area in the process of target detection and training, we designed the density map regression network module to generate crowd density graph. Its structure is shown in Figure 3. As the input of the MCNN (multiscale convolutional neural network) block, the C3 feature layer is convolved by four different convolutional kernels, and then concentrated together. e main purpose of this operation is to extract multiscale crowd image features. Images usually contain different sizes of head and aggregation information, so convolutional kernels with the same size are unlikely to capture the population density information at different scales. It is more natural to use convolutional kernels with different sizes to complete the mapping from original pixels to density maps. e size of convolutional kernels are 1 × 1, 3 × 3, 5 × 5, and 7 × 7. e number of channels of the characteristic graph after concentration is 256, and it is reduced to 128 after three convolution operations. en, the feature map is pooled through the convolution operation with a convolutional kernel whose size is 3 and step size is 2, and then the size of the pooled feature map changes to 26 × 26. Finally, the convolutional layer with three convolutional kernels is used to reduce the number of channels in the feature map until the number of channels is 1, which is the final population density map. e parameter settings of each layer of the density map generation module are shown in Table 2.
We use a simple but intuitive way to generate the crowd density map. If there is a head at the position of x i in the image, the corresponding position is expressed as δ(x − x i ), and the image with N people can be expressed as en, the image is transformed into a density map by Gaussian kernel function:   Computational Intelligence and Neuroscience where G σ (x) is the Gaussian kernel function. Specifically, we first adjust the original image to 26 × 26, add 1 to the pixel point with head, and the pixel value of other areas without a head is 0. e total number of people is the sum of image pixel values. en, Gaussian filtering with a kernel size of 3 × 3 is used to process the image in the form of density map, which can avoid the final output of the model converging to all 0, and the total population count remains unchanged. e actual effect of the density map generated by this method is shown in Figure 4. e number of people in the image can be calculated by summing the values of all pixels in the density map.

Experiments and Evaluation
We implemented the proposed YOLO-CC on Pytorch, the models are trained and tested with NVIDIA GeForce RTX 3090. Our CUDA vision is 11.4, the CPU model is Intel I9-10900K. In the human body detection module, the coordinate error adopts the mean square error, and the errors of classification and confidence adopts the cross entropy loss function. In the density map regression module, the error between the predicted density map and the real density map adopts the MAE (mean absolute error), and the final error is the sum of the above errors. During training, the size of input image is uniformly adjusted to 416 × 416. Learning rate with cosine annealing function, the highest learning rate in the first 30 epochs is 1 × 10 −3 , followed by a maximum learning rate of 1 × 10 −4 with a minimum learning rate of 1 × 10 −6 .

Data Set and Evaluations Metrics.
e data set we used is VisDrone2021-counting [26], from the 2021 Vision Meets Drones: A Challenge. e data set is divided into two parts: train-and test-challenge, including 1807 and 912 RGB images, respectively. Test-challenge is dedicated to the testing in contests and does not provide real labels.   Computational Intelligence and Neuroscience erefore, the images used in this paper are from train and are divided into training sets, a verification set and a test set in the ratio of 7 : 1 : 2. For the evaluation of pedestrian detection quality, we adopt metrics AP50 (average precision). Specifically, for AP50, to consider a bounding box prediction as true, the IoU between the predicated and the ground truth box must be higher than 0.5. For the evaluation of counting quality, we adopt MAE (mean absolute error) and MSE (mean squared error).

Experimental Results.
In this section, we evaluated two modules of YOLO-CC on the test set and compared with other methods. We used the original YOLOv4 as the baseline. Table 3     Computational Intelligence and Neuroscience is also a common method of object detection and achieved 5.12% improvement in AP 50 , our method is 16.64% higher than YOLOX in AP 50 . Visualization results of YOLO-CC, YOLOv4, and YOLOX in the test set is shown in Figure 5.
YOLO-CC has a good performance in the test set. e results of human body detection generated by YOLO-CC can better give the specific location of the human body. e density map can more accurately reflect the actual population distribution.

K-Fold Validation Experiment.
e K-fold validation experiment means dividing the data set into K parts equally, choosing one of them as a test set to evaluate the model performance and training other K − 1 parts as a training set to train the model parameters, and then evaluating the model performance comprehensively based on the results of multiple groups. In our experiment, K takes 5. Figure 6 shows the comparison of the estimated number of pedestrians and the actual number of pedestrians in the density map generated by the YOLO-CC model. Table 5 shows the result of K-fold validation experiment. e experimental results show that the YOLO-CC model performs smoothly on different test sets. e estimated number of pedestrians in the density map fits the actual number of pedestrians, with an average error of 12.19.

Conclusion
is paper designs the YOLO-CC network model, which is divided into the human body detection network module and the density map regression network module. In the human detection network module, first of all, CSPDarknet-34 is used as the backbone network for feature extraction, after that SPP and FPN are used for feature enhancement and fusion, and finally fixed scale anchor is used for location and detection. In the density map regression network module, first of all, multiscale convolution is used to extract the features, and then the feature dimension reduction method is used to generate the predicted density map. Our experiments show that the human body detection network module can get better pedestrian detection results, and the density map regression network module can improve the attention to the target area and give better feedback about the pedestrian distribution.
In future, we will focus on more complex aerial images, such as dense crowds, to improve the accuracy of population detection and counting.

Data Availability
All data included in this study can be downloaded from the official websites of "VisDrone-Vision Meets Drones: A Challenge" or can be obtained by contacting the corresponding authors.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.