Enhancement of Local Crowd Location and Count: Multiscale Counting Guided by Head RGB-Mask

Background In crowded crowd images, traditional detection models often have the problems of inaccurate multiscale target count and low recall rate. Methods In order to solve the above two problems, this paper proposes an MLP-CNN model, which combined with FPN feature pyramid can fuse the feature map of low-resolution and high-resolution semantic information with less computation and can effectively solve the problem of inaccurate head count of multiscale people. MLP-CNN “mid-term” fusion model can effectively fuse the features of RGB head image and RGB-Mask image. With the help of head RGB-Mask annotation and adaptive Gaussian kernel regression, the enhanced density map can be generated, which can effectively solve the problem of low recall of head detection. Results MLP-CNN model was applied in ShanghaiTech and UCF_ CC_ 50 and UCF-QNRF. The test results show that the error of the method proposed in this paper has been significantly improved, and the recall rate can reach 79.91%. Conclusion MLP-CNN model not only improves the accuracy of population counting in density map regression, but also improves the detection rate of multiscale population head targets.


Introduction
At present, image-based crowd counting still faces many problems: (1) Problems such as image clutter, uneven crowd distribution, crowd overlap, and occlusion lead to low head detection rates. (2) Pedestrians have different scales in the image. Due to the difference in the distance between the head and the camera, the head has different scales, so the head with small scale is not easy to be detected. All these reasons have created huge challenges for the further advancement of crowd counting [1][2][3][4][5].
Current crowd counting methods can be divided into two categories: methods based on object detection and feature regression [6][7][8][9]. Early work is to use some kind of object detection model to detect individual objects. However, the detection architecture requires a lot of computational resources and cannot better solve the occlusion problem and size feature extraction. When the head is small or occluded, it usually cannot be detected. erefore, the main problem is the low recall rate of the head. In real dense crowd scenes, small heads are common. As a result, detection-based dense crowd counting tends to be gradually replaced by other methods due to underestimation [10][11][12].
In the past, head detection can only detect the size of a crowd of dozens of people. When the size of the crowd exceeds a few hundred people, the detection model is difficult to cope with due to the small size and serious occlusion. In contrast, the regression method based on the density map can more reliably obtain the overall characteristics of the crowd and can effectively estimate the number of the crowd [9,13,14].
Usually, the Gaussian kernel is generated with each head as the center, but it does not match the size of the head, and the density map is obviously interfered by the background [15][16][17]. erefore, the density map thus generated also suffers from significant deficiencies. As shown in Figure 1, GT and ES are the real density maps and estimated density maps generated by the MCNN model on the ShanghaiTech PartA dataset, and the density maps estimated by MCNN are obviously distorted. e problem is that using density map regression can only estimate the number but cannot locate the head position, which severely limits the application of crowd counting in video anomaly detection and pedestrian reidentification. As shown in Figure 1, the head detection of the YOLO V4 model cannot detect small-scale heads. In contrast, the RGB human head annotation box provides more information about head localization. If these head ROI pictures can be used as training masks, it will help strengthen the head features and facilitate the estimation of human head size.
ere are currently methods that utilize adaptive Gaussian kernels to generate high-quality density maps [18,19]. High-quality density maps train more robust regression networks, providing prior knowledge for crowd detection that is closer to the actual distribution of crowds [20]. One of the reasons that previous detection methods cannot detect small heads is due to the lack of scale perceptron or the limitation of its own structure. For those tiny heads, efficient scale-adaptive perceptrons should be designed. Fortunately, RGB image and head RGB-Mask image feature fusion can provide a prior for estimating head size, which helps to set suitable scale fusion perceptrons for different scales of human heads [21,22].
Aiming at the shortcomings of the above methods, this paper attempts to use the prior information provided by the density map combined with the RGB-Mask labeled data to achieve a high recall rate and high robustness based on the density map guided detection. e contributions of our work are summarized as follows: (i) In the past, there was not much work to count and detect people of different sizes by using multifeature fusion. In particular, previous work has rarely fused the RGB-Mask feature into the RGB feature. is paper proposes a fusion scheme of "medium-term fusion" between the RGB-Mask feature and the RGB feature. e selection of "medium-term fusion" can not only ensure the effective fusion of the head RGB-Mask feature and the head RGB feature, but also ensure that the head RGB-Mask strengthens the role of local small target features in the vgg16 small target feature extraction process. erefore, this part of the enhanced small target head feature can be effectively connected with the low-resolution semantic features in the subsequent FPN feature pyramid. (ii) rough the analysis of previous work, it is found that the traditional FPN feature pyramid starts with high-resolution semantic features, so there are insufficient low-resolution semantic feature information and low-resolution semantic feature map. e improved FPN model starts with low-resolution semantic features and ends with low-resolution semantic features after being fused with high-resolution semantic information features. In this way, the feature map of low-resolution and high-resolution semantic information can be fused with less computation. It can take into account the high semantic features with less information containing small targets and the low semantic features with more information containing small targets. Finally, the feature layer of high semantic content is sampled up and stacked down to ensure the characteristics and information of small targets. (iii) rough the analysis of previous work, it is found that there are many methods to realize population detection by using cross entropy loss or L1 and L2 loss functions alone. However, there is less literature on the mixed use of cross entropy loss and L1 and L2 loss functions. Because the cross entropy loss is only effective for low-density pedestrian detection, it is not suitable for dense crowd detection. erefore, this paper attempts to combine cross entropy loss with L1 and L2 loss functions and then realize smallscale head detection with density map regression as the guiding model.

Detection-Based
Counting. Early work on crowd counting problems focused on detection counting methods. ese works count the total number of pedestrians by detecting body, head, or shoulders [23][24][25][26][27]. Reference [23] proposed a method based on skeleton detection to count the total number of pedestrians in crowd scenes. Specifically, the skeleton map is obtained by foreground segmentation, and the moving target is detected by comparing the difference between the skeleton and the background. e work [24,25] used a real-time skeleton detection model using OpenPose to detect pedestrians.
is method has achieved initial success in sparse populations. However, in the case of occlusion, the detection of multiple human skeletons is abnormal due to overlapping, which will lead to the problem of wrong counts. However, occlusions are common in realworld scenarios, so most pedestrian detection and counting systems fail. To achieve efficient detection, head regionbased detection is an effective way to avoid occlusion [26,27].
In recent years, CNN-based head-and-shoulders pedestrian detection has been fully developed. For example, RCNN [28], Fast RCNN [29], R-FCN [30], or Mask-RCNN [31] can be applied in low-density crowd counting, but these detection models are not very good in small object detection. e reason is that these models are not designed with an effective head scale processing strategy to deal with small target objects. For another group of methods such as Overfeat [32], YOLO [33], or SSD [34], although these frameworks can detect some objects with smaller scales, the detection performance is poor, especially in small objects with large detection errors. Although SSD has a good performance in balancing computation time and accuracy, the above methods are obviously unable to cope with dense crowds with serious occlusion because no effective strategy is designed.
Crowd counting is an extremely challenging job. Currently crowded images are divided into two categories: crowded images that can be resolved and small-resolution clumps that cannot be resolved. For discriminable crowded crowds, crowd counting can be done using regression-based methods. Much literature [35][36][37][38][39] uses regression methods to implement the crowd counting problem. ese methods first extract local edge features and texture features of crowd images and then learn a regression function to estimate the sum of all local counts in the image. A regression function is used to build a mapping from local features to counts. Commonly used regression functions include linear regression [35], piecewise linear regression [36], ridge regression [37], Gaussian process regression [38], and neural networks [39].
A small head in an indistinguishable crowd image only covers 10 to 20 pixels, so there is not enough information to extract pedestrian features; Ji et al. [40] consider the difficulty of learning such features and therefore use random forest regression to learn the nonlinear mapping between local patch features and density maps. Following this work, Sadler et al. [41] used random forests to regress crowd density, and the training efficiency was also greatly improved. In [42], Mo mentioned a response of a Laws filter convolved with mask to obtain a two-dimensional density layer and finally realized the regression of difficult-to-distinguish crowd images, where mask is to create a mask by the gray-scale restricted area growth method. In other words, methods based on these regressions are more likely to fail in crowd counting in image areas with high crowd density due to the lack of deeper features. erefore, the counting problem of visually indistinguishable crowded images cannot be completely solved.

Density Map Regression.
Methods based on regression density maps have achieved a breakthrough in addressing indistinguishable crowd counting [8,[43][44][45][46][47][48][49]. Powerful CNNs play an important role in the density map regression process, and Wang et al. [43] show that features extracted from deep models are more effective than handcrafted features. Compared with the regression-based method, the density map regression-based method preserves a large amount of spatial distribution information in the crowd area, so the density map regression is more suitable for analyzing small targets. e crowd counting process is to first regress the density map of the crowd and then get the count by integrating the density map.
Pai [44] et al. aim to achieve dense crowd counting in visually indistinguishable crowded images.
is method convolves image patches with a Gabor filter and classifies the responses of the Gabor filter with a support vector machine (SVM).
is method is effective for counting both highdensity crowd images and low-density crowd images in a specific scene, but the counting effect of replacing it with other scenes cannot take effect, and the migration performance is not good.
In reference [44] proposed density map regression with an adaptive Gaussian kernel, which can better handle density map estimation in regions with different density levels. Miangoleh et al. [45] attempted to learn various density levels to integrate contextual information and generate highresolution density maps. Reference [45] also proposed to use density map regression results to guide detection. Reference [46] proposed a framework called Hydra-CNN, which achieves the final density prediction by extracting a pyramid of image feature blocks at multiple scales. Zhang et al. adopted a CNN with geometric or perspective information Computational Intelligence and Neuroscience to fuse scale-dependent contextual information to achieve multiscale perception. Zhang et al. [48] fused features from different counting network layers to obtain robust representations for scale changes. Reference [49] proposed a Deep Scale Purification Network (DSPNet) to extract multiscale features and compensate for the loss of context. Sam et al. [50] proposed Switch CNN, which trains an optimal regressor for a specific input, thereby improving the counting ability.
Density map regression based on deep learning [51][52][53] has solved many dense crowd counting problems in the past few years, but it also has some shortcomings: although it can increase the location information of crowds in crowded images, it cannot locate pedestrians border.
is limits further applications in surveillance domains such as pedestrian tracking and reidentification tasks.

Density Map Regression Guided Detection.
In order to simultaneously estimate the number of human heads and detect bounding boxes when regressing the density map, Zhong et al. [54] used the density map regression to improve the head detection results. But that method does not work for cross-scene counting. e research most relevant to this paper is Hou et al.'s [55] using cross-modal data to achieve crowd counting in RGB-D images with the help of a regression guided detection network (RDNet). Leverage density maps improve head detection rates in detection networks. To improve the robustness of the method, the detector directly classifies anchors into specific classes and regresses bounding boxes in a dense manner. ese convolutional features usually only capture basic visual patterns and lack strong semantic information, which may lead to many false positive results.

Methods
e overall architecture of the method described in this paper includes two kernels, the head RGB-Mask head perceptron and the adaptive Gaussian kernel density map regressor.
is section will deeply analyze the internal mechanism of this method from the perspective of formula principle and structure. e head RGB-Mask perceptron is implemented with the help of MLP-CNN network. e regression guided detection of the adaptive Gaussian density map is realized with the help of the MR-CNN network. e training data in this paper uses the head RGB-Mask binarization and the mask head annotation box as the input to strengthen the head supervision training.

Adaptive Gaussian Kernel Density Map. Adaptive
Gaussian kernel regression is able to produce density maps that are closer to the true density map. e adaptive Gaussian kernel can gradually approach the guide size of the head mask through training. e density maps produced with the help of regression can provide prior knowledge for the head detection module of the MR-CNN network. is prior can guide the location and size of the generated head detection boxes.

Gaussian Density Map. Crowd estimation requires the conversion of labeled head images into crowd density maps.
Assuming that an image has N heads, its original formula is expressed as δ is the impulse function, x i is the position of the head in the pixel, δ(x−x i ) is the impulse response function of the head position in the image, and N is the total number of heads in the image. e density map based on the traditional Gaussian kernel can be expressed as Among them, S i in formula (2) is the average distance of the nearest m heads from the head of x i . In formula (2), S i is approximately equal to the size of the head in a dense crowd.
Here the experimental parameter β is adaptively adjusted according to the actual crowding degree of each image. e size of the Gaussian kernel is variable.

Head RGB-Mask Adaptive Estimation.
In order to make the density map better correspond to the images of different head sizes of dense crowds, the traditional Gaussian kernel function is improved and a Gaussian kernel based on head RGB-Mask geometric adaptability is proposed.
On the basis of the traditional Gaussian density map formula, the prior knowledge is used to further enhance the adaptability of the Gaussian kernel to the head RGB-Mask features. Different from the prior knowledge of previous algorithms, this paper proposes a new head RGB-Mask perception prior knowledge, which further highlights the target of the head RGB-Mask prior by considering the position and size relationship of the head RGB-Mask geometric constraints. is prior knowledge is represented by a Gaussian model as Among them, µ indicates the position of the Gaussian peak; σ controls the shape of the Gaussian curve; the smaller σ is, the steeper the curve is; (x θ , y θ , d i ) is the coordinate of the pixel θ in the normalized image coordinate system. e XY plane corresponds to the image plane, and d i corresponds to the head RGB-Mask size of the image. e density map regression module takes an image as input and utilizes a CNN for density map estimation. e density map generation strategy is to use the head RGB-Mask adaptive Gaussian kernel to generate the density map. Given a training set of heads with annotated boxes, if the image contains a total of N heads, the adaptive Gaussian kernel density map of the image can be written as 4 Computational Intelligence and Neuroscience G(x, y, d i ) is a 2D Gaussian kernel with adaptive bandwidth, thus transforming the crowd counting problem into the following problem: F : I(x)⟶F(x), which learns from the image space I(x) to the density map space F(x) mapping. When the mapping function F(x) is established, a density map for any given image can be obtained, and the integral over the entire image is an estimate of the total head count.

Head RGB-Mask Perception Network. RGB-Mask Perceptron.
For the head RGB-Mask perceptron, the annotated head RGB-Mask dataset was used to train MLP-CNN. MLP-CNN includes multiple scalable submodules, and each submodule unit consists of a VGG16 network. In order to find a reasonable structure of the MLP-CNN variant, here each VGG16 unit of the MLP-CNN is connected in series and parallel. Among them, the RGB-Mask features are captured by the first ten convolutional layers of VGG16. Finally, the RGB-Mask information features of MLP-CNN will enter the RGB feature network from the mid-end entry and finally complete the feature fusion of head RGB and head RGB-Mask.
e Head RGB-Mask and RGB Fusion Network. As shown in Figure 2(a), different entrances are used to fuse the head RGB-Mask and RGB model. e head RGB and head RGB-Mask inputs can be directly concatenated, resulting in a new first convolutional layer. It is called early fusion. e scores of the head RGB network and head RGB-Mask branch can also be concatenated at the end of the network and then use 1 × 1 convolution as the classifier. It is called late fusion.
is Paper Adopts Mid-Term Fusion. Although early fusion is more expressive than mid-level fusion, it can fully exploit the correlation between features. However, the larger the amount of data expressing the power, the higher the required training cost. e benefit of late fusion is that most of the network initialization weights can be reused directly without readjusting the network weights based on additional inputs. Unfortunately, it does not allow the network to learn about such high-level interdependencies between individual input modalities, since only the resulting scores at the classification level are fused.
Finally, the scores of the head RGB-Mask branch can be merged before a max-pooling layer of the RGB network followed by a 1 × 1 convolutional layer. e number of MLP-CNN modules used in this mid-level fusion method is determined by the desired spatial dimension in the RGB network. erefore, these models realize the optimal design according to the number of VGG16 of MLP-CNN module, taking into account the training cost and the high-level interdependence between various input modes.
FPN (Feature Pyramid Network). In order to achieve multiscale target processing, a feature pyramid structure is added here, as shown in Figure 2(b). e purpose of using feature pyramid is to increase the processing power of CNN for head scale transformation. e model on the left side of the feature pyramid is called bottom-up. e network first performs the traditional bottom-up top-down feature convolution (left side of the figure), and then the feature map on the left side of the FPN fuses adjacent feature maps from top to bottom. e model on the right is called top-down, and the horizontal arrows are called lateral connections. e purpose of this is that the high-level feature semantics is more, and the low-level feature semantics is less but with relatively more location information.
e specific method is that the higher-level features of the two feature layers use the interpolation method to complete the 2-fold upsampling; that is, on the basis of the original image pixels, the interpolation algorithm is used to insert new pixels between the pixels, and the feature size is doubled.
e lower-level features are changed by 1 × 1 convolution to change the number of channels of the lowerlevel feature, and then the corresponding elements of the result after upsampling and 1 × 1 convolution are simply added. e horizontal connection should use 1 × 1 convolution to change the number of channels, so that the channels of each level processing result are 256-d, which is convenient for classifying the added features later.
With the improved FPN network structure, head RGB-Mask annotation is used as a priori under feature training, and head RGB-Mask plays a role in strengthening local small target features in the VGG16 small target feature extraction process. erefore, this part of the strengthened small target head features can be effectively connected with low-resolution semantic features. Starting from the low-resolution semantic features, after fusion with the high-resolution semantic information features, it ends with the low-resolution semantic features. It can fuse the feature map with strong low-resolution semantic information and the feature map with weak high-resolution semantic information but rich spatial information under the premise of less computation. e improved FPN network can take into account the high semantic features with less information containing small targets and the low semantic features with more information containing small targets. Finally, the feature layer of high semantic content is sampled up and stacked down to ensure the features and information of small targets.
Density Map Generator. First, the frame coordinates of the human head in the original image should be calibrated, and the density function should be obtained with the help of the Gaussian kernel function. However this assumes that each Gaussian kernel is independent in the sample space. In fact, head pixels are inconsistent in scale in different distance regions due to scale variation. Also, in practice, it is impossible to obtain the size of the head accurately due to the Computational Intelligence and Neuroscience occlusion of the human head, so it is difficult to find the relationship between the size of the head and the density map. erefore, in the same scale area, the average distance of adjacent heads is used as a parameter, so the difference between the generated density map and the real density map is large, as shown in Figure 2.
RGB-Mask Perceptron. In order to accurately estimate the population density, it is necessary to consider adding the head RGB-Mask perception parameter to the adaptive Gaussian kernel function. Due to the consideration of image distortion, usually the geometry of the head cannot be determined in the original scene, because the original image lacks the spatial constraint information of the head pixels. In order to obtain the spatial constraint information of the head pixels, the perceptron fused with the head RGB-Mask image information is used as the head range constraint information. Human heads of different scales can give the reasonable range of the head RGB-Mask for the geometrically distorted part. e parameter σ of the adaptive Gaussian kernel is determined for each head size.
MR-CNN Detector. e detection network takes the features of human heads of different scales as input. Estimate the center point of each scale head object. en, the head mask reinforcement learning is used to close the head center point to the reinforcement feature boundary and finally represent them with detection boxes, as shown in Figure 2(c).

Dataset Introduction and Evaluation Criteria.
e crowd counting method in this paper has been evaluated experimentally on three standard datasets, ShanghaiTech, UCF_CC_50, and UCF-QNRF, as shown in Table 1. ShanghaiTech contains part_A_Final and part_A_Final two parts; this paper uses three datasets for model training and testing. e feasibility and applicability of our proposed method are verified by experimental comparison. is paper first gives the relevant parameters of the three datasets used in the experiments. en, the comparison results between the method used in this paper and the current state-of-the-art crowd counting methods under these datasets are given, and the crowd detection results with high recall rate are given. Finally, this paper conducts ablation experimental studies to demonstrate the independent effectiveness of each method unit in our comprehensive approach.
Metrics. Mean absolute error (MAE), mean squared error (RMSE), and cross entropy are used to evaluate crowd counting work. MAE loss is also known as L1 loss; RMSE loss is also known as L2 loss: N is the total number of test images, N i is the actual number of people in the ith test image, and n i is the estimated number of people in the ith image.
y i represents the label of sample i. Head class is 1, nonhead class is 0. p i represents the probability that sample i is predicted to be head class.   Computational Intelligence and Neuroscience loss are more suitable for regression problems. erefore, the common regression density map method can better complete the counting of dense population, but it is difficult to meet the problem of dense population detection at the same time. Because L1 and L2 loss cannot be applied to dense crowd images with non-Gaussian distribution under the classification task, the detection effect will be very poor, and small-scale head can not be detected. Cross entropy does not rely on the assumption of Gaussian distribution. erefore, the combination of cross enterprise in classification detection can make up for the problem that L1 and L2 loss cannot be fully detected in dense population distribution. Another reason is that, relative to L1 and L2 loss, the cross entropy loss is monotonic as a whole. e greater the loss, the greater the gradient. It is convenient for gradient descent backpropagation and optimization. erefore, for classification problems, cross entropy is often used as loss function.

Advantages of Cross Entropy Loss
Since the model of this paper is a technical route of density map regression guided detection, from the perspective of training, density map regression based on Gaussian distribution is the primary task of our work, and the use of head density points in density map is a favorable premise for guided detection. erefore, our work is to complete the training based on adaptive Gaussian regression model and then complete the head detection based on head enhancement feature learning. Here, L1 and L2 loss are used for the training of adaptive Gaussian regression model, and cross enterprise completes the training of head detection model on this basis. erefore, cross entropy combined with L1 and L2 loss can be competent for the overall training of density map regression guided detection model.

Dataset Parameter Setting and Training.
Preprocessing. e acquisition of the head RGB-Mask needs to go through two preprocessing steps. e following processes are all implemented by programming, as shown in Figure 3. e rectangular RGB image of the head is cropped by the head annotation frame in the dataset. Pixels outside the head annotation are replaced with RGB-Mask. e RGB image is converted into small head images, which are used to highlight the mask feature of the head and finally convert it into an RGB image.
MLP-CNN Training Settings. MLP-CNN is trained endto-end. e initial value of Gaussian parameter in MLP-CNN is set to 0.5, and the standard deviation is set to 0.02. In our experiments, MLP-CNN chooses stochastic gradient descent (SGD) with momentum and uses a small learning rate for ShanghaiTech dataset, UCF_CC_50 dataset, and UCF-QNRF dataset to train the model, the initial learning rate is set to 0.005, and the momentum is set to 0.85. After this setting, the training convergence speed is faster, as shown in Figure 4. e implementation of our method is completed under the Pytorch framework. In terms of hardware, three NVIDIA 1080 Ti GPU graphics cards and four Intel(R) Xeon(R) E5-2630 v4 CPU are used to ensure the performance requirements of graphics cards and computing units.

Crowd Counting.
Experimental data were collected on the state-of-the-art methods in crowd counting from 2015 to 2021, give the performance of these methods on these different datasets, and give the results of the comparison between the methods used in this paper and the current stateof-the-art crowd counting methods. From Table 2, it can be found that the performance of the advanced method gradually improves as the method approaches as the year, so this paper only compares the results of the method closest to ours in 2021, as shown in Table 2.
ShanghaiTech Dataset. Our method is compared with other state-of-the-art methods on PartA and PartB of the ShanghaiTech dataset. e specific performance is as follows: for PartA on the ShanghaiTech dataset, our results achieve an 8.89/6.01 improvement in MAE and RMSE metrics compared to the state-of-the-art method Partial Annotations in 2021. In particular, our results are 46.3/67.6 better than SFCN in 2019 and 0.9/1.9 better than MCNN in 2016, which is a clear improvement over the PartA count on the ShanghaiTechA dataset, as shown in Figure5(a). For PartB of the ShanghaiTech dataset, our method achieves 2.45/6.11 improvements in MAE and RMSE metrics compared to the state-of-the-art method Partial Annotations in 2021. In particular, our results are 1.12/3.41 better than ic-CNN in 2018 and 16.82/28.71 better than the classic MCNN in 2016, and analyzing the qualitative results shows that our method performs well in databases with different degrees of crowding, as shown in Figure 5(b). At the same time, the density map and the density map of ground truth are more prominent than the crowd Gaussian boundary. Compared with MCNN, the saliency of the human head part is more obvious, as shown in Figure 6.
UCF-QNRF Dataset. e performance results of our method on the UCF-QNRF dataset are shown in Table 2. From the results, it is found that our method achieves a 24.52/49.36 improvement in MAE and RMSE metrics compared to the state-of-the-art method Partial Annotations in 2021. In particular, our results are 1.99/11.81 better than DUBNet in 2020 and 173.39/257.31 better than the is performance is also a clear improvement in the count of the UCF-QNRF dataset, as shown in Figure 5(c). e density map is compared with the density map of ground truth. e crowd Gaussian boundary is more prominent. Compared with MCNN, the saliency of the human head part is more obvious, as shown in Figure 7.
UCF_CC_50 Dataset. e performance results of our method on the UCF_CC_50 dataset are shown in Table 2 is performance is also a significant improvement in the count of the UCF_CC_50 dataset, as shown in Figure 5(d). e density map and the ground truth density map are more prominent than the crowd Gaussian boundary. Compared with MCNN, the saliency of the human head part is more obvious, as shown in Figure 7. Analysis of the overall qualitative results shows that our method performs well in databases of varying degrees of crowding. e main reason is that our proposed network learns more head RGB-Mask spatial context information, which is consistent with our original motivation. e results verify the effectiveness of our method. e conclusion after comparison is that this method is applied in UCF-QNRF. e performance of UCF-QNRF dataset is better than that of DFN, SS-CNN, and RPNs    Computational Intelligence and Neuroscience models. e performance of UCF_CC_50 dataset is better than that of DFN model, but the error performance is worse than that of SS-CNN and SD-CNN models, as shown in Table 3. e reason is that SS-CNN and SD-CNN have made a lot of contributions in the multiscale sensing mechanism, but in the too dense crowd, the method in this paper only uses the improved PFN to judge the head size of small targets which has certain limitations. In addition, in ShanghaiTech dataset, the method error used in this paper is slightly better than DFN. In addition to the design characteristics of each method, the form of dataset training and annotation will directly affect the counting accuracy of the model for dense populations. Generally, SSL uses labeled and unlabeled data to fit the model, but unlabeled data may make the model worse. FSL performs best because it completely labels all samples, but the labeling cost is too high. Although SSAL can reduce the labeling cost, using some fully labeled images for network training will lose the head posture, illumination, image perspective, and other information of unused labeled images. Pal can maximize the retention of the head posture, illumination, image angle, and other information of the pictures in the dataset, while using less annotation to achieve more accurate full annotation to complete more accurate crowd calculation. erefore, pal is generally better than SSAL.

Model Complexity and Processing Time Experiment.
At UCF_QNRF dataset, this method compares the most advanced counting networks in terms of model parameters (Params) and processing time (Time/s) in order to verify the model's complexity and time consumption. Model parameters (Params) are used to measure the complexity of the model, and processing time (Time/s) is used to measure the time-consuming performance of the model. rough comparison, it is found that the method described in this paper adds FPN and fusion mechanism to the model, so there are many parameters. However, too many model parameters increase the image processing time, so some time-consuming performance is sacrificed. For mlp-cnn, Params � 14.25 × 10 6 , and Time � 2.39 s, as shown in Table 3.  210  230  250  270  290  310  330  350  370  390  410  430  450  470  490  510  530  550  570  590  610  630  650  670  690  710  730  750  770    Computational Intelligence and Neuroscience method of this paper extracts the head center point (red point) in the density map, as shown in Figure 7(b). e localization performance of our method on the Shang-haiTech dataset is evaluated by evaluating the precision and recall between the extracted estimated location points (red points) and ground truth annotated head center points (green points), as shown in Figure 7(c).
Before using cross entropy loss, our method has the problem of missing detection in detecting small-scale human heads, as shown in Figure 8(b). ere are various crowd scale of the estimated location points, as shown in Figure 8(c). With the help of the cross entropy loss, heads with different scales can be well detected, especially small heads.
e positioning result is shown in Figure 8(d). Compared with current more sophisticated feature extraction detection frameworks, the method results in outperforming other methods in terms of precision and recall.
is is because the spatial context information of the head RGB-Mask image can constrain the size range of the adaptive Gaussian kernel. In density map head classification, cross entropy can avoid the decline of learning rate of mean square error loss function, the assumption of Gaussian distribution, and the gradient explosion problem caused by L1 and L2, which can effectively improve the validity of the detection results.

Effectiveness of the Head RGB-Mask Adaptive Gaussian
Kernel. In this part, the ablation experiment is carried out on the RGB-Mask adaptive Gaussian kernel. As shown in Table 4, four different variables were selected for qualitative analysis; namely, the Gaussian kernel function G(X), the density function H(X), the multivariate Gaussian function G(X n ), the difference of the head RGB-Mask perceptron combinations are evaluated. From the results, it can be seen that the density function H(X) using the Gaussian kernel function G(X) has a large error in the counting result. It is worth noting that the density function H(X) of G(X) does not converge. e reason is that G(X) cannot obtain the boundary constraints of head spatial context information from different dimensions and is not suitable for the convergence of denser crowds. e degree function H(X) using the multivariate Gaussian kernel function G(X n ) is more suitable for the parallel processing of crowd counting results in terms of counting results, and the processing time is shortened. erefore, the introduction of the head RGB-Mask perceptron can constrain the edge expansion of each Gaussian kernel, and the convergence time is shortened. is means that the combination of multivariate Gaussian kernel function G(X n ) with perception of head RGB-Mask information helps crowd counting with smaller MAE and RMSE errors.
is part is the ablation study of variables in MLP-CNN. As shown in Table 4        modules of the head RGB-Mask obtain the head RGB-Mask features from the same dimension at the same position, and the spatial context information can effectively learn the difference of the head region. However, the concatenated structure of the head RGB-Mask feature loses the corresponding relationship of this feature, which will lead to ambiguity in the selection of the same feature. Important information of the crowd count RGB-Mask may be lost. As shown in Table 5, the reason for choosing VGG16 as encoders: after comprehensively considering a variety of encoders, it is found that VGG16 can effectively improve the processing efficiency of Google inception V1, while VGG19 and inception V2 and V3 models can finally extract more effective features, but too complex network models may bring overfitting and training pressure to training.

Effectiveness of the Head RGB-Mask Feature Fusion
Method.
is part also discusses how to use the head RGB-Mask information in the adaptive multivariate Gaussian kernel. Four different feature fusion combination schemes are tried, and the results are shown in Table 6. From the results, the feature fusion results using only RGB and head RGB-Mask are not as good as the density map regression   using only adaptive Gaussian kernels. is is because there is a certain feature coupling relationship between RGB and head RGB-Mask. However, compared with the adaptive Gaussian kernel, the adaptive Gaussian kernel can reflect the spatial interaction of multihead RGB-Mask features. e feature fusion of RGB and head RGB-Mask can only identify complex channel features. Neither of the individual channels used in combination with the adaptive Gaussian kernel is comparable. e reason is that the coupling degree of local features of channel information or head RGB-Mask feature information is still not optimal. Using the fusion feature of RGB and head RGB-Mask, the head RGB-Mask channel features of the adaptive Gaussian kernel can be mined.
Invalid iterations to predict the final crowd density map can be suppressed. erefore, the combination of adaptive Gaussian kernel and multimodal feature fusion of RGB and head RGB-Mask is the best combination for crowded counting networks.

Effectiveness of Dense Crowd Object Detection Based on Cross Entropy Loss.
is method discusses the ablation experiment of the combination of cross entropy loss and L1 and L2 loss, so as to guide more accurate crowd head detection and complete effective crowd positioning. erefore, different combination schemes were carried out, and the results are shown in Table 7. As can be seen from the results, the use of cross entropy loss alone makes it impossible to identify crowd with large scale differences, as shown in Figure 9(a). Cross entropy loss is only effective for pedestrian detection with low density and is not suitable for dense crowd detection. From Figures 9(b) and 9(c), YOLO V4 and YOLO V5 cannot identify people with smaller scales. erefore, cross entropy loss is necessary to use the density map regression generated after L1 and L2 loss training as a priori guidance for detection. e combination of cross entropy loss and L1 and L2 loss can realize small-scale head detection, as shown in Figure 9(d).
In the above cases, the combination of H(x) + RGB-Mask + AGK + Cross entropy loss + L1 and L2 loss has the best detection results for people with large density differences. From the comparison of precision-recall curves in all cases in Figures 10(b) and 10(c), it highlights the progressiveness of using the mask method. e combination of cross entropy loss and L1 and L2 loss method used in this paper has the largest precision and recall rate. Figure 10(d) shows the analysis of the detection results of four target detection frameworks. No matter which detection framework is used alone, it is not applicable to the detection of dense population. If the head detection of dense population is completed, the help of the combination of H(x) + RGB-Mask + AGK + cross entropy loss + L1 and L2 loss is needed in this method.

Conclusions
In this paper, this method proposes a population counting and detection model. Our MLPNet uses the first ten layers of VGG-16 for feature extraction; our proposed MLP-CNN uses a fusion network based on RGB and head RGB-Mask to extract image channel features and uses an adaptive Gaussian kernel model to extract image spatial edge constraints features and estimates crowd density maps. Cross entropy combined with L1 and L2 loss functions ensures the accuracy of density map regression guided detection model and improves the results of dense population counting and small head detection. Experiments are conducted on ShanghaiTech dataset, UCF_CC_50 dataset, and UCF-QNRF dataset, and our method achieves equally satisfactory results in crowd counting as other state-of-the-art techniques. Detection network can detect uneven scale, noisy, multidensity crowd. is improves localization performance for smaller populations in the crowd.
MLP-CNN has certain limitations in detecting crowd counts in too dense areas. When the crowd scale is too dense and there are too many small-scale heads, there will be large errors in crowd detection and counting. For example, Figure 11 shows the crowd detection results in ShanghaiTech PartA dataset, Figure 11(a) shows the ground truth annotation of the crowd, and Figure 11(b) shows the actual detection results. It can be clearly seen from Figure 11(b) that, in the most crowded part of the crowd, head detection can only detect a small number of heads with obvious characteristics, but the detection rate of heads without obvious characteristics in overcrowded people is very low. In areas with relatively low congestion, the detection rate is very high. Although the PFN scale pyramid and mask fusion module included in the method used in this paper can improve the detection accuracy of some small-scale heads, when the crowd is too dense, the occlusion problem of high-density people is serious, the head resolution is low, and the head features are confused. erefore, in practical application, this method is largely limited by congestion, resolution, and occlusion. ese problems need to be solved in the future.

Discussion
e comparison of visualization results also demonstrates the effectiveness of our method for crowd detection in complex scenes. In the future, we will extend our approach to video crowd counting and detection, in particular, the effectiveness of the algorithm in improving the overall real-time processing power.

Data Availability
e ShanghaiTech dataset, UCF_CC_50 dataset, and UCF-QNRF dataset used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.