Multi-Scale Guided Attention Network for Crowd Counting

The CNN-based crowd counting method uses image pyramid and dense connection to fuse features to solve the problems of multiscale and information loss. However, these operations lead to information redundancy and confusion between crowd and background information. In this paper, we propose a multi-scale guided attention network (MGANet) to solve the above problems. Speciﬁcally, the multilayer features of the network are fused by a top-down approach to obtain multiscale information and context information. The attention mechanism is used to guide the acquired features of each layer in space and channel so that the network pays more attention to the crowd in the image, ignores irrelevant information, and further integrates to obtain the ﬁnal high-quality density map. Besides, we propose a counting loss function combining SSIM Loss, MAE Loss, and MSE Loss to achieve eﬀective network convergence. We experiment on four major datasets and obtain good results. The eﬀectiveness of the network modules is proved by the corresponding ablation experiments. The source code is available at https://github.com/ lpfworld/MGANet.


Introduction
Crowd counting can count the number of people in images or video frames to realize the effective management of different scenes such as meetings and sports events. It has been widely used in public security, intelligent traffic, video surveillance, etc. [1][2][3][4][5]. Crowd counting can also be used to count the number of cells, viruses, and animals, extending the field of research into medical and behavioral science [6]. However, it is still a challenging task in the field of computer vision due to the problems of crowd occlusion, scale variation, uneven data distribution, and so forth (see Figure 1).
With the excellent performance of convolutional neural networks in various fields of computer vision, researchers have made various attempts on crowd counting. Some researchers tried to use multicolumn structures with different convolution kernels to solve the problem of multiscale [7][8][9][10]. Other researchers used parallel convolution kernels to obtain feature maps with different scales or fused multiscale information by the dense connection of multilayer features [11,12]. However, in these similar structures, the features learned from different branches have greater repeatability, which makes little contribution to the extraction of multiscale information. At the same time, these redundant features cause crowd density maps to be disturbed when generated and make background or other image content easily mistaken for the crowd [12,13]. Some networks are tried to use spatial attention in the training process to emphasize the crowd in images, to solve the problems such as background interference [13,14]. e feature maps of different layers in the network contain different scale features and semantic features, so we cascade the features of different layers from top to bottom. e low-level network contains more detailed information, which is conducive to the formation of a density map of high-density scenes, while the high-level network contains more semantic information, which is conducive to the distinction between the human head and background noise. In this way, different scales of information can be obtained without increasing the complexity of the network structure. In comparison, most crowd counting networks use a downtop multilayer fusion and then use deconvolution or dilated convolution to keep the final density map scale unchanged. But in our top-down network structure, we do not need to consider how to recover the final density map scale. We cascade the multilayer features of the network but not overlay the multilayer feature maps of the network through channels. Cascading operations at different layers can effectively prevent information loss so that the final density map contains complete context information. e fusion of features of different layers can obtain rich information, but a further selection is needed before generating the density map. Most networks use spatial attention mechanisms to selectively strengthen the human head area [15][16][17]. However, the channel information contains corresponding to the special class. In the crowd image, the classes are head and background. erefore, we considered optimizing the resulting features by combining spatial attention with channel attention.
MSE loss is the most commonly used loss function in crowd counting methods. However, in crowd scenes, the texture features and pixel correlation of high-density areas and lowdensity areas are different. MSE Loss is based on the assumption of pixel independence and ignores the local correlation of the density map. Besides, MSE Loss does not take into account global counting errors of input images. To optimize the loss function, we combine SSIM Loss, MAE Loss, and MSE Loss as the final loss. e loss function can calculate the local consistency of the predicted density map and the ground truth density map and calculate the difference between the predicted number of people and the real number of people so that the network can better converge to generate the high-quality density map.
In this paper, we propose a novel network for crowd counting. e network integrates the multiscale features of the crowd from top to bottom and uses spatial attention mechanisms and channel attention mechanisms to further guide the features to produce high-quality density maps.
is network is called MGANet, and its structure is shown in Figure 2. Specifically, our network consists of two parts: MFE (Multiscale Feature Extraction) and MFG (Multiscale Feature Guide). MFE uses the VGG16 backbone to extract the multiscale features and fuse the multilayer features step by step in a top-down manner, which can not only get the semantic features contained in different levels but also well express the detailed features of people of different scales. MFG differentiates the head information and background information through channel attention, locates the head area more accurately through spatial attention mechanism, and uses the output of the two kinds of attention to form an effective density map. MFG contains three columns of feature outputs with different resolutions, which are converted into density maps by Conv 1 * 1, and the density maps are upsampled to the same resolution. SSIM Loss, MAE Loss, and MSE Loss are combined as the final loss function. Experiments on four datasets (ShanghaiTech [11], UCF_CC_50 [18], WorldExpo'10 [19], and Smart City [12]) demonstrate the effectiveness and robustness of our method.
In summary, the main contributions of our paper are as follows: (1) We propose a novel crowd counting network MGANet, including MFE and MFG. It can effectively deal with the information redundancy and confusion caused in the process of multiscale feature fusion and generate a high-quality density map for accurate crowd counting.
(2) We propose a top-down feature fusion network based on MFE, which realizes the complementarity of different layers of feature information by a multilayer feature cascade. e network can effectively adapt to the change of crowd scale and prevent the loss of context information. MFG is a network containing channel attention and spatial attention, which can eliminate the influence of redundant features, and the resulting density map pays more attention to the crowd.

Related Work
e researchers summarized a series of excellent crowd counting methods [20,21]. We briefly introduce the research related to the paper, including crowd counting methods and the attention mechanism related to the crowd.

Crowd Counting Methods.
Crowd counting methods can be classified into two categories: traditional methods and CNN-based methods.

Traditional Methods.
Traditional methods are mainly classified into two categories: detection-based methods and regression-based methods. In the early days, people used the detection-based method to count the crowd. e main feature of this method was that the sliding windows were used to manually extract the whole of the body features of the human body for detection, for example, Haar wavelet, HOG, or pedestrian edge information [4,22]. But in a crowded situation, these methods do not work. To solve these problems, researchers have tried to examine specific body parts rather than the whole body, such as the shoulders or head [23,24], but these methods are still unable to cope with high-density crowd scenes. e regression-based method counts the crowd by learning a feature-to-crowd mapping [25][26][27]. is method firstly extracts the low-level features such as texture and edge and then selects the appropriate machine learning regression model to realize the mapping of low-level features to the crowd. Whether the detection method or the regression method, the output information is relatively limited and the processing steps are relatively complicated.

CNN-Based Methods.
In recent years, deep learning has developed rapidly, and great progress has been made in crowd counting based on CNN methods. Researchers try to use CNN-based target detectors for counting, including YOLO [28] and Faster-RCNN [29]. Although this method has made great progress compared with the traditional method, it can get better prediction results for the dense crowd in the image by using the fully convolutional network   mapping to the density map. A detailed CNN-based counting survey is presented [30]. MCNN uses a threecolumn convolutional neural network with a similar structure. e three-column network contains convolution kernels of different sizes. Its purpose is to use convolution kernels of various scales to adapt to the head size of different scales [10]. Switch-CNN is also a multicolumn structure, which uses a density classifier to enable different density patches to be distributed to appropriate networks [8]. MSCNN used multiscale clusters for computation, which can generate scale-dependent features in a single column structure, thus achieving high performance of crowd counting, with high accuracy and cost-effectiveness in practical applications [31]. CSRNet is divided into front-end and back-end networks, and the VGG16 with full connection layer is eliminated as the front-end network, while the dilated convolutional neural network is used as the back-end network. While maintaining the resolution, the perception domain is enlarged to generate a high-quality crowd density map [11]. CP-CNN firstly extracts and classifies the features of the whole input image, converts the classification results into a global context related to the density level, and performs the same operation for the patches cut from the image to obtain the local context. Finally, the whole feature is constrained, so that the network can adaptively learn the features of the corresponding density level for any image [32]. SANet extracts the head information of multiple scales of the image, which uses a module similar to inception architecture. Different sizes of convolution kernels are used in each convolutional layer at the same time, and finally, the final high-resolution density map is obtained through deconvolution [33]. DADNet fuses the scale-aware attention output with different expansion rates to capture different visual granularity of the area of interest to the crowd and uses deformable convolution to generate high-quality density maps [15]. CAN fuses feature acquired from different receptive fields, learns the importance of each feature in the image, and adaptively encodes the scale of context information to accurately predict crowd density. ese studies realize the aggregation of multiscale information utilizing multicolumn structure, parallel dilated convolution, or fusion of features of different levels [16]. MS-GAN combines a multiscale convolutional neural network (generator) and an adversarial network (discriminator) to generate a high-quality density map [34].

Attention Mechanism.
To extract effective features, the attention mechanism has also been extensively studied in the field of crowd counting. DecideNet believes that the detection-based method is better in the sparse scene, while the regression method is better in the dense crowd scene. erefore, the attention mechanism is adopted to adjust the weight of the two methods of detection and regression adaptively according to the change of crowd density [35]. DADNet uses multicolumns dilated Conv to deal with multiscale problems and uses Conv + sigmod to predict an attention map on different columns to achieve feature selection [15]. SCAR proposed to add a spatialwise attention model and a channelwise attention model to filter image features and then integrate features [36]. e focus for free integrates the tasks of segmentation, counting, and classification; uses segmentation to provide spatial attention for counting; and uses classification to provide channel attention for counting, to promote the accuracy of counting [37]. ASNet first generates interest masks at different density levels, then generates density maps and scaling factors, multiplying them to output individual attention-based density maps, adding these density maps to produce the final density map [20]. ese attention mechanisms improve the counting accuracy only depending on the feature fusion method. is paper also considers the use of channel and spatial attention mechanisms for multiscale fused features. With a series of attention modules, the final attention effort can be refined and the noise gradually eliminated, giving more weight to the really important areas.

Method
We propose that MGANet can accurately count people with different crowding degrees. In this section, we elaborate on four aspects of MFE, MFG, feature fusion, and loss function.

MFE (Multiscale Feature Extraction).
In the process of crowd image acquisition, the different distances between the crowd and camera in the same scene will cause a perspective effect; that is, there is a multiscale problem. To solve the problem, we propose a feature fusion method from top to bottom. VGG16 has a good characteristic representation ability, and its network is easy to modify. erefore, our network chooses VGG16 as the front-end feature extractor. Among others, Conv3-3, Conv4-3, and Conv5-3 are used to output features of dimensions 1/4, 1/8, and 1/16 of the original input dimensions. Due to the forward propagation of the network, the receptive field of the feature map gradually becomes larger. In the feature map generated by the front layer of the network, attention should be paid to the small-size heads, while in the feature map generated by the rear layer, attention should be paid to the large-size heads. e calculation formula of the receptive field is shown in where RF i is the receptive field of the convolution layer at the i layer, RF i+1 is the stride of the receptive field at the i + 1 layer, stride is the step length of convolution, and Ksize is the convolution kernel. Besides, the high-level features contain more semantic information, while the low-level features retain the location information of the input image. erefore, it is an effective method to improve network performance by combining the features of different levels. Our specific fusion method is shown in where F ij represents the ith row; the jth column fuses represents the (i − 1)th row; and the (j − 1)th column is the feature obtained by the upsampling module. is fusion method can not only solve the multiscale problem of the human head but also make full use of the semantic information and location information of the feature map of different layers. In this work, f(U) was the key to connecting the layers, we named upsampled embedded block (UEB). Its detailed design is shown in Figure 3.
Firstly, the feature maps of the Conv5_3 output with the nearest neighbor interpolation are upsampled by UEB and fused with the feature map of the Conv4_3 output; the same operation is done with the Conv4_3 and Con3_3. e network obtains the first layer of fusion features and repeats similar operations. With Conv3_3, Conv4_3, and Conv5_3, we had three sets of feature fusions (M 1 , M 2 , M 3 ) which were fed into the MFG for further manipulation.

MFG (Multiscale Feature Guide).
e features obtained from MFE need to be advanced to generate a high-quality density map. e specific operation includes spatial attention mechanism and channel attention mechanism.

Spatial Attention.
e full convolutional network regresses the pixels of the density map and does not explicitly give more attention to the head region. In other words, crowd background may influence training loss. To solve the problem of crowd and background confusion, we use spatial attention to make the network generate a spatial attention map of crowd head information. en, the attention map is applied to suppress the selection of nonhead region information to make the network focus more on the head region (see Figure 4).
M i , the feature map taken by MFE on different columns, is used to predict an attention map with conv + sigmod of 1 * 1, respectively, and then multiplied by M i with attention map to obtain the spatial attention feature map F s−attr i . e process is shown in where W i is the attention map obtained by conv + sigmod and i is 1, 2, 3 { }. e solution process is shown in the following equations: where i ∈ [1, S], I i is the feature map of each channel, C 1 * 1 represents the convolution of 1 * 1, Θ c represent the parameters of 1 * 1 convolution, and Sigmoid is the activation function.

Channel Attention.
Each channel of a high-level feature can be viewed as a specific category of responses, with different semantic responses linked to each other. By utilizing the dependency relationship between different channels, the interdependence between features can be enhanced and the expression of feature semantics can be improved (see Figure 5). e dimension of M i is C * H * W. First, reshape M i to get C * N, where N � H * W. en, the transpose matrix of M i is multiplied by C 1 . Finally, C 1 is sent to softmax to obtain C 2 . e above operation process is shown as follows: where x ji is the effect on the ith channel on the jth channel and C 2 consists of x ji . C 2 and M i reshape matrix multiplication and will get the final result combined with M i channel attention to F C−attr i ; its expression is as follows: w is initialized to 0 and gradually learns to assign more weight to it. In the process, the calculated C 2 matrix also plays a role of attention, each row of which calculates the dependency relationship between a certain channel and other channels. e value is changed to 0-1 by softmax probability; the greater the value is, the stronger the dependency will be. By multiplying the attention map and M i , the dependent channels were selectively integrated, semantic feature expression was improved, and long-distance semantic dependence between different channels was modeled.

Fusion of Features.
For the output of spatial attention and channel attention, the elementwise sum is executed to complete feature fusion, and the density map is obtained through Conv 1 * 1. Since the density map size is 1/4, 1/8, and 1/16 of the original input size, the upsampling operation is needed to obtain the feature map with the final size of 1/4 of the original image size.

Low-level Feature
High-level Feature The fusion feature Upsampling Conv 3 * 3 e bilinear interpolation method is used for upsampling to restore the high-level features to the same size as the low-level features. Conv 3 * 3 convolution is responsible for learning weights, extracting the high-level semantics for better-integrated information, and the fusion features could be further fused from top to bottom. e specific fusion features of Conv3_3, Conv4_3, and Conv5_3 from the VGG16 backbone network are used.

Loss Function.
In the crowd scene, the local features of the high-density crowd are different, but MSE Loss is based on the assumption of pixel independence, and the local correlation of the density map is not considered. Besides, the counting error of the image is not taken into account. erefore, SSIM loss and MAE loss are considered to be added to the loss function. e SSIM Loss calculates the similarity between two images according to the local features, and the similarity between the generated crowd density map and the truth value can be compared. e MAE Loss directly measures the difference between the estimated crowd number and the ground truth.

SSIM Loss.
SSIM is an indicator widely used in the field of image quality assessment. It calculates the similarity between two images based on local patterns (including mean, variance, and covariance). e value range of SSIM is [−1, 1]. e more similar the two images are, the greater the value is. When the two images are the same, it is equal to 1. First, the local statistics are estimated using an 11 * 11 normalized Gaussian kernel with a standard deviation of 1.5. en, the weight set from W � W(r)|r ∈ R, R � { (−5, 5), . . . , (−5, 5) { }} is defined, where r is the center and R contains the kernel of all locations. erefore, for each position t, the local statistics of the density map F d and the corresponding truth value D are calculated. e local mean μ F d and variance σ 2 F d of F d are calculated, as follows: e local mean μ d and variance σ 2 D of D are as follows: We can calculate the local covariance σ F d D between F d and D as follows: According to these indicators, SSIM is calculated point by point: where Q 1 and Q 2 are random very small constants to avoid being divided by 0. Finally, the loss function of SSIM can be defined as follows: where M is the sum of pixels in the density map.

MAE Loss.
Mean absolute error loss is introduced to assess the counts and the estimated values as follows: where I i means the density map generated by MGANet, is the estimated count, and C ′ (I i ) is the label value.

MSE Loss.
e Euclidean loss is used to assess the difference between the training density map and the model output density map, to facilitate the model to adjust parameters and produce a density map closer to the real situation. e Euclidean loss can measure estimation error at the pixel level. e loss function is given as follows: where F(X i ; θ) is the output of MGANet, θ is the variable parameters of the model parameters, X i is the input image, and F i represents the ground truth result.

Experiment
In this section, we introduce the evaluation metrics, the three standard benchmark datasets, the experimental setup, and the training method in order. Finally, we perform experimental results on the datasets.

Training Details.
MGANet is an end-to-end structure, which is implemented using the PyTorch framework. So, the training process is very simple. We set the training batch size as 1. MGANet uses standard SGD as the optimization with the momentum at 0.9. Besides, we employ random Gaussian initialization with a standard deviation of 0.01. e initial learning rate is set to 0.001 and decreases as the number of iterations increases.

Evaluation Metrics.
Following the existing literature, the evaluation metrics are the mean absolute error (MAE) and mean squared error (MSE), which can evaluate the performance of our method. MAE indicates the accuracy of the model counting, and MSE represents the robustness of the model. e formulae are as follows: where N represents the number of test images. Z i is the actual number of people in the ith test image. Z i ′ is the corresponding estimated count, which is the output by our model.

Ground Truth Generation.
For the density map to be able to adapt to various conditions of the crowd image, it can be expressed as F(x) with N heads. e calculation method of F(x) is to convolve the delta function δ(x − x i ) with a Gaussian Kernel G σ i (x) normalized to 1: where x i represents per pedestrian head at the pixel, σ i is the crowd distribution of all the images in the dataset, β is a constant, and d ι represents the average distance of k nearest neighbors of the target.
In our experiments, we follow the configuration in CSRNet [11]. Certain parameters are set to fixed values (β � 0.3, k � 3). e setups are shown in Table 1.

Comparisons with State of the Art.
We evaluate our method on three publicly available crowd counting datasets: ShanghaiTech, UCF_CC_50, WorldExpo'10, and Smart City. ese datasets include different crowd situations, such as dense scenes and sparse scenes. Tables 2-5

ShanghaiTech Dataset.
ShanghaiTech dataset [11] includes two sets: Part A and Part B. It is taken in sparse scenes, which consists of 1198 crowd images with 330,165 people. Part A consists of 300 training images and 182 testing images. Part B contains 400 training images and 316 testing images, counting from 9 to 578.
ShanghaiTech dataset test results and visual results are shown in Table 6 and Figure 6. In the visualization section, we tested the low-density to high-density crowd images in Part A and Part B, respectively. We scored 96.7 on MSE on Part A, and other data also performed well. It can be seen that we perform well in both low-density and high-density scenarios.

UCF_CC_50
Dataset. UCF_CC_50 dataset [18] is a very challenging dataset, due to different perspectives, small dataset size, and resolutions. It contains 50 crowd images with 63974 people. e annotated persons range from 94 to 4543, with an average of 1280. Fivefold cross-validation is the most commonly used method in this dataset. Our results  Scientific Programming are lower than 212.2 and 243.7 of CAN [16]. In the UCF_CC_50 dataset, images are high-density scenes without complex background information. Our MFGNet is more suitable for complex background situations. e test results and visual results on the UCF_CC_50 are shown in Table 2 and Figure 7. It can be seen that our MAE and MSE are 240.8 and 311.5, respectively, which are greatly improved compared with other methods, proving that our MAE and MSE also perform well in small datasets and highdensity scenarios. e test results and visual results on the WorldExpo'10 dataset are shown in Table 3 and Figure 8. e data is divided into 5 different scenarios with large background interference. We tested each of the five scenarios, and the final score is 7.86. In S1 and S5 scenarios, the scores are 2.1 and 3.0, respectively, better than 2.4 and 4.0 in CAN [16]. It can be seen that our model also performs well on the multiscene crowd counting dataset under complex background.

Smart City
Dataset. Smart City [12] contains 50 images. When collecting data, the shooting angle is high, and the scene includes ten scene images such as a sidewalk and shopping mall. e images are divided into indoor scenes and outdoor scenes, and there are few pedestrians, ranging from 1 to 14 people from the lowest to the highest, with an average of 7.4 people.
e test results and visual results on the Smart City dataset are shown in Table 4 and Figure 9. Relatively speaking, the background of people in this dataset is more complex. On this dataset, we got the best results. ey are 8.2 and 10.2, respectively, better than 9.4 and 11.4 in CAN [16]. It can be seen that our model also performs well on the multiscene crowd counting dataset under complex background.  Table 5). With the increase of cascade operation, the indicators gradually become better. However, the complexity of the model is gradually increasing, and the performance will not be significantly improved.  Table 7.

e Function of the Attention Mechanism
(2) Comparison of the Same Form of Attention Mechanism. Channel attention and spatial attention are also included in the CBAM [39]. Although there are similar names, they make certain differences. We believe that the tasks of CBAM and MFG are different. CBAM pays more attention to the recognition of target objects, which also makes CBAM have better explanatory. MFG pays more attention to pixel-level attention design and is more suitable for crowd counting. To be fair, we add CBAM and MFG to the MCNN [10] network, respectively, and the

Dataset
Parameter settings ShanghaiTech Part A [11] σ � 4 ShanghaiTech Part B [11] σ � 15 WorldExpo'10 [19] σ � 3 Smart City [12] σ � 3 UCF_CC_50 [18] Geometry-adaptive kernels   Table 9. It can be seen that MSE Loss, the most commonly used loss, still plays a major role in network convergence, but the combination of SSIM Loss and MAE Loss in this paper can also promote network convergence to a certain extent.
(2) Setting of Super Parameters α and β. We briefly compare the different parameter settings of the presented       Figure 9: Visualization results of smart city dataset. e first column is the original image, the second column is the ground truth, the third column is the prediction results, and the fourth column is the true/predicted values.  loss function on ShanghaiTech A demonstrated in Table 10.
It indicates that when α and β are 10 − 4 and 10 − 4 , the MAE and MSE reach the lowest.

Conclusion
In this paper, we propose MGANet for accurate and effective crowd counting. To obtain multiscale information and prevent the loss of context information, we propose a top-down method to concatenate deep and shallow features. To make the network pay more attention to the spatial information and channel information of the crowd in the image and ignore the irrelevant information, we design a combination of spatial attention and channel attention that pay more attention to the pixel level and further guide the features. To obtain a high-quality density map, based on the commonly used MSE Loss, the loss function that can effectively promote network convergence is obtained by combining SSIM Loss and MAE Loss. A large number of experiments have proved the effectiveness and robustness of our method, and the related ablation experiments have also confirmed the effectiveness of each module.

Data Availability
All data can be found at https://github.com/lpfworld/ MGANet. Other raw data supporting the conclusions of this article can be obtained from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.