Multiscale Aggregate Networks with Dense Connections for Crowd Counting

The most advanced method for crowd counting uses a fully convolutional network that extracts image features and then generates a crowd density map. However, this process often encounters multiscale and contextual loss problems. To address these problems, we propose a multiscale aggregation network (MANet) that includes a feature extraction encoder (FEE) and a density map decoder (DMD). The FEE uses a cascaded scale pyramid network to extract multiscale features and obtains contextual features through dense connections. The DMD uses deconvolution and fusion operations to generate features containing detailed information. These features can be further converted into high-quality density maps to accurately calculate the number of people in a crowd. An empirical comparison using four mainstream datasets (ShanghaiTech, WorldExpo'10, UCF_CC_50, and SmartCity) shows that the proposed method is more effective in terms of the mean absolute error and mean squared error. The source code is available at https://github.com/lpfworld/MANet.


Introduction
Crowd counting technology is widely used in video surveillance, crowd management, traffic control, and other fields as well as at sporting events and political meetings [1,2]. Crowd counting methods can also be extended to indirectly related fields, such as medical image analysis and animal group behavioral analysis [3]. Although the relevant research has achieved good results, considerable challenges persist owing to large-scale variations, heavy occlusion, background noise, and perspective distortion ( Figure 1).
Researchers have proposed different approaches to solve these problems. For example, numerous multicolumn networks have been proposed. Multicolumn architectures involve several columns of a convolutional neural network (CNN) with different receptive fields to accommodate multiscale crowds [4][5][6][7]. Although these methods have achieved good results, the multicolumn structure induces a considerable increase in parameters and computational costs. Furthermore, the similarity of column networks results in a high redundancy of learning features [8][9][10]. e goal of our architecture is to retain more multiscale contextual features. e proposed network comprises an encoder that can extract and retain the required features and a decoder that gradually recovers the image resolution and interprets the encoded features.
A feature contains different information at different layers of the neural network. Most crowd counting methods use a 1 × 1 convolution to transform the feature of the last layer of the network into a density map. However, these methods ignore the relation between different layer features. We use dense connections to improve the structure and integrate the features of different layers.
Dilated convolution can effectively expand the receiving field without increasing the number of parameters and computational costs [11][12][13]. Li et al. [8] proposed a congested scene recognition network (CSRNet) by combining VGG-16 and dilated convolution layers to aggregate multiscale contextual features. Chen et al. [14] proposed a scale pyramid network, which contains different dilated convolution rates in parallel for multiscale information extraction. Although these methods show good performance in many tasks, the design of dilated convolution modules usually has excessive memory size requirements. erefore, the modules must consider efficiency and effectiveness through processing operations.
In this study, we propose a multiscale aggregation network (MANet) for crowd counting ( Figure 2). e proposed MANet is an encoder-decoder network that uses a densely connected multiscale aggregation module in the encoder, referred to as a cascade scale pyramid network (CSPN). e CSPN contains four parallel dilated convolutions with different dilated rates for capturing the features of different receptive fields. e features obtained using the four dilated convolutions are further fused in a cascade manner to improve the ability of the network to handle multiscale features and anti-interference. Furthermore, the dimension reduction operation reduces redundant computations that are typical of deep convolution networks. To restore the resolution of the features, we use deconvolutions with different parameters to act on the features of different layers in the decoder. e loss function contains a Euclidean loss and a mean squared error loss, which form a valid training loss function. We conduct experiments using four major datasets (ShanghaiTech [7], SmartCity [9], Worl-dExpo'10 [15], and UCF_CC_50 [16]), achieving excellent results.

Related Work
A series of excellent crowd counting methods have been proposed [1,17].
ese methods can be categorized as detection-based, regression-based, and CNN-based approaches.

Detection-Based Approaches.
Earlier, detection-based methods used a sliding window for target detection, including the manual extraction of the features of the human body or specific parts [18], such as the Haar wavelet [19] and histogram of oriented gradients [20]. To improve detection accuracy, researchers have analyzed crowd scenes by detecting specific body parts rather than the entire body [21]. Recently, researchers have attempted to employ CNN-based object detectors to count objects, such as YOLO [22], SSD [23], and faster RCNN [24]. However, even if only a pedestrian's head or smaller body parts are detected, these methods often cannot handle high-density crowd scenes owing to occlusion and illumination in crowded scenes.

Regression-Based Approaches.
Regression-based approaches for crowd counting cannot accurately locate pedestrians. However, they can provide more accurate count estimates in crowded scenes. In particular, the regressionbased approaches include feature-based regression approaches and density estimation-based regression approaches.

Feature-Based Regression Approaches.
Feature-based regression approaches attempt to extract various features from local image blocks [25][26][27]. Foreground or textural features are used to generate low-level information. Similar methods have been formulated based on Fourier analysis, SIFT [28], and interest points [29]. Feature-based regression methods handle occlusion and clutter effectively. However, they ignore scale information.

Density Estimation-Based Regression Approaches.
Density estimation-based regression methods consider the relation between image features and data regression. Lempitsky and Zisserman [30] proposed a linear mapping method considering local region features and density maps. Pham et al. [31] attempted to use random forest regression to realize a nonlinear map. Based on these studies, many density estimation-based regression methods for crowd counting have been developed [17,32,33].  patches into appropriate CNN columns as inputs. A previous study proposed a CP-CNN [6] that involves twocolumn networks to extract both global and local contextual information. e network maps the input data to a highdimensional feature map and then inputs the previously extracted contextual information set into the final fusion network to obtain a high-quality density map. In SAANet [34], global and local attention weights are used to capture variations in the crowd density between and within images.
is proposed attention mechanism for network usage can automatically focus on local and global scales. SANet [10] attempted to extract multiscale head information from each image using a similar front-end network module. Furthermore, the final density map is obtained by deconvolution using different-sized convolution kernels in each layer. Although these CNN-based methods show good crowd counting ability, they have several disadvantages. ese networks with redundant parameters and slow convergence are difficult to train to solve the problems of multiscale and occlusion.
Some studies have proposed crowd counting methods. DecideNet [35] used detection-based methods to count crowds in sparse crowd scenes and regression-based methods to count crowds in dense scenes and adopted an attention mechanism to regulate the use of the two methods. Sam et al. [36] proposed locating each person in a dense crowd using a bounding box to size the identified heads and then counting them. Another study proposed an adaptive dilated convolution that can learn a continuous hole rate at different positions in the image to effectively match changes in the scale at different positions [37]. PACNN [38] framework eliminates the need for a density regression paradigm.
e specific operation involves encoding the input as perspective perception layers and adaptively combining multiscale density maps. Using ASNet [39], intermediate-density maps and scaling factors are first generated and then multiplied by the attention mask to output multiple density maps at different density levels based on the attentional mechanism. e final density map is obtained by combining these density maps.

Proposed Approach
An overview of the proposed model is shown in Figure 2. In this section, we describe the proposed model. In Section 3.1, we introduce the cascaded scale pyramid network (CSPN). In Sections 3.2 and 3.3, we describe the feature extraction encoder (FEE) and density map decoder (DMD), respectively. Network parameters are introduced in Section 3.4.

Cascaded Scale Pyramid Network (CSPN).
e scale often varies continuously across the image and shows a large range. A network structure that achieves better results usually contains more complex designs. Considering these challenges, we propose a CSPN, which can balance efficiency and effectiveness. e standard convolution can be divided into two steps [40]. In the first step, pointwise convolution is used to reduce the dimension. In the second step, multiscale features are extracted using the spatial pyramid of dilated convolution. Motivated by this idea, we define the computational process of our module ( Figure 3).
First, an M-dimensional input is reduced to a d-dimensional input using d convolution kernels of en, four dilated convolutions with different dilated rates are used to parallel compute the feature output from the previous step; subsequently, four features of the same size are obtained. Finally, these four features are cascaded, and the result is added to the original input features to obtain the final output. Computational Intelligence and Neuroscience (2) e low-dimensional features are calculated using the parallel dilated convolution with different dilated rates (d1 � 1, d2 � 4, d3 � 8, and d4 � 16), which can rapidly increase and capture multiple receptive field information. Each dilated convolution in the CSPN possesses the same number of channels. For a given feature, the size of the receptive field is 3 × 3, 9 × 9, 17 × 17, and 33 × 33 for the features extracted using four dilated convolutions (3) e outputs are fused to eliminate the "gridding issue," and the output of the scale pyramid is obtained as follows: where M s ot represents the fused features of the s-th layer. e features are spliced together to obtain the output of the scale pyramid M ot ∈ R H * W * 4 s�1 C s , where W and H represent the width and height of the feature map, respectively, and C s represents the number of output channels for different columns.

Feature Extraction Encoder (FEE).
We employ SPPNet as the front-end network of the encoder and input the generated feature to the CSPN. Four CSPNs, which are connected using specific rules, are used ( Figure 4). e current CSPN improves information flow within the underlying network by sharing the features of the previous CSPN.
If the dense connection method is adopted and each layer produces k features, k 0 + k(i − 1) features will be input to the i-th layer. Here, k 0 is the number of channels in the input layer and the hyperparameter k is the growth rate of the network. A larger k value signifies that the amount of information that flows in the network increases, the ability to extract features becomes stronger, and the number of model calculations increases.
Since each layer of the network will receive the features of all previous layers as inputs, there is a middle layer behind each densely connected block for dimensionality reduction. We set the compression factor θ(0 < θ ≤ 1) for dense connections. When θ � 1, the channel number of output features does not change. In the middle layer, all θ values are considered to be 0.5, implying that the middle layer reduces the number of output channels to half the number of inputs.

Density Map Decoder (DMD)
. CNN-based methods generate a low-resolution density map during continuous convolution and pooling, owing to which details of the crowd are usually lost [10,14]. We use four fusion layers to progressively refine the details of the features to obtain a high-quality density map. Four deconvolutions are used to restore the image map resolution. When using deconvolution operations, the number of input channels is the same as the number of output channels. Finally, we adopt a 1 × 1 convolution to generate a high-resolution density map, which has the same resolution as the input image.

Loss Function.
e Euclidean distance is used to assess the difference between the training density map and the model output density map. Based on this assessment, model parameters are adjusted to produce a density map that closely depicts the ground truth. e Euclidean loss function can provide an estimation error at the pixel level. e loss function is expressed as follows: where F(X i ; θ) denotes the output of MANet, θ represents the variable model parameters, X i denotes the input image, and F i represents the ground truth result. In addition, the mean absolute error (MAE) loss function is introduced to determine the count and estimated values as follows: where I i represents the density map generated using MANet, C(I i ) represents the estimated count, and C ′ (I i ) denotes the label value. To weigh the loss, the final loss function is expressed as follows: where α is the super weight parameter, which was set to 0.01.

Experiments
We evaluate the proposed MANet using four datasets (ShanghaiTech [7], SmartCity [9], WorldExpo'10 [15], and UCF_CC_50 [16]). First, we introduce the evaluation metrics, ground truth generation, and training details. en, we compare the proposed method with state-of-the-art methods using these datasets. Finally, we demonstrate the effectiveness of our module via ablation experiments. e experiments were implemented in Pytorch, and the detailed network configuration is shown ( Figure 5).

Evaluation Metrics.
Based on the existing literature, the evaluation metrics are the MAE and mean squared error (MSE), which can be used to evaluate the performance of crowd counting methods. e MAE indicates the accuracy of Computational Intelligence and Neuroscience the count, and the MSE represents the robustness of the model. e MAE and MSE are calculated as follows: where N represents the number of test images, Z i denotes the actual number of people in the i-th test image, and Z i ′ denotes the corresponding estimated count, i.e., the model output.

Ground Truth Generation.
We follow the scheme used in previous studies [7,8,14] to prepare a ground truth density map. To ensure that the density map adapts to various conditions of crowd images, it can be expressed as F(x) with N heads. F(x) is obtained by convolving the delta function δ(x − x i ) with a Gaussian kernel G σ i (x) normalized to 1: where x i represents per pedestrian head in a pixel, σ i represents the crowd distribution of all the images in the dataset, β is a constant, and d i represents the average distance of k nearest neighbors of the target. In our experiments, we follow a previously proposed configuration [8]. Certain parameters are set to fixed values (β � 0.3 and k � 3). e parameter settings for different datasets are listed in Table 1.

Training Details.
MANet has an end-to-end structure. e training process is very simple. We set the training batch size to 1. MANet uses standard SGD with momentum (0.9) as the optimization method. Furthermore, we employ a random Gaussian initialization with a 0.01 standard deviation. e initial learning rate is set to 1e − 5. e learning rate decreases as the number of iterations increases. Computational Intelligence and Neuroscience

Comparisons with State-of-the-Art (SOTA) Methods.
We illustrate the result of our method using four challenging datasets. ese datasets include different crowd situations, such as dense and sparse scenes. We present the density estimation results generated using MANet and discuss the problems in the model based on the results.  Table 2 and illustrated in Figure 6. On Part_A, our approach outperforms PACNN [38], the most recently proposed method, by 1.49% and 10.2% in terms of MAE and MSE, respectively. ese are good, although not exceptional results. Compared to the results obtained using PACNN [38], the results obtained using the proposed method on Part_B are not as good as those obtained on Part_A. is is because the image sources differ. e images in Part_A were downloaded randomly from the Internet, and the crowd density is very high. e images in Part_B were obtained in street scenes with a low crowd density and relatively complex backgrounds compared with images in Part_A. Our proposed network handles the multiscale problem well; however, it does not completely solve the problem of complex backgrounds. Many latest studies have added an attention mechanism, which improves the effect in some cases [14,38].

UCF_CC_50
Dataset. UCF_CC_50 [16] contains 50 crowd images with a total of 63974 people. e number of annotated people ranges from 94 to 4543 (an average of 1280). Fivefold cross-validation is the most commonly used method on this dataset. e test and visualization results obtained using the UCF_CC_50 dataset are presented in Table 3 and illustrated in Figure 7. UCF_CC_50 is a very challenging dataset. It is a small dataset, and the resolution of the images is not high. e images are of pedestrians captured from different perspectives; therefore, scale variations are obvious. e MAE and MSE values obtained using the proposed method are 240.8 and 311.5, respectively; these values are 7.09% and 7.26% higher than those obtained using SPN [14]. Only some images in this dataset have background interference. ese findings also prove that the proposed method achieves good results when handling small datasets with large-scale changes and dense crowds.

WorldExpo'10
Dataset. WorldExpo'10 [15] includes images captured using 108 different surveillance cameras, containing 3,980 training frames in 1,132 video sequences, which can provide the cross scene to evaluate a model. e regions of interest are provided for all scenes. e test and visualization results obtained using the WorldExpo'10 dataset are provided in Table 4 and illustrated in Figure 8. e dataset is divided into five different scenes with different degrees of background interference. We tested each of them, and the average score is 7.86. e best results are obtained in S1 and S5, i.e., 2.1 and 3.0, respectively. However, our results are not as good as those obtained using SOTA in other scenes [12,38]. Relative to other datasets, the shooting distance is long, the crowd does not show obvious multiscale changes, and the background interference is higher. In this case, our approach still shows good performance. [9] contains 50 images. When collecting data, the shooting angle was high. e dataset includes ten scenes such as scenes of a sidewalk and a

SmartCity Dataset. SmartCity
Geometry-adaptive kernels SmartCity σ i � 15 e test and visualization results obtained using the SmartCity dataset are presented in Table 5 and illustrated in Figure 9.
e MAE and MSE values are 8.2 and 9.6, respectively; these values are 4.65% and 17.24% higher than those obtained using SaCNN [9]. Differing from UCF_CC_50, the SmartCity dataset is small and the images have complex backgrounds, which are usually easy to identify. e results demonstrate that the proposed model shows good performance on small datasets with images of sparse crowds.
As shown in the table, our method obtains the lowest MAE and MSE values on multiple datasets. ese results demonstrate the effectiveness of the proposed method, particularly in the case of a high-density crowd in an image. is observation not only proves the effectiveness of our method but also demonstrates its robustness. We compare the visualization results of the proposed method with those obtained using SOTA methods. e density map produced by our model is of higher quality and closer to the original map   [15] 467.0 498.5 MCNN [7] 377.6 509.1 Switch-CNN [4] 318.1 439.2 CP-CNN [6] 295.8 320.9 CSRNet [8] 266.1 397.5 SANet [10] 258.4 344.9 SPN [14] 259.2 335.9 MANet 240.8 311.5 Computational Intelligence and Neuroscience ( Figure 10). is also proves that our model can retain more multiscale and contextual information. However, our results also indicate that the proposed model has some limitations. Occasionally, objects in the background of an image are mistakenly classified as pedestrians in a crowd. ese phenomena are indicated by boxes outlined in red (Figures 7  and 8). is type of problem may lead to other problems with our model under the background of similar goals, and we must address this issue through certain mechanisms.

Ablation Experiments.
In this section, we describe several ablation studies, including the CSPN and dense connection operations, to demonstrate the effects of different modules in our proposed MANet.

Effectiveness of CSPN.
To prove the effectiveness of the CSPN structure, we conduct multiple ablation experiments. (1) e last convolution layer in the MCNN is replaced with the CSPN (MCNN + CSPN). (2) e last convolution layer in the MCNN network is replaced with the SAN of SANet (MCNN + SAN). (3) e backend of CSRNet is replaced with the CSPN (CSRNet + CSPN). (4) e CSPN in MANet is replaced by an ordinary convolution (CNet) ( Table 6).
Our experiment on MCNN proves that the CSPN is effective. e MSE of MCNN is reduced from 110.2 to 92.4, and the MSE is reduced from 173.2 to 157.5. However, our effect is similar to that of SAN. Compared with CSRNet, the results are similar. Moreover, the self-ablation experiment proved its effectiveness.   Table 7.
e results demonstrate that the incorporation of dense connections provides better results than not including the connections. More connections help the model retain features; however, the disadvantage is that a large number of features require more computational resources and training time. 10. e first, second, and third columns show test samples, the corresponding ground truth, and the generated density map, respectively. e box outlined in red represents the area where we mistook the background for a head.

Image
Ground truth MCNN DSNet Ours Figure 10: Comparison of the visualization results obtained using our method and those obtained using SOTA methods. From left to right, test samples, ground truth, and visualization results were obtained using MCNN [7], DSNet [41], and MANet.

Conclusion
In this study, we proposed MANet, an innovative encoderdecoder structure for crowd counting. MANet comprises a FEE and a DMD. FEE uses dense connections to integrate the features extracted from the CSPN, a multiscale aggregation network, to obtain multiscale and contextual information. e DMD adopts deconvolutions and fusion operations to obtain features containing detailed information to realize high-quality density maps. We conducted numerous experiments on the model using the four datasets. Experimental results show that the proposed MANet performs well on MAE and MSE. e focus of future work will be on increasing the attention mechanism for an improved distinction between crowds and backgrounds.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding this study.