A Fusion-Based Dense Crowd Counting Method for Multi-Imaging Systems

. Dense crowd counting has become an essential technology for urban security management. Te traditional crowd counting methods mainly apply to the scene with a single view and obvious features but cannot solve the problem with a large area and fuzzy crowd features. Terefore, this paper proposes a crowd counting method based on high and low view information fusion (HLIF) for large and complex scenes. First, a neural network based on an attention mechanism (AMNet) is established to obtain a global density map from a high view and crowd counts from a low view. Ten, the temporal correlation and spatial complementarity between cameras are used to calibrate the overlap areas of the two images. Finally, the total number of people is calculated by combining the low-view crowd counts and the high-view density map. Compared to single-view crowd counting methods, HLIF is experimentally more accurate and has been successfully applied in practice.


Introduction
Crowd counting and density estimation have essential applications in urban security governance, such as preventing dangerous events like crowd stampede and illegal assembly.With the continuous development of deep learning, the performance of single-camera dense crowd counting methods has gradually improved, achieving good results on existing single-view datasets [1][2][3].However, for some scenes that are large and wide and require more robust perception as shown in Figure 1, such as parks, squares, stadiums, and large train stations, a single camera cannot cover the entire scene while improving the clarity of feature information.It is also challenging to address the impact of objects such as buildings, landscapes, and stairs in the location on the count.Terefore, multiple cameras with overlapping felds of view can provide complementary information about diferent features, efectively solving problems such as occlusion and perspective.
Traditional multi-view counting methods are mainly based on detection [4][5][6][7], regression [8,9], and 3D cylinder [10].Te detection-based methods obtain the detection results of each view by a detector and then integrate the information from multiple perspectives.Te regressionbased methods extract the features of each view, such as size, shape, and key points, and construct a regression function between the feature vector and the crowd size.3D cylinder-based methods determine the position of a person in a 3D scene by minimizing the distance between the 3D projected position of the person and the camera view.However, these methods depend on manual feature extraction and foreground extraction techniques, which are inefective for high-density crowd counting.Based on the excellent performance of deep neural networks in singleview dense crowd counting, Zhang et al. [11] constructed the deep neural networks for multiview counting method, and more accurate results were achieved.Te existing algorithms have made good progress for low-and medium-density crowd counting problems.However, further research is required for high-density or ultrahigh-density scenarios.
Based on the above problems, we propose a crowd counting method based on high and low view information fusion (HLIF).First, under the premise that the high-altitude view should include the low-altitude view as shown in Figure 2(a), an attention mechanism-based low-altitude crowd counting model, AMNet, is established for accurate crowd counting in low-altitude view images.Second, the corresponding feature points between the high-altitude view and low-altitude view images are elected to achieve the fusion of the discrete density map of the population of highaltitude view images and low-altitude view images.Finally, the global number of people is derived from the low and high view crowd density map information in the overlap area.
Te main existing multi-view datasets are PETS2009 [12], DukeMTMC [13], and City Street [14] as shown in Figure 3, but these datasets have some shortcomings.PETS2009 focuses on a sidewalk scene, not a large view scene.DukeMTMC focuses on multi-view tracking and human detection, so strictly speaking it is not dense crowd.City Street improves on the resolution and crowd size from the frst two datasets, but the crowd density level is still low.To better explore the research idea of multi-view approach, we established the dataset CROWD_SZ for the landmark and large musical fountain square, which contains rich crowd density levels, multiple camera views with diferent spatial perceptions, etc., as shown in Figure 2(b).
In summary, the work and contributions of this paper are as follows.
(i) For large-view scenes, a high-and low-altitude view image fusion mechanism is established to efectively utilize the spatial complementarity between highand low-altitude image information to realize the alignment fusion of crowd density map of highaltitude view and low-altitude view images, which is experimentally verifed to be highly accurate.(ii) We construct a multiview dense crowd counting dataset from real-world scenes, and it provides video frames of diferent scenes with good statistical dispersion, fully shows the actual situation in multiview scenes.

Related Work
Since single-view images are primarily close-up images with more obvious crowd characteristics, existing methods have improved on scale transformation, background noise, and other issues.However, they can only count the number of people in a single region and cannot refect the global crowd dynamics changes in large scenes.Multi-view counting can compensate for the lack of information provided by singleview images and efectively solve the problems of occlusion and perspective in dense crowd scenes.Also, the reliability of single-view counting is benefcial to improve the perception of large scenes and the accuracy of global crowd calculation, which is the basis of multi-viewpoint research.

Single-View Perception Methods.
To address the problems of scale variation and uneven distribution in crowd counting, Zhang et al. [1] constructed a multi-column convolutional neural network using convolutional kernels of diferent sizes to extract head scale information of various sizes, which is a pioneering work in crowd counting algorithms.PACNN [14] is a four-column perspective-aware convolutional neural network that integrates perspective information into density regression to provide additional knowledge of the scale variation of the person in the image, and the output density map is combined to adapt to scale variation.DeepCount [15] introduces multi-gradient fusion, where the backbone network receives gradients from 2 International Journal of Intelligent Systems multiple branches to learn density information.MMNet [16] is an end-to-end scale-aware network that handles the problem of head scale variation by integrating multi-scale features generated by flters of diferent sizes.SASNet [17] is a scale-adaptive selection network that automatically learns the internal correspondence between scale and feature level.STNet [18] consists of a scale tree diversity enhancer and multi-level auxiliaries and uses a tree structure for hierarchical parsing of coarse to delicate crowd regions to alleviate the scale variation problem.PFDNet [19] overlays multiple perspective views in the backbone network and introduces fractional expansion convolution to address the scale variation problem.
In a recent study, Gao et al. [20] analyzed crowd counting technologies in the felds of Internet of Tings, healthcare systems, and cell counting, showing the trends and challenges of crowd counting technologies in various felds.Guo et al. [21] proposed a lightweight Ghost Attention Pyramid Network (GAPN) that combines the advantages of Ghost Convolution and Ghost Batch Normalization, reduces the network parameters and computation, and uses Pyramid Attention Mechanism to capture multi-scale crowd features.Scale Region Recognition Network (SRRNet) was proposed in [22].Te scale variation problem is solved by encoding multiple scale feature representations through Scale-Level Awareness module.Te efect of background interference is suppressed by Object Region Recognition module.Zhai et al. [23] proposed Scale-Context Perceptive Network (SCPNet), which consists of Scale Perceptive (SP) module and Context Perceptive (CP) module.Te scale variation problem is solved by a local-global branching structure, and the CP module uses a channel-space self-attention mechanism to suppress the efect of background interference.Zhai et al. [24] proposed Attentive Hierarchy ConvNet (AHNet) in which discriminative feature extractor is used to extract multi-level feature representation and hierarchical feature aggregator is used to mine semantic features in a coarseto-fne manner.Zhai et al. [25] proposed Feature Pyramid Attention Network (FPANet), which uses a lightweight structure to extract features at multiple scales and uses an attention mechanism to focus on crowd regions and suppress background interference.Dense Attention Fusion Network (DAFNet) was proposed in [26].DAFNet employs a partitioning strategy and designs two key modules: the Iterative Attention Fusion (IAF) module and the Dense Spatial Pyramid (DSP) module.Te IAF module utilizes multi-scale channel attention units to mitigate the efect of background clutter, and the DSP module utilizes hierarchical information from diferent receptive felds to overcome the problem of object scale variation.Guo et al. [27] proposed a Triple Attention and Scale-Aware Network (TASNet) for object counting in remotely sensed images, where the feature pyramid module uses a lightweight structure to extract multi-scale features, the triple-weighted map attention module uses a three-dimensional attention manipulation to distinguish between the object region and the background region, and the pyramid feature aggregation module uses an adaptive weight fusion to generate the fnal density map.Guo et al. [28] proposed a Spatial-Frequency Attention Network (SFANet), where the spatial attention module is used to emphasize features at diferent locations in the spatial domain and adaptively selects regions containing individuals, and the multi-spectral channel attention module is used to obtain a more complete representation of each International Journal of Intelligent Systems channel with frequency components in the frequency domain.Inspired by biology, Zhai et al. [29] proposed Grouped Segmentation Attention Network (GSANet), which reduces the computational cost by dividing the input feature map into multiple subgroups and processing the subfeatures of each subgroup in parallel.At the same time, it combines the information of spatial and channel dimensions to mitigate the estimation error of the background region.Finally, it employs a learning-based cross-group strategy to aggregate and facilitate the fusion of feature maps with diferent channel dimensions.Zhai et al. [30] proposed a Dual Attention Perception Network for robust crowd counting in dense crowd scenes with scale variations.Te network consists of a Spatial Attention (SA) module and a Channel Attention (CA) module.Te SA module focuses on spatial dependencies throughout the feature map to accurately localize the head.Te CA module attempts to process the relationships between the channel maps and highlights discriminative information in specifc channels.Te interaction between the two modules provides synergy and helps in learning discriminative features with attention to the head region.
Although the multi-column structure can efectively handle the scale transformation problem, such models usually depend on the number of network branches.Te model has many parameters, is difcult to train, and has poor real-time counting capability.Te density map generated based on the attention mechanism has specifc errors.It cannot be directly used for density estimation, and it is necessary to consider a way to perceive density in a way that minimizes errors.
Based on this, our low-view crowd counting method uses a single-column structure to ensure the accuracy of counting while simplifying the computational complexity of the network.

Multi-View Perception Methods.
Due to the limitations of traditional multi-view counting methods, the powerful feature extraction capability of deep neural networks, and their success in single-view counting, more and more neural network-based multi-view counting methods are proposed.Te multiview multiscale (MVMS) [11] model is the frst DNN-based multi-view counting method.It uses the camera geometry, the feature maps from all cameras are projected onto the ground-plane in the 3D world so that the same person's features are approximately aligned across multiple views.Te aligned single-view feature maps are fused together and used to predict the scene-level groundplane density map.Zhang and Chan [31] switched to a 3D density map and 3D projection to improve the counting performance.Zheng et al. [32] further enhanced the performance of the fusion model later in the MVMS by modeling the correlation between each team of views.Since the projection of single-view onto the ground plane for fusion requires camera calibration, it limits the method's applicability to scenes where camera calibration is impossible.Zhang et al. [33] proposed a cross-view cross-scene multi-view counting model (CSCV) that incorporates camera selection and noise into the training and can output density maps in diferent scenes with arbitrary camera layouts.In [34], a multi-view counting model without calibration (CF-MVCC) was proposed to obtain the whole scene person by weighting the confdence score of camera view content and distance information.To solve the view scale inconsistency problem in multi-view counting methods, Liu et al. [35] proposed a multiview crowd counting model (SASNet) based on scale aggregation and spatially aware networks, in which a multi-branch adaptive scale aggregation module selects the appropriate scale for each pixel in each view based on the extracted features, which can ensure the scale consistency across views.
A key issue in multi-view counting is how to fuse information from multiple cameras.Existing methods primarily project features from the original image coordinates (2D) to the world coordinate system (3D).Ten, the projected features of multiple views are fused to generate a scene-level density map.Terefore, it is necessary to obtain the internal and external parameters of the camera and the zcoordinate (height) in the world coordinate system, which has some limitations for the application of the methods in natural scenes and cannot be used across scenes.In addition, the validation datasets used by these methods only partially satisfy the requirements of large scenes.Te crowd density needs to be higher to fully illustrate the efectiveness of dense crowd counting methods in large scenes.
Based on the above problems, we propose a new crowd counting method based on the fusion of high-and lowaltitude information and validate it using a new large-scene dense crowd dataset.

Method
Aiming at the problems of incomplete low-view coverage and unclear high-view texture characteristics in dense crowd estimation, global density information is corrected using the number of low-view angles.As shown in Figure 4, high and low view information fusion (HLIF) consists of two modules, local information processing and information fusion.By inputting a high-view image into the network, it can obtain multiple density levels of a discrete density map which directly refects the density similarity of each region.At the same time, the overlap area with the low-view image is calculated to obtain its crowd count.Te information fusion module combines the person number of low-view and the pixel count of high-view density maps in the overlap area, using the similarity of density distribution to compensate for blind areas, establishes the proportion relation, and deduces the accurate crowd count value.

Low-View Crowd Counting
3.1.1.AMNet Architecture.In this method, we select the frst 13 layers of VGG-16 [36] as the front-end network of the structure, as shown in Figure 5. VGG-16 has advantages of simple structure, strong feature extraction ability, and strong transfer learning ability.It has performed well in many felds, such as image recognition and target detection.Take a continuous 3 * 3 convolutional kernel to ensure the reception feld and reduce the parameters.Select the pooling size of 2 * 2 to extract the main features, reduce the size of the 4 International Journal of Intelligent Systems output feature graph, and simplify the computational complexity of the network.Te back-end of the network introduces the Convolutional Block Attention Module (CBAM) [37].Te spatial attention module takes the feature map output from the front-end network as input and assigns weights to the pixels for diferent locations, and the channel attention module realizes the assignment of weights to the foreground and background information in the feature channel, so that the model pays more attention to the head location information.Te attention module generates the corresponding dimensional weights for the original feature maps in channel and space, respectively, then multiplies them with the original feature maps of the input, respectively, which makes the network pay more attention to the more important feature information, and fnally outputs the density map of the crowd distribution using successive convolutional layers.

Ground Truth Generation.
Before training, we frst generate the ground truth to calculate the real number of people in the image and generate the density map by the normalized Gaussian Blur head annotation convoluting Gaussian kernel.Te specifc process is as follows.If there is a head at a particular position x i , it can be represented as δ(x − x i ).If there are N heads in a picture, the number of people in the picture can be expressed as follows: According to the known head position, the size of the head is often associated with adjacent K personal head of center distance, so we can get the distribution by using the geometric adaptive Gaussian kernel, and density map F(x) can be expressed as where G σ i is a Gaussian kernel, σ i represents the average adaptive distance of the head within a certain range, β is usually 0.3, and d i is the average distance of k neighbors of the current head.

Data Augmentation.
We crop four patches from each image at diferent locations and gray scale with a quarter size of the original image.At the same time, adaptive kernel is applied to the annotation fles, weighted sum of the original image is carried out, and the ground truth of the image is obtained by traversing, which is involved in the subsequent training model.International Journal of Intelligent Systems 3. 1.4.Training Details.Te model's input is the crowd image, and the output is the density map.Stochastic Gradient Descent (SGD) is applied with a fxed learning rate at 1e − 6 during training.We choose the Euclidean distance to measure the diference between the ground truth and the estimated density map.Te loss function is given as where N is the size of training batch, F(X i ; Θ) is the output generated by AMNet with parameters shown as Θ, X i represents the input image, and is the ground truth of the input image X i .

Discrete Density Map Generation.
Firstly, the image was input into the AMNet to obtain the density map.Secondly, using the colormap in Matplotlib, it was discretely divided into four density levels: low density, medium density, high density, and ultra-high density.
We determined thresholds for diferent density classes based on the statistics in CROWD_SZ.Specifcally, we took the number of people at each pixel as the density value and then calculated the mean and standard deviation of the density values at all pixels of the images.We used a mean plus or minus standard deviation as the dividing line between medium and high density, and two mean plus or minus standard deviations as the dividing line between low and medium density, and high and ultra-high density.Finally, as shown in Figure 6, the density map displays diferent colors according to the change in density level, namely, orange, green, red, and purple.Color can more intuitively refect the correlation between regions with the same density level on the same image.In the image from the high view, there are some regions that overlap with the image from the low view.As a result, the density distribution in this region can be observed more clearly.However, there are no low-view cameras with overlapping felds of view in some high-view regions.Terefore, analogous reasoning can be applied to other high-view regions with the same density.Tere is an overlap area between high-and low-view images that may refect the distribution of crowds in high-view images and the specifc number of people in lowview images.Te corresponding relationship is shown in Figure 7.

High-and Low-View Image Region
To achieve more accurate regional registration, as shown in Figure 8, this paper fnds the corresponding feature points in the high-view images and registers them according to the homography theory.Te position of feature points in the low-view image is represented by (x, y), the position coordinates of multiple feature points are represented by (x r , y r ), and the corresponding feature points in the high-view region are represented by (X r , Y r ).Te transformation relation of the two images is as follows: where H is the transformation matrix.
To verify the registration accuracy, ground marker lines were selected as reference lines, as shown in Figure 9. Te red line in Figure 9(a) is the ground marker line, and the verifcation result of the ffth column is obtained by a transformation; the two are highly overlapping.After verifying the matrix accuracy, the rectangular region to be studied is selected from the high-view image and converted to obtain the corresponding region of the overlapping region in the low-view image.In Figure 9(a), the red border is the selected overlap region, and the blue quadrangle in Figure 9(c) is obtained after the homography matrix transformation.

Information Fusion. Te information fusion process is as follows:
(i) Image acquisition: at the same time, select Trough the registration matrix, the discrete density map is projected onto the corresponding low-view image and the AMNet is used to calculate the specifc number of people in the covered area l j .(iv) At last, establish a proportional relationship to calculate the global number G.
International Journal of Intelligent Systems where G is the global number of the high-view image, n j represents the pixels of same areas in the discrete density map, l j is the number of people in the area covered by the discrete density map, b t is the number of all pixels contained in each density level, m is the number of overlap areas, and t is the density level.MAE � where N is the number of test images and Z i and  Z i represent the ground truth and estimated values of the ith image.

AMNet Performance Evaluation.
Our model was trained on two benchmark datasets, ShanghaiTech and UCF_CC_50.Experimental results show that this structure is superior to most existing crowd counting networks.
As shown in Table 1, the ShanghaiTech dataset contains 1,198 images, with a total count of 330,165 people, divided into two parts, sparse and crowded scenes.Te optimal MAE and sub_optimal MSE were obtained, and the MAE was 1.1% lower than that of IG-CNN.In Part_B, AMNet achieves the optimal MAE and MSE, the MAE is reduced by 23.6% compared to IG-CNN, and the MSE is reduced by 11.8% compared to PCC Net.
UCF_CC_50 specifcally collects high-density crowd scenes, which compensates for the shortcomings of the existing dataset.Te number of people in a single image ranges from 94 to 4,543, with an average of 1,280.However, this dataset only contains only 50 images, so we used 5x cross-validation.As shown in Table 2, the MAE was 2.8% lower than that of the previous optimal SANet.

Validation of Fusion Algorithm.
At present, there is no high view and low view dataset to test the proposed method, so we collect a crowd dataset CROWD_SZ.Te dataset was collected from the video surveillance, located in the Jinji Lake Urban Life Square in Suzhou, Jiangsu Province, China.Te Jinji Lake Urban Life Square is famous for its large musical fountain and attracts a large number of visitors during the opening of the fountain.Tus, it is a typical largescale, high-density scene.As shown in Figure 10, one high view and four low views are selected to determine their overlapping regions.Choose October 1, 2018, from 18:00 to 21:00, during which the crowd fow, gathering, and dispersal can be comprehensively observed, including before the fountain opens and after the fountain closes.
First, AMNet was used to obtain the discrete density map of the high-view image, as shown in Figure 10.Color components of diferent density levels were extracted, and the pixel numbers of color components of diferent density levels and the pixel numbers of color components of each density level contained in the overlapping area were counted.
Using 19:35:00 as an example, the process of estimating the number of people in the medium density area is as   Trough AMNet, the crowd counts l j are 42.6 and 18.2, respectively, c is the scale factor, and g is the crowd count in the current density level.Table 3 shows the specifc calculation content.For each density level, the global number can be added up.
As shown in Table 4, the global count calculation process at 20:05:00 is taken as an example to illustrate the content of the global crowd calculation.Te global number G can be obtained by summing up the inferred crowd of each density level.At 20:05:00, the global crowd count in this scene is 5,981.4331.
We selected the video clips from 19:28:40 to 20:45:00 to well observe the changes in crowd density.During this period, the number of people was calculated every fve minutes, as shown in Figure 12, the average error is about 258 people, and the estimation accuracy is up to 93.8%.From 19:28:40, the number of people continues to increase until 19:45:00 to 20:20:00 with relatively stable fuctuations.After 20:20:00, the crowd count continues to reduce.However, since the selected scene is an open space, there are errors in manual labeling, and the ground truth here is approximate, basically consistent with the actual changes.
In addition, we use several popular methods to count the number of people in this dataset, such as CSRNET, MCNN, and AMNet.Te performance of the methods is poor in the scene, as shown in Figure 13.Although the crowd distribution can be captured, the calculated results are very different from the ground truth.
Due to the distance of the high-altitude camera equipment, the crowd information is displayed as pixels, and the efective characteristic information of the human body is not obvious, so we need to add the low-altitude information to supplement.In addition, in the CROWD_SZ dataset, the musical fountain performance has strong light and shadow changes, which leads to overexposure or insufcient light in the global image, and more information is lost, which is also compensated by the low-altitude information.For example, in the frst three columns of Figure 13, neural network models such as MCNN are afected by the light and estimate the exposed area as a blank area.However, the use of AMNet on low-altitude images can efectively detect scenes with changing black light and compensate for the lost information globally by more accurate low-altitude crowd  [43] 258.4 334.9 SAAN [44] 271.6 391.0 PACNN [14] 267.9 357.8 Onoro-Rubio and López-Sastre [45] 465.7 371.8 PGCNet [46] 259.4 317.6 CSRNET [47] 266.International Journal of Intelligent Systems counts, resulting in a more accurate global crowd count estimate.Tus, the HLIF structure efectively reduces the impact of light on the global crowd counting.In the last three columns, the light source is closed and the overall of the square is weak, resulting in unclear features of the area farthest from monitoring and the worst

Conclusion
Crowd counting in public places, especially large-scale highdensity spaces, is obviously a very challenging task for public security.Te video surveillance system is a good tool to monitor and manage the crowd.Te rich video information can provide a more accurate estimate of the number of people, which becomes an important foundation for security management decisions.
In this paper, we propose a crowd counting method based on high and low view information fusion, which effectively solves the problem of dense crowd counting in large-view scenes.Te crowd counting network structure based on an attention mechanism (AMNet) is established and helps us to obtain a discrete density map.Te overlapping areas are defned by the registration mechanism from high-altitude view and low-altitude view in the scene.Te global number of people was calculated by combining the density distribution information of high-altitude view and the number of people in low-altitude view in the overlapping areas.
Compared with the traditional crowd counting algorithm, our proposed high-altitude and low-altitude information fusion algorithm (HLIF) can efectively use the low-altitude counting results to compensate and optimize the global crowd counting, adapt to the drastic changes of light at night, and improve the overall counting accuracy.

Figure 1 :Figure 2 :
Figure 1: Dense crowd in a large view scene.
Registration.Te images to be processed are images from high view and low view.Te choice of perspective should follow the following principles: (i) Te high-view range must include the area covered by the low-view range.(ii) Low-view camera equipment should be able to capture a set of richer human features.(iii) Te selection of low-view areas should include diferent gathering forms at this time.(iv) High-and low-view images should adhere to time consistency.

4 . 1 .
Evaluation Metrics.In this paper, we use the Mean Absolute Error (MAE) and the Mean Squared Error (MSE) to evaluate the performance of the model on the test set, which are defned as follows:

Figure 7 :
Figure 7: Schematic diagram of overlap area of high-and low-view images.

Figure 9 :
Figure 9: (a) Te overlap region in the high-view image.(b) Corresponding low-view images.(c) Te result of fusion.(d) Ground mark line.(e) Te result of fusion accuracy validation.

8
International Journal of Intelligent Systems follows: the frst column of Figure 11 is the color component diagram of the medium density level, and the pixel value b 2 is 169,124.Te second column of Figure 11 lists two selected overlapping regions, and the pixel values n j are 2,592 and 1,224, respectively.Trough the homography matrix, the perspective is mapped to the corresponding region of the low-view image, namely, the green mask part in the image.

Figure 11 :
Figure 11: Te frst column shows the two selected regions, the second column shows the discrete density maps generated by AMNet, and the third column shows the corresponding low-view images of these regions.

Figure 12 :
Figure 12: Test results of high and low view information fusion algorithm on CROWD_SZ.

10
International Journal of Intelligent Systems recognition efect of CSRNET.In the ffth column, the estimated value and actual value of MCNN on the image are 2,187.50and 5,471, respectively, with an error of nearly 3,300 people.introducing the high-altitude perspective image information fusion module, the estimated result is 5,621.44,with an error reduced to 150 people.Terefore, a more accurate number of people can be obtained by the similarity of density distribution.

Figure 13 :
Figure 13: Te frst row shows the samples of the test set in CROWD_SZ dataset.Te second row shows the ground truth for each sample.Te third, fourth, and ffth rows are density maps generated by CSRNET, MCNN, and AMNet, respectively.Te sixth row is the test result of AMNet + HLIF.
1 highview equipment called A and m low-view equipment called B 1 , B 2 , B 3 . . .B m .(ii) Region registration: select m rectangular areas (in overlap areas) on device A as L 1 , L 2 , L 3 . . .L m and use the matrix H to obtain the overlap area on device B i corresponding to the rectangular region.(iii) Information processing: Input the high-view image to get the discrete density map and calculate the pixel number b t of each density level, where t represents the density level.Te number of pixels at each density level in the region L m is denoted as n j .

Table 1 :
Estimation errors on ShanghaiTech dataset.Te bold value shows the best result in every column.
Te bold value shows the crowd count in the medium density level.
Te bold value shows the global crowd count.