This paper proposes an accurate crowd counting method based on convolutional neural network and low-rank and sparse structure. To this end, we firstly propose an effective deep-fusion convolutional neural network to promote the density map regression accuracy. Furthermore, we figure out that most of the existing CNN based crowd counting methods obtain overall counting by direct integral of estimated density map, which limits the accuracy of counting. Instead of direct integral, we adopt a regression method based on low-rank and sparse penalty to promote accuracy of the projection from density map to global counting. Experiments demonstrate the importance of such regression process on promoting the crowd counting performance. The proposed low-rank and sparse based deep-fusion convolutional neural network
Recent years have witnessed extensively crowded scenes, such as concerts, political speeches, ceremonies, marathons, and tourist spots. The crowd counting problem, as a machine learning and computer vision problem, takes a single image or a surveillance frame as input and aims to estimate how many people are in it. It is of significant importance to public security and automatic surveillance [
To solve these problems and promote the accuracy of crowd counting, much methods have been proposed in the crowd counting literature. This paper is not the first one to leverage convolutional neural network (CNN) model to promote the accuracy of crowd counting, whereas most of the CNN based crowd counting methods adopt a two-stage pipeline: crowd density estimation with an end-to-end deep network and direct integral to obtain the global counting, which accumulates the errors and limits the promotion of counting accuracy. To solve this problem and promote the accuracy of crowd counting, we propose a low-rank and sparse based deep-fusion convolutional neural network
In this paper, we aim to promote the accuracy of crowd counting methods from a single image. Motivated by the density map regression architecture and feature-based global regression architecture, we propose the low-rank and sparse based deep-fusion convolutional neural network for crowd counting, which contains two key components: deep-fusion network for density map regression and a subsequent regression method to map the estimated density map to global counting. As the spatial information and the global counting of crowds are, respectively, used by the two steps, the images are projected step by step, from surveillance frame to gray-scale density map and ultimately to global counting. The contributions of the proposed method can be summarized as follows.
To improve the accuracy of density map regression, inspired by the inception structure of googlenet [
Aiming to improve the accuracy of global counting regression, which is the ultimate objective of crowd counting methods, we adopt least squares regression with low-rank and sparse penalty to project the estimated density map to global counting, instead of the direct integral process adopted by most existing CNN based crowd counting methods. The inspiration here is rather intuitive: the estimated density maps are coarse with abundant errors and ambiguity, so it is necessary to build the following regression models to eliminate the errors and obtain overall count. Enlightened by low-rank and sparse learning, which builds upon the theory that signals should contain a low-rank part and a sparse part, we adopt low-rank and sparse penalty on estimated density map. We then solve the problem by transforming the penalty on density map to penalty on regression parameters. Compared with other regression methods, such as Ridge Regression and LSSVM, our proposed method can also be viewed as fine-tuning the estimated density map based on the assumption that an accurate density map should contain low-rank structure and sparse structure.
Experiments on large-scale crowd counting datasets demonstrate that, to our knowledge,
Existing crowd counting methods can be divided into location-based methods and regression based methods.
The location-based methods are based on the foundation that a crowd is composed of single targets which can be detected and then counted. These methods attempt to locate every single person by detector scanning [
Another popular crowd counting pipeline, regression based method, treats the whole crowd instead of a single person as target and avoids the challenging task of detecting individual person. These methods, more suitable for extensively crowded scenes, can also be divided into two catalogues, global counting regression methods [
The global feature-based regression pipeline usually contains three successive steps: (1) foreground segmentation, (2) feature extraction, and (3) crowd counting regression. Pixel features [
Compared with global feature-based regression methods, density map regression methods, proposed by [
In recent years, with the prosperity of convolutional neural networks (CNN) in image classification [
The CNN based counting regression methods, such as
Though lots of crowd counting methods have been proposed as shown above, most existing crowd counting methods, including the CNN based ones, either regress the global counting without the spacial information or utilize direct integral of estimated density maps to obtain the global counting without using the global counting. Adopting CNN based density map estimation architecture and a learning process to project the density map to the overall count, our model differs from the existing methods and benefits from adopting both the spacial information and the global information of crowds.
Low-rank and sparse structures have been profoundly studied in matrix completion, compressed sensing, and dimensional reduction. Principal Component Analysis (PCA) [
A variant of PCA [
Furthermore, Go Decomposition (GoDec) [
Chalapathy et al. [
In this section, we illustrate the proposed deep-fusion network structure for crowd density map regression. The goal of the deep-fusion network is to learn a density map regression function
The first step is to calculate the density maps as the training ground truth of the network. With position of each pedestrian labeled, the true density map is actually decided by the pedestrians location, shape, and perspective distortion. Due to severe occlusions, pedestrians’ bodies overlap with each other and head is the main cue to judge whether there exists a pedestrian in extensively crowded images. So our work follows [
The density map of the whole image is calculated as a sum of Gaussian kernels of all the pedestrians as in
Multiscale is a significant problem of almost all current computer vision tasks, especially in crowd counting problems owing to the perspective distortion of surveillance cameras [
Network structure of googlenet.
Deep-fusion network structure.
The fusion unit, constructed by three paratactic base networks, is the basic cell of the proposed deep-fusion network. At the end of each fusion unit, the feature maps of base networks are concatenated to obtain the fused representation, which serves as the input of the next fusion unit.
To further illustrate the scale-invariance property of this deep-fusion network structure, we use
Through analyzing the information flow path, we can figure out that the output representation of a fusion unit comes from the three base networks and can also flow to the next three base networks of the next fusion unit. Specifically, in our network structure, the three base networks in the first fusion unit bring about three information flow paths of the whole network,
As the heads are small instances, the receptive field of the network should be considerably small to match it and more detailed information instead of the semantic information should be adopted, compared with conventional networks of classification task and detection task. Owing to the fact that the receptive field is enlarged and detailed features are projected to semantic features layer by layer with the growing of the network’s depth, the proposed deep-fusion network is rather shallow, compared with resnet and googlenet. The configuration of our network layers is shown in Table
Configuration of the conv layers in deep-fusion network.
Layer | Configuration |
---|---|
Conv 1_1 | Filter |
Conv 1_2 | Filter |
Conv 1_3 | Filter |
|
|
Conv 2_1 | Filter |
Conv 2_2 | Filter |
Conv 2_3 | Filter |
|
|
Conv 3_1 | Filter |
Conv 3_2 | Filter |
Conv 3_3 | Filter |
Conv 4_1 | Filter |
Conv 4_2 | Filter |
Conv 4_3 | Filter |
|
|
Conv_all | Filter |
What is more, rectified linear unit (ReLU) [
In this section, we further illustrate the motivation, problem definition, solution, and some comments of our method’s second component, low-rank and sparse regression from density map to overall counting.
The output of density map regression network is the two-dimensional crowd density map, while the ultimate objective of crowd counting methods is the counting of persons. To our knowledge, most of the existing CNN based methods simply sum up all values of the estimated density map as overall counting, which accumulates the error in estimated density map without using the global counting information of the image. To solve this problem, in this paper, we propose for the first time to use a subsequent regression method to project the estimated density map to the final crowd counting. The inspirations are listed below.
Inspired by the feature-based methods, which use hand-craft features and regression methods to obtain the overall crowd count, it is intuitive to employ the feature maps of conv networks as extracted features of subsequent counting regression model to project the estimated density map to the overall crowd counting.
In comparison with the density maps calculated in (
Calculated density map (a) and estimated density map (b) of 132nd test image in Part A dataset.
On the one hand, the global counting regression problem is essentially the projection from a high dimensional feature to a number; we can formulate it with the commonly used regression function in
Thus,
On the other hand, in the context of density map modification, the error of overall counting is caused by the errors of the coarse estimated density map. Therefore, the global regression problem can also be viewed as illuminating the errors and constructing the more accurate density map with low-rank and sparse structure. Consequently,
In Section
As the density map is two-dimensional signal and should contain low-rank and sparse structure, enlightened by GoDec, we attempt to add low-rank and sparse penalty on the modified density map to eliminate the errors.
To solve this problem, as the estimated density map
Nevertheless, GoDec is mainly a signal decomposition method used for matrix completion and background modeling by capturing the main part of the signal without supervision, as shown in (
The problem defined in Section
To solve this problem, we transform the problem mathematically to
Note that we can solve this optimization problem with [
This low-rank and sparse based regression method is actually a regression projection from feature to number, with the low-rank and sparse penalty on modified density map. To solve the problem, we transform the penalty on density map to regression parameter matrix.
While our model is not a popular end-to-end architecture, the following process of low-rank and sparse learning can also be viewed as another pairwise product layer, whose weight is trained with low-rank and sparse penalty. Zhang et al. [
Experiments show that our model can promote the accuracy and robustness of the existing crowd counting methods. Implementation of the proposed model is based on the Caffe framework developed by [
Following the existing works [
We evaluate our
Shanghaitech is a large-scale crowd counting dataset released in 2016 [
WorldExpo10 dataset [
Following the data augmentation method of [ Location-based method including Feature-based regression methods, including CNN based density map regression methods, including
In Tables
Performance comparison of crowd counting methods for Shanghaitech Part A dataset.
Network | Part A dataset | ||
---|---|---|---|
MAE | MSE | MRE | |
ACF [ |
390.5 | 526.8 | 84.7 |
|
|||
LBP + RR | 303.2 | 371.0 | 70.4 |
LBP + LSSVM | 224.5 | 294.6 | 83.3 |
|
|||
|
179.7 | 252.9 | 67.7 |
|
110.2 | 173.2 | 37.9 |
|
118.4 | 171.1 | 39.4 |
|
110.1 | 170.1 | 37.0 |
|
115.8 | 167.9 | 38.5 |
|
99.1 | 145.3 | 30.0 |
|
95.7 | 143.2 | 26.1 |
|
|||
Deep-fusion network | 97.9 | 145.1 | 29.5 |
Fusion + RR | 126.9 | 187.8 | 31.7 |
Fusion + LSSVM | 99.3 | 145.2 | 29.1 |
|
|
|
|
Performance comparison of crowd counting methods for Shanghaitech Part B dataset.
Network | Part B dataset | ||
---|---|---|---|
MAE | MSE | MRE | |
ACF [ |
69.7 | 108.0 | 70.4 |
|
|||
LBP + RR | 59.1 | 81.7 | 69.2 |
LBP + LSSVM | 48.3 | 67.8 | 57.6 |
|
|||
|
32.0 | 49.8 | 37.6 |
|
26.4 | 41.3 | 24.2 |
|
26.1 | 37.7 | 25.9 |
|
20.3 | 31.0 | 22.6 |
|
21.7 | 32.4 | 20.26 |
|
19.8 | 33.1 | 18.1 |
|
17.1 | 26.3 | 15.5 |
|
|||
Deep-fusion network | 17.3 | 28.9 | 16.4 |
Fusion + RR | 20.5 | 28.7 | 18.9 |
Fusion + LSSVM | 17.6 | 30.1 | 15.3 |
|
|
|
|
Performance comparison of crowd counting methods for the WorldExpo10 dataset.
Network | The WorldExpo10 dataset | ||
---|---|---|---|
MAE | MSE | MRE | |
ACF [ |
41.79 | 52.36 | 79.56 |
|
|||
LBP + RR | 31.01 | 44.53 | 80.97 |
LBP + LSSVM | 28.86 | 42.79 | 74.69 |
Gabor + LSSVM | 33.61 | 46.69 | 84.53 |
|
|||
|
12.90 | 9.62 | 40.96 |
|
11.60 | 16.78 | 36.50 |
|
12.56 | 17.75 | 35.21 |
|
10.56 | 14.86 | 30.76 |
|
13.18 | 18.76 | 36.38 |
|
13.93 | 19.70 | 41.71 |
|
8.76 | 11.83 | 25.25 |
|
|||
Deep-fusion network | 10.48 | 15.04 | 28.99 |
Fusion + RR | 30.43 | 41.17 | 152.28 |
Fusion + LSSVM | 13.81 | 16.60 | 67.59 |
|
|
|
|
Among these methods, location-based method,
Clearly, experimental results demonstrate that the proposed
One interesting issue we observe is that the performance of deep-fusion network with Ridge Regression degenerates a lot compared with the direct integral, which might be caused by the fact that the
In Figure
Compared with the ground truth density maps, the estimated ones calculated by density regression networks are more coarse and contain much more nonzero small errors in background areas. Among the estimated density maps, the modified ones calculated by our proposed
To illustrate the density map regression performance with respect to network structure, we construct
Performance of various network structures on Shanghaitech Part A dataset.
Network | Part A dataset | ||
---|---|---|---|
MAE | MSE | MRE | |
One-column network | 179.7 | 252.9 | 67.7 |
Three-column network | 110.2 | 173.2 | 37.9 |
Alexnet_density | 125.1 | 185.4 | 41.6 |
VGGnet_density | 128.5 | 189.6 | 43.5 |
Deep-fusion |
|
|
|
The crowd counting performance of various network structures shown in Table
To compare the robustness characteristics of the methods and analyze the scale-invariant capability of the CNN based methods in more detailed view, we need to know their performance on pedestrian with diverse sizes. As the size of the pedestrian is always inversely proportional with the crowd density in extensively crowded images, we divide the two datasets to subgroups according to the number of pedestrians in the images. The performance of CNN based methods is presented in Figure
Density map regression performance of methods.
The
MRE of the CNN based methods in the WorldExpo10 dataset evaluated in different subgroups.
In this paper, we have proposed an accurate crowd counting method, called low-rank and sparse based deep-fusion convolutional neural network
The paper matches the formatting instructions of IJCAI-07.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China (61473149). The support of IJCAI, Inc., is acknowledged.