Low-Rank and Sparse Based Deep-Fusion Convolutional Neural Network for Crowd Counting

This paper proposes an accurate crowd countingmethod based on convolutional neural network and low-rank and sparse structure. To this end, we firstly propose an effective deep-fusion convolutional neural network to promote the density map regression accuracy. Furthermore, we figure out thatmost of the existingCNNbased crowd countingmethods obtain overall counting by direct integral of estimated density map, which limits the accuracy of counting. Instead of direct integral, we adopt a regression method based on low-rank and sparse penalty to promote accuracy of the projection from density map to global counting. Experiments demonstrate the importance of such regression process on promoting the crowd counting performance. The proposed low-rank and sparse based deep-fusion convolutional neural network (LFCNN) outperforms existing crowd counting methods and achieves the state-of-the-art performance.


Introduction
Recent years have witnessed extensively crowded scenes, such as concerts, political speeches, ceremonies, marathons, and tourist spots.The crowd counting problem, as a machine learning and computer vision problem, takes a single image or a surveillance frame as input and aims to estimate how many people are in it.It is of significant importance to public security and automatic surveillance [1].Though tremendous strides have been made in crowd counting, it still remains a challenge due to severe occlusion, various perspective distortion, and diverse crowd densities.
To solve these problems and promote the accuracy of crowd counting, much methods have been proposed in the crowd counting literature.This paper is not the first one to leverage convolutional neural network (CNN) model to promote the accuracy of crowd counting, whereas most of the CNN based crowd counting methods adopt a two-stage pipeline: crowd density estimation with an end-to-end deep network and direct integral to obtain the global counting, which accumulates the errors and limits the promotion of counting accuracy.To solve this problem and promote the accuracy of crowd counting, we propose a low-rank and sparse based deep-fusion convolutional neural network (LFCNN), which adopts the low-rank and sparse penalty based regression process instead of the direct integral.
1.1.Contributions.In this paper, we aim to promote the accuracy of crowd counting methods from a single image.Motivated by the density map regression architecture and feature-based global regression architecture, we propose the low-rank and sparse based deep-fusion convolutional neural network for crowd counting, which contains two key components: deep-fusion network for density map regression and a subsequent regression method to map the estimated density map to global counting.As the spatial information and the global counting of crowds are, respectively, used by the two steps, the images are projected step by step, from surveillance frame to gray-scale density map and ultimately to global counting.The contributions of the proposed method can be summarized as follows.
To improve the accuracy of density map regression, inspired by the inception structure of googlenet [2], we propose a deep-fusion network structure to capture multiscale targets in crowded images.In each inception unit of our deepfusion network, to achieve robustness to variation of peoples size, conv layers with filters of various sizes and numbers are utilized as base networks.At the end of each inception,

Related Works
2.1.Crowd Counting.Existing crowd counting methods can be divided into location-based methods and regression based methods.
The location-based methods are based on the foundation that a crowd is composed of single targets which can be detected and then counted.These methods attempt to locate every single person by detector scanning [3,4], tracking, and trajectories clustering [5] before getting the counting result.However in extensively crowded scenes, a single person is prone to overlap with another and can hardly be precisely detected, which leads to relatively severe error on counting result.
Another popular crowd counting pipeline, regression based method, treats the whole crowd instead of a single person as target and avoids the challenging task of detecting individual person.These methods, more suitable for extensively crowded scenes, can also be divided into two catalogues, global counting regression methods [6][7][8] and density map regression methods [9][10][11][12][13][14].
The global feature-based regression pipeline usually contains three successive steps: (1) foreground segmentation, (2) feature extraction, and (3) crowd counting regression.Pixel features [6], texture features [7], and integrated features [8,15] were utilized and regression models, such as Gaussian Process [8], Ridge Regression [16], and neural network and random forest [15], were adopted to achieve better performances.Despite the effectiveness of these methods, merely utilizing the global counting as supervision signal without using the spacial information of the crowds largely limited the accuracy of these methods.
Compared with global feature-based regression methods, density map regression methods, proposed by [9], further promoted the accuracy of crowd counting by utilizing the crowd's spatial information contained in density map, which was calculated by the position of each person and denoted the crowd density of a local area.Following this pipeline, [10,17] promoted the counting accuracy using modified random forest algorithm as regression model of density map.
In recent years, with the prosperity of convolutional neural networks (CNN) in image classification [18,19], detection [20,21], segmentation [22], and pedestrian detection [23], the CNN model is also leveraged by crowd counting methods.Zhang et al. [11] firstly proposed CNN based density map regression methods, called Patch-CNN, and demonstrated significant improvement on the methods based on handcrafted features.Based on this pipeline, Zhang et al. [12] adopted three networks with various kernel sizes to construct MCNN, which was more adaptive to variations in person size.Inspired by combining the high-level semantic information and low-level detailed features, Boominathan et al. [13] combined deep and shallow networks to construct Longshort CNN as density map regression network.Aiming to solve the multiscale problem of person size, Onoro-Rubio and López-Sastre [14] proposed Hydra-CNN, using a pyramid of image patches of multiple scales to train multiple networks and benefitting from the integration of multiple models.Another attempt to promote the counting accuracy by model ensemble is the Boost-CNN [24], which employed boosting to density map regression CNN model.To sum up, the success of these methods could be attributed to the following two reasons: the automatic learning ability of end-to-end density map regression networks and the usage of spacial information in density maps.These methods attempted to promote the accuracy of density map regression by adopting more and more complicated network structures.However, as the global counting instead of the density map is the objective of counting methods, these methods are exactly not end-toend trained for global counting regression and there is always a gap between the output of the network and the objective of counting problem.The direct integral, adopted by most of the existing methods to project from density maps to global counting, accumulates the error in estimated density maps and limits the promotion of counting accuracy.
The CNN based counting regression methods, such as Patch-count CNN [25], Patch-multitask CNN [26], and TSCCM [27], applied fully connected layers to directly regress the counting of person in image patches.Though these methods constructed end-to-end network for counting task, a surveillance frame needs to be cut into amounts of patches with each patch counted by the network, which is fairly timeconsuming.Moreover, without using the spacial information in density maps, these methods' accuracy is severely limited.
Though lots of crowd counting methods have been proposed as shown above, most existing crowd counting methods, including the CNN based ones, either regress the global counting without the spacial information or utilize direct integral of estimated density maps to obtain the global counting without using the global counting.Adopting CNN based density map estimation architecture and a learning process to project the density map to the overall count, our model differs from the existing methods and benefits from adopting both the spacial information and the global information of crowds.

Low-Rank and Sparse
Structure.Low-rank and sparse structures have been profoundly studied in matrix completion, compressed sensing, and dimensional reduction.Principal Component Analysis (PCA) [28] is based on the assumption that signals usually have low intrinsic complexity, are low-rank, or lie on some low-dimensional manifold.And it operated linear projection to seek such low-rank representation by minimizing the error between the signal and the low-rank representation.
where E denotes the noise and L is the low-rank part of the signal X.
A variant of PCA [28], known as a robust PCA (RPCA) [29,30], is built upon the theory that signals matrix has lowrank structure and the noise is sparsely distributed, affecting only fraction of the signal matrix entries.
Furthermore, Go Decomposition (GoDec) [31] proposed low-rank + sparse decomposition of a signal, where Chalapathy et al. [32] applied deep neural network to construct robust nonlinear subspace that captures the majority of data points and detect anomaly instances, while allowing for some data to have arbitrary corruption.As the deep network extends the robust PCA model to the nonlinear autoencoder setting, the nonlinearity helped discover potentially more subtle anomalies, which promoted the robustness of the model.

Notation and Problem Definition
Notation.We use boldface lowercase letters like z to denote vectors.Boldface uppercase letters like Z are used to denote matrices.‖ ‖ 1 is used to denote the ℓ 1 norm, and ‖ ‖ * is used to denote the trace norm.⊙ denotes the Hadamard products.
Problem Definition.Suppose that we have  images, denoted with  = {(I  , ℎ  )}  =1 , where I  is the th image and ℎ  is the number of people in this image.Person in this image is denoted with  = {  } ℎ  =1 and the location of person   is (  ,   ).The density map is denoted by D.More specifically, D  denotes the density map calculated as ground truth of density map regression network, D   denotes the density map estimated by CNN networks, and D   denotes the density map modified by the low-rank and sparse regression method.

Deep-Fusion Density Map Regression Network
In this section, we illustrate the proposed deep-fusion network structure for crowd density map regression.The goal of the deep-fusion network is to learn a density map regression function  = I → D, where I is the surveillance image of arbitrary scene and D is the crowd density map of it.So we firstly illustrate the calculation method of the supervision signal D, based on which we further explain the deep-fusion network structure with the some detail of the network.

Density Map Calculation.
The first step is to calculate the density maps as the training ground truth of the network.With position of each pedestrian labeled, the true density map is actually decided by the pedestrians location, shape, and perspective distortion.Due to severe occlusions, pedestrians' bodies overlap with each other and head is the main cue to judge whether there exists a pedestrian in extensively crowded images.So our work follows [9] and adopts the Gaussian kernel centered on the locations of pedestrians head to denote each pedestrian in the calculated density map, as in where   is the th pedestrian and (  ,   ) is its location.Actually, the parameter of Gaussian kernel should correlate with the size of each head, which is influenced by the height and angle of the surveillance camera according to the perspective distortion theory.Most of the scene-specific methods [8] get the parameter by measuring the perspective distortion parameter of each scene as prior knowledge of crowd counting model.However, for arbitrary scene, measuring every single image to get its perspective distortion parameter is much too time-consuming and almost impossible.In our model, we define a global constant parameter for all training images based on the average size of all the heads in datasets.
The density map of the whole image is calculated as a sum of Gaussian kernels of all the pedestrians as in where D  is the calculated density map of I  and ( −   ,  −   ) stands for the impulse function.

Deep-Fusion Network.
Multiscale is a significant problem of almost all current computer vision tasks, especially in crowd counting problems owing to the perspective distortion of surveillance cameras [12].Motivated by googlenet [2], as shown in Figure 1, which integrates several paratactic conv layers with various perspective fields in a inception unit, we propose to use a deep-fusion network to manipulate the scale variation problem.The overall structure of our deep-fusion network is illustrated in Figure 2. The fusion unit, constructed by three paratactic base networks, is the basic cell of the proposed deep-fusion network.At the end of each fusion unit, the feature maps of base networks are concatenated to obtain the fused representation, which serves as the input of the next fusion unit.
To further illustrate the scale-invariance property of this deep-fusion network structure, we use    to denote the nonlinear function of th base network in th fusion unit and R  to denote the output representation of the th fusion unit; then we have where   denotes the fusing function of the th inception and we adopt concatenating as fusing function in this paper.Through analyzing the information flow path, we can figure out that the output representation of a fusion unit comes from the three base networks and can also flow to the next three base networks of the next fusion unit.Specifically, in our network structure, the three base networks in the first fusion unit bring about three information flow paths of the whole network, That is, with three fusion units, our deepfusion network actually contains 3 3 = 27 information flow paths and each path corresponds to a latent network with specific receptive field, which can capture targets of a specific range of size.As a result, the fused network integrates 27 latent networks and is able to capture heads of various sizes.
As the heads are small instances, the receptive field of the network should be considerably small to match it and more detailed information instead of the semantic information should be adopted, compared with conventional networks of classification task and detection task.Owing to the fact that the receptive field is enlarged and detailed features are projected to semantic features layer by layer with the growing of the network's depth, the proposed deep-fusion network is rather shallow, compared with resnet and googlenet.The configuration of our network layers is shown in Table 1.
What is more, rectified linear unit (ReLU) [33] is adopted as activation function and Max pooling is used.The fully convolutional layer, instead of the fully connected layer, is used to estimate the density map based on the concatenated feature maps, which not only largely reduces the size of parameters but enables the input image to be of arbitrary size as well.At length, the Euclidean loss is defined as the loss function of our deep-fusion network, as illustrated in

Low-Rank and Sparse Based Regression
In this section, we further illustrate the motivation, problem definition, solution, and some comments of our method's second component, low-rank and sparse regression from density map to overall counting.

Motivations.
The output of density map regression network is the two-dimensional crowd density map, while the ultimate objective of crowd counting methods is the counting of persons.To our knowledge, most of the existing CNN based methods simply sum up all values of the estimated density map as overall counting, which accumulates the error in estimated density map without using the global counting information of the image.To solve this problem, in this paper, we propose for the first time to use a subsequent regression method to project the estimated density map to the final crowd counting.The inspirations are listed below.
Inspired by the feature-based methods, which use handcraft features and regression methods to obtain the overall crowd count, it is intuitive to employ the feature maps of conv networks as extracted features of subsequent counting regression model to project the estimated density map to the overall crowd counting.
In comparison with the density maps calculated in (5), the estimated density maps are more coarse and noisy, containing much more small nonzero values in background areas, as shown in Figure 3.The cause of the small nonzero values is that the background objects, such as buildings, sky, and tress, are mistreated as human targets by the density map regression network.Such noise and estimation errors are accumulated by direct integral in most of the existing CNN based methods, which limit the accuracy of counting.To eliminate such errors, enlightened by GoDec [31], we leverage the low-rank and sparse penalty on the estimated density map while regressing global counting.In other words, this regression can also be regarded as a modification process to construct the more accurate density maps with low-rank and sparse structure from density maps estimated by CNN network.

Definition of Counting
Regression.On the one hand, the global counting regression problem is essentially the projection from a high dimensional feature to a number; we can formulate it with the commonly used regression function in where h = [ℎ 1 , . . ., ℎ  , . . ., ℎ  ]  and ℎ  is the number of labeled pedestrians in image I  .We also represent the  ×  gray-scale density map D   by the vector x  ∈  V (V = ) by concatenating its columns; therefore the matrix X = [x 1 , . . ., x  , . . .x  ] ∈  V× contains the  density maps estimated by the deep-fusion network.
Thus, w = [ 1 , . . .,   , . . .,  V ] ∈  V is the parameter to map the density map to the overall count.The global counting of the th image can be calculated by On the other hand, in the context of density map modification, the error of overall counting is caused by the errors of the coarse estimated density map.Therefore, the global regression problem can also be viewed as illuminating the errors and constructing the more accurate density map with low-rank and sparse structure.Consequently, W ∈  × , which is reshaped by w, denotes the learnt parameter to eliminate the noise of the estimated density map, where the modified density map D   can be calculated by the estimated density map D   with where ⊙ denotes the Hadamard products and the entry in the th raw and th column of D   ,    , is the product of    and   , which are the entries in the same position of the two matrices.Thus the overall crowd counting is the integral of the modified density map: where 1 is unit vector.So the counting regression problem can also be defined in min As the density map is two-dimensional signal and should contain low-rank and sparse structure, enlightened by GoDec, we attempt to add low-rank and sparse penalty on the modified density map to eliminate the errors.min where the modified density map is calculated by D   = D   ⊙W as in Section 5.2 and L  and S  are the th modified density map's low-rank and sparse structure, respectively.The target of this optimization problem actually is {L  , S  }  =1 and all the low-rank structures and sparse structures of density maps need to be counted with the counting as supervision signal, which is difficult to solve.
To solve this problem, as the estimated density map D   is a constant matrix with a constant rank(D   ) and Horn [34] has theoretically justified rank(A ⊙ B) ≤ rank(A)rank(B), the low-rank constrain of L   can be relaxed to the constraint that parameter matrix W contains a low-rank part L W .Moreover, as the density map D   is also a dense matrix with abundance of nonzero values, W should also contain a sparse part S W to ensure S  = D   ⊙ S W is sparse for all density maps.To sum up, the low-rank and sparse penalty on modified density maps can be transformed to the weight matrix W, as shown in So the previous optimization problem can be formulated as min Nevertheless, GoDec is mainly a signal decomposition method used for matrix completion and background modeling by capturing the main part of the signal without supervision, as shown in (2) and the problem in (3) is mainly a matrix construction problem instead of the regression problem as shown in Section 5.2.
The problem defined in Section 5.2 is indeed a regression problem with global counting as supervision signal, whose density map is constrained by the low-rank and sparse structure.To solve this problem, we add the low-rank and sparse penalty of regression weight matrix on density map regression problem formulated in (12).Thus the previous regression problem becomes min which equals min

Solution.
To solve this problem, we transform the problem mathematically to min where the trace norm regularization term encourages the desirable low-rank structure in the matrix L W , the ℓ 1 -norm regularization term induces the desirable sparse structure in the matrix S W , and  and  are nonnegative trade-off parameters.When the matrices L W and S W are resized to vectors l W and s W , the vector version of W, w, equals l W + s W and the whole problem can be represented as min Note that we can solve this optimization problem with [35].

Comments. This low-rank and sparse based regression
method is actually a regression projection from feature to number, with the low-rank and sparse penalty on modified density map.To solve the problem, we transform the penalty on density map to regression parameter matrix.
While our model is not a popular end-to-end architecture, the following process of low-rank and sparse learning can also be viewed as another pairwise product layer, whose weight is trained with low-rank and sparse penalty.Zhang et al. [11] also attempt to add a fully connected layer, without penalty, after the estimated density map to solve the gap between density map and the overall count, but the performance of the network degenerates a lot, which is partly owing to the reason that the overfit problem harms the convergence of the network.By analyzing the essence of the crowd counting problem, we attribute the reason of our models success to the penalty of parameter, which correlates the density map's low-rank and sparse structure of this task.

Experiments
Experiments show that our model can promote the accuracy and robustness of the existing crowd counting methods.Implementation of the proposed model is based on the Caffe framework developed by [36] and MALSAR toolbox released by [37].
6.2.Datasets.We evaluate our LFCNN model on two existing large-scale datasets, Shanghaitech dataset and WorldExpo10 dataset, instead of the low-density or single-scene datasets, such as USCD [38] and Pest2009 [39].Shanghaitech is a large-scale crowd counting dataset released in 2016 [12], which contains 1198 annotated images with 330165 people located in the centers of their heads.There are two parts in it.Part A contains pictures captured through the Internet with arbitrary size and scene, while Part B is composed of frames collected by several different surveillance cameras in a crowded street.
WorldExpo10 dataset [11] was another existing largescale crowd counting dataset containing 1132 annotated video sequences which are captured by 108 surveillance cameras in the campus of WorldExpo.Different from the cross scene crowd counting application of [11], we divide all of the annotated images to train datasets and test datasets with the proportion of 6 : 4 to evaluate the methods' performance in arbitrary scene crowd counting.

Accuracy of Crowd
Counting.In Tables 2, 3, and 4, we compare the performance of the proposed LFCNN method with those of other existing crowd counting methods of three catalogues on Shanghaitech Part A dataset, Shanghaitech Part B dataset, and WorldExpo10 dataset, respectively.In addition to the low-rank and sparse based regression method, we also adopt other regression methods, such as Ridge Regression and LSSVM, to evaluate the effect of low-rank and sparse penalty.Among these methods, location-based method, ACF, cannot achieve decent crowd counting accuracy, which is largely due to the severe occlusion in extensively crowded scenes, especially the Part A dataset.The effect of the featurebased regression methods largely depends on the type of hand-crafted features and regression methods.In addition, it is not surprising to find that the CNN based methods enjoy the preferable accuracy.
Clearly, experimental results demonstrate that the proposed LFCNN method outperforms all existing CNN based crowd counting methods with a margin and reduces the MRE (mean relative error) of state-of-the-art methods by 33.71%, 23.87%, and 19.80%, respectively, on three largescale datasets.The Shanghaitech Part A dataset is of highest average crowd counting and the proposed LFCNN shows highest accuracy promotion on it, which demonstrates our proposed method's preferable performance on extensively crowded scenes.
One interesting issue we observe is that the performance of deep-fusion network with Ridge Regression degenerates a lot compared with the direct integral, which might be caused by the fact that the ℓ 1 -norm penalty of the Ridge In Figure 3, we show the density map regression and crowd counting result of some test images.The first column is test image, the second column contains the ground truth density maps calculated by the annotated targets, the density maps in the third and fourth column are calculated by Patch-CNN [11] and MCNN [12], respectively, the fifth column ones are calculated by our proposed deep-fusion network, and the ones in the sixth column are the modified density maps calculated by the point production of weight matrix of lowrank and sparse regression and the density maps in the fifth column.
Compared with the ground truth density maps, the estimated ones calculated by density regression networks are more coarse and contain much more nonzero small errors in background areas.Among the estimated density maps, the modified ones calculated by our proposed LFCNN are more accurate and fine-grained.By comparing the fifth column and the sixth column, we can figure out that the accuracy is largely due to the low-rank and sparse regression process.

Deep-Fusion Network.
To illustrate the density map regression performance with respect to network structure, we construct MCNN network structure following [12] and density map regression network based on Alexnet [43] and VGGnet [44] by replacing the fully connected layers with fully convolutional layers.The VGGnet contains 16 layers, which is much too deep for density map regression task as the gradient back-propagated from the density map is considerably small and vanishes in a deep network structure.So we choose the first 6 layers of VGGnet to construct the VGGnet density network.The density map regression performance of the networks on Shanghaitech Part A dataset is shown in Table 5.
The crowd counting performance of various network structures shown in Table 5 verifies the accuracy of deepfusion structure on density map regression task.The proposed deep-fusion network structure reduces the MAE metrics by 45.58%, 11.16%, 21.74%, and 23.81% compared with one-column network, three-column network, Alexnet-based density map regression network, and VGGnet-based density map regression network.The effect of the proposed deepfusion network can be attributed to its capability of capturing multiscale small targets, which is one of the key problems in surveillance based crowd counting.As only pedestrian needs to be detected in crowd counting problem instead of the thousands of objects in image classification task, the abundant features extracted by Alexnet and VGGnet may not be able to show their superb capability.Tables 2, 3, and 4 also demonstrate that the accuracy of crowd counting is enhanced by two aspects, deep-fusion network structure and the lowrank and sparse based regression process.detailed view, we need to know their performance on pedestrian with diverse sizes.As the size of the pedestrian is always inversely proportional with the crowd density in extensively crowded images, we divide the two datasets to subgroups according to the number of pedestrians in the images.The performance of CNN based methods is presented in Figure 4.
The -axis denotes the number of pedestrians in an image and with the increase of the number of pedestrians, the average size of each one is decreased.In Figure 5, when the number of pedestrians is under 50, which denotes that average size of pedestrian is large, the MRE decrease by about 50% compared with [11] and by about 35% compared with [12].When the crowd count is above 350 which demonstrates that the pedestrians are small, the MRE is about half the MRE of [12] and only a third of the MRE of [11], which not only shows the robustness of our method, but shows our methods' strong capability on small instance capturing.

Conclusion
In this paper, we have proposed an accurate crowd counting method, called low-rank and sparse based deep-fusion convolutional neural network (LFCNN).In this method, the proposed deep-fusion network is designed to capture the multiscale targets and promote the density map regression accuracy.Then the global counting is regressed through a low-rank and sparse based regression.To our knowledge, the low-rank and sparse penalty is firstly used for the regression of global counting.Experiments on large-scale crowd counting datasets demonstrate the promotion of accuracy achieved by the proposed method.

Figure 3 :
Figure 3: Calculated density map (a) and estimated density map (b) of 132nd test image in Part A dataset.

6. 6 .
Robustness of Crowd Counting.To compare the robustness characteristics of the methods and analyze the scaleinvariant capability of the CNN based methods in more

Figure 4 :
Figure 4: Density map regression performance of methods.

Figure 5 :
Figure 5: MRE of the CNN based methods in the WorldExpo10 dataset evaluated in different subgroups.

Table 2 :
Performance comparison of crowd counting methods for Shanghaitech Part A dataset.

Table 3 :
Performance comparison of crowd counting methods for Shanghaitech Part B dataset.

Table 4 :
Performance comparison of crowd counting methods for the WorldExpo10 dataset.

Table 5 :
Performance of various network structures on Shanghaitech Part A dataset.