Estimating the crowd density of public territories, such as scenic spots, is of great importance for ensuring population safety and social stability. Due to problems in scenic spots such as illumination change, camera angle change, and pedestrian occlusion, current methods are unable to make accurate estimations. To deal with these problems, an ensemble learning (EL) method using support vector regression (SVR) is proposed in this study for crowd density estimation (CDE). The method first uses human head width as a reference to separate the foreground into multiple levels of blocks. Then it adopts the first-level SVR model to roughly predict the three features extracted from image blocks, including D-SIFT, ULBP, and GIST, and the prediction results are used as new features for the second-level SVR model for fine prediction. The prediction results of all image blocks are added for density estimation according to the crowd levels predefined for different scenes of scenic spots. Experimental results demonstrate that the proposed method can achieve a classification rate over 85% for multiple scenes of scenic spots, and it is an effective CDE method with strong adaptability.
With increased living standard and the constant acceleration of urbanization progress, collective activities in large scale public places are becoming more and more frequent. In recent years, frequent accidents have been caused by dense human crowds. Therefore, using computer vision to intelligently monitor human crowds, make timely warnings, and take effective measures plays an essential role in social stability and population safety.
Current human crowd density estimation (CDE) methods are mainly divided into two categories.
However, in actual scenic spots, the existing method is difficult to make accurate prediction due to problems such as illumination change, camera angle and height inconsistency, pedestrian continuous crowding, and severe occlusion. To deal with these problems, the unique character of scenic spot monitoring is considered, and a CDE method based on multifeature ensemble learning (EL) is proposed. Experiments demonstrate that compared with people count estimation method commonly used in recent years, the proposed method not only achieves relatively good results for public data sets, but also performs well in experimental scenic spot scenes, showing strong adaptability for scenes.
The structure of this paper is as follows: Section Section In Section In Section In Section
In summary, CDE methods should solve the following problems. (
To address the above problems, a density estimation method based on multifeature EL is developed, with its flow chart shown in Figure Foreground segmentation is a widely used method in density estimation because it can reduce the background interference. Ordinarily, for example, mixture of Gaussians-based technique [ To solve the problem of camera perspective projection change, image blocking method is adopted in this study. The separated blocks (subimages) are resized to a uniform size. In order to better describe the people count of crowds, three descriptors, including dense-SIFT [ Support vector regression (SVR) is used to fit the subimage features and people counts, and an EL method is proposed, which uses two-level SVR model to predict the local people counts of subimages. Finally, the local people counts of all subimages are added, and the density is estimated according to the people counts set for different scenic spot scenes.
Flow chart of the methodology.
In scenic spot monitoring environment, monitoring cameras are often installed in high places, with a certain tilt angle to the horizontal plane. Therefore, perspective projection effects exist in the collected images. There are two effects of perspective projection. (
Currently, methods solving perspective projection change are divided into the following types.
(a) Reference persons at the nearest and farthest point in the labeled ground plane area and (b) the calculated weight map.
Perspective transformation methods or feature weighting methods require training the models for different scenes. Traditional blocking methods [
(a) Blocking method diagram and (b) blocking of a scenic spot scene.
Select a reference pedestrian. When his/her head enters the region of interest, the head width is measured as
After resize to a uniform size, the separated blocks (subimages) can be trained as unified samples. This method not only solves the problem of perspective projection, but also increases the scalability and adaptability in other scenes.
Since crowds usually exhibit typical texture and dense corner points, three descriptors, including D-SIFT, ULBP, and GIST, are used in this study to describe the people count of crowds.
SIFT feature is a method for detecting local features [ Read a gray-level image, and use a patch to slide on the image with a sampling step. This patch is the sampling area of the descriptor. The size of the patch is In each patch, the gradient of each pixel is calculated. The gradient histogram of the pixels in 8 directions in each cell is calculated, resulting in a feature dimension of
Dense-SIFT descriptor geometry.
For an image with width
For an image of size
Due to the high dimensionality of D-SIFT feature, it is unsuitable for direct use. In this study, BOF method [
The bag of features diagram: (a) the feature descriptors extracted from images, (b) clustering, (c) the formation of word frequency statistics histogram, and (d) the new feature from the feature histogram.
After processing the original D-SIFT features by BOF, the dimensionality is determined by the number of cluster centers, that is, the predefined word number. In the proposed scheme, 500 or 1000 words are usually set. Therefore, it is clear that BOF can significantly reduce the dimensionality of D-SIFT features.
LBP (local binary pattern) is an operator that describes the local texture feature of an image. It has significant advantages such as rotation invariance and grayscale invariance. For a circular area that has a center of
Obviously, an LBP operator
Ojala et al. held that the LBP value jumps no more than twice in an actual image. Under this consideration, let
A large number of psychological studies have proved that humans can extract information from an image within 100 ms to determine the class of a scene or obtain the global feature of a scene, which is also called the GIST of a scene [
In later development, people aim to use algorithms to implement such GIST features. Oliva and Torralba proposed to use multiscale and multidirection Gabor filters to filter scene images, extract contour information of scenes, and form GIST features [
Multiscale multidirection Gabor filters are based on normal Gabor filter
Divide an image with
In this study,
GIST feature extraction diagram.
After features are extracted from subimages, we adopt a procedure where people count is estimated first followed by density estimation, as shown in Figure
The process from features to density estimation.
By EL, the features are converted into the estimation of the people counts in crowds. Then, the people counts of all subimages of an image are summed and classified, finally obtaining the density estimation.
SVR is used in this study to establish the relation between features and people counts in subimages and then determine the people count of a crowd. Support vector machines search the optimum between model complexity and learning ability according to limited sample information and obtain good generalization ability. In SVR, the relation between prediction values and image features can be expressed as follows [
Radial basis function (RBF) is usually used for kernel function [
In this study, an EL method combining SVR is proposed to improve the prediction accuracy. EL accomplishes learning tasks by establishing and combining multiple learning machines. Currently, EL methods are categorized into two major classes [
The three features selected in this study have certain differences. Therefore, the parallel method is adopted, as shown in Figure
EL diagram.
The three features extracted in Section
The method of learning is used as the combining strategy, that is, using a learning machine for combination. Here, the individual learning machines are called primary learning machines, while the learning machine used for combination is called the secondary learning machine. The learning based combining algorithm is shown in Algorithm
Input: Initial training set where Primary SVR algorithm Secondary SVR algorithm Process: for each //use part of the training set let //primary learning machines training end for for for each //predict with primary learning machines end for end for //secondary learning machine training Output:
At training stage, the secondary training set is generated by primary learning machines. If the secondary train uses the same training set as that of the primary training, the risk of overfitting is high. Therefore, only part of the training set is used for training the primary learning machines, and prediction results from the primary learning machines are used as the training set of the secondary learning machine. In this way, the overfitting problem is solved. As shown in Algorithm
At the prediction stage,
Considering that different features have different sensitivity to crowd density, we adopt two levels of regression to compensate the drawbacks of each of them and improve the prediction accuracy.
For a test image, the predicted people counts of all blocks are added to generate the estimation of the people count of this image.
It is worthily noted that adding the predicted people count of all blocks results in quadratic error. And an estimation of density attracts more of our interest than an exact number of people counts. Therefore, a classification method is adopted after adding the predicted people count of all blocks.
In this study, common five-level classification is used to convert the people count into density estimation.
In this paper, different classification standards are set for different scenes. The maximum number of people
The different density images on scene 1: (a) very low, (b) low, (c) medium, (d) high, and (e) very high.
To verify the efficacy of feature selection for prediction accuracy improvement, blocking method, EL, and the adaptability to scenes, comparison tests have been made on multiple scenes in Pingjiang scenic spot of Suzhou City and UCSD public crowd data set [
In the data set of Pingjiang scenic spot of Suzhou City, scene 1 shown in Figure
In UCSD public crowd data set, a video with a total of 2000 frames is contained, which was captured in University of California of San Diego, along with the people count annotations for all frames. Similar to other algorithms [
For the comparison of people count estimation, mean absolute error (MAE) and mean relative error (MRE) are used as the criteria [
The performances of different individual features in people count prediction in block samples are compared. The experiment is conducted on the block samples from the data set of Pingjiang scenic spot of Suzhou City. Six features including HOG, original LBP (LBP), GLCM, D-SIFT, ULBP, and GIST are used. The subimages of the 900 training samples are used to train the model. The remaining subimages of the 600 test samples are used to predict and compare with the ground truth. The results are shown in Table
Comparison between different individual features.
Feature | HOG | LBP | GLCM | ULBP | GIST | D-SIFT |
---|---|---|---|---|---|---|
MAE | 1.28 | 1.15 | 1.12 | 1.06 | 0.97 |
|
MRE (%) | 21.7 | 19.5 | 19.0 | 18.0 | 16.4 |
|
It is clear from Table
Then, different combinations of D-SIFT, GLCM, and GIST are compared. Two-feature combinations are compared with the three-feature combination, and cascading combination is compared with EL. In order to ensure the fairness of the comparison, for cascading combination, the training and test sample is exactly the same as that of the individual feature. For EI, since it uses two-level SVR model, half of the training samples are selected as each level’s regression training samples, and the test sample is the same as cascading combination. Table
Comparison between multifeature combinations.
Combination method | MAE | MRE (%) |
---|---|---|
ULBP + GIST | 0.92 | 15.6 |
ULBP GIST EL | 0.85 | 14.4 |
ULBP + D-SIFT | 0.82 | 13.9 |
GIST + D-SIFT | 0.81 | 13.7 |
ULBP + D-SIFT + GIST | 0.78 | 13.2 |
ULBP D-SIFT EL | 0.78 | 13.2 |
GIST D-SIFT EL | 0.72 | 12.2 |
D-SIFT ULBP GIST EL |
|
|
It is clear from Table
In order to verify the accuracy of the proposed blocking method, a global estimation method is designed to contrast.
This global method differs from the proposed method in that it removes the blocking step by extracting the ULBP, GIST, and D-SIFT features directly from the foreground of the region of interest. 450 training samples are used to train the first level of SVR model along with the people count annotations. Then the second level of SVR was trained with the output of the first level with another 450 training samples. Finally, 600 test samples are predicted by using the two-level SVR, and the classification results are compared with the ground truth. Results are shown in Table
The classification accuracy of global method (%).
Scene 1 | VL | L | M | H | VH |
---|---|---|---|---|---|
VL |
|
10.14 | |||
L | 10.32 |
|
16.67 | ||
M | 13.22 |
|
8.26 | 0.83 | |
H | 8.40 |
|
23.53 | ||
VH | 4.17 | 25.00 |
|
In the proposed method, the 600 test samples are divided into blocks, and the model trained by EL with the three features in Section
The classification accuracy of our method (%).
Scene 1 | VL | L | M | H | VH |
---|---|---|---|---|---|
VL |
|
0.72 | |||
L | 1.59 |
|
2.38 | ||
M | 7.44 |
|
4.96 | ||
H | 8.40 |
|
5.88 | ||
VH | 12.50 |
|
Experimental results show that using the proposed method can achieve an average accuracy of 91.67%, while using the local analysis method only achieves an average accuracy of 76.5%. This indicates that the proposed blocking method is indeed capable of solving the perspective effect problem.
To further verify the effectiveness of the proposed method, algorithms in [
Comparison with other algorithms on UCSD dataset.
Method | MAE | MRE (%) |
---|---|---|
Wu et al. [ |
2.60 | 14.2 |
Chan et al. [ |
2.30 | 12.6 |
Zhang et al. [ |
2.08 | 11.3 |
Chen et al. [ |
2.07 | 11.3 |
Proposed | 1.95 | 10.7 |
Lempitsky and Zisserman [ |
|
|
|
||
Hu et al. [ |
1.98 | 10.8 |
Zhang et al. [ |
|
|
It is clear in Table
To further verify the adaptability of the model, other scenes are tested.
Figure
Ten-hour prediction data of scene 2.
The 5-class classification accuracy of 300 test samples of two selected scenes is shown in Tables
The accuracy of our method on scene 2 (%).
Scene 2 | VL | L | M | H | VH |
---|---|---|---|---|---|
VL |
|
2.90 | |||
L | 4.76 |
|
7.94 | ||
M | 4.92 |
|
6.56 | ||
H | 6.78 |
|
10.17 | ||
VH | 18.75 | 81.25 |
The accuracy of our method on scene 3 (%).
Scene 3 | VL | L | M | H | VH |
---|---|---|---|---|---|
VL |
|
4.17 | |||
L | 4.92 |
|
6.56 | ||
M | 6.15 |
|
3.08 | ||
H | 5.17 |
|
10.34 | ||
VH | 2.27 | 15.91 |
|
The images of scene 2 and scene 3.
Experimental results show that, under different scenes, the same method can achieve accuracy over 85%. Particularly, the average classification accuracy for scene 2 and scene 3 is 88% and 89%, respectively, indicating that the proposed method has relatively strong scene adaptability.
In this study, a scenic spot crowd density estimation algorithm based on multifeature ensemble learning is proposed. The algorithm solves the perspective effect problem by introducing a new blocking method for scenic spot scenes. In each block of an image, the coarse regression prediction of people count is made by a layer of SVR model for the extracted multiple features. Then, another layer of SVR model is used on the coarse regression results for fine regression prediction. The people counts of all subimages are summed up and graded for density estimation according to the standards defined for different scenes. Experiments demonstrate that the proposed algorithm is highly robust and effective for crowd density estimation.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported by the Shenzhen Basic Science & Technology Foundation of China (Grant no. JCYJ20150422150029095) and the Suzhou Industrial Technology Innovation Foundation of China (Grant no. SS201616).