Predicting Pedestrian Counts for Crossing Scenario Based on Fused Infrared-Visual Videos

Estimating the number of pedestrians based upon surveillance videos and images has been a critical task while implementing intelligent signal controls at intersections. However, this has been a difficult task considering the pedestrian waiting area is an outdoor scenario with complex and time-varying surrounding environment. In this study, a method for estimating pedestrian counts based on multisource video data has been proposed. First, the partial least squares regression (PLSR) model is developed to estimate the number of pedestrians from single-source video (either visible light video or infrared video). Meanwhile, the temporal feature of the scenario (daytime or nighttime) is identified based on visible light video. According to the recognized time periods, pedestrian count detection results from the visible light and infrared video data can be obtained with preset corresponding confidence levels.The empirical experiments showed that this fusion method based on environment perception holds the benefits of 24-hour monitoring for outdoor scenarios at the pedestrian waiting area and substantially improved accuracy of pedestrian counting.


Introduction
Estimating the number of pedestrians is critical within the intelligent transportation system.The pedestrian counts have been a vital input for intersection signal control [1], the guidance of passenger flow, and early warning of large-scale crowd gathering [2,3].However, the approach of estimating pedestrian counts under outdoor scenarios, such as the pedestrian waiting area, is still an unsolved challenge.
Generally, there are two main approaches to estimate the number of pedestrians.One kind is based on reliable tracking of individual pedestrians, which achieves the purpose of counting pedestrians through identifying each individual pedestrian based on image data [4][5][6][7].However, this method is suitable for the case where the pedestrian density is low.If the pedestrian density is high and there is severe pedestrian overlapping, the performance of the method will be deteriorated.The other approach extracts feature from image data and applies regression analysis techniques to estimate the pedestrian counts rather than trying to identify each pedestrian in the image.This method is concluded to be more flexible since there is no need to track each pedestrian in the image.
The surveillance video data have been frequently adopted to estimate the number of pedestrians, which can be further divided into visible light video and infrared video.Infrared video is mostly used to determine whether there are people at scene or whether the target is a human being [8][9][10][11].But, they were barely used for estimating the pedestrian counts.On the contrary, tremendous efforts have been investigated on the estimations of pedestrian counts using visible light video.For instance, Davies et al. [12] used geometric features such as areas and perimeters to estimate the number of pedestrians in the image.He [13] proposed a two-region learning algorithm, applying improved aggregate channel feature detection and Gaussian process regression to estimate the number of pedestrians.Chan [14] segmented the image, extracted the features of each segmentation region, and then used Gaussian process regression to learn the correspondence between the features and the number of pedestrians in each segment.Zhang [15] applied dimensionality reduction techniques to process high latitude features of images and performed regression analysis.Li [16] proposed a feature description operator combining wavelet transform and gray level cooccurrence matrix and used SVM to obtain the pedestrian density model.

Infrared detection result Infrared camera
Figure 1: Pedestrian-counting framework.
Yan [17] used the simile classifier to optimize the subimage and then used the regression analysis model to establish the relationship between subimage blocks and the number of pedestrians.However, the abovementioned studies are based on visible light video which is sensitive to lighting conditions and cannot be implemented for monitoring the pedestrian waiting area for the whole day.
In this study, we propose a pedestrian number estimation method which is dependent on fusion of visible light video and infrared video based on environment perception, in order to realize 24-hour pedestrian counts detection for the pedestrian waiting area.First, partial least squares regression (PLSR) was employed to obtain the number of pedestrians from the image based upon visible light video and infrared video, respectively.Then, based on the environmental feature obtained from the visible light video, an information fusion model is established to obtain the number of pedestrians in the image.The specific schematic diagram is shown in Figure 1.
The remaining of this paper is organized as follows: in Section 2, we describe the image processing and how to extract features from images.Then, Section 3 describes the establishment of the pedestrian count estimation model and how to fuse the result of visible light detection with the result of infrared detection.And the report and analyses of the experimental results are given in Section 4 while Section 5 summarizes the work and discusses future directions.

Image Processing
In this section, visible light image processing and infrared image processing procedures are introduced correspondingly.

Visible Light Image Processing.
The most important task in image processing is to extract the foreground of the motion from the image.For visible light images, background difference method was adopted to obtain the motion foreground in the image.
Since the background image would gradually change along the time in the actual scene, the background image needs to be updated in real time.Kalman filter was used to update the background here.To be specific, the background image at the time  is determined by the background image at time  − 1 and the real-time image at time , which includes both prediction and update.The forecast formula is as follows: B (, , ) =  (, ,  − 1) P (, , ) =  (, ,  − 1) + where (, ,  − 1) is the background optimal value at time  − Then, a binary region-of-interest (ROI) mask proposed by Chan [18] was applied to  V (, , ), which not only reduces the amount of subsequent calculations, but also prevents some interference in noninterest areas.After applying the where ℎ V is the threshold used for binary processing.In our experiments, we set ℎ V = 45.
For the image   V (, , ), the closed operation (dilation followed by erosion operation) is to fill the small holes in the connected domain, connect adjacent objects, and smooth the boundary [19].Then, it analyzes the connected domain and eliminates the connected domain with smaller area to remove noise [20].The final result of visible light image processing is the image  V (, , ).Then the set of blobs in  V (, , ) is where V  is the -th blob in the image  V (, , ) and  is the total number of blobs in the image  V (, , ).For example, Figures 2(a

Infrared Image
Processing.The infrared video data are imaged by thermal radiation, which is not sensitive to ambient light.Since the pedestrian generally appears as a highlighted area in the infrared image, we extract the foreground of the image by the gray value of the image.First, the projection images of the infrared images on the R, G, and B color channels are analyzed to find the projection image which has the greatest difference between pedestrians and the surrounding environment.Figure 3 illustrates that, in the projection image on the G color channel, the characteristics of the pedestrian are the most prominent and easier to distinguish.This projection image is defined as the grayscale image   (, , ).
With the application of the ROI mask, the binary foreground of infrared images    (, , ) is calculated by where ℎ  is the threshold used for binary processing.In our experiments, we set ℎ  = 120.
For the image   V (, , ), the closed operation and connected domain analysis are also performed to remove the noise and ensure the integrity of the pedestrian.The final result of the infrared image is   (, , ).Then the set of blobs   in   (, , ) is where   is the -th blob in the image   (, , ) and  is the total number of blobs in the image   (, , ).For example, Figure 4(a) is the image   (, , ).

Feature Extraction.
Here the visible light image feature extraction procedure was taken as an example, while the feature extraction of infrared images is similar.The contained features of blobs and the inferred number of pedestrians were further extracted.Take the blob V  as an example to calculate its geometric features and positional features using the following steps: (1) Area   , which is the weighted sum of all pixels in the blob,

Original infrared image R -color -channel G -color -channel B -color -channel image I in
(2) Number of edge points   , which is the weighted sum of pixels on the boundaries of the blob, where [V  (, )] denotes the edge image that is generated by the Sobel edge detector on the image V  .
(3) Length of the spot   , which is the maximum number of pixels in the horizontal direction of the blob, (4) Height of the spot   , which is the maximum number of pixels in the vertical direction of the blob, (5) Horizontal position   , which is the horizontal position of the center pixel of the blob in image  V (for infrared images it is   ), where [V  ] denotes a horizontal position set of the pixels of the spot V  in the image  V .( 6) Vertical position   , which is the vertical position of the center pixel of the blob in image  V (for infrared images it is   ), where [V  ] denotes a vertical position set of the pixels of the spot V  in the image  V .Features , , , and  have strong correlations with pedestrian crowd density.In general, at the same position of the image, the larger the values of , , , and , the more the number of pedestrians included in the blob.And the further away a pedestrian is from the camera lens, the smaller he is in the image.Therefore, we use position features  and  to record the positional relationship between pedestrians and the camera lens to ensure the accuracy of pedestrian counting.
Since the final decision result is based on visible light detection results and infrared detection results, an indicator was further introduced to selectively believe based on the distinct detection methods in different situations.In this study, the ambient brightness  from the visible light image is the indicator.
where   denotes the ambient brightness at time .

Model Establishment
This section focuses on how to infer the number of pedestrians from the extracted features.There are mainly two tasks being carried out: (1) a pedestrian count estimation model was developed based on the features of single-source video to establish; (2) then information fusion model was established based on the detection results of multisource video.

Pedestrian Count Estimation Model.
In order to estimate the number of pedestrians in the blob and prevent the problem that overaggregated data might fail to reveal the true correlation between variables, we apply partial least squares regression (PLSR) [21,22].PLSR is a method for multivariate statistical analysis.It draws on the idea of extracting information from explanatory variables in principal component regression, and can effectively solve the multiple correlation problem between variables.
First, the main components are extracted in  0 and  0 . 1 and  1 are the first component of  and .Then  1 and  1 need to meet the following conditions: where cov( 1 ,  1 ) denotes the covariance between  1 and  1 .
After the first components  1 and  1 are extracted, the regression of  versus  1 and the regression of  versus  1 are performed, respectively.If the regression equation has reached a satisfactory accuracy, the algorithm terminates; otherwise, the second round of component extraction will be performed using the residual information of  and .So reciprocate until a satisfactory accuracy is achieved.If we finally extract a total of  components  1 , ⋅ ⋅ ⋅ ,   , PLSR will be implemented by implementing   regression of  1 , ⋅ ⋅ ⋅ ,   and then expressed as   regression equations for the original variables  1 , ⋅ ⋅ ⋅ ,   ( = 1, 2, ⋅ ⋅ ⋅ , ).
Take the visible light image  V (, , ) as an example.Based on PLSR, we establish the pedestrian estimation model where the feature set {  ,   ,   ,   ,   ,   } of the blob V  is an input and the number of pedestrians   included in the spot V  is an output.
Based on the above model, the number of pedestrians included in each blob in the image  V (, , ) is calculated.The total number of pedestrians  V in the image  V (, , ) is where ⌊  + 0.5⌋ denotes rounding of   . V is the detection result of visible light.Using the same method, we can get the infrared detection result   .

Information Fusion Model.
The environment of outdoor scenarios like the pedestrian waiting area varies substantially along the daytime due to the lighting conditions, temperature, etc.In order to ensure the accuracy of the pedestrian count estimations, a method of combining the visible light detection result with the infrared detection result was proposed with its advantages of applying feasibility in different scenarios.First, the current scenario (day or night) is identified based on the ambient brightness   obtained above.Then, according to the recognition result of the scenario, a corresponding confidence level is set for the detection result of the visible light and the detection result of the infrared.In the case of good daylight and good light, we believe the detection result of visible light; otherwise we believe the detection result of infrared.Therefore, the information fusion result   at time  is where  V is the confidence level of visible light detection result and   is the confidence level of infrared detection result.ℎ  is the environment segmentation threshold and we set it ℎ  = 75.

Empirical Analysis
The empirical analysis was conducted at the campus of Tongji University.A total of 106 groups of daytime images and 18 groups of night images (as shown in Figure 5) were collected.The visible image is 640 * 480 pixels, and the infrared image is 320 * 240 pixels.This section uses 8-fold cross validation to divide the image set into a training set and a test set and then to check the accuracy of the proposed method.

Daytime Scenario.
For the subset of daytime images, the visible light detection results are shown in Table 1 and the infrared detection results are shown in Table 2. Figure 6 is a schematic diagram of information fusion in a daytime scenario.It can be seen from Figure 6 that the visible light image is clearer and the noise in the processing result of the visible light image is smaller.This is because the resolution of the visible light image is higher than that of the infrared image.Therefore, the result of information fusion gives credibility to the detection result of visible light, which is consistent with the actual situation.

Nighttime Scenario.
For a group of night images, since there are no street lights near the experimental site, this would cause the visible light detection complete failure.Therefore, the visible light detection result is 0. The infrared detection results are shown in Table 3. Figure 7 is a schematic diagram of information fusion in a night scenario.Since the ambient brightness at this time is very low, the result of the information fusion is selected to believe the infrared detection result, which is consistent with the actual situation.

Influence of Thresholds 𝑇ℎ
V and ℎ  .The thresholds ℎ V and ℎ  are key parameters in this study, which were used to distinguish pedestrians from the background in the image.If ℎ V and ℎ  are too large, a large number of pixels representing the pedestrians in the image will be misjudged as the background, which will result in incomplete motion foreground.As a consequence, the final pedestrian count result will be small.If ℎ V and ℎ  are too small, a large number of pixels representing the background in the image will be misjudged as pedestrians, so that the foreground of the motion will contain a lot of noise.And the final pedestrian count result will be large.
Here different thresholds ℎ V were performed as an example, and the threshold ℎ  is similar.For the same      8 and the results of the pedestrian detection are shown in Table 4.According to Figure 8 and Table 4, we can find that when the threshold ℎ V is too small (ℎ V =30), the motion foreground contains more noise, and the final pedestrian count result is too large.When threshold ℎ V is too large (ℎ V =70), the motion foreground is incomplete and the final pedestrian count results are small.Therefore, the thresholds ℎ V and ℎ  need to be set according to the characteristics of the data and the actual situation.

Contribution of the Features.
The method in this paper is based on six features (Section 2.3 Feature Extraction).
In order to evaluate the contribution of these features to the final result, the average elastic coefficient is introduced.The bigger the average elastic coefficient of the feature, the greater contribution to the final result.And the average elastic coefficient  is where  is the average of the independent variables and  is the average of the dependent variables.
In the visible light model and the infrared model, the average elastic coefficient of each feature is calculated separately.The calculation results are shown in Figure 9.We have found that the feature  is the most influential feature of the final result in both the visible light model and the infrared model, because  is the most important parameter to represent the distance from the pedestrian to the lens in the testing scenario of this paper.On the other hand, the features , , and  are reasonable predictors of crowd density, which reflects the number of pedestrians from different angles.One possible explanation for the low contribution of features  and  is that the camera's field of view is parallel to the road, not vertical or oblique in the testing scenario of this paper.

The Efficiency of Background Update.
Visible light video detection is based on background differences to obtain motion foreground.Since the environment around the pedestrian waiting area varies greatly in a day, real-time background update is a must.Here, the efficiency of the background update method based on Kalman filter is tested.Three rounds of tests were performed on 102 images.The results are shown in Figure 10 and Table 5. (Note: this test was performed on a laptop and the test software is MATLAB 2017b.) According to Table 5 and Figure 10, it can be found that the average time of background update is about 0.0168s.The result is ideal and can meet the needs of practical applications.4.6.Accuracy Verification.The accuracy of the proposed method is verified based on 8-fold cross validation.124 images are randomly divided into 8 groups.Each group is in turn used as a test set for the model, and the remaining 7 groups serve as a training set for the model.The experimental results are shown in Figure 11 and Table 6.As can be seen from Figure 11, the accuracy of the individual visible light detection The number of test  is sometimes higher than that of the individual infrared detection, and sometimes lower.However, the accuracy of information fusion detection is always the highest.Combined with Table 6, the average accuracy of information fusion detection is higher than that of the individual visible light and infrared detections, while considering both daytime scenarios and nighttime scenarios.Moreover, since there is no public dataset containing both infrared and visible light images, we test other methods on the dataset of this paper to show the advantage of the proposed method.Table 7 listed the prediction accuracy comparisons.It can be seen that the fusion method could provide better performance with lower MSE and higher accuracy as compared to the existing methods.Therefore, for 24-hour pedestrian counting in outdoor scenarios, the fusion method between visible light video and infrared video from the perspective of environment perception is more effective than the single video (visual videos or infrared videos).

Conclusion
In this study, a fused method between visible light video and infrared video based on environment perception for estimating the number of pedestrians has been proposed.And the method is intended to combine visual light information with infrared information to enable pedestrian counting techniques for complex outdoor scenarios.The proposed approach is depending on two aspects: the estimation of the number of pedestrians based on single-source video and the information fusion based on multisource detection results.First, PLSR was applied to combine the dimensionality reduction analysis with the regression analysis to establish the pedestrian number estimation model based on singlesource video.The method holds the advantages of reducing the redundancy of the data in the feature set and effectively solving the multiple correlations between variables.Meanwhile, the ambient brightness was employed to identify the scene of images and integrate the visible light detection result and the infrared detection result.The empirical analyses showed that, for 24-hour pedestrian counting in outdoor scenarios, the proposed method has better performance than the method using single information source, which expands the application scenario of pedestrian counting and provides reference for relevant research.As for future analyses, one thing that needs to be expanded is the sample size of the empirical analyses and test the feasibility of utilizing deep learning networks to identify different scenarios (day, night, rain, fog, etc.).Besides, being under heavy fog or rain conditions will substantially increase the noise of video, and how to reduce the interference of these noises on pedestrian count would be a challenging issue in the future to be investigated.In addition, continued improvements of the information fusion model and the feasibility of employing new sensing equipment (such as laser scanners) to estimate the number of pedestrians will be tested.
) and 2(b) are the original and background images, respectively.Figure2(c)shows the background difference result  V (, , ).

Figure 2 (
d) is the ROI mask.
Figure 4(b) is the ROI mask and Figure 4(d) is the final result   (, , ).

Figure 3 :Figure 4 :
Figure 3: The RGB analysis of infrared images.

Figure 5 :
Figure 5: Examples of experimental data.

PN t = 3 Figure 6 :
Figure 6: Schematic diagram of information fusion in a daytime scenario.

Figure 7 :
Figure 7: Schematic diagram of information fusion in a night scenario.

(a)Figure 8 :
Figure 8: The results of motion foreground extraction with different threshold ℎ V .

Figure 9 :Figure 10 :
Figure 9: The calculation results of average elastic coefficient.

Table 1 :
Visible light detection results in a daytime scenario.

Table 2 :
Infrared detection results in a daytime scenario.

Table 3 :
Infrared detection results in a nighttime scenario.

Table 4 :
The results of pedestrian counting with different threshold ℎ V .

Table 5 :
The efficiency of background update.

Table 6 :
The results of 8-fold cross validation.

Table 7 :
The comparison of MSE and accuracy.