Pedestrian Detection in Crowded Environments through Bayesian Prediction of Sequential Probability Matrices

In order to safely navigate populated environments, an autonomous vehicle must be able to detect human shapes using its sensory systems, so that it can properly avoid a collision. In this paper, we introduce a Bayesian approach to the Viola-Jones algorithm, as a method to automatically detect pedestrians in image sequences. We present a probabilistic interpretation of the basic execution of the original tool and develop a technique to produce approximate convolutions of probability matrices with multiple local maxima.


Introduction
Being able to detect and avoid pedestrians is an essential feature of autonomous vehicles, if they are to guarantee a safe behavior in populated environments.However, automatically detecting human shapes in images is a very complex procedure for a computer vision system, and it has been widely studied before.
One of the most usual frameworks in literature is Viola-Jones [1], based on feature training and classifier cascades, which is explained in detail in Section 2.1.This technique has been improved by its authors by considering object motion [2,3] and also by applying several classifiers simultaneously [4] or RealBoost to improve weak classifiers [5].
The main contributions of this paper are the introduction of a Bayesian approach to pedestrian detection methods-exemplified by, but not limited to, the Viola-Jones framework-, by creating a statistical interpretation of the basic execution of the original algorithm and developing a technique to produce approximate convolutions of probabilistic matrices with multiple local maxima.This aims to increase the precision of the framework for its usage on autonomous vehicles, in order to more efficiently detect and avoid obstacles and pedestrians in image sequences.
Furthermore, the method we present can be used with both preprocessed binary results and unaltered probabilistic elements.As the latter are commonly returned by the sensors of a robot, this allows for greater flexibility and a more accurate management of the uncertainty of the available data.
1.1.Related Work.Another important algorithm for detecting pedestrians consists of using Histograms of Oriented Gradients (HOG) to define the features on an image [6].This algorithm has been implemented for FPGA-based accelerators [7] and GPUs [8] and combined with Support Vector Machine (SVM) classifiers [9,10].Variations of histogrambased detection methods, such as Co-occurrence HOG [11] and combinations with wavelet methods [12] also exist.Bayesian methods have also been applied to the problem of pedestrian detection [13].
Both HOG and Viola-Jones algorithms are included in the official release of OpenCV [14].Although the former usually provides very precise detection results, as studied in [15], it has been proved to perform slightly slower than the latter and is therefore less suitable for a real-time operation like pedestrian detection for a moving vehicle.

Viola-Jones
Framework.The Viola-Jones object detection framework uses object features which, similarly to Haar-like features [16], are defined by additions and subtractions of the sums of pixel values within rectangular, nonrotated areas of an image.The different types of features used by Viola-Jones are shown in Figure 1.
Thanks to the usage of integral images, such that where  is the integral of image , these operations can be done in constant time.For example, the sum of all the pixels of the rectangle in Figure 2 would be calculated as since each   value is the sum of all the pixels in the rectangle defined by the opposite corners  and .
A set of classifiers are then trained using AdaBoost [17], and a cascade architecture allows the result to be used in real-time, by immediately discarding a sample as soon as one classifier rejects it, as shown in Figure 3.

Bayesian Model.
Let  and  be two random variables.
(i)  expresses the existence or absence of objects of interest (in our case, pedestrians) within an image, for each pixel location.(ii)  shows an equivalent value, as returned by the Viola-Jones detection when applied to an image.
It is possible to use  as evidence to evaluate the degree of belief of proposition  (i.e., ( | )), by applying Bayes' theorem: The common use of a Bayesian model is to weed out wrong positive detections by comparing them to previous observations.However, when detecting pedestrians this decision could be damaging to the procedure, since false positives are preferable to false negatives, a missed detection involves immediate danger, whereas a false detection would only cause a less efficient route.
Therefore, we propose a reverse application of Bayes' theorem, which filters absences of objects rather than detections, by considering the reverse values of the presented variables:  where ( | ) and () are calculated as explained in the following subsections.

Likelihood.
The default behavior of the Viola-Jones detection method, for a given image, is to return a set of rectangles within which objects of interest have been found.A binary matrix can be produced from these areas, such that each cell is set to 1 if it belongs to one of them, and 0 otherwise.In our work, the binary matrix corresponding to the th rectangle is named   .Some of these marked areas may be superfluous (false positives), and others may overlap.The more rectangles that overlap over a group of pixels, the more likely it will be to contain an actual object of interest.
The original Viola-Jones algorithm allows for a minimum overlap restriction: a rectangle would only be valid if it can be computed as the intersection of a given number of overlapping detections.
Instead, we suggest to produce a detection matrix, such that the value of each one of its cells is equal to the number of rectangles that overlap over its corresponding pixel (Figure 4).This matrix is equal to the sum of the binary matrices of all the observed detections.
The likelihood matrix for the probability of absence of objects of interest within an image is proportional to the opposite value of the detection matrix; for  detections, this would be The concept of associating a weight value to each detection was also presented in the Soft Cascade method [18].Its results are returned as rectangular areas, but unlike Viola-Jones, these are isolated and as such cannot be processed into probabilistic matrices.Preliminary tests showed that, because of this restriction, the accuracy of this technique is noticeably inferior to that of the probabilistic interpretation of Viola-Jones that we present in this work.Therefore, we chose not to use Soft Cascade in our experiments.

2.2.2.
Prior.The usage of Bayes' theorem involves an evolution of the resulting posterior probability function, in order to produce the prior probability function for the following iteration of the algorithm (typically a convolution is applied).
Ideally, at each time step , the location of an object is determined by a certain probability distribution.The distribution of the appearance of objects of interest in our experiments is extracted from the normalized addition of overlapping binary rectangular distributions, which is asymmetrical and has a flat top.A new probability distribution was developed to approximate this behavior.Let  be a set of detections as returned by the Viola-Jones method for a particular object of interest.An object can be represented as a ⟨||, , ⟩ tuple, such that (i) || is the number of elements in set , (ii)  is the minimal rectangle area that holds the intersection of all the elements in , and (iii)  is the minimal rectangle area that holds the union of all the elements in .
Using these data, a two-dimensional function which simulates the summation of all the elements in  was modeled: If considering a single dimension, rectangles  and  can be seen as two segments  1  2 and  1  2 , respectively, where  1 <  1 <  2 <  2 (Figure 5).Consider the following function: The shape of  suits our needs, but its height is scaled down so that, for two dimensions, the summation of the detections of a single object can be calculated as A probability matrix can therefore be generated, using the tuples which define the detected objects of interest.For  objects In order to isolate each object of interest among the added distributions of all the detections in an image, we locate the maximum value in the probability matrix and analyze its adjacent cells to define a tuple, such that (i) area  contains all the cells that share a maximum probability value ||, caused by the overlapping of all the involved detection rectangles, and (ii) area  contains all the cells that are delimited by local minima and zero values, so that we can assume that all nonzero cells that are not contained in  belong to unrelated detections.
After an object is located, its data are stored and it is removed from the probability matrix.This procedure is repeated until the matrix is empty.Once all objects are extracted, they are matched to those of previous time steps to study their relative movement.When the objects involved are clearly individual, their movements can be analyzed and predicted separately.In our case, their number and their correspondences between frames are unknown.
Using a minimum mean square error estimation, each object is then added to a previously stored trajectory, which is used to predict new values for the following time step, using a linear regression over the tuple values.
The prediction values are finally used to generate the prior probability matrix using (9) (Figure 6).

Results and Discussion
Our method was tested over twelve image sequences, described in Table 1 and exemplified by Figure 7. Dataset ETSII was recorded in the parking lot of the Computer Engineering School of Universidad de La Laguna.Datasets ITER1 and ITER2 were filmed in the outer limits and in the parking lot of the Institute of Technology and Renewable Energy (ITER) facilities in Tenerife (Spain), respectively.
These three image sequences were captured by the visual sensors of the VERDINO prototype (Figure 8), a modified EZ-GO TXT-2 golf cart equipped with computerized steering, braking, and traction control systems.Its sensor system consists of a differential GPS, an Inertial Measurement Unit (IMU), an odometer, three Sick LMS221-30206 laser range finders, two thermal stereo cameras, and two Santachi DSP220x optical cameras.
Datasets BAHNHOF, JELMOLI, and SUNNY DAY were downloaded from Andreas Ess' Robust Multi-Person Tracking from Mobile Platforms website at the Swiss Federal Institute of Tecnology.These image sequences were recorded using a pair Figure 7: Example frames for all datasets referenced in Table 1. of AVT Marlins F033C and have been used in publications [19][20][21][22].
Datasets CAVIAR1 to CAVIAR4 belong to the Context Aware Vision using Image-based Active Recognition (CAV-IAR) project [23] and were recorded in a shopping center in Portugal using a static camera.The selected image sequences correspond to the corridor views of clips Walk-ByShop1 (CAVIAR1), OneShopOneWait1 (CAVIAR2), OneS-hopOneWait2 (CAVIAR3), and ThreePastShop1 (CAVIAR4).
Dataset DAIMLER corresponds to the Daimler pedestrian detection benchmark dataset, introduced in [24], and dataset CALTECH corresponds to sequence V002 from testing set seq06 of the Caltech pedestrian detection benchmark [15,25].Both datasets were recorded from a vehicle driving through regular traffic in an urban environment.
Ten tests were conducted over each image dataset; the average results are shown in Figures 10 and 9.As explained in Section 2.2, the main goal of our detection enhancement  method is to reduce the amount of false negatives returned by the Viola-Jones framework.As such, classic analysis techniques such as receiver operating characteristic (ROC) and detection error tradeoff (DET) curves, which depend on the amount of false positives of the results, do not properly display the improvement introduced by our approach.We instead present the average ratio between the amount of false positives returned by both the original and the enhanced detection methods, and the amount of true positives found in the input frames.
We observed that our Bayesian approach always provides less conservative detection rates than Viola-Jones, successfully lowering the rate of false positives for all datasets.Results were especially good for the ETSII, ITER, CAVIAR, and DAIMLER datasets.The sequences for these sets have good visibility, which results in more accurate detections by the original method and, consequently, a higher improvement introduced by our approach.
The rest of the datasets have higher occlusion rates and feature pedestrians in poses and locations that complicate their detection, thus lowering the enhancement of a Bayesian processing.This effect was especially noticeable for the CALTECH dataset, which features very few clearly visible pedestrians.

Conclusions
We have developed a Bayesian approach to the Viola-Jones detection method and applied it to a real case where pedestrians must be located and avoided by a self-guided device.Our method describes a statistical modification of the original tool, which is combined with a form of approximate convolution of two-dimensional probability matrices with multiple local maxima.
Our algorithm has been proved to improve the precision of the results, by restricting a probabilistic matrix returned by the original method to the area where objects are expected to appear, according to their previously observed movements.
It was found that our method behaves best when pedestrians are clearly visible, so that the detections by the original method can be properly enhanced by a Bayesian processing.More accurate detection algorithms are expected to improve the results of our approach in situations of high visual occlusion.This proposal serves as grounds for further research.

Figure 1 :
Figure 1: Features used by the Viola-Jones framework.The value of each feature is the sum of the pixels in the white area minus the sum of the pixels in the gray area.

Figure 2 :
Figure 2: Example of a rectangle in an integral image.The sum of its pixels would be calculated as   −   −   +   .

Figure 4 :
Figure 4: Unaltered Viola-Jones result for a minimum of three overlapping detections (a) and corresponding likelihood probability function (b).Brighter areas represent a higher probability of presence of objects.

Figure 5 :
Figure 5: Example of theoretical cross section of the approximate probability distribution for an object of interest.

FalseFigure 9 :
Figure 9: Comparison of the performances of the unaltered Viola-Jones tool and the presented Bayesian method.

Figure 10 :
Figure 10: Average false negative rate for each complete image dataset.
∈ R 2 , and where   ,   ,   , and   are, respectively, the leftmost, rightmost, upper, and lower limits of area , and   ,   ,   , and   are the corresponding limits of area .

Table 1 :
Features of the image datasets.