Person Detection for an Orthogonally Placed Monocular Camera

Counting of passengers entering and exiting means of transport is one of the basic functionalities of passenger flow monitoring systems. Exact numbers of passengers are important in areas such as public transport surveillance, passenger flow prediction, transport planning, and transport vehicle load monitoring. To allow mass utilization of passenger flow monitoring systems, their cost must be low. As the overall price is mainly given by prices of the used sensor and processing unit, we propose the utilization of a visible spectrum camera and data processing algorithms of low time complexity to ensure a low price of the final product. To guarantee the anonymity of passengers, we suggest orthogonal scanning of a scene. As the precision of the counting is relevantly influenced by the precision of passenger recognition, we focus on the development of an appropriate recognition method. We present two opposite approaches which can be used for the passenger recognition inmeans of transport with and without entrance steps, or with split level flooring. +e first approach is the utilization of an appropriate convolutional neural network (ConvNet), which is currently the prevailing approach in computer vision. +e second approach is the utilization of histograms of oriented gradients (HOG) features in combination with a support vector machine classifier. +is approach is a representative of classical methods. We study both approaches in terms of practical applications, where real-time processing of data is one of the basic assumptions. Specifically, we examine classification performance and time complexity of the approaches for various topologies and settings, respectively. For this purpose, we form and make publicly available a large-scale, class-balanced dataset of labelled RGB images. We demonstrate that, compared to ConvNets, the HOG-based passenger recognition is more suitable for practical applications. For an appropriate setting, it defeats the ConvNets in terms of time complexity while keeping excellent classification performance. To allow verification of theoretical findings, we construct an engineering prototype of the system.


Introduction
In passenger transport, person flow monitoring has an indispensable importance. In some areas of public transport, passenger flow monitoring systems are employed to automate this task. One of the basic measures, which must be provided by the system, is the number of transported passengers. A precise counting of passengers entering and exiting means of transport has a positive effect on public transport surveillance, passenger flow prediction, transport planning, transport vehicle load monitoring, station control and management, and cost optimization [1,2].
To ensure a robust and precise counting of passengers in real time, a passenger flow monitoring system must be based on an appropriate imaging system and data processing algorithms. In order to allow a mass deployment of such a monitoring system, a low-cost final solution is equally important. e solution should also meet legal requirements where passenger anonymity is of great importance. Specifically, identification of individuals according to their faces must be avoided.
e imaging system must ensure the acquisition and processing of data, i.e., its basic components are a sensor and a processing unit. In order to develop an inexpensive solution, low price of both components is crucial. While the lower price limit of the processing unit is mainly given by the complexities of used data processing algorithms, the lower price limit of the sensor is given by the used sensing technology. Radar sensors [3], laser scanners [4], 3D laser scanners [5], or infrared sensors [6] are applicable for the counting of passengers. All these sensors naturally guarantee a high level of passenger anonymity. eir main drawbacks are high prices of the sensors and frequent errors in the counting [7,8]. For these reasons, cameras operating in the visible spectrum of light are preferably used for the counting of persons [9]. Conventional cameras (cameras operating at wavelengths of visible light) are significantly cheaper, compared to the previously mentioned sensors. e cameras can be combined with depth sensing devices [10]. e fusion of data can result in a more balanced trade-off between false positives and false negatives [11]. On the other hand, the depth sensing devices increase the final prices of sensors, i.e., utilization of a depth sensing device would increase the final price of the imaging system. e automated counting of persons in a scene is usually carried out in colour images or in sequences of colour images. Many data processing algorithms aimed at precise counting of persons in crowded scene images have been presented [9]. Most of them are designed for overriding installations of cameras. Cameras installed at public as well as at private places usually look down on scenes from angles that typically range between 40°and 80°(from the ground). Considering low subject distances in transportation means (a distance between a camera and a passenger), we conclude that the anonymity of passengers is not guaranteed for such a setup (i.e., data processing algorithms aimed at processing of such images cannot be used for the counting of passengers). Only orthogonally captured images (camera placed above a scene, looking directly down on the scene) assure a high level of passenger anonymity ( Figure 1).
A data processing chain, aimed at counting of persons in orthogonally captured images, is compounded of three fundamental steps: person detection, multiperson tracking, and person counting ( Figure 2). In the first step, a processed image is examined for the presence of persons. e following step is the tracking, where all persons detected within the first step are matched with existing tracking models of persons. In the case a person cannot be associated with any existing model, a new tracking model is initialized. e last step of the chain is the counting. If a person described by a tracking model leaves the scene, which is usually defined by virtual lines, counting is triggered [11]. Naturally, an integral part of this data processing chain is an algorithm which splits video data provided by a camera into individual images.
Accuracy and time complexity of the data processing chain is primarily given by accuracy and time complexity of the person detection. Person detection is a process of location and recognition of persons in images. Within this process, possible locations of persons (regions) are proposed using an appropriate technique. e regions determine candidate object images, which are classified using an appropriate object recognition system. e proposition of regions can be carried out using an exhaustive method such as a sliding window [12] or using an advanced time-efficient method such as a selective search algorithm [13]. In modern object detection systems, both the location and the recognition are carried out by a single deep neural network [14][15][16]. ese systems are characterized by high detection accuracy but high time complexity.
As the analysis shows, a low-cost passenger counting system should be based on a conventional camera (due to low prices of visible light cameras). In order to guarantee the anonymity of passengers, the camera must be placed above a scene, looking directly down on the scene. For the data processing, methods capable of processing orthogonally captured images must be used. e resulting data processing chain must be robust and precise. To keep the low-cost requirement, the time complexity of the methods should be as low as possible. From this perspective, the detection of passengers seems to be the weak link in the chain.
As the time complexity of the single deep neural network detectors is high [14][15][16], we tend to implement a passenger detector as a two-stage system. When using a robust and time-efficient region proposal method such as selective search algorithm [13], the accuracy and computational complexity of the detector is mainly given by a used object recognition method. In colour images, the recognition of persons typically relies on optical flow features [11,17,18]. An alternative approach to the detection of persons is the detection of their heads and shoulders [19]; however, a head itself can provide a strong feature due to its almost circular shape. e counting of heads is typically used by counting of persons in dense crowd images [20,21]. Recognition of heads in orthogonally captured images can also rely on the optic flow analysis [22]. e main disadvantages of optic flow-based methods are their high computational complexity and noise sensitivity [23].
Considering the importance of passenger recognition for their counting, we focus on the development of a pricecompetitive and time-efficient object recognition system. As the system is aimed at recognition of passengers, we name it "the passenger recognition system." As the trend in object recognition is still clearly heading towards convolutional neural networks (ConvNets) [24,25], we examine the performance of ConvNets for passenger recognition. Usually, ConvNet-based object recognition systems have good classification performance, but their time complexity is typically high. For this reason, we propose a competitive approach which is based on histograms of oriented gradients (HOG) features [26] and on a support vector machine (SVM) classifier. For an appropriate setting of parameters, HOG-based object recognition can have good classification performance while keeping low time complexity [27].
Recognition of passengers in orthogonally captured images using the HOG features and the SVM classifier, based on object images which comprise of heads and shoulders of passengers, has proven useful in scenes without height differences [19]. Modern public means of transport are increasingly low-floor (i.e., there is no or negligible height difference in the area of a doorway), but a substantial part of operated buses, trams, trains, and trolleybuses are high-floor [28][29][30]. Considering this fact, we conclude that the robustness of the HOG-based passenger recognition system must be verified in the context of variable distances between a camera lens and passenger heads. We also consider that the time complexity of the system can be reduced once the object image contains only the heads of the passengers (omitting the shoulders will result in smaller object images and consequently reduce data processing time). We deduce the suitability of such an approach from remarkable results of HOG-based object recognition systems on similar tasks, e.g., for grape detection [31,32] (see Figure 3; the round shape of grapes is similar to the shape of heads).
Within this article, we study the classification performances and time complexities of passenger recognition systems. e systems are aimed at recognition of passengers in orthogonally captured images, where the recognition quality is not adversely affected by the variable distance between the passenger and camera sensor. e passenger recognition systems are based either on ConvNets or on HOG features. Both approaches rely on the detection of heads. In the case of ConvNet-based systems, we consider ConvNets of various topologies. In the case of the HOGbased system, we examine various settings of parameters. We validate the theoretical results in a real-world application. For this purpose, we develop an engineering prototype of the system.

Engineering
Prototype of the System. Two basic components of the system are the sensor and the processing unit ( Figure 4). In our case, we use an industrial colour camera Basler acA2500-60uc [34] as the sensor. e camera is placed in a means of transport, at the ceiling near a door. e optical axis of the camera is perpendicular to the vehicle floor. Considering the construction of means of transport, we expect the average subject distances to be from 0.2 m to 1 m. e camera should monitor an area of about 2.4 m × 2.0 m. With respect to these parameters, we equipped the camera with a Computar M3514-MP lens [35]. e output of the camera (i.e., the input of the data processing chain) is a sequence of RGB images.
We use the prototype for a data collection as well as for the validation of the proposed recognition methods, i.e., the prototype must be capable of processing acquired images in real time. In order to allow testing of all proposed solutions (including solutions based on ConvNets), we use a singleboard computer VOB-P3310. It offers an NVIDIA Tegra X2 (2.0 GHz, 6 cores) CPU together with 8 GB RAM and it

Passenger Recognition.
Candidate object images may or may not contain complete heads of passengers ( Figure 5). According to this criterion, the images are classified either as "head" or "not head" by a passenger recognition system. Inputs of the recognition system are sized normalized RGB object images of dimensions 51 × 51 pixels ([51, 51] px). Its outputs are labels of the images, where labels "head" and "not head" are allowed.

Passenger Recognition Based on ConvNets.
In terms of classification accuracy, the state-of-the-art object recognition systems are based on one of the successful deep ConvNet architectures [37]. Mostly, they process raw image data (i.e., no image preprocessing is usually carried out). ey consist of multiple layers arranged in a feed-forward manner. Upper and lower level layers ensure feature extraction and classification of object images, respectively. e feature extraction is usually carried out using convolutional and pooling layers, where the convolutional layers are typically combined with a ReLU activation function. e classification is generally ensured by a softmax activation function. e function processes data at the output of the last network layer, where a fully connected layer is placed. e number of neurons of this layer corresponds to the number of object classes [38]. e main drawback of the state-of-theart deep ConvNet architectures is their high computational demands.
e passenger recognition can be simply implemented as a ConvNet of an appropriate architecture, where the network ensures both feature extraction and classification ( Figure 6). As a low time complexity of the system is crucial, we test the performance of five ConvNet architectures of different complexities. e simplest architecture, Net1, consists of one convolutional layer (32 filters with 3 × 3 px kernels), one maxpooling layer (2 × 2 px nonoverlapping pools), and two fully connected layers of 512 and 2 neurons, respectively. e classification is carried out using the softmax function. In the (a) (b) Figure 3: Comparison of head (three images in (a)) and grape images (three images in (b)). For each category, we provide an original RGB image, an image obtained by filtering of the RGB image using the Canny edge detector [33], and gradients obtained using a HOG descriptor [26], respectively. second simplest architecture, Net2, we replace the convolutional and the max-pooling layers by the sequence of layers: convolutional layer (32, 3 × 3) + convolutional layer (32, 3 × 3) + max-pooling layer + convolutional layer (64, 3 × 3) + convolutional layer (64, 3 × 3) + max-pooling layer, where 2 × 2 px nonoverlapping pools are used at both pooling layers. In both networks, we use ReLU activation functions at the convolution and fully connected layers. To reduce overfitting, we place dropout layers after each maxpooling layer and after the first fully connected layers in both networks. e dropout rate is 25% and 50% for the maxpooling and the fully connected layers, respectively. e remaining three architectures studied within this article are the well-known LeNet-5 [39,40], AlexNet [41], and VGG-16 net [42]. e networks are ordered according to their complexities. e LeNet-5 is the pioneering Con-vNet of a relatively simple architecture. AlexNet is probably the most cited deep ConvNet with a huge number of industrial and engineering applications. VGG-16 is a representative of very deep ConvNets. As it consists of only 13 and 3 convolutional and fully connected layers, respectively, the real-time processing of data by VGG-16 implemented in the engineering prototype (Section 2.1) is still possible.
We train all the networks from scratch with initial weights set randomly with normal distribution (mean � 0, standard deviation � 0.05). In addition, we use transfer learning (TL) for AlexNet and VGG-16 in order to test the possibility of better performance [41,42]. For both architectures, we fine-tune the last three layers of the pretrained networks.
Due to a stochastic character of the training process, we repeat the training a hundred times for each network and training strategy. For each training, we randomly split up a training set into training and validation subsets at the ratio 85 : 15. For each training subset, we run the training in a batch mode for 100 epochs with batches of 32 images. We randomly shuffle data in training subsets for each epoch. We use an ADAM optimizer [43] with initial learning rate setup at 10 −3 and exponential decay rates for the first and second moment estimates setup at 0.9 and 0.999, respectively. e optimizer and setting of the hyperparameters are the results of a pilot study. We minimize a binary cross-entropy function: y j ln y j + 1 − y j ln 1 − y j , (1) where n is the number of images in the training subset and y j and y j are an actual and a predicted class of the j-th object image, respectively. We validate each such trained network on the corresponding validation subset using the crossentropy function (1).

Passenger Recognition Based on HOG and SVM.
Herein, we present a passenger recognition system developed using traditional computer vision techniques. A vision pipeline of the system consists of three successive steps: image preprocessing, feature extraction, and classification ( Figure 7). For the feature extraction and classification, we use the HOG descriptor and SVM classifier, respectively. In order to reduce the time complexity of the system, we convert input RGB images to the grayscale format within the image preprocessing. e conversion is carried out according to the ITU-R recommendation BT.601 [44]. e second step of the preprocessing is the unity-based normalization of the grayscale images [31].
e HOG descriptor encodes local shape information from regions within an image into a feature vector [26]. e descriptor has five parameters: number of bins, orientation binning, size of cells (in pixels), number of cells in blocks, and number of overlapping cells between adjacent blocks. As the size of cells significantly influences the final performance of image recognition systems [27] (Figure 8), we study the   influence of this parameter on the classification performance of the HOG-based passenger recognition system. Specifically, we consider cells of sizes [6,6], [8,8], . . . , [16,16] px. For the remaining parameters, we use a conservative setting which has proven to be efficient: linear gradient voting into 9 bins linearly spread over 0 to 180 degrees, blocks of 2 × 2 cells, and 1 overlapping cell between adjacent blocks in both directions.
Training of the SVM classifier is an optimization problem which searches for a hyperplane with a maximal margin from the training data [45]. In the case that the data is not linearly separable, the data must be transformed into a linearly separable problem using an appropriate kernel function. For strongly nonlinear problems, selection of the kernel function is crucial. Considering this fact, we test the influence of various kernels on the performance of the HOG-based passenger recognition system. Specifically, we focus on the well-established linear, Gaussian radial basis function (RBF), and polynomial kernel functions (we use polynomial kernel with order equal to 2 and 3).
Performances of SVM classifiers are also influenced by settings of their regularization constants. In the case that an SVM classifier uses the RBF kernel, its performance is further influenced by kernel width. In a pilot study, we have found setting of the regularization constant at 1 to be optimal. We use a subsampling-based heuristic procedure to find the optimal setting of the kernel width.
As classification performances of classifiers strongly depend on the composition of training sets, we search for a setting ensuring the best performance of the HOG-based passenger recognition system. We carry out the search in the manner described in Section 2.2.1. Specifically, we randomly split up the training set into training and validation subsets at the ratio 85 : 15, and we train and validate the system on the subsets. We repeat the training-validation process a hundred times for each possible combination of kernel function and cell size. We carry out the validation on corresponding validation subsets using a loss function that is given as a sum of misclassified observations, i.e., where I · { } is the indicator function.

Evaluation of Passenger Recognition.
Two key aspects of the presented passenger recognition systems are their classification performances and their time complexities. A common practice of the evaluation of the classification performance is calculation of accuracy over a testing set (a dataset independent of the training set). For the classification of images into categories "positive" and "negative," the accuracy is given as follows: where |TP| is the number of correctly classified positive images, |FN| is the number of misclassified positive images, |FP| is the number of misclassified negative images, and |TN| is the number of correctly classified negative images.
To evaluate the classification performance comprehensively, we use three additional measures [31,46]:    [16,16] px, [8,8] px, and [6,6] px, respectively. e length of white abscissae is related to the gradients in the image. 6 Journal of Advanced Transportation To evaluate the time complexities of the systems, we measure times that the systems needed to process the testing set. To keep the results independent on the used hardware, we operate with a relative computational time. For the j-th evaluated system, its relative computational time is given as follows: where t j is time the j-th system needs to process the data and k is the number of all evaluated systems. We carry out the evaluation of passenger recognition systems using the best models obtained within the training process (see Sections 2.2.1 and 2.2.2). In the case of Con-vNet-based systems, we use for each architecture, the model with the smallest value of the cost function (1) obtained by its validation. In the case of the HOG-based system, we use for each setting the model with the smallest value of the cost function (2) obtained by its validation.

Training and Testing Sets.
Quality and composition of the training and testing sets conspicuously influence the overall performance of object recognition systems in real-life applications. Data included in the sets should reflect as many aspects of the real situation as possible. Considering this fact, we base the sets on video sequences acquired in the means of public transport and similar public places under various light conditions, using the engineering prototype.
A set of candidate object images generated by a search algorithm from a frame is imbalanced (often highly) [12,13] with a predominance of images without complete heads ( Figure 5). As conventional SVMs are not suitable for the imbalanced learning tasks [47], the training and testing sets must be balanced to get unbiased results. Considering these facts, we create the sets manually to ensure the balance of the classes in the sets.
Specifically, we perform four distinct video recording experiments. ey are set to simulate the real situation as well as to comprehend the architecture of the assumed person flow monitoring system (see Figure 4). All the experiments include stairs and a group of persons walking under the acquisition sensor. Men, women, and children as well as people with and without a head cover (hats, scarves, caps, and hoods) are included. Since the used camera lens is focused manually (once for each experiment), the acquired frames show certain blurring according to the specific distance between the object and the lens. We varied locations, lighting conditions, number of frames, mean distance between persons D PP (mean distance between a subject and two other nearest persons), and minimal and maximal distances between a head and the sensor, min D HS and max D HS , in each experiment (Table 1).
We cut out and size normalize 6020 unique object images from the video data (dimension of the normalized images are [51, 51] px). We label the images according to the presence/absence of heads (Figure 9). We mix and divide the labelled images into the training and testing sets according to Table 2. We make the sets publicly available at [48]. e sets contain large-scale class-balanced data which make them universally applicable (the sets can be used to design any classifier including classifiers, which are not suitable to be trained with imbalanced training sets).

Validation of Passenger Recognition Systems.
We train and validate each proposed architecture (ConvNet-based system) and each setting (HOG-based system) a hundred times. To show the validation results, we use box plots. Results obtained for the systems based on ConvNets are shown in Figure 10. e central lines in the graphs are medians of the loss function (1); the edges of the boxes are 25th and 75th percentiles; and the whiskers indicate the variability outside the upper and lower quartiles. e data are grouped according to the architectures and training strategies (x-axis). e values on the y-axis correspond to the loss function values. Figure 11 shows validation results obtained for the HOG-based passenger recognition system using the loss function (2). We use a separate graph for each kernel function. Data in the graphs are grouped according to the sizes of cells. Outliers are symbolized using stars.

Classification Performance of Passenger Recognition
Systems. In Table 3, we summarize evaluation results obtained from the testing set using the measures (3)-(6). e results are grouped into two sections according to the approach they are based on. e best results obtained for each measure are in bold for both approaches.

Time Complexities of Passenger Recognition Systems.
We display relative computational times (7) as a bar graph (the lower chart in Figure 12), where the time and evaluated systems are on the y-and x-axes, respectively. Above each result, we display the F1-score (6) of the system as a bar graph (the upper chart in Figure 12), where the F1-score and evaluated systems are on the y-and x-axes, respectively.

Discussion
e main objective of the presented work is comparison of the two well-established object recognition approaches for the passenger recognition task. As the evaluation results (Table 3) show, for the cells of size [10,10] px and the polynomial kernel function of degree 3, the classification performance of the HOG-based system slightly exceeds the classification performance of ConvNet-based systems. For this setting, the HOG-based system has the highest values of all four measures. e ConvNet-based systems show the best results for only one measure at a time (aside from LeNet-5 with highest accuracy and F1-score). Except for recall, the HOG-based system also exceeds the ConvNet-based systems in sizes of the performance measure values. Further, for this Journal of Advanced Transportation  Figure 9: Examples of object images in the sets. e first three images (a) are labelled as "head" while the remaining three (b) are labelled as "not head."   Journal of Advanced Transportation [6,6] px [8,8] px [10,10] px [12,12] px [14,14] px [16,16] px [6,6] px [8,8] px [10,10] px [12,12] px [14,14] px [16,16] px [6,6] px [8,8] px [10,10] px [12,12] px [14,14] px [16,16] px [6,6] px [8,8] px [10,10] px [12,12] px [14,14] px [16,16]     setting, the HOG-based system has significantly lower time complexity when compared to the ConvNet-based systems ( Figure 12). Considering all these facts, we conclude that the HOG-based passenger recognition system, with polynomial kernel function of degree 3 and cells of size [10,10] px, best fits the requirements for implementation into the low-cost automated real-time passenger counting system. is is in agreement with an earlier study of passenger recognition without the height differential [19]. e well-established ConvNets such as AlexNet and VGG-16 are expected to be a good basis of object recognition systems. As the validation results ( Figure 10) show, they feature good learning ability, resulting in small loss function values. A similar ability can be observed for the AlexNet. From this perspective, the proposed networks Net1 and Net2 seem to be insufficient. However, their classification performance evaluated on the testing set (Table 3) is comparable with AlexNet-and LeNet-5-based systems (there is no clear winner among these four networks). Surprisingly, the VGG-16-based system has the worst performance in the category of the ConvNet-based systems. e most likely explanation of this phenomenon is a relatively high learning capacity of VGG-16 (compared to the other presented architects) that may cause overfitting on the head recognition task. Considering the high time complexity of VGG-16 ( Figure 12), we conclude that, despite expectations, VGG-16 is not appropriate for the passenger recognition.
We also investigated possible benefits of the transfer learning by the training of ConvNet-based passenger recognition systems. We observe a lower variability in the cost function values for the networks trained using TL, when compared to the networks trained from scratch ( Figure 10). Also, the medians of the cost function values are shifted towards smaller values for TL. We conclude that a model with a low cost function value can be more likely obtained using TL than by its training from scratch. e size of cells has been reported to be the seminal parameter predetermining the performance of object recognition systems which are based on HOG features [27]. e experimental results presented in this article confirm this finding. An incorrect setting of the cell size results in inferior classification (compare results obtained for cells of size [10,10] px and [14,14] px in Figure 11 and Table 3). Also, the time complexity of the HOG-based system strongly depends on the setting of this parameter (compare, e.g., results for cells of size [6,6] px and [10,10] px in Figure 12).

Conclusions
Presently, deep ConvNets are usually considered as the first choice when developing an image recognition system. We established that image recognition systems with equally good classification performances can be developed using traditional computer vision methods. When appropriately designed and setup, such systems can beat ConvNets-based solutions in terms of time efficiency which is particularly important in real-world applications. is is also the case of the HOG-based passenger recognition system, where the utilization of HOG features in combination with the SVM classifier can result in time-efficient and accurate passenger recognition. In this context, we showed that passenger heads are sufficient for the precise while fast passenger recognition. We also showed that the HOG-based system is highly flexible, as it can be employed in both low-floor and highfloor means of transport. Its implementation into a passenger monitoring system is being currently developed, allowing us utilization of a basic processing unit. Cost savings on the unit is reflected in the final price of the person flow monitoring system and thus supports its mass use in means of transport all over the world.

Data Availability
Data used to support the findings of the study are available at https://www.researchgate.net/publication/342888989_ Dataset_for_head_detector.

Conflicts of Interest
e authors declare that they do not have any commercial or associative interest that represents conflicts of interest in connection with the work submitted.