Recently, because of its importance in computer vision and surveillance systems, object tracking has progressed rapidly over the last two decades. Researches on such systems still face several theoretical and technical problems that badly impact not only the accuracy of position measurements but also the continuity of tracking. In this paper, a novel strategy for tracking multiple objects using static cameras is introduced, which can be used to grant a cheap, easy installation and robust tracking system. The proposed tracking strategy is based on scenes captured by a number of static video cameras. Each camera is attached to a workstation that analyzes its stream. All workstations are connected directly to the tracking server, which harmonizes the system, collects the data, and creates the output spatial-tempo database. Our contribution comes in two issues. The first is to present a new methodology for transforming the image coordinates of an object to its real coordinates. The second is to offer a flexible event-based object tracking strategy. The proposed tracking strategy has been tested over a CAD of soccer game environment. Preliminary experimental results show the robust performance of the proposed tracking strategy.
Because of the advance in surveillance systems, object tracking has been an active research topic in the computer vision community over the last two decades as it is an essential prerequisite for analyzing and understanding video data. However, tracking an object, in general, is a challenging problem. Difficulties in tracking objects may arise due to several reasons such as abrupt object motion, changing appearance patterns of the object and/or the scene, nonrigid object structures, partial/full object-to-object and object-to-scene occlusions, camera motion, loss of information caused by projection of the 3D world on a 2D image, noise in images, complex object motion, complex object shapes, and real-time processing requirements. Moreover, tracking is usually performed in the context of higher-level applications, which in turn require the location and/or shape of the object in every captured frame. Accordingly, several assumptions should be considered to constrain the tracking problem for a particular application.
A great deal of interest in the field of object tracking has been generated due to (i) the recent evolution of high-speed computers, (ii) the availability of high quality and inexpensive sensors (video cameras), and (iii) the increasing demand for an automated real-time video analysis. Tracking an object is to get its spatial-tempo information by estimating its trajectory in the image plane as it moves around a scene, which in turn helps to study and predict its future behavior. There are three main steps in video analysis: detection of interesting moving objects, tracking those objects from frame to frame, and analysis of the recognized object tracks to detect their behavior. Some examples for tracking include the tracking of rockets in a military defense system, customers in a market, players in a sport game, and cars in a street. Tracking these objects helps in military defense, goods arranging, sport strategies, and traffic control, respectively.
Basically, tracking can be done by one or multiple types of sensors [
The three scenarios of surveillance system.
High resolution
Medium resolution
Low resolution
Using color information may solve some tracking situations. Various methods have been proposed to present the appearance model of the tracked objects, such as, polar representation method [
The appearance template of a tracked object.
Sometimes, it may be more beneficial to replace the high-preciseness sensor by multiple low-preciseness ones. Such strategy has a great impact in cost minimization as the low-preciseness sensors are cheaper than those high-preciseness ones [
However, multiple-sensors tracking systems suffer from several hurdles, such as sensors registration and integration. Furthermore, they become very complicated if the design of sensors network is noncentralized. In the centralized network scenario, the data streams from the machine behind the sensor (workstation) to a dedicated server, and vise-versa. So, the data stream paths are clearer than those in noncentralized systems. Generally, the centralized networks (server-workstation network) are preferred than the noncentralized networks (peer-to-peer) as they usually introduce better stability, robustness, and organization.
Usually, in several tracking systems, the tracked objects are not only moving randomly (not as rigid bodies) but also captured in low resolution. These restrictions result in several hurdles that are usually faced during the tracking systems’ operation. Such hurdles include (i) the automatic initialization and installation of new objects, (ii) detection of objects that enter the field-of-view (FOV) of each camera, (iii) detection of those exiting objects, (iv) guaranteeing an objects’ continuous migration from a camera’s FOV to another’s, and more importantly, (v) solving the merging/splitting situations of objects or groups of objects, which are the major challenge in the tracking system.
In this paper, a real-time tracking system will be proposed for automatically tracking a set of moving objects using a number of video streams collected by a set of static cameras. The simple installations, portability, as well as the low cost are the salient features of the proposed tracking system [
The main contributions of this paper are (i) introducing a novel image-to-world coordinates transformation that impacts well the systems’ setup and tracking phases and improves the flexibility of tracking those objects that travel between cameras FOV(s), (ii) a new concept that considers the moving object as a group of physical targets, which in turn solves and corrects a lot of tracking problems, has been developed, and finally, (iii) introducing a novel event-driven tracking algorithm for efficient tracking of multiple objects using multiple cameras network.
The rest of the paper is organized as follows. Section
Over the last two decades, a lot of work in the area of object tracking has been developed. Both the structure and organization of those developed systems differ according to their different purposes, the physical properties of the used sensors, and the considered economic issues. To the best of our knowledge, based on the underlying architecture, object tracking systems can be classified into three main categories. Two of them are built over a network of camera stations; the first depends on static cameras, while the second uses dynamic cameras in zooming and rotating. On the other hand, the third category depends on a wide-screen image (panorama image), which can be collected by a number of static cameras in the same side of the tracking field. Through the rest of this section, the first category, which employs the static cameras, will be discussed in details because of its similarity to our system. Afterwards, the two other types will be briefly reviewed.
According to the first category, the tracking system consists of a set of workstations that cover the tracked environment using a set of static cameras. Kang et al. [
Another architecture that uses stationary cameras was proposed by Iwase and Saito in [
On the other hand, the tracking systems architectures that use a network of dynamic cameras are sometimes used as a result of their cheapness. For illustration, tracking in soccer game needs no further cost as it depends on a number of dynamic cameras to deliver TV-Camera video streams, which is already constructed. Hayet presented a modular architecture for tracking objects in the sports broadcasting [
Finally, the use of a wide image (panorama image) makes it possible to track objects using one machine, which avoids the difficulties of correlating data over multiple machines. Sullivan and Carlsson in [
Through the next subsections, a quick review of the methodology used to extract the tracked objects as well as the employment of Kalman filter in the literature will be introduced.
To extract the moving objects in the video stream a very faster approach is the background subtraction method [
Capturing interesting moving objects using background subtraction method.
Background model
An input frame
Extracting interesting objects (subtracting background)
Foreground pixel
Two different techniques can be used for creating the background model: recursive [
Next, the objects are created by connecting foreground pixels. For each object, a number of foreground pixels are clustered as a blob. These pixels are clustered by applying Otsu’s method, which determines the threshold for each pixel either belongs to an object or not. This blob is the source of object’s appearance in terms of its dimensions, its color signature, and even its position. The object may be represented by more than one blob. Finally, the object dimensions and position are represented with a bounded rectangle around its blobs. Figure
Steps of creating the object from by background subtraction method.
The bounding rectangle of tracked object
The tracked object is always expressed by
An object tracking system is a system that determines the location of an object(s) over time using a sequence of images (frames). So, the first idea for getting the location of an object in the next image is to search for it in a small rectangle area around its present location. However, such idea may fail if the object got out of the searching rectangle. Hence, it is needed to modify this method to predict the object location in the next image by searching for it in a small rectangle area. To accomplish such aim, Kalman filter can be successfully employed because of its ability of prediction in noisy environments [
Kalman filter recursive algorithm flow chart.
Kalman filter works as an estimator for the process state at some time and then obtains feedback in the form of noisy measurements. So, the equations of Kalman filter are divided into two groups: the prediction (time update) and correction (measurement update). The solved process by Kalman filter is governed by a linear stochastic difference equation. In the tracking systems, this process is nonlinear, so Kalman filter must be modified to tackle this problem. The modification of the filter comes in a famous form called extended Kalman filter (EKF) [
Through this section, the proposed tracking strategy will be illustrated in details as well as its main components.
The proposed tracking system architecture, as well as its main elements, is illustrated in Figure
System architecture layers and elements.
As illustrated in Figure
The ideal positioning of a couple of cameras. The two cameras’ axes are perpendicular in top view.
The overall system's FOV encompasses the visible FOVs of all the cameras in the system. On the other hand, the system is not interested in any virtual FOV (the uncovered areas by cameras). Each camera is connected to a workstation as illustrated in Figure
In the server layer, the server machine works as the system's central point. It receives the spatial-tempo data from each workstation and then collects them to (i) guarantee the data integration, (ii) recorrect falsely received data, and (iii) predict the most correct data. Hence, the data streams from workstations to server, and vise versa. Finally, the output spatial-tempo data from the system is then stored in a database. This database can be expressed by general markup language like XML to be easily sent at real time to another media over a low bandwidth, TV shows, mobile cells, or internet. In addition, the complete database can be linked to video streams and can be used offline as a multimedia database to predict situations happened in the tracked environment as a kind of data mining.
Before tracking, each workstation's FOV must be defined with respect to the general system coordinates to avoid gaps and blind areas in the tracked environment. The workstation's FOV is defined as a polygon that has a number of corners representing the pitch area covered by the camera in front of that workstation. The manual choosing of these corners is more accurate, less expensive, and more transparent in many situations. Choosing those corners is performed over one captured frame before beginning tracking.
At each workstation, tracking moving objects is performed using the methodology illustrated in Figure
Workstation tracking methodology.
After splitting the occluded group, workstation depends on three choices to match the input objects with the output objects of the occluded group, which are (i) EKF prediction and correction, (ii) the color histogram of each object, and (iii) the feedback coming from the server.
The server machine acts as the center of the system as it receives all data coming from the workstations connected to the cameras (as shown in Figure
Server jobs and interactivity with workstations.
Data synchronization is performed by the scheme presented in [
In any tracking system, transforming the image-to-world coordinates is a very essential and complicated step. This transformation should be calculated accurately to get the precise real position of each moving object. However, depending basically on the internal parameters of the camera and its position makes this transformation very complicated and expensive. Camera calibration is one of these complicated ways which is proposed in [
At the server, a TG-creator module creates the TG for each workstation. Initially, TG-creator reads the real coordinates of the TG points' real coordinates which are organized and measured manually. These data are presented in the form of an XML file such as that is shown in Algorithm
<scene> <point name=" ……………. …………….
The second step in the TG-creator is to use the graphical user interface (GUI) shown in Figure
A part of TG-Creator interface. Numbers and black rectangles refer to some of equivalent points in both an initial frame in the left and real model in the right.
Hence, by simple clicks using TG-creator interface, the operator can assign each point in the initial frame, at the left part, to its equivalent, at the right, in the real model. As soon as the manual assignment is finished, the last step in TG-creator is to produce an XML file that collects the coordinates of each image point
The XML file written by the TG-creator and expresses one workstation's TG is written as in Algorithm
……………. …………….
Transforming a point from its image coordinates to real coordinates is begins with reading the previous XML. Next, the TG-cell where the transformed point exists is detected, or the nearest TG-cell if it does not exist inside any one. Once the TG-cell is detected, the transforming algorithm can be started as illustrated in Algorithm
Input data: The image coordinates of point P(xi,yi). The TG-Cell point P laid in The real x-coordinate of point P11 is called P11-xr. Calculated parameters: Rn: the iterated value of R after n tries. R0: the initial value of R which equals 0.5. Px1, Px2: are calculated with equation ( points' image coordinates. xth: the perpendicular distance in pixels from point P(xi,yi) to line Px1, Px2. Pxr: the real x-coordinate of Point P; it is the goal of this algorithm. Algorithm: Start: Calculate Rn:
Referring to Figure
The considered point P with its hosting cell C.
Hence, using the previous pair of equations, the real
A sample illustration of transformation algorithm basics and steps.
The transformed point P and its threshold distance in the two directions
Deciding the next value of R ratio in
An example of iterating of R ratio in
In the same way, the real
As soon as the image coordinates are transformed to the corresponding real coordinates, the workstations coverage modules can be started immediately. After constructing the TG, the workstations coverage modules help in plotting the coverage polygonal over the ground of each camera's FOV. This plotting is performed manually with a developed GUI software aid tool at the server. The server receives the initial frame of each workstation and its GT XML file, and then the server operator assigns corner points of the coverage polygonal manually. Figure
Workstations FOVs versus their illustration in the server coverage aid tool.
Cameras FOVs (initial frame)
Workstation's coverage polygonal
Coverage polygonal all workstations
The main goal in any tracking system is to solve the problem of discontinuity of objects paths (tracks) during the tracking process. This problem occurs due to many reasons such as (i) the occlusion with objects or structures in the scene (disappearing partially or completely by a static object), (ii) object disappearing in blind areas, or (iii) exiting and reentering of objects from the scene. Basically, object occlusion can be categorized into two main categories which are (i) interobject occlusion and (ii) occlusion of objects. The former happens when a group of objects appeared in the scene where objects are blocked by others in the group. This poses a difficulty not only in identifying those objects in the group, but also in determining the exact position of each individual object. On the other hand, occlusion of objects occurs when the object is not observed or disappeared for an amount of time blocked by structures or objects (static or dynamic) in the scene. Hence, the tracking system has to wait for the object’s reappearance or concludes that the object has left the scene. Hence, to minimize the effect of occlusion, we suppose an environment totally covered by several cameras; moreover, the ground is always visible since there are no static objects. On the other hand, the discontinuity of objects tracks has been overcome by (i) initializing new objects, (ii) successfully solving merging/splitting situations, and (iii) precisely integrating entering/exiting FOVs of the tracking workstations.
In most tracking systems, each physical target is represented as one corresponding object. However, this representation is not suitable for tracking multiple continual interacting targets. A generic formation of the tracked objects, in our system, assumes that “the tracked object can be represented as a group of micro-objects.” This assumption is illustrated in Definition
Considering Definition
Figure
Three players are beginning tracking as one object, and then they split.
In the traditional tracking system, the workstations are entrusted with segmentation and objects creating stage and objects paths creating stage. However, the server was entrusted with feedback correction stage. We break the linearity of cascading the classic tracking stages by “taking the decision of tracking from events performed in those stages.” An early version of this strategy was introduced in [
Tracking events used, their meanings, and their responses.
Event | Description | Corresponding response |
---|---|---|
Creating a new object | ||
Object entering the FOV of workstation | ||
Object split from a group of targets | ||
Object exiting the FOV of workstation | ||
Object merged with other one | ||
Object is tracked in stable manner |
Each event is triggered by a collection of conditions that are generated by simple operations, calculation, and conditions. For example, changing the size of object-bounded box
Parameters used in the tracking algorithm.
Parameters | Description |
---|---|
Rectangle bounded object | |
Rectangle bounded object | |
Appearance template of object | |
The | |
An array that contains all objects that were sent to the server by workstations | |
An array that contains all objects created by server's association module. | |
An array that contains the predicted location of each object in |
As illustrated in Algorithm
Inputs data: SSB: Set of N objects Segmented Bounded Boxes SSB = KBB: Set of M objects EKF Bounded Boxes KBB = AT: Set of M Appearance Templates AT = Algorithm Defining Events for each Workstation: For each E1 : if Then Trigger R1 go to OUT1 E2: if Then Trigger R2 go to OUT1 E3: if size ( Then Trigger R3 go to OUT1 OUT1: Next For each E4: if Then Trigger R4 go to OUT2 E5: if size ( Then Trigger R5 go to OUT2 E6: if Then Trigger R6 go to OUT2 OUT2: Next Events Responses: Responses are done over: SRM (Server Response Module) and WRM (Workstation Response Module) R1: new WRM.register ( WRM.create_object_template ( WRM.send_to_SRM ( SRM.associate ( SRM.create_and_send_to_each_WRM ( R2: new WRM.register ( WRM.send_to_SRM ( SRM.associate ( SRM.create_and_send_to_each_WRM ( R3: For all if not Then if Then WRM.trigger_new_response ( WRM.build_history_of ( Else if Then Next WRM.trigger_stable_response ( R4: WRM.stop_tracking ( R5: WRM.register_group ( WRM.announce_to_SRM ( SRM.associate ( SRM.create_and_send_to_each_WRM ( if WRM.search( Then WRM.split ( Else WRM.give_EKF_position ( R6: WRM.track ( WRM.send_to_SRM ( SRM.associate ( SRM.create_and_send_to_each_WRM (
As illustrated in Algorithm
The flow chart of tracking events algorithm.
Triggering
The second event
The third event (
The second set of events (which are
The merge situation of objects fires the event
All the previous events are addressed as nonstable events, so the objects that do not trigger these events are called to be tracked in a stable way. The last event,
Once all the possible events are successfully triggered, the continuity of objects tracks can be done correctly. Figure
Four cases show the robust of the tracking algorithm in solving some of tracking problems.
Grouping of two players in left frame is solved by another camera frame
Frames from one camera solve two splits
First row shows entering of a player, who takes the same label in another camera shown in the second row
Frames show exiting of a player from a camera's FOV and system stopping of tracking him
In this section, three experiments will be presented. The first two experiments determine the accuracy of the proposed image-to-world transformation method, while the third evaluates the proposed algorithm of object tracking.
In this experiment, the efficiency of the proposed image-to-world transformation method will be measured on a real three-dimensional CAD model. This model simulates the soccer pitch with a camera's FOV that captures the center of the pitch. The frame size was adjusted to be 720 × 576 pixels. Each point in the TG points is 8 meters far from its neighbors in the four directions. Five paths are constructed in the CAD simulation, each of them consists of 40 points. Each point real position is known in the CAD simulation, and its image position is manually fed. We choose such paths, as well as their points positions, to test the efficiency of the algorithm near and far, boundaries, and middle of the camera's FOV. Moreover, we did not take any diagonal (over the camera's FOV) path that may get tricky results with the calculated Root Mean Square Error (RMSE). The proposed transformation method is compared here with the fast projection method [
The tested five paths and two projection planes in the middle camera.
For each path, RMSE is calculated using each point’s real position (
Results for each path are summarized in Table
Experiment I results; comparing five paths using the image transformation algorithm and two projection planes. The comparison employs the RMSE over the two axes.
RMSE | Path 1 | Path 2 | Path 3 | Path 4 | Path 5 | |
TG | 0.59 m | 0.51 m | 0.60 m | 0.31 m | 1.02 m | |
0.65 m | 0.57 m | 0.67 m | 0.28 m | 0.98 m | ||
Plane 1 | 29.77 m | 13.53 m | 28.81 m | 19.13 m | 31.35 m | |
33.03 m | 15.28 m | 25.11 m | 18.64 m | 35.21 m | ||
Plane 2 | 12.95 m | 14.69 m | 11.73 m | 10.75 m | 15.02 m | |
13.48 m | 15.49 m | 13.22 m | 9.61 m | 13.84 m |
Results in Table
This experiment extends the previous one by assuming that each TG point is 8, 16, 24, 32, and 40 meters far from its neighbors in the four directions. The RMSE of the same five paths shown in Figure
RMSE of the five paths tested in experiment II. The test was performed over five different distances of TG points.
Calculated RMSEs over the
Calculated RMSEs over the
By observing Figure
In this experiment, the proposed tracking algorithm will be tested over a three-dimensional CAD simulation. The environment of this experiment is the soccer pitch covered by a network of five cameras. The pitch includes 25 moving objects, which are divided into two teams and three referees (with different color clothes). The simulation period was 2 minutes length, 25 frames per second, MPEG format, and 768 × 576 pixels in frame size. The workstations, as well as the server, are Intel-I5 CPU, and the modules were built over . NET framework running under Windows 7 operating system. The FOVs of the five static cameras used to cover the soccer field are shown in Figure
Workstations’ FOVs versus the illustration of them in server coverage aid tool.
The proposed tracking events algorithm can be evaluated by comparing the events generated automatically by the system versus the manual observation of workstations and server events as percentage form. The event ratio can be calculated using the following formula:
Experiment III results; the event-tracking evaluation.
observing | Manual and CAD | System succession | Event_ratio |
---|---|---|---|
New object register event | 25 | 22 | 88% |
Object-enter event | 31 | 30 | 97% |
Object-split event | 34 | 27 | 79% |
Object-exit event | 26 | 26 | 100% |
Object-merged event | 32 | 30 | 94% |
Targets trajectory correlation | 1.0 | 0.81 | 81% |
The reported results in Table
As a proof of system ability to preserve the continuity of objects tracks, we present two plots of objects tracks with two different ways. The first one is a plot of two objects over the pitch as shown in Figure
Comparing real and system path of two players. The events new (N), merging (G), and splitting (S) are shown on the path.
In this paper, a novel strategy was presented for constructing a tracking system in a crowded environment. The novelty of this strategy appeared in the flexibility of the system architecture, clearly solving image transformation and system setup problems, exploring a new model of the tracked objects, and finally presenting an event-driven algorithm for tracking objects over a network of static cameras. The proposed tracking strategy provides a potential to the surveillance system which requires wide area observation and tracking objects over multiple cameras, for example, airport, train station, or shopping malls. It is not possible for a single camera to observe the complete area of interest as the structures in the scene constraint the visible areas and the devices resolution is restricted. Therefore, the surveillance system of wide areas requires multiple cameras for objects tracking. Over a CAD of soccer game, experiments have shown that the proposed tracking strategy performs well even in cluttered environments. Our future investigations are to extend the proposed tracking algorithm to include other events. Moreover, although the soccer game was a very rich tracking environment, the system should be tested in other environments such as parking areas, other sports playgrounds, and hyper markets.