A TBB-CUDA Implementation for Background Removal in a Video-Based Fire Detection System

. This paper presents a parallel TBB-CUDA implementation for the acceleration of single-Gaussian distribution model, which is effective for background removal in the video-based fire detection system. In this framework, TBB mainly deals with initializing work of the estimated Gaussian model running on CPU, and CUDA performs background removal and adaption of the model running on GPU. This implementation can exploit the combined computation power of TBB-CUDA, which can be applied to the real-time environment. Over 220 video sequences are utilized in the experiments. The experimental results illustrate that TBB+CUDA can achieve a higher speedup than both TBB and CUDA. The proposed framework can effectively overcome the disadvantages of limited memory bandwidth and few execution units of CPU, and it reduces data transfer latency and memory latency between CPU and GPU.


Introduction
Video-based fire detection systems play an important role in the existing surveillance systems.Compared with conventional fire detection methods based on particle sensors [1], visual fire detection is more suitable in an open or large space, and it can provide abundant and intuitive information.For video-based fire detection, motion and color are the ordinary characteristics.There are several specific methods to find moving and flame color pixels by integrating background removal algorithms with Gaussian distribution models [2][3][4].In addition to ordinary motion and color clues, flame and fire flickers can be detected by analyzing the video in wavelet domain [5][6][7].These methods have been successfully applied in surveillance systems and proven effective.However, the demands of real-time processing require the acceleration of fire detection.Parallel processing is a suitable way to provide satisfactory performance for realistic applications.
GPU (graphics processing unit) has recently become a popular parallel platform for large-scale data computing.CUDA (Compute Unified Device Architecture), created by NVIDIA [8], provides a data-parallel programming framework and enables parallel execution of C function kernels [9].
For this reason, many developers have taken advantage of the high performance of CUDA to accelerate computation across various problem domains, such as signal processing, computer vision, computational geometry, and scientific computing [10][11][12][13].However, the focus of CUDA is on complicated calculations.The memory latency and data transfer latency between CPU and GPU in data processing still need further consideration.TBB (Intel Threading Building Blocks) is a running-based parallel library that offers a rich methodology to express parallelism in C++ programs [14,15].As a typical fine-grain parallel model, TBB supports parallel tasks which run on threads.In addition, TBB implements task stealing to balance parallel workload across available processing cores, leading to the reduction of load imbalance, increase of core utilization, and adaptability to dynamic environments.Some researches take advantage of TBB to improve algorithms such as Floyd-Warshall algorithm [16], a Three-tier Parallel Genetic Algorithm (TPGA) [17], and Large Dense Linear Equations [18].
In our work, a parallel programming framework of CUDA+TBB is provided.We apply TBB to initialize work running on CPU, and CUDA to perform background removal and adaption of model running on GPU.The hybrid parallel mode overcomes the disadvantages of limited memory bandwidth and less execution units of CPU in TBB.CUDA+TBB can effectively overcome the major drawback of CUDA by reducing the unnecessary data transfer latency and memory latency between GPU and CPU, resulting in computation acceleration.The rest of this paper is organized as follows.Section 2 presents the background modeling techniques based on single-Gaussian model and adaption of the model parameters.Parallelization to accelerate background removal is discussed in Section 3. Section 4 presents experiments where the parallel implementations are applied to video-based fire detection.We end this paper in Section 5 with conclusions and future work.

Background Modeling
Gaussian distribution is a common probability model that is widely used in pattern recognition and image processing to depict some random variables such as pixels and noise.For digital image processing, single-Gaussian model is used in the foreground extraction algorithm of the image whose background is single and stable.In the fire detection system, natural fire flames are seen as dynamic objects in the video images, which a fixed camera observed.The color of the fire regions is also a very important feature for distinguishing fire from others, so color information helps us to obtain fire regions more precisely.In this paper, we use a single-Gaussian model with mean and covariance matrix extracted from the video frames, which is based on RGB color spaces [19].

Background Removal Algorithm.
The first step is to initialize the background model.In image processing, all operations are based on pixels: (,   ,   ), where  is the value of pixel,   is the mean value of pixel, and   is the standard deviation value of the pixel.Subscript  denotes time .In this project, we use  ( = 20) frames to initialize the background.The formula of the mean and the standard deviation is as follows: where   (, ) is the mean value of   (, ),   (, ) is the standard deviation of   (, ),  ∈ {, , }.
The second step is to classify the pixel.In order to reduce the computation, every channel of color space is assumed to be independent, so for each pixel, the color probability is calculated by [19]  ( (, )) =   (  (, )) +   (  (, )) where   ,   ,   are, respectively, distribution models for red, green, and blue channels, (, ) is pixel value at coordinate (, ), and ((, )) is provability density of (, ).For certain color channel i, each distribution is given as follows: Using the model parameters in (1), the following formula can show if the pixels are foreground or background: where   (, ) denotes the result of  channel's detection.  (, ) is set to 1 if spatial location of (, ) changed, 0 otherwise. is the constant which can affect the final change detection.In this experiment,  = 2.5.(, ) denotes the result of the pixel's detection.If there are changes at least in two-color channels, (, )shows that the pixel is marked as a foreground pixel, which is regarded as fire suspected area, otherwise it is considered to be a background.

Adaptation of Model
Parameters.This step is to update the background model by adapting parameters.In general, the scene observed can change by lighting, or with other natural effects.In order to respond to environmental changes, the updating method of adapting pixel's parameters is given by where    (, ) denotes value of pixel at coordinate (, ) in th color channel at time ,    (, ) and  −1  (, ) denote mean values at times  and  − 1, respectively, and    (, ) and  −1  (, ) are standard deviations at times  and  − 1, respectively.  is a constant for updating model parameters in th color channel, which ranges from 0 to 1.

Parallel of Background Removal Model
In this section, we apply application of TBB+CUDA for background removal model.The hybrid architecture of CPU and GPU is shown in Figure 1.Firstly, video was decoded from the AVI formats on CPU.From the decoded video frames, we initialized mean and standard variance on CPU and transmitted initialized result to global memory on GPU.Secondly, kernel function reads the parameters from global memory to perform the multithreaded computing tasks.After all the calculations are finished, the result should be transmitted from GPU to CPU.

Realization of Parallel Algorithm
Based on TBB.TBB supports scalable parallel programming using standard ISO C++ code, and it puts focus on computation in parallel without having to explicitly deal with threads.In TBB, we specify tasks instead of threads.Tasks are mapped and scheduled to physical threads by the TBB scheduler [20].Moreover, TBB can abstract platform details and simplify parallel programming.In addition, TBB provides a template-based runtime library which contains a series of data structures and algorithms, and it enables developers to devote themselves to address identifying concurrency rather than worrying about its management [21].Through TBB relevant template class, mean value and standard deviation are assigned to different threads, resulting in making full use of multicore resources.In particular, the algorithm based on TBB includes the following steps.
Step 1. Installation of TBB parallel computing platform and setting environment for the preprocessing.
Step 2. Initialization of a TBB task scheduler.A task scheduler object is task scheduler init, which is responsible for supporting the allocation of multiple threads.
Step 3. Development of parallel computing template class.The template of parallel for is selected to obtain the mean and the standard deviation of  frames in the body object.
Step 4. Invocation of the parallel template class "parallel for." Once we have the loop body written as a body object, the general form of the constructor is parallel for (block range<T>(begin, end, grainsize)).Parallel for breaks this iteration space into trunks and runs each trunk on a separate thread.Each operator implements a grainsize.In this project, the value of grainsize is set 1000 [11].
Step 5. End the TBB task scheduler and get results.

Realization of Parallel Algorithm Based on CUDA.
In this section, we present the CUDA-based background removal model.Details can be described as follows.
Step 1. Declare shared memory.Each block has 16 K shared memories to store the mean value and standard deviation of 256 pixels.Shared memory is divided into 16 banks.The size of each bank is 32 bits and adjacent 32 bits are organized.The instruction of SM is executed in a half-warp as a unit.So threads in a half-warp read the data linearly from the banks in shared memory.
Step 2. Assign the number of threads and the size of block.We set 256 as the number of threads of a block   ( ).The size of the blocks in a grid is width * height/ .
Step 3. Compute the position of the first pixel.In this project, each thread handles a pixel.For each thread, CUDA sets the thread number as threadIdx.xand block number as blockIdx.x.Each thread can determine the location of the corresponding data source according to threadIdx.xand blockIdx.x.The formula is as follows: The value of  is the start of the data source.Each thread executes the kernel program which is satisfied with the parallel operations of CUDA.
Step 4. Copy data from global memory to shared memory.Because the threads of a block enjoy the same shared memory, and the access latency of share memory is less than global memory, we transfer the data from global memory to the shared memory which included   (, ),   (, ),    (, ),  ∈ (, , ).
Step 5. Transfer the updated data from GPU to CPU and release the memory of GPU.

Experiment Results
The experiments were conducted on a Pentium (R) Dual-Core E6500 2.94 GHz personal computer equipped with NVIDIA's GeForce 210 GTX graphics card.The software is Intel TBB and CUDA 4.0.We applied three different parallel methods including TBB, CUDA, and TBB+CUDA, The experimental results of runtime on different image sizes are described in Table 1.Runtime of TBB is obviously less than serial algorithm.TBB decides the number of threads that will be used by task scheduler automatically.Then it sends the threads to different cores according to the working stealing algorithm.The impact of TBB is emphasized when comparing the results obtained for serial approach.
Comparisons of the three methods' speedups on different image sizes are listed in Table 2. Speedup is the ratio of sequential runtime to parallel runtime for the same task.Because the required number of threads is much larger than the dual cores of CPU, what is more, GPU has many execution cores and a larger number of registers for data processing.As it can be seen, TBB achieves a lower performance when compared to CUDA.The speedup of CUDA is nearly at least 9 times higher than TBB, where TBB+CUDA achieve a higher speedup than CUDA.
In Contrast with TBB and CUDA method, TBB+CUDA can significantly accelerate the single-Gaussian distribution model.It can reduce the communication overhead of data transfer latency and memory latency between CPU and GPU.Total runtime includes data transfer time from CPU to GPU, runtime of kernel, and data transfer time from GPU to CPU, so it is better to cut down the unnecessary latency and reduce the proportion of CPU-GPU data transfer time on GPU.As shown in Table 3, the latency of CUDA is nearly 10-19 times larger than CUDA+TBB.TBB is arranged to get the parameters and the proportion of latency with CUDA+TBB has shown 2%-17% improvement over CUDA.
In this experiment, images were captured at nighttime.Experimental results are presented in Figure 2 and Figure 3, where the true positive (TP) of fire detection is demonstrated.An average TP of 94% over 220 test video sequences in the experiment can be obtained.Figure 3 shows the degree of the flame and the effects of background removal in video frames of 320 * 240.

Conclusion
In order to accelerate the process of the background removal, this paper proposes a hybrid parallel mode.This parallel mode consists of two phases: the initializing phase by TBB running on CPU and the parallel computing phase for background removal and adaption of the model running on GPU with CUDA.The experimental results indicated that our solution makes full use of computation resources of GPU and CPU, leading to a higher speedup than TBB or CUDA.The hybrid parallel mode is generic to certain extent and can also be applied to other areas such as traffic routing and logistics location.

Table 1 :
Runtime on different image size (ms).

Table 2 :
Speedup of three modes on different image size.

Table 3 :
Proportion of CPU-GPU data transfer latency.