A Time-Consistent Video Segmentation Algorithm designed for Real-Time Implementation

In this paper, we propose a time consistent video segmentation algorithm designed for real-time implementation. Our segmentation algorithm is based on a region merging process that combines both spatial and motion information. The spatial segmentation takes benefit of an adaptive decision rule and a specific order of merging. Our method has proven to be efficient for the segmentation of natural images (at or textured regions) with few parameters to be set. Temporal consistency of the segmentation is ensured by incorporating motion information through the use of an improved change detection mask. This mask is designed using both illumination differences between frames, and region segmentation of the previous frame. By considering both pixel and region levels, we obtain a particularly efficient algorithm at a low computational cost, allowing its implementation in real-time on the TriMedia processor for CIF image sequences.


I. INTRODUCTION
Segmentation of videos into homogeneous regions is an important issue for many video applications such as regionbased motion estimation, image enhancement (since different processing may be applied for different regions), 2D to 3D conversion (segmentation can be used for depth estimation).These applications require two main features from segmentation: accuracy of regions boundaries in the spatial segmentation and temporal stability of the segmentation from frame to frame.Spatial segmentation can be classified into two main categories, namely contour-based and region-based methods.In the first category, edges are computed and connected components are extracted [1].The main drawback of such an approach is that the computation of the gradient is prone to large errors especially in noisy images.Moreover we cannot take benefit of statistical properties of the considered image regions.The second category of methods, i.e. region-based, is less sensitive to noise.We are interested here in a bottom-up segmentation approach.In these methods, two important points must be considered: the first one is the order of merging and the second one is the similarity criterion, see for example [2], [3], [4].When dealing with video segmentation, the temporal dimension must be added.Various approaches have been tested.Some authors consider video data as a volume [5], while others take benefit of motion information, such as scene change detection or motion field [6], [7].Other works propose the object tracking [8] which is beyond the topic of this paper.
In this paper, we propose a spatial segmentation that takes benefit of both an adaptive decision rule and an original order of merging.This method gives good results for spatial segmentation with few parameters to be set.We use motion information to improve the temporal consistency.Motion estimation is a real bottleneck for real-time implementation, so we rather propose to combine an improved Change Detection Mask (CDM ) with spatial segmentation in order to improve the temporal consistency of our segmentation.By making comparisons both at a pixel level and at a region level, we obtain an efficient algorithm for video segmentation at a low computational cost.Our method runs in real-time on the TriMedia processor for CIF image sequences.Moreover, experimental results conducted on real video sequences demonstrate a good temporal consistency.
The paper is organized as follows.The spatial segmentation method is detailed in section II.The temporal consistency improvement is detailed in section III.In section IV, we discuss the implementation of the algorithm.Experimental results and measures are given in section V.

II. SPATIAL SEGMENTATION
Let us consider an image I, the notation |.| represents the cardinal and I(p, n) the pixel intensity at position p = (x, y) T in the frame n.
A region-based segmentation problem aims at finding a relevant partition of the image domain in m regions {S 1 , S 2 , .., S m }.The algorithm of segmentation detailed here uses an implicit encoding of the initial 4 connected planar sampling grid of pixels.It combines a specific order of fusion with an adaptive threshold.Both steps are detailed thereafter.

A. ORDER OF MERGING
The order of merging is built on the edges weights as in [9], [10].The idea behind this order of merging is to merge first similar regions rather than different ones.An edge e denotes a couple of pixels (p, p ) in a 4-connectivity scheme.The similarity between pixels is measured by computing the distance between two pixel colors as follows: For our algorithm, we consider the Y U V color space, since this is the native color space of CIF sequences.The color space (L * a * b * ) provides partitions with a little greater subjective quality but with a higher computational cost.
The edges are sorted in increasing order of their weights and corresponding couples of pixels are processed in this order for merging.As far as the implementation is concerned, the image is only scanned twice for this sorting: One time, in order to compute the number of edges with same weights and store number in a table (edge weight histogram).Second, in order to store each edge in the memory part corresponding to its weight.

B. THE CRITERION OF MERGING
Given two regions S 1 and S 2 , we want to know if these two regions have to be merged.A similarity criterion between the two regions must then be chosen and evaluated.Couple of adjacent regions will be considered as similar and merged if their merging criterion is below a given threshold.The choice of the corresponding threshold is often difficult.In this paper, we propose an adaptive threshold for a similarity criterion based on the means of the intensities of these two regions.This adaptive threshold is justified using statistical inequalities as in [9] but we here propose a simpler statistical interpretation of the image which leads to a more adapted criterion for real-time implementation.Such a predicate gives very good segmentation results as shown in Fig. 2.
1) Merging predicate: The mean of the intensities of the region S i is computed as follows: where I i (p k , n) is the intensity of the k th pixel of region S i .The merging predicate is then : (2) where g is the maximum level of I (g = 255 for grayscale images).The parameter Q allows to tune the coarseness of the segmentation.In the experiments, we choose Q = 2 which gives good results for CIF (Common Input Format = 352 × 288) images.
2) Statistical justification of the predicate: Classically, the image I is considered to be an observation of a perfect image I * and then pixel intensities are considered as observations of a vector of random variable (r.v.) noted X = (X 1 , .., X n ) T .The merging predicate is based on the use of statistical inequalities and especially the McDiarmid inequality as in [9].
Theorem 2.1: (The independent bounded difference inequality [11]) Let X = (X 1 , X 2 , ..., X n ) be a family of n independent r.v. with X k taking values in a set A k for each k.Suppose that the real valued function f defined on k A k satisfies |f (x) − f (x )| ≤ c k whenever vectors x and x differ only in the k th coordinate.Let E(f (X)) be the expected value of the r.v.f (X), then for any τ ≥ 0, When testing the deviation between two adjacent regions S 1 , S 2 , we consider the following vector of random variables: ) where I i (p j ) is the intensity of the j th pixel of S i corresponding to the observation of the random variable I * i (p j ).In this case, the size of the random vector is n = |S 1 | + |S 2 |.In order to apply Theorem 2.1, we then choose f (x) = S 1 − S 2 .By inversion of the theorem, we have with a probability of at least 1 − δ (0 < δ ≤ 1): , and E(S) is the expected value of S. If S 1 and S 2 belong to the same region in I * , the expectation E S 1 − S 2 will be null and the predicate follows.

C. SEGMENTATION ALGORITHM
Our spatial segmentation could be divided in three steps.In the first one, we compute the weights of edges and their histogram.In the second step we sort edges in an increasing way of their weights.In the last step we merge pixels or regions connected by edges following their order.The implementation of this last step is given in section IV.

III. TIME CONSISTENCY IMPROVEMENT
In video segmentation, the quality of the spatial segmentation is not the only requirement, time consistency is also a very important one.If in two successive frames, one region is segmented very differently because of noise, occlusion or deocclusion, results of segmentation would be very difficult to exploit for any application like image enhancement, depth estimation and motion estimation.Many works, see for example [7], use motion estimation to improve time consistency in video segmentation.However, motion estimation is a real bottleneck for real-time implementation and is even sometimes unreliable.In this paper, we combine an improved Change Detection Mask (CDM ) with spatial segmentation in order to improve the temporal consistency of our segmentation.
The CDM is designed using both illumination differences between frames and region segmentation of the previous frame.We first detect changing pixels using the frame difference.Then, we take benefit of the region segmentation of the previous frame in order to classify the pixels not only at a pixel level but also at a region level.Given the current frame I(:, n) and the previous one I(:, n − 1), the frame difference F D is given by F D(p) = |I(p, n) − I(p, n − 1)|.Classically, F D is thresholded in order to distinguish changing pixels from noise.A pixel p, with L(p) = 1 is denoted a changing pixel.We then use the previous segmentation in order to convert the CDM from the pixel level to a region level which is more reliable.For each region in the previous segmentation S i , we compute τ (S i ) which represents the ratio of changing pixels in the region S i .We have τ (S i ) = N i,changing |Si| where N i,changing is the number of changing pixels in the region S i .Pixels are then classified using three categories:

Fig. 1 .
Fig. 1.(a) Segmentation of a frame n-1.(b) CDM computed from the segmentation of the frame n-1 and frame difference between frame n-1 and frame n.(c) Segmentation of the frame n without CDM .(d) Segmentation of the frame n with CDM .