Benefits of Using Decorrelated Color Information for Face Segmentation / Tracking

We analyze in this paper the benefits that can be derived from employing color image alignment techniques in the context of face segmentation or tracking based on texture (defined as the patch of intensities) template matching. By making full use of the decorrelated color information, improvements on the accuracy of the segmentation are demonstrated. This is intended to enhance the face segmentation algorithm by increasing its robustness to differences in images caused by various image acquisition devices or settings or by variations in the ambient illumination conditions.


INTRODUCTION
The use of color information becomes increasingly important in nowadays image processing applications as inexpensive color image acquisition devices become easily available.Color image processing permits a more extensive image representation, which expectedly leads to better results.
We deal in this paper with the specific case of face segmentation which employs face modeling techniques.This can also be viewed in the more general context of deformable template matching, using for this purpose a statistical model of shape variations.Extensive work has been carried out in the area of face modeling and face segmentation using statistical models [1][2][3][4][5].These techniques have initially been developed for gray level images.Extensions have later been proposed for color images [6,7].Some advantages of using the color extension have been demonstrated, but mostly for working in a controlled image acquisition environment.Processing color information can thus be challenging, especially when designing more general applications that are supposed to work within unconstrained image acquisition conditions.We demonstrate in this paper some positive results when using the decorrelated color information for applications which include face segmentation and face tracking, intended to work under no predefined constrains.
The outline of this paper is as follows.In Section 2, we briefly describe several decorrelated color spaces in terms of their transforms from the common RGB color space; we also include a comparison between these color spaces in terms of how well they are able to decorrelate image channels on a series of test images.In Section 3, a face segmentation method is described, based on a statistical shape model and a fixed face texture template.The limitations of the application described in Section 3 are addressed in Section 4, introducing some texture alignment and color transfer techniques in order to adapt the texture template to the color distribution of the current image.These operations are facilitated by converting the texture data to one of the decorrelated color spaces presented in Section 2. In Section 5, we show our experiments performed on a general face image database; the database is built as a mixture of images, gathered mostly from various standard face image and video databases; a set of comparative results is provided in Section 5. Finally, in Section 6 we draw the conclusions of our work.

IMAGE DECORRELATION WITH RESPECT TO COLOR INFORMATION
Colorwise image decorrelation is useful for applying color image processing operations independently on each image channel.

Karhunen-Loève transform
The Karhunen-Loève transform (KLT) is optimal in terms of energy compaction and mean-squared error minimization for a truncated representation.By applying KLT to a color image, it creates image basis vectors which are orthogonal, and it thus achieves complete decorrelation of the image channels [8][9][10][11][12] as follows: where T contains the image color signals and ; E is the mathematical expectation.C RGB is the covariance matrix of the of the image color signals as follows: with x, y ∈ {R, G, B}.A is the transformation matrix formed by the eigenvectors of the covariance matrix C RGB : Yet, KLT is data dependant, meaning that it requires the recalculation of the transformation matrix A for each set of data (e.g., each new image).

I 1 I 2 I 3 color space
An interesting color space is I 1 I 2 I 3 , proposed by Ohta et al. [13], which realizes a statistical minimization of the interchannel correlations (decorrelation of the RGB components) for natural images.The conversion from RGB to I 1 I 2 I 3 is given by the simple linear transformation in (4) as follows: I 1 stands as the achromatic (intensity) component, while I 2 and I 3 are the chromatic components.We remark that the simple numeric transformation from RGB to I 1 I 2 I 3 enables simple and efficient transformation of datasets between these two color spaces.
I 1 I 2 I 3 was designed as an approximation for the KLT of the RGB data to be used for region segmentation on color images.As the transformation to I 1 I 2 I 3 represents a good approximation of the KLT for a large set of natural images, the resulting color channels are almost completely decorrelated.
In the previous work of Ohta et al., the discriminating power of 109 linear combinations of R, G, and B was tested on eight different color scenes.The selected linear combinations were gathered such that they could successfully be used for segmenting important (large area) regions of an image, based on a histogram threshold.It was found that 82 of the linear combinations had all positive weights, corresponding mainly to an intensity component which is best approximated by I 1 ; another 22 showed opposite signs for the weights of R and B, representing the difference between the R and B components which are best approximated by I 2 ; finally, the remaining 4 linear combinations could be approximated by I 3 .Thus it was shown that the I 1 , I 2 , and I 3 components in (4) are effective for discriminating between different regions and that they are significant in this order [13].We can further conclude, based on the above figures, that the percentage of color features which are well discriminated on the first, second, and third channels is around 76.15%, 20.18%, and 3.67%, respectively.I 1 I 2 I 3 is also found in [14] to perform better as compared to other color space implementations like YIQ, CIELAB, and UVW for segmentation of color images based on Markov random field (MRF) processing.In [15], the I 1 I 2 I 3 color space was used for color image segmentation based on an MRF model and simulated annealing due to its effectiveness in terms of the quality of the segmentation and the reduced complexity of the transformation.

lαβ color space
Assuming that the human visual system is ideal for processing natural scenes, Ruderman et al. [16] developed the lαβ color space, which also minimizes the correlation between channels for natural images.The conversion from RGB is realized by means of an initial transform to LMS cone space, followed by a conversion of the data to logarithmic space (used to reduce skewness): 0.3811 0.5783 0.0402 0.1967 0.7244 0.0782 0.0241 0.1288 0.8444 Finally, the lαβ data is obtained from This color space has successfully been used in [17,18] for image color transfer operations, which will be described in Section 4.1.

Comparison between the different color image representations
The correlation between two image channels is given by where v i and v j represent the ith and jth image channel signals, respectively, (with i, j = 1, 3, i / = j), for a certain color image representation.
The total interchannel correlation is calculated as follows: The correlation coefficients have been measured for several test images (see Figure 1) in the discussed color image representations using the above formulae [19].Results are summarized in Table 1.
It can be observed that the RGB representation presents a very high interchannel correlation, while the I 1 I 2 I 3 and lαβ image representations significantly reduce this correlation.As stated above, the KLT, which is adapted to each particular image, achieves total decorrelation of the image channels.

FACE SEGMENTATION USING DEFORMABLE TEMPLATE MATCHING
Note that the term texture, frequently used in this paper, refers in the context of this work to the set of pixel intensities across an object, also subsequent to a suitable normalization.

Statistical shape models
We are interested in designing a shape model robust to head pose variations.The shape is defined as the set of positions of some fiducial points on the face.The model is statistically built from a training dataset which contains image examples, annotated with a fixed set of landmark points.The sets of 2-D coordinates of the landmark points define the shapes inside the image frame.These shapes are aligned using the generalized procrustes analysis [20], a technique for removing the differences in translation, rotation, and scale between the training set of shapes.This defines the shapes in the normalized frame.
Let N be the number of training examples.Each shape example is represented as a vector s of concatenated coordinates of its points (x 1 , x 2 , . . ., x L , y 1 , y 2 , . . ., y L ) T , where L is the number of landmark points.principal components analysis (PCA) is then applied to the set of aligned shape vectors reducing the initial dimensionality of the data.It can be noted that PCA is very similar to KLT.In a geometric interpretation, KLT can be viewed as a rotation of the coordinate system, while for PCA, the rotation of the coordinate system is preceded by a shift of the origin to the mean point [21].Shape variability is thus linearly modeled as a base (mean) shape plus a linear combination of shape eigenvectors: where s m represents a modeled shape, s is the mean of the aligned shapes, ) is a matrix having p shape eigenvectors as its columns (p < N); finally, b s defines the set of parameters of the shape model.p is chosen so that a certain percentage of the total variance of the data is retained.The standard deviation for each parameter of the face model, as resulted from the training dataset, provides its dynamic range.By altering the model parameters within their dynamic range helps insuring that only plausible instances of the modeled object are being generated.A description of the way in which the optimal model parameters for a new image can automatically be estimated follows in Section 3.2.

Face texture template optimization algorithm
In order to optimize the face model parameters, a texture template is also required.The separation between shape and texture is realized using a reference shape.Based on this reference shape, the so-called texture examples can be extracted.The reference shape is usually chosen as the pointwise mean of the shape examples.The texture examples are defined in the normalized frame of the reference shape.Each image example is then distorted such that the points that define its attached shape, used as control points, match the reference shape, such that the topology is preserved.An image warping method is employed for this purpose.Image warping methods are discussed in Section3.3.
Subsequent to the warping stage, all shape differences between the image examples have been removed.The texture across each image object is thus mapped into a shapenormalized representation.The resulting images are also called the image examples in the normalized frame.For each of these images, the corresponding pixel values across their common shape are scanned to form the texture vectors t im = (t im1 , t im2 , . . ., t imP ) T , where P is the number of texture samples.
Based on previous experiments, we remark that the variability of the shape component of the face is much more important than the variability of the texture component in terms of a successful segmentation of the face.Due to this fact, we consider in the following a simplified formulation of a model-based face segmentation technique, where the modeled image is represented by a fixed texture template; extensions could be made so that to include texture variability, yet that was beyond the purpose of the current work.Thus during an optimization stage (fitting the model to a query image), the parameters to be found are p = gs bs , where g s are the shape 2-D position, 2-D rotation, and scale parameters inside the image frame, and b s are the shape model parameters.The optimization of the parameters p is realized by minimizing the reconstruction error between the query image and the modeled image.The error is evaluated in the coordinate frame of the model, that is, in the normalized texture reference frame, rather than in the coordinate frame of the image.The difference between the query image and the modeled image is thus given by the difference between the (normalized) image texture and the (normalized) template texture as follows: and r(p) 2 is the reconstruction error, with • marking the Euclidean norm.
A first order Taylor extension of r(p) is given by δp should be chosen so that to minimize r(p + δp) 2 .It follows that: Normally, the gradient matrix ∂r/∂p should be recomputed at each iteration.Yet, as the error is estimated in a normalized texture frame, it was shown that this gradient matrix may be considered as fixed, being thus possible to precompute it from a training dataset; these techniques, introduced in [22], and extended so that to also incorporate a statistical texture variation model (as opposed to a fixed texture template described above), are called active appearance models (AAMs).Using this technique, each parameter in p is systematically displaced from its known optimal value retaining the normalized texture differences.The resulted matrices are then averaged over several displacement amounts and over several training images.The update direction of the model parameters p is then given by where R = ((∂r/∂p) T (∂r/∂p)) −1 (∂r/∂p) T is the pseudoinverse of the determined gradient matrix, which can be precomputed as part of the training stage.The parameters p continue to be updated iteratively until the error can no longer be reduced and convergence is declared.

A TPS-based model fitting technique
Piecewise affine warping is extensively used in techniques like AAM due to its reduced computational costs.A triangulation (e.g., Delauney) is used to partition the convex hull of the control points.The points inside triangles are then mapped via an affine transformation which uniquely assigns the corners of a triangle to their new positions.Although the assumption that the face patches are piecewise affine within the triangles is a satisfactory solution when there is a sufficiently large number of landmark points, it also shows an important drawback.This refers to the fact that, when modeling large face pose variations, corners of some triangles tend to get reversed due to occlusions of the corresponding landmark points.This obviously affects the image warping outcome by creating erroneous face patches.The errors are further propagated into the fitting algorithm, resulting in an incorrect fit.That is why the piecewise warping method works well mostly for modeling frontal or nearly frontal faces.
A more advanced and accurate warping method is obtained by employing the thin plate splines (TPSs), introduced in [23].A short description of this warping method is also given in the appendix.An initial drawback of using the thin plate splines was represented by the fact that they were quite expensive to calculate.The solution requires the inversion of a p × p matrix (the bending energy matrix) which has a computational complexity of O(N 3 ), where p is the number of points in the dataset (i.e., the number of pixels in the image); furthermore, the evaluation process is O(N 2 ).Fortunately, important progress has been made in order to speed this process up.An approximation approach was proved in [24] to be very efficient in dealing with the first problem, reducing greatly the computational burden.As far as the evaluation process is concerned, the multilevel fast multipole method (MLFMM) framework was described in [25] for the evaluation of two-dimensional polyharmonic splines, while in [26] this work was extended for the specific case of TPS, showing that a reduction of the computational complexity from O(N 2 ) to O(N log N) is indeed possible.Thus the computational difficulties involving the use of TPS have been to an important extent removed.We show in Figures 2 and 3 an example of fitting the model based on TPS warping.The error is evaluated relative to the number of available data points after the deformation.

IMPROVED MODEL FITTING BY MEANS OF LOCAL COLOR TRANSFER
A face detection algorithm is firstly applied for the current image.We used here the Viola-Jones face detector [27], which is based on the AdaBoost algorithm [28].A statistical relation between the face detector estimates for the face position and size (rectangle region) and the position and size of the reference shape inside the image frame is initially learnt (offline) from a set of training images.This relation is then used to obtain a more accurate initialization for the reference shape, tuned with the employed face detection algorithm.It is also important to have a reasonably close initialization to the real values in order to insure the convergence of the fitting algorithm described in Section 3. Color statistics are then extracted across the convex hull of landmark points of the initialized reference shape.

Image color transfer
According to [17], color can be transferred between two images (global color transfer) using the formula in (15), applied in the lαβ color space: where μ and σ are, respectively, the mean and standard deviation of the Gaussian distribution in the considered color space.
For local color transfer between two images, color statistics (e.g., mean and variance of the Gaussian-modeled color distribution) are gathered from the target and source image, respectively, and used to calculate the color influence map (CIM).CIM contains the weights for each pixel in the target image, determined based on their proximity to the color range in the source image.
Consider the distance between a pixel and the center of the color distribution.For three-dimensional color data this is the Mahalanobis distance given by where S is the covariance matrix of the three-variate color texture vector.Yet, if a decorrelated color space is used, then the covariance matrix S is close to being diagonal and ( 16) reduces to the normalized Euclidean distance (17): where σ is the standard deviation vector of c over the sample set.
The weights in CIM are calculated using a function of the above distance f (d), for which the following conditions should be met as follows: The function below was proposed in [18] to be used with the lαβ color space: The color transfer equation in (15) was also extended in [18] to or, if a single color c is used as source for color transfer,

Adaptive texture template matching
Using a decorrelated color space (see Section 2), the color of the texture template (see Figure 4(a)) can be adapted to the current image, increasing the chance of a correct fitting (correct-face segmentation) of the face model.Experimental results to support this premise and to confirm the benefits of employing color adaptation techniques with the template matching algorithm follow next.

EXPERIMENTS
The experiments have been performed on a randomly chosen subset of 16 images from the database in Figure 1.The images have been semiautomatically annotated and the set of annotations has been used as the ground truth for calculating the boundary errors, which give an objective measure for the fitting quality of the face model.The boundary errors are measured between the exact shape in the image frame (obtained from the ground truth annotations) and the optimized model shape in the image frame.The boundary error is calculated as the point-to-point (Pt-Pt) error, which is given by the Euclidian distance between the two shape vectors of concatenated x and y coordinates of the landmark points.The mean and standard deviation of Pt-Pt errors is used to evaluate the boundary errors over a whole set of images.The results are summarized in Table 2.
An implementation based only on the intensity (gray scale) component has also been tested.The gray scale images have been obtained by applying the standard mix of RGB components in (22): The initial results (no color adaptation) show a slight gain in the fitting accuracy over the gray scale implementation when color information is added.However, significant increase in face segmentation accuracy can be observed when adapting the color of the texture template using color transfer techniques.It can also be noted that the implementation based on I 1 I 2 I 3 color space performs slightly better in terms of segmentation accuracy, although subjectively better color adaptation results have been observed when using the lαβ color space.This can be explained by the fact that the I 1 I 2 I 3 color space representation is more suitable to be used together with the fitting algorithm which is implemented in the RGB color space.
The robustness to changes in the illumination conditions was also tested using the Oulu face image database [29].An example of color adaptation of the texture template for this database is shown in Figure 5.

DISCUSSION AND CONCLUSIONS
We analyzed in this paper the possibility of enhancing a face segmentation/tracking method based on texture template matching by means of color image alignment.We also presented a model parameters optimization approach which minimized the error between the texture template and the warped image texture across the current shape.We employed here the TPS-based warping method which is more robust for head pose variations.
The color alignment techniques make use of the decorrelated color statistics of the current image and template image.Improvements of the accuracy of the segmentation have been demonstrated.
From our experiments, we can conclude that the coloradaptation method for the texture template can also be useful in face tracking applications which employ face modeling techniques similar to the one described in Section 3. In particular, it was shown significant improvements and increased robustness for the case of tracking a face under changes in the illumination conditions, like the change of the type of illuminant.This may be a real change of the illuminant or it could be caused by some wrong white balance setting of the image acquisition device.

APPENDIX IMAGE WARPING: PRINCIPAL WARPS
The thin plate splines (TPSs)-based warping method, also named principal warps, was first introduced in [23].It represents a nonrigid registration method, built upon an analogy with a theory in mechanics.Namely, the analogy is made with minimizing the bending energy of a thin metal plate on which pressure is exerted using some point constraints.The bending energy is then given by a quadratic form; the spline is represented as a linear combination (superposition) of eigenvectors of the bending energy matrix: f (x, y) = a 1 + a x x + a y y + p i=1 w i U x i , y i − (x, y) , (A. 1) where U(r) = r 2 log(r); (x i , y i ) are the initial control points.a = (a 1 , a x , a y ) defines the affine part, while w defines the nonlinear part of the deformation.
The total bending energy is expressed as The surface is deformed such that to have minimum bending energy.The conditions that need to be met so that (A.1) is valid (so that f (x, y) has second-order derivatives) are given by where K i j = U( (x i , y i ) − (x j , y j ) ), O is a 3 × 3 matrix of zeros, o is a 3 × 1 vector of zeros, P i j = (1, x i , y i ); w and v are the column vectors formed by w i and v i , respectively, while a = [a 1 a x a y ] T .

Figure 4 :
Figure 4: Face modeling using color transfer.(a) Generating the color-adapted template from the original template and the estimated face region in the image; (b) converged model with no color transfer; and (c) converged model using the color-adapted template.

Figure 5 :
Figure 5: Example of color adaptation of the texture template under 16 different camera calibration settings and illumination conditions from Oulu database (all 4 × 4 combinations of horizon, incandescent, fluorescent, and daylight illuminants).(a) Initial images; (b) color-adapted templates.
the interpolation conditions f (x i , y i ) = v i , (A.1) can now be written as the linear system in (A.4):

Table 1 :
Decorrelation results over the tested database.
Figure 1: General, mixed color face image database.