One of the critical issues for facial expression recognition is to eliminate the negative effect caused by variant poses and illuminations. In this paper a two-stage illumination estimation framework is proposed based on three-dimensional representative face and clustering, which can estimate illumination directions under a series of poses. First, 256 training 3D face models are adaptively categorized into a certain amount of facial structure types by
In the last few years, with the rapid progress of human-computer intelligent interaction (HCII), automatic facial expression recognition has become a very active topic in machine vision community. Although recognition on frontal face with indoor lighting is already relatively mature, the performance among different PI is still far from satisfactory [
In order to eliminate the negative effects of variant PI in expression recognition, we have to estimate them first. The estimation of PI can be done in 2 steps and since it is easier to get illumination-invariant descriptors, the first step should be pose estimation [
There are many state-of-the-art works related to illumination estimation for face recognition which can be roughly categorized into two categories, namely, the traditional 2D based methods [
However, there are 3 main problems with 3D reconstruction based method: (1) generalization problem—all 3D reconstruction based methods require that the subject to be recognized be also in the training set, which is suitable for face recognition. But this requirement cannot be met in subject-independent expression recognition. (2) The 3D reconstruction process (e.g., 3D morphable model) itself is computationally expensive [
In this paper, we propose a subject-independent illumination estimation method, which can solve the generalization problem by RF and clustering technique with a complexity of
Figure
System overview.
The rest of the paper is organized as follows. In the next section, we give a brief introduction of the dataset we utilized and the preprocessing method. In Section
The dataset used to generate RFs in our experiments is BJUT-3D Face Database [
Shape vector consists of
Texture vector represents the color of the
Triangular patch vector represents the
We randomly select 400 subjects, 199 males and 201 females in the beginning, but, due to the inner problem of some models (burr and triangular patch relation error), we just keep 320 good models in the end as the dataset for our experiments; Figure
3D face model (a) before mesh simplification and (b) after mesh simplification.
The training set should contain as many types of facial appearance as possible in order to make our RF and clustering based method work well. So a relative large number of 3D face models are required to generate the training set. All the 320 3D face models are divided into 2 parts randomly; 256 models are used to generate the representative faces for training. The other 64 models are kept to generate test images with variant illuminations for testing the generalization ability of our method.
As described in Section
Another useful step to make 3D face vectors computable is pixel-wise correspondence, which is necessary for generating RF in Section
After mesh simplification and pixel-wise correspondence, the shape and texture information of the
The triangular patches can be computed using the Delaunay triangulation algorithm [
It is the subject’s 3D facial structure that determines the appearance (intensity distribution) of his/her photo under various illuminations, and facial organs such as eyes, nose, and mouth are the main cause of 3D structure difference among different people. Though people’s facial appearances differ in thousands of ways, their facial structures can be classified into some main types according to the positions and shapes of their facial organs. By clustering all 256 3D face models according to the coordinates of their main facial organs, we actually cluster them into a number of facial structure types; namely, faces in one cluster are alike and each cluster represents a facial structure type (as shown in Figure
Before clustering, we must obtain the coordinates of 4 facial fiducial points—two eyes, nose tip, and upper-lip tip—whose detection in a 3D face model is much easier than in a 2D image. The point with the greatest
Four fiducial points.
Whether two subjects are alike has nothing to do with their head size and pose in an image. So our clustering algorithm should be scale and rotation invariant. Each model in the database undergoes a 3D transformation with a vertical stretch to map its 4 fiducial points to the same destination set of fiducial points. Mathematically, the four 3D fiducial points
Here,
Using 256 3D models, the best transformation matrix is found by optimizing the 7 parameters
Then, we have all models’ nose tips aligned to a base point
It is difficult to decide an appropriate cluster number for
(a) Choice of clusters numbers and (b) silhouette value for 31 clusters.
Six RFs of six clusters representing six different facial structure types.
To have an idea of how well separated the resulting clusters are, see Figure
Now, we have 31 clusters. Faces within each cluster are alike and each cluster represents a facial structure type. The next step is to find a representative face that can represent its cluster’s facial structure type.
A 3D average face represents a kind of 3D stable structure hidden behind all individual faces that contribute in computing it. We believe that, for the need of illumination estimation, it is stable and representative enough to approximate all individual faces belonging to its facial structure type. So we generate an average face for each of the 31 clusters to represent 31 types of facial structures.
The average face can be computed as follows:
The triangular patches can be computed using the Delaunay triangulation algorithm [
We refer to [
The training set is generated from the 31 RFs by rotating them to certain poses and illuminating them with certain lights. For instance, 13 kinds of illuminations are defined in the experiments of Section
Figure
A representative face under 13 illuminations.
In a pattern recognition problem, it is of great significance to get the most discriminative feature for classification. In this paper, we select pixels which are most sensitive to illumination changes as feature. We call it “saltire-over-cross feature” because its shape is like the symbol on a Union Jack.
As illustrated in Figure horizontal (7 pixel lines), vertical (5 pixel lines), +45 degree diagonal (5 pixel lines), −45 degree diagonal (5 pixel lines).
Saltire-over-cross feature—the gray values of pixels in the 22 pixel lines are selected as features.
Positions of the 13 lamp-houses.
Using general face detection algorithms to locate face region in an image, get the centroid of the region as the middle of the “saltire-over-cross” (continuous line in Figure
We also try using all pixels within the facial region and the concatenated histograms of image partitions [
Comparison of different features.
Feature | Dimension | Accuracy | Description |
---|---|---|---|
Saltire-over-cross feature | 2825 | 96.27 | The downsampled saltire-over-cross feature. |
Saltire-over-cross feature | 5649 | 96.94 | The original saltire-over-cross feature. |
Regional feature | 2856 | 94.47 | Select pixels of the whole facial region as feature, and resample the feature vector to 2856-dimension. |
Regional feature | 5820 | 94.95 | Select pixels of the whole facial region as feature, and resample the feature vector to 5820-dimension. |
Partitioned histogram | 11520 | 89.78 | Partition the whole facial region into some blocks, extract histogram from each block, and concatenate all histograms to form a feature vector. |
Unlike many traditional classifiers that aim at minimizing the Empirical Risk, SVM [
In this paper, C-SVM with the radial basis function (RBF) kernel is used as our classifier. There are two parameters with C-SVM with RBF kernels,
Fivefold cross-validation and the grid-search technique described in [
SVM, as explained above, is suitable only for binary classification, while our illumination estimation is an
In a voting strategy, each binary classification is considered to be a vote where votes can be cast for all data points—at the end, a point is designated to be in a class with maximum number of votes, while, in an eliminating strategy, the margin size of each dichotomy is regarded as the classification confidence of that dichotomy. All dichotomies are sorted by their confidence, and, in each binary classification, one class is eliminated (see the paper in [
When we use the “one-against-one" approach [
To test the validity of our illumination estimation method, several experiments are conducted on different datasets, such as images generated from the 64 3D face models kept for testing, CAS-PEAL face database, and CMU PIE database, as well as a small test set created by ourselves.
In this experiment, we define 13 lamp-house positions each for an illumination class. As shown in Figure
Tilt angle and pan angle in pose definition.
For each test, two estimation results are given; one is for the 3D face models participating in the generation of RFs, which are projected with 13 illuminations into 2D to form
Illumination estimation performance under certain poses (1st row indicates pan angle; 1st column indicates tilt angle).
Angle | 0° | 30° | 60° |
---|---|---|---|
−40° | 92.22%/91.59% | 93.60%/94.23% | 92.01%/90.38% |
−20° | 94.35%/92.67% | 92.07%/93.60% | 93.90%/92.07% |
0° | 97.39%/96.88% | 96.94%/96.51% | 95.10%/93.99% |
20° | 90.53%/89.54% | 90.99%/90.87% | 88.43%/88.46% |
40° | 82.48%/82.57% | 80.38%/81.01% | 73.23%/72.84% |
From the estimation results shown in Table
Though test images in group-II have nothing to do with the generation of RF, we achieve comparable results when estimating samples in group-II. Actually, the accuracy of group-II is only a little bit lower than that of group-I in the large (sometimes even a little bit higher), which supports our second argumentation—there are some main types of facial structures, and the clustering technique does provide our illumination estimation system with a good generalization ability.
In the experiments of Section
To further enhance the robustness of our system, we conduct experiments on CAS-PEAL face database from the Chinese Academy of Sciences [
Since the illumination positions of images in CAS-PEAL face database are different from those we set in Section
Lamp-house positions of CAS-PEAL face database.
To be consistent with the sample images from training set, we interactively select the face regions and normalize all face images to uniform size as shown in Figure
Normalized facial images from CAS-PEAL face database.
Finally, a 57.33% recognition accuracy is achieved when estimating 15 illumination conditions with our 3D representative face and clustering based method. To analyze experimental results, we print the 15 × 15 error matrix ErrMat as shown in Table
As can be seen, elements in the right part of ErrMat (column index > 10) have higher values, which indicates that some lamp-houses of the All individuals in training set have their hair covered during the image acquisition, so no hair is shown in the training images, while most subjects in the images of test set have their forehead masked by hair, which results in low intensity in the upper area of image, and it seems like the lamp-house is in the nether No accouterments can be found in training images, while some individuals in test set are wearing some accouterments, such as glasses.
A two-stage recognition framework is presented to further promote the recognition performance on the test set from CAS-PEAL database in this paper. In the first stage, we use saltire-over-cross features discussed in Section
Since the total number of classes is reduced greatly (only three vertical lamp-house positions) in the second stage, the classifier has more margin to distinguish one class from another. In this way, the vertical angle of the lamp-house positions can be estimated more accurately.
It can be observed in this experiment that the performance on CAS-PEAL database can be promoted greatly with two-stage classification. The overall accuracy can be promoted from 57.33% to 78.67%. As illustrated in Figure
Accuracy comparison for each illumination class.
Neither CAS-PEAL nor CMU PIE face database provides images with both illumination and expression variations at the same time. Therefore, we create a small set of images by ourselves in this experiment, which contains more than 40 images of 5 people posing several expressions under some illuminations defined in Figure
Example images from the small dataset collected by ourselves.
In experiments, 72% recognition accuracy is achieved on the small image set by our two-stage illumination estimation framework based on 3D RF and clustering.
In [
We set head pose and lighting directions the same as in Huang et al. experiments [
Average estimation error with varying illumination under a series of poses. ±90°, I-2-H, indicates horizontal (H) (V for vertical) changes from −90° to 90° with 2° increments (I).
Pose range | Illumination range | Average error | Standard deviation |
---|---|---|---|
±90°, I-10-H | ±45°, I-4-H | 6.41°/7.32° | 6.32°/6.36° |
±90°, I-20-H | ±45°, I-10-H | 9.74°/8.27° | 13.53°/7.19° |
±45°, I-10-V | ±45°, I-10-V | 2.03°/9.12° | 3.07°/10.40° |
It turns out that our method achieves comparable performance when estimating horizontal illumination changes without 3D reconstruction. Meanwhile it outperforms Huang et al. method [
In Table
Estimation accuracy by computing the relative horizontal angles for different flashes. Camera numbers are 5, 27, and 34. Illumination numbers are 9, 11, and 21. The last column (computed values) is calculated from geometric information provided in the database.
Average estimation | Standard deviation | Computed values | |
---|---|---|---|
Illumination:
|
8.46°/10.28° | 34.52°/20.04° | 16.53° |
Illumination:
|
2.12°/1.89° | 7.28°/5.63° | 4.30° |
The error matrix.
|
Example cropped images from the CMU PIE face database.
As shown in Table
In this paper, a subject-independent illumination estimation method is proposed, which can solve the generalization problem by using RF and clustering technique with a complexity of
When estimating test images with expression variations (e.g., Section
The illumination estimation algorithm presented in this paper can be applied to a broad range of applications in face recognition and expression recognition to estimate the illumination conditions as long as the objects to be classified are at similar location, orientation, and scale in both the training and the test images. So, in experiments, we need to manually crop and normalize the test images. However, if the system is used in conjunction with appropriate segmentation and rectification algorithms, then these constraints can be removed.
An earlier version of this work appeared in the 6th International Symposium on Neural Networks [
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work is supported by the Tianjin Higher Education Fund for Science and Technology Development under Grant no. 20110808 and the National Natural Science Foundation of China (NSFC) under Grant no. 61173032. The authors would also like to thank the Beijing University of Technology for providing the BJUT-3D Face Database. Portions of the research in this paper use the BJUT-3D Face Database collected under the joint sponsoring of the National Natural Science Foundation of China, Beijing Natural Science Foundation Program, and Beijing Science and Educational Committee Program.