Facial feature point detection has been receiving great research advances in recent years. Numerous methods have been developed and applied in practical face analysis systems. However, it is still a quite challenging task because of the large variability in expression and gestures and the existence of occlusions in real-world photo shoot. In this paper, we present a robust sparse reconstruction method for the face alignment problems. Instead of a direct regression between the feature space and the shape space, the concept of shape increment reconstruction is introduced. Moreover, a set of coupled overcomplete dictionaries termed the shape increment dictionary and the local appearance dictionary are learned in a regressive manner to select robust features and fit shape increments. Additionally, to make the learned model more generalized, we select the best matched parameter set through extensive validation tests. Experimental results on three public datasets demonstrate that the proposed method achieves a better robustness over the state-of-the-art methods.
In most literatures, facial feature points are also referred to facial landmarks or facial fiducial points. These points mainly locate around edges or corners of facial components such as eyebrows, eyes, mouth, nose, and jaw (see Figure
Schematic diagram of our robust sparse reconstruction method for facial feature point detection.
In recent years, regression-based methods have gained increasing attention for robust facial feature point detection. Among these methods, a cascade framework is adopted to recursively estimate the face shape
In this paper, we propose a sparse reconstruction method that embeds sparse coding in the reconstruction of shape increment. As a very popular signal coding algorithm, sparse coding has been recently successfully applied to the fields of computer vision and machine learning, such as feature selection and clustering analysis, image classification, and face recognition [
The following contents of this paper are organized as follows: related work is introduced in Section
During the past two decades, a large number of methods have been proposed for facial feature point detection. Among the early methods, Active Appearance Model (AAM) [
Generally, Constrained Local Model- (CLM-) based methods [
The aforementioned methods share the same characteristic which controls face shape variations through some certain parameters. But different from those methods, the regression-based methods [
Our method belongs to the regression-based method, like [
In this paper, the alignment target of all methods is assessed through the following formula:
Multi-initialization means diversification of initial iteration shape which can improve robustness of the reconstruction model. Specifically, we randomly select multiple ground-truth face shapes from the training set to form a group of initial shapes for the current image. Obviously, the multi-initialization strategy is able to enlarge the training sample size and enrich extracted feature information that makes each regression model more robust, while, during the testing stage, multi-initialization can create more chances to step out of potential local minima that may lead to inaccurate feature point localization.
In our method, there are four key parameters that are the size of feature dictionary, the size of shape increment dictionary, and their corresponding sparsity. The selection of the four parameters has a direct influence on the learned reconstruction model. Therefore we do a large number of validation tests to find the best matched parameters. Then according to the validation results, we decide to adopt three sets of parameters to train the model.
We use the Orthogonal Matching Pursuit (OMP) [
In Supervised Descend Method (SDM [
To better represent local appearances around facial feature points, the extracted HoG features are also encoded into sparse coefficients:
Now we describe the shape regression framework in detail (see Pseudocode
set images and corresponding initial shapes: coding parameters set: step 1: Given step 2: Sequentially get step 3: Get step 1: Similarly, extract step 2: Given step 3: Obtain
During the testing stage, we can get the local appearance coefficients
In this section, we summarize the following three contributions of the proposed method: Sparse coding is utilized to learn a set of coupled dictionaries, named the shape increment dictionary and the local appearance dictionary. The solved corresponding sparse coefficients are embedded in a regression framework for approximating the ground-truth shape increments. A way of alternate verification and local enumeration is applied for selecting the best parameter set in extensive experiments. Moreover, it is shown in experimental results that the proposed method has a strong stability under different parameter settings. We also rebuild testing conditions that the top 5%, 10%, 15%, 20%, and 25% of the testing images are removed according to the descending order sorted by the normalized alignment error. And then the proposed method is compared with three classical methods on three publicly available face datasets. Results support that the proposed method achieves better detection accuracy and robustness than the other three methods.
In this section, three publicly available face datasets are selected for performance comparison: Labeled Face Parts in the Wild (LFPW-68 points and LFPW-29 points [
The implementation codes of SDM [
Generally, the size of shape increment dictionary and local appearance dictionary in our method depends on the dimensionality of the HoG descriptor. And in the following validation experiments, we will introduce how to select the best combination of parameters. Parameters settings of SDM, ESR, and RCPR are consistent with the original settings reported in the papers. In SDM, the regression runs 5 stages. In ESR, the number of features in a fern and candidate pixel features are 5 and 400, respectively. To build the model, the method uses 10 and 500 stages to train a two-level boosted framework. And in RCPR, 15 iterations, 5 restarts, 400 features, and 100 random fern regressors are adopted.
In our experiments, we use the following equation to calculate and normalize the alignment errors. Firstly, we calculate the localization errors between the ground-truth point coordinates and the detected point coordinates, that is, the Euclidean distance between two vectors. Then it is further normalized by the interocular distance as follows:
In this section, we will introduce how to use the way of alternate verification and local enumeration to find the final values of parameters. As described above, there are six variables
Firstly the values of
Comparison of different parameter sets on LFPW (68 points) dataset. Here
| | Size of | Size of | | | Mean errors |
---|---|---|---|---|---|---|
2 | 2 | 256 | 256 | 5 | 1 | 0.079381 |
256 | 512 | 5 | 1 | 0.08572 | ||
512 | 256 | 5 | 1 | 0.085932 | ||
512 | 512 | 5 | 1 | 0.086731 | ||
| ||||||
2 | 4 | 256 | 256 | 5 | 1 | 0.081958 |
256 | 512 | 5 | 1 | 0.083715 | ||
512 | 256 | 5 | 1 | 0.086187 | ||
512 | 512 | 5 | 1 | 0.087157 | ||
| ||||||
2 | 6 | 256 | 256 | 5 | 1 | 0.07937 |
256 | 512 | 5 | 1 | 0.07986 | ||
512 | 256 | 5 | 1 | 0.075987 | ||
512 | 512 | 5 | 1 | 0.084429 | ||
| ||||||
2 | 8 | 256 | 256 | 5 | 1 | 0.075863 |
256 | 512 | 5 | 1 | 0.082588 | ||
512 | 256 | 5 | 1 | 0.077048 | ||
512 | 512 | 5 | 1 | 0.082644 | ||
| ||||||
2 | 10 | 256 | 256 | 5 | 1 | 0.076178 |
256 | 512 | 5 | 1 | 0.076865 | ||
512 | 256 | 5 | 1 | 0.080907 | ||
512 | 512 | 5 | 1 | 0.088414 |
Comparison of multi-initialization and multiparameter strategies on LFPW (68 points) dataset. Here
| | Size of | Size of | | | Mean errors | Fusion errors |
---|---|---|---|---|---|---|---|
6 | 10 | 512 | 256 | 4 | 10 | 0.062189 | 0.055179 |
10 | 10 | 512 | 256 | 4 | 10 | 0.06075 | |
8 | 8 | 512 | 256 | 4 | 10 | 0.061787 |
Due to the existence of a small number of facial images having large shape variations and severe occlusions, it challenges the random multi-initialization strategy which fails to generate an appropriate starting shape. Therefore we compare our method with three classic methods on rebuilt datasets. These datasets still include most of the images coming from LFPW (68 points), LFPW (29 points), and COFW (29 points). We just remove the top 5%, 10%, 15%, 20%, and 25% of the testing facial images in each dataset by sorting the alignment errors in a descending order (see Figure
Cumulative Error Distribution (CED) curves of four methods tested on LFPW (68 points), LFPW (29 points), and COFW (29 points) datasets. The top (a) 0%, (b) 5%, (c) 10%, (d) 15%, (e) 20%, and (f) 25% of the testing images are removed according to the descending order sorted by the normalized alignment errors.
In Figure
In general, the more facial feature points are, the more difficult they are to detect. By comparing among five facial components, the mean errors of nose and eyes given in Tables
Mean error of each facial component on LFPW (68 points), LFPW (29 points), and COFW (29 points) datasets. The top 5% maximal mean errors of the testing facial images in each dataset are removed.
Method | Contour | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
LFPW (68 points) | SDM | 0.0829 | 0.0619 | 0.0478 | 0.0395 | 0.0369 |
ESR | 0.0862 | 0.0750 | 0.0651 | 0.0596 | 0.0527 | |
RCPR | 0.0948 | 0.0690 | 0.0562 | 0.0493 | 0.0433 | |
Our method | 0.0747 | 0.0587 | 0.0455 | 0.0405 | 0.0392 |
Method | Jaw | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
LFPW (29 points) | SDM | 0.0422 | 0.0422 | 0.0410 | 0.0422 | 0.0328 |
ESR | 0.0570 | 0.0459 | 0.0531 | 0.0502 | 0.0400 | |
RCPR | 0.0748 | 0.0507 | 0.0568 | 0.0509 | 0.0357 | |
Our method | 0.0382 | 0.0403 | 0.0401 | 0.0406 | 0.0323 |
Method | Jaw | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
COFW (29 points) | SDM | 0.0713 | 0.0714 | 0.0709 | 0.0600 | 0.0519 |
ESR | 0.1507 | 0.1022 | 0.1082 | 0.0952 | 0.0801 | |
RCPR | 0.1209 | 0.0810 | 0.0781 | 0.0655 | 0.0539 | |
Our method | 0.0668 | 0.0642 | 0.0702 | 0.0567 | 0.0497 |
Mean error of each facial component on LFPW (68 points), LFPW (29 points), and COFW (29 points) datasets. The top 25% maximal mean errors of the testing facial images in each dataset are removed.
Method | Contour | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
LFPW (68 points) | SDM | 0.0718 | 0.0581 | 0.0433 | 0.0363 | 0.0337 |
ESR | 0.0746 | 0.0634 | 0.0535 | 0.0435 | 0.0408 | |
RCPR | 0.0830 | 0.0615 | 0.0484 | 0.0414 | 0.0367 | |
Our method | 0.0634 | 0.0518 | 0.0406 | 0.0364 | 0.0332 |
Method | Jaw | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
LFPW (29 points) | SDM | 0.0385 | 0.0381 | 0.0376 | 0.0389 | 0.0295 |
ESR | 0.0498 | 0.0419 | 0.0461 | 0.0435 | 0.0350 | |
RCPR | 0.0637 | 0.0460 | 0.0471 | 0.0433 | 0.0309 | |
Our method | 0.0336 | 0.0360 | 0.0365 | 0.0362 | 0.0292 |
Method | Jaw | Eyebrow | Mouth | Nose | Eye | |
---|---|---|---|---|---|---|
COFW (29 points) | SDM | 0.0607 | 0.0643 | 0.0614 | 0.0525 | 0.0457 |
ESR | 0.1255 | 0.0873 | 0.0882 | 0.0781 | 0.0664 | |
RCPR | 0.1055 | 0.0679 | 0.0633 | 0.0533 | 0.0440 | |
Our method | 0.0561 | 0.0569 | 0.0603 | 0.0497 | 0.0435 |
Figure
Mean errors of four methods tested on LFPW (68 points), LFPW (29 points), and COFW (29 points) datasets.
Specifically, we plot the detection curves of five facial components in Figure
Facial feature point detection curves of four methods for each facial component on LFPW (68 points), LFPW (29 points), and COFW (29 points) datasets. (a) Eyebrow. (b) Eye. (c) Nose. (d) Mouth. (e) Contour or jaw.
A robust sparse reconstruction method for facial feature point detection is proposed in this paper. In the method, we build the regressive training model by learning a coupled set of shape increment dictionaries and local appearance dictionaries which are learned to encode various facial poses and rich local textures. And then we apply the sparse model to infer the final face shape locations of an input image by continuous reconstruction of shape increments. Moreover, in order to find the best matched parameters, we perform extensive validation tests by using the way of alternate verification and local enumeration. It shows in the comparison results that our sparse coding based reconstruction model has a strong stability. In the later experiments, we compare our proposed method with three classic methods on three publicly available face datasets when removing the top 0%, 5%, 10%, 15%, 20%, and 25% of the testing facial images according to the descending order of alignment errors. The experimental results also support that our method is superior to the others in detection accuracy and robustness.
The authors declare that they have no competing interests.
This work is supported by the National Key Research & Development Plan of China (no. 2016YFB1001401) and National Natural Science Foundation of China (no. 61572110).