Facial expression recognition plays an important role in communicating the emotions and intentions of human beings. Facial expression recognition in uncontrolled environment is more difficult as compared to that in controlled environment due to change in occlusion, illumination, and noise. In this paper, we present a new framework for effective facial expression recognition from real-time facial images. Unlike other methods which spend much time by dividing the image into blocks or whole face image, our method extracts the discriminative feature from salient face regions and then combine with texture and orientation features for better representation. Furthermore, we reduce the data dimension by selecting the highly discriminative features. The proposed framework is capable of providing high recognition accuracy rate even in the presence of occlusions, illumination, and noise. To show the robustness of the proposed framework, we used three publicly available challenging datasets. The experimental results show that the performance of the proposed framework is better than existing techniques, which indicate the considerable potential of combining geometric features with appearance-based features.
Facial expression recognition (FER) has emerged as an important research area over the last two decades. Facial expression is one of the immediate, natural, and powerful means for humans to communicate their intentions and emotions. The FER system can be used in many important applications such as driver safety, health care, video conferencing, virtual reality, and cognitive science etc.
Generally, facial expression can be classified into neutral, anger, disgust, fear, surprise, sad, and happy. Recent research shows that the ability of young people to read the feeling and emotion of other people is getting reduced due to the extensive use of digital devices [
An automatic FER system commonly consists of four steps: Preprocessing, feature extraction, feature selection, and classification of facial expressions. In the preprocessing step, face region is first detected and then extracted from the input image because it is the area that contains expression-related information. The most well-known and common algorithm used for face detection is the Viola–Jones object detection algorithm [
Although a lot of work has been done to develop a robust FER system, we find that several common problems still exist in the real-time environment which hinder the development of the FER system: (i) The extracted features are sensitive to the change in illumination, occlusion, and noise. That means a slight change in illumination, occlusion, and noise may influence the recognition accuracy rate. (ii) The large data dimension is another problem which deteriorates the performance of such systems.
The contributions of the proposed work are as follows: A dual-feature fusion technique is proposed in this work for effective and efficient classification of facial expressions in the unconstrained environment. The proposed framework is based on local and global features, which make the proposed framework robust to change in occlusions, illumination, and noise. Feature selection process is used to obtain the discriminative features, where the redundant features are discarded. The reduction in feature vector length also reduces the time complexity which makes the proposed framework suitable for real-time applications.
The rest of the paper is organized as follows: Section
Numerous methods for facial expression recognition have been developed due to its increased importance. These methods are mainly categorized into geometric-based and appearance-based methods based on feature extractions.
In geometric-based methods, information such as shape of the face and its components are used for feature extraction. The first important and challenging step in the geometric-based method is to initialize a set of facial points as the facial expression evolves over time. The study presented in [
On the contrary, appearance-based features extraction methods encode the face appearance variations without taking muscle motion into account. Chen et al. [
Apart from the appearance-based or geometric-based feature extraction, fusion of this two-feature extraction method is also a promising trend. Zhang et al. [
In this paper, different from other methods, we select the facial informative local regions instead of dividing the face image into nonoverlapping blocks. Such representations can improve the classification performance compared with the block-based image representation. The appearance-based feature is computed from local face regions and also from the whole face area. These features are then fused which provide more robust features.
The working of the proposed framework based on dual-feature fusion is illustrated in Figure
Proposed framework flow diagram.
In order to extract the region of interest (i.e. face portion), we utilized the Viola–Jones algorithm [
The spatial misalignment usually occurs due to the expression and pose variations in the face image. Division of the face image into nonoverlapped blocks or exploiting holistic features cannot resolve this issue [
For this purpose, we used the method presented by Kazemi and Sullivan [
After landmark position estimation, we use the facial point location to divide the face image into 29 local regions. The local feature is extracted from all these local regions. In order to reduce the data dimensions, we do not require exhaustive search technique as performed in [
The Weber local descriptor is proposed by Chen et al. [
Formally, the differential excitation component can be defined as
Figures
WLD excitation and orientation component.
First row represents the original image, second row is the sample images of excitation component, and third row depicts orientations of component images.
We can compute the DCT of an input scanned image
After selection of appearance-based and geometric-based features, we employed a score-level fusion strategy to combine these features. Feature-level fusion and score-level fusion are the two fusion strategies which are used widely in the literature. In the feature-level fusion, different feature vectors are simply concatenated after normalization process. In contrast to the feature-level fusion, a distance-based classifier is used in the score-level fusion to compute the distance between the feature vector of training and testing samples. The feature-level fusion mainly produces large data dimension [
The procedure of feature extraction and fusion is presented in Algorithm
Testing sample images
For multi and binary classification problem, the SVM [
To evaluate the performance of the proposed framework, we used 3 publicly available benchmarking databases, namely, MMI database, extended Cohn-Kanade (CK+) and static face in the wild (SFEW). MMI database: this image database [ Extended Cohn-Kanade (CK+): this database contains 593 video sequence of 123 subjects [ Static face in the wild (SFEW): the SFEW [
Sample images of each database is shown in Figure
Sample images taken from MMI, CK+, and SFEW database.
Number of selected images per expression from MMI, CK+, and SFEW database.
Dataset | Expression | |||||||
---|---|---|---|---|---|---|---|---|
Neutral | Fear | Disgust | Angry | Surprised | Sad | Happy | Total | |
MMI | 36 | 41 | 39 | 45 | 39 | 34 | 39 | 273 |
CK+ | N/A | 90 | 90 | 90 | 90 | 90 | 90 | 540 |
SFEW | N/A | 50 | 41 | 50 | 50 | 50 | 50 | 291 |
To make maximum use of the available data, we employed 5-fold and 10-fold cross validation for all the experiments. To get the better picture of the facial expression recognition accuracy, average accuracy rate and confusion matrices are given across all the three datasets.
This section shows the results obtained using MMI, CK+, and SFEW datasets. MMI dataset contained most of the spontaneous expressions. The proposed framework achieved an average recognition accuracy of 96% and 98.62%, respectively, for MMI and CK+ database. The confusion matrix of classifying 7 facial expressions for MMI dataset and 6 basic expressions for CK+ is shown in Tables
Confusion matrix of recognition accuracy for MMI database.
Neutral (%) | Fear (%) | Disgust (%) | Angry (%) | Surprised (%) | Sad (%) | Happy (%) | |
---|---|---|---|---|---|---|---|
Neutral |
|
0 | 0 | 0 | 0 | 0 | 0 |
Fear | 4.88 |
|
0 | 0 | 2.44 | 0 | 0 |
Disgust | 2.56 | 0 |
|
2.56 | 0 | 0 | 0 |
Angry | 4.44 | 0 | 4.44 |
|
0 | 0 | 0 |
Surprised | 0 | 2.56 | 0 | 0 |
|
0 | 0 |
Sad | 0 | 0 | 0 | 0 | 0 |
|
0 |
Happy | 0 | 2.56 | 0 | 0 | 0 | 0 |
|
Confusion matrix of recognition accuracy for CK+ database.
Fear (%) | Disgust (%) | Angry (%) | Surprised (%) | Sad (%) | Happy (%) | |
---|---|---|---|---|---|---|
Fear |
|
2.8 | 2.2 | 0 | 0 | 0 |
Disgust | 0 |
|
0 | 0 | 0 | 0 |
Angry | 0 | 0 |
|
0 | 2.22 | 0 |
Surprised | 0 | 0 | 0 |
|
1.11 | 0 |
Sad | 0 | 0 | 0 | 0 |
|
0 |
Happy | 0 | 0 | 0 | 0 |
|
In Table
The confusion matrix in Table
The confusion matrix for SFEW results is shown in Table
Confusion matrix of the recognition accuracy for the SFEW database.
Fear (%) | Disgust (%) | Angry (%) | Surprised (%) | Sad (%) | Happy (%) | |
---|---|---|---|---|---|---|
Fear |
|
0.0 | 6.0 | 14.0 | 8.0 | 8.0 |
Disgust | 7.3 |
|
17.1 | 12.2 | 19.5 | 12.2 |
Angry | 6.0 | 10.0 |
|
10.0 | 14.0 | 18.0 |
Surprised | 22.0 | 0.0 | 16.0 |
|
12.0 | 8.0 |
Sad | 8.0 | 8.0 | 8.0 | 2.0 |
|
10.0 |
Happy | 10.0 | 4.0 | 14.0 | 8.0 | 10.0 |
|
Table
Confusion matrix of recognition accuracy for MMI.
Method | Fear (%) | Disgust (%) | Angry (%) | Surprised (%) | Sad (%) | Happy (%) | Mean (%) |
---|---|---|---|---|---|---|---|
Chen et al. [ |
68.40 | 65.30 | 69.50 | 82.60 | 68.20 | 83.90 | 73.00 |
Cruz et al. [ |
91.36 | 92.27 | 88.44 | 97.63 | 93.53 | 98.75 | 93.66 |
Ghimire et al. [ |
70.00 | 80.00 | 70.00 | 90.00 | 73.33 | 92.50 | 79.305 |
Chen et al. [ |
76.50 | 60.40 | 70.20 | 84.20 | 62.10 | 81.20 | 72.40 |
Alphonse and Dharma [ |
81.30 | 81.30 | 82.00 | 90.00 | 76.70 | 83.33 | 82.44 |
Yu et al. [ |
81.24 | 88.21 | 83.24 | 85.29 | 85.77 | 93.22 | 86.16 |
Proposed method | 92.70 | 94.90 | 91.10 | 97.40 | 100.00 | 97.40 | 95.58 |
In Table
Confusion matrix of recognition accuracy for CK+.
Method | Fear (%) | Disgust (%) | Angry (%) | Surprised (%) | Sad (%) | Happy (%) | Mean (%) |
---|---|---|---|---|---|---|---|
Chen et al. [ |
92.50 | 86.20 | 96.10 | 96.40 | 94.10 | 98.20 | 91.20 |
Cruz et al. [ |
89.33 | 91.58 | 93.52 | 94.75 | 87.00 | 100.00 | 92.69 |
Ghimire et al. [ |
96.00 | 96.67 | 97.50 | 100.00 | 93.33 | 100.00 | 97.80 |
Chen et al. [ |
91.70 | 94.30 | 95.60 | 97.50 | 89.40 | 95.90 | 93.80 |
Alphonse and Dharma [ |
99.23 | 97.36 | 92.77 | 99.55 | 98.69 | 98.69 | 97.715 |
Yu et al. [ |
99.71 | 99.68 | 100.00 | 100.00 | 99.14 | 99.89 | 99.73 |
Proposed method | 95.00 | 100.00 | 97.80 | 98.90 | 100.00 | 100.00 | 98.62 |
Figure
Comparison between existing method and proposed approach based on recognition accuracy.
In the uncontrolled environment, noise, and occlusions are the main factors to degrade the image quality and reduce the facial expression recognition accuracy rate. It is required for any FER system to perform well in the presence of noise and partial occlusions. In this section, we examine the robustness of our proposed method in the presence of noise and partial occlusions.
To check the robustness against noise, we randomly added salt and pepper noise of different levels to the images of MMI and CK+ database. This type of noise is composed of two components.
The first component is the salt noise which occurs as a bright spot in the image, and the second component is the pepper noise which appears as a dark spot. As shown in Figure
Sample images of salt and pepper noise from (a) MMI and (b) CK+ where
The results illustrated in Figure
Recognition accuracy of MMI and CK+ databases in the presence of noise.
In order to assess the proposed method performance in the presence of occlusions, we have added a block of random size to the test images. The range of block size starting from [15 × 15] to [55 × 55] randomly placed to the face images are shown in Figure
Sample images of occlusion from (a) MMI and (b) CK+ databases with varying block size.
The average recognition accuracy rates for both MMI and CK+ are illustrated in Table
Assessment of MMI and CK+ results in the presence of occlusions.
Block size | MMI (%) | CK+ (%) |
---|---|---|
[15 × 15] | 91.9 | 98.1 |
[25 × 25] | 90.8 | 98.3 |
[35 × 35] | 90.5 | 90.6 |
[45 × 45] | 88.3 | 88.5 |
[55 × 55] | 75.1 | 90.6 |
To prove the robustness of our proposed method against noise and occlusions, we also compared the performance with the existing method [
Comparison graph of the proposed method accuracy rate assessment with other methods in the presence of noise.
Competitive assessment with the existing method in the presence of occlusions.
Facial expression recognition in the real-world case is a long-standing problem. The low image quality, partial occlusions, and illumination variation in the real-word environment make the feature extraction process more challenging. In this paper, we exploit both texture and geometric features for effective facial expression recognition. The effective geometric features are introduced in this paper from facial landmark detection, which can capture the facial configure changes. Considering that the geometric feature extraction may fail under various conditions, the addition of texture feature with geometric features is useful for capturing the minor changes in expressions. WLD is utilized for the extraction of texture feature which is more effective to capture the facial subtle changes. Furthermore, we have employed score-level fusion for fusion of geometric and texture features which results in decreasing the number of features. The performance of the proposed approach is evaluated on standard databases like MMI, CK+, and SFEW, and the results are compared with the state-of-the-art approaches. The effectiveness of our proposed dual-feature fusion strategy is verified by different experimental results.
Although WLD works well on the face images for the extraction of salient features, the variation of local intensity cannot effectively be represented by using the standard WLD because it neglects different orientations of the neighborhood pixel. In future work, we are planning to address this issue along with the experimentation with ethnographic datasets.
The authors confirm that the data generated or analyzed and the information supporting the findings of this study are available within the article.
The authors declare no conflicts of interest.
All the co-authors have made significant contribution in conceptualization, data analysis, experimentations, scientific discussions, preparation of original draft, and revision and organization of the paper.
This study was supported by the Deanship of Scientific Research, King Saud University, Riyadh, Saudi Arabia, through the Research Group under Project RG-1439-039.