In the facial expression recognition task, a good-performing convolutional neural network (CNN) model trained on one dataset (source dataset) usually performs poorly on another dataset (target dataset). This is because the feature distribution of the same emotion varies in different datasets. To improve the cross-dataset accuracy of the CNN model, we introduce an unsupervised domain adaptation method, which is especially suitable for unlabelled small target dataset. In order to solve the problem of lack of samples from the target dataset, we train a generative adversarial network (GAN) on the target dataset and use the GAN generated samples to fine-tune the model pretrained on the source dataset. In the process of fine-tuning, we give the unlabelled GAN generated samples distributed pseudolabels dynamically according to the current prediction probabilities. Our method can be easily applied to any existing convolutional neural networks (CNN). We demonstrate the effectiveness of our method on four facial expression recognition datasets with two CNN structures and obtain inspiring results.
Facial expressions recognition (FER) has a wide spectrum of application potentials in human-computer interaction, cognitive psychology, computational neuroscience, and medical healthcare. In recent years, convolutional neural networks (CNN) have achieved many exciting results in artificial intelligent and pattern recognition and have been successfully used in facial expression recognition [
In this paper, we aim at improving the cross-dataset accuracy of a CNN model on facial expression recognition. One way to solve this problem is to rebuild models from scratch using large-scale newly collected samples. Large amounts of training samples, such as the dataset ImageNet [
The generative adversarial networks (GAN) have two subnetworks: a generator and a discriminator. The discriminator decides whether a sample is generated or real, while the generator produces samples to cheat the discriminator. The GAN is first proposed by Goodfellow et al. [
In this paper, the dataset that is used to train the baseline CNN is referred to as the source dataset, and the dataset being tested on for cross-dataset performance is referred to as target dataset. Our method uses samples generated by a generative adversarial network (GAN) to make up for the shortage of samples in the target dataset. More specifically, we apply our method on two widely used CNN structures, Alexnet [ Introducing an unsupervised domain adaptation method using GAN generated samples. Proposing a distributed pseudolabel method for samples generated by GAN. Improving the cross-dataset accuracy of baseline CNN in facial expression recognition using the proposed method.
The overall architecture of our unsupervised domain adaptation is shown in Figure
Overall training structure of the domain adaptation,
In a supervised training task, it is classic to use cross-entropy loss during training. Let
Let
The one-hot label of real JAFFE Image and the distributed label of a generated JAFFE image used in our DPL method.
By applying (
Let
During training, each time when a GAN generated image passes through the CNN, we assign a new distributed pseudolabel to it according to the current prediction, so the label of the generated image can change dynamically. With DPL, the entropy loss function for GAN generated images changes to (
( ( ( ( ( ( ( ( ( (
We use four FER datasets and seven emotions in our experiments. Figure
Sample images from four datasets.
The Alexnet [
Experiment results of the recognition accuracy on the target dataset.
Model | Source Dataset | Target Dataset | Source Only | Our Result |
---|---|---|---|---|
VGG11 | FER-2013 | JAFFE | 44.60% | 59.62% |
Alexnet | FER-2013 | JAFFE | 50.70% | 54.46% |
Alexnet | FER-2013 | MMI | 58.14% | 61.86% |
Alexnet | FER-2013 | CK+ | 71.90% | 76.58% |
Alexnet | CK+ | JAFFE | 46.94% | 51.64% |
Alexnet | JAFFE | CK+ | 60.33% | 65.01% |
We resize the face-cropped images of CK+, JAFFE, and MMI to
Network structure of GAN. The convolutional layer is denoted as conv and the transposed convolutional layer is denoted as dconv. N stands for neurons (channels), S stands for stride, and K stands for kernel size. LReLU means leaky ReLU nonlinearity and BN means batch normalization.
From left to right, GAN generated images from CK+, JAFFE, and MMI.
In the domain adaptation training process, we use 2k GAN images in each experiment. The weights of the top 3 classes (
We conduct a series of experiments over different datasets. During training, the source dataset and its label information is used to train the CNN, whereas the target dataset is only used to generated GAN samples without the label information. The label information of the target dataset is only used for testing. First, the relatively large dataset, FER-2013, is used as source dataset. When using Alexnet as the CNN structure, we take JAFFE, MMI, and CK+ as target dataset and obtain 3.76%, 3.72%, and 4.41% recognition accuracy improvement on the target dataset, respectively. We also train on VGG11 with FER-2013 as the source dataset and JAFFE as the target dataset to examine our method on a different CNN structure, and the recognition accuracy increases by 15.02%. Then we use smaller dataset as source dataset to further test our method. When we use CK+ as the source dataset, we get 4.70% improvement of recognition accuracy on JAFFE and 4.68% improvement when using JAFFE as source dataset and CK+ as target dataset. The experiment results have shown that our method can improve the CNN model’s recognition accuracy on the target dataset with different datasets as well as different CNN structures.
We compare our experiment results with other published cross-dataset recognition accuracy results in Table
Comparison with other published methods.
Method | Source Dataset | Target Dataset | Recognition Accuracy on The Target Dataset |
---|---|---|---|
Meguid et al. [ | Bu-3DFE | JAFFE | 41.96% |
Wen et al. [ | FER2013 | JAFFE | 50.70% |
Gu et al. [ | CK | JAFFE | 55.87% |
Zhu et al. [ | FEED | JAFFE | |
Our Method | CK+ | JAFFE | 51.64% |
Our Method | FER2013 | JAFFE | 59.62% |
| |||
Mayer et al. [ | CK | MMI | 60.30% |
Mayer et al. [ | FEED | MMI | 58.90% |
Our Method | FER2013 | MMI | |
| |||
Gu et al. [ | JAFFE | CK+ | 54.05% |
Mayer et al. [ | FEED | CK+ | 56.60% |
Wen et al. [ | FER2013 | CK+ | 76.05% |
Our Method | JAFFE | CK+ | 65.01% |
Our Method | FER2013 | CK+ | |
We compare the confusion matrix of our result with the baseline CNN trained only with the source dataset to see the recognition accuracy changes of each class of the emotions. In Tables
The target dataset recognition accuracy (%) confusion matrix of baseline CNN, FER-2013
Angry | Disgust | Fear | Happy | Sad | Surprise | Neutral | |
---|---|---|---|---|---|---|---|
Angry | | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 100.00 |
Disgust | 0.00 | | 3.45 | 0.00 | 27.59 | 0.00 | 55.17 |
Fear | 0.00 | 0.00 | | 3.13 | 0.00 | 9.38 | 53.13 |
Happy | 0.00 | 0.00 | 0.00 | | 0.00 | 3.23 | 32.26 |
Sad | 3.23 | 0.00 | 0.00 | 0.00 | | 0.00 | 80.65 |
Surprise | 0.00 | 0.00 | 0.00 | 6.67 | 0.00 | | 10.00 |
Neutral | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
The target dataset recognition accuracy (%) confusion matrix of our method, FER-2013
Angry | Disgust | Fear | Happy | Sad | Surprise | Neutral | |
---|---|---|---|---|---|---|---|
Angry | | 13.33 | 0.00 | 0.00 | 10.00 | 0.00 | 53.33 |
Disgust | 6.90 | | 3.45 | 0.00 | 17.24 | 0.00 | 13.79 |
Fear | 9.38 | 31.25 | | 0.00 | 0.00 | 6.25 | 3.13 |
Happy | 0.00 | 0.00 | 0.00 | | 0.00 | 0.00 | 3.23 |
Sad | 25.81 | 9.68 | 16.13 | 3.23 | | 3.23 | 9.68 |
Surprise | 0.00 | 0.00 | 6.67 | 13.33 | 0.00 | | 0.00 |
Neutral | 6.67 | 0.00 | 0.00 | 6.67 | 6.67 | 3.33 | |
We compare DPL with two alternative methods using GAN generated images, the pseudolabel [ Pseudolabel takes the class which has the highest predicted probability as the unlabelled image’s one-hot pseudolabel and updates the pseudolabel each time when the unlabelled image is fed into the network. LSRO is a regularization method used for GAN samples generated from a person re-ID dataset; they presume that the generated samples do not belong to any of the person predefined and should be labelled with a uniform distribution
This experiment is conducted using both two CNN architectures, the Alexnet and the VGG11, with 2k GAN images. The source dataset is FER-2013 and the target dataset is JAFFE. Table
Comparison with two other methods using GAN generated images, FER-2013
Method | Recognition Accuracy on Target Dataset | |
---|---|---|
Alexnet | VGG11 | |
Baseline | 50.70% | 44.60% |
Pseudo-label | 51.17% | 42.72% |
LSRO | 53.99% | 57.75% |
DPL | | |
In previous experiments, we only fine-tune the CNN with GAN generated samples; now we want to investigate how our method performs with real images from the target dataset. We use FER-2013 as the source dataset and train a CNN with it. And we treat the JAFFE as unlabelled target images to fine-tune the CNN with DPL. Since the JAFFE dataset has only 213 images, we fine-tune a CNN with 213 generated images for comparison. Table
Comparison with real images, FER-2013
Method | Recognition Accuracy on Target Dataset | |
---|---|---|
Alexnet | VGG11 | |
Sour-only | 50.70% | 44.60% |
Real-213 | 51.64% | 56.34% |
GAN-213 | 52.11% | 55.87% |
GAN-2k | 54.46% | 59.62% |
GAN-2k+Real-213 | 55.40% | 60.09% |
The weights of the top 3 classes (
The experiment result with different weight combinations (
Here we look into the impact of the number of GAN generated images used for DPL on the experiment results. We take FER-2013 as source dataset and JAFFE as target dataset. The 0.4-0.2-0.2 weights combination is used for DPL and the learning rate is set to 0.000001. We stop fine-tuning after 10 epochs. From Figure
The experiment result using different number of GAN images.
In this paper, we propose an unsupervised domain adaptation method, a method using GAN generated samples to improve the cross-dataset performance of facial expression recognition. When training the CNN with unlabelled GAN generated samples, we introduce a distributed pseudolabel method (DPL). With our method, domain adaptation can be achieved with limited target data without ground truth labels. Experiments have shown that our method outperforms other GAN-based domain adaptation methods and can get state-of-the-art cross-dataset recognition accuracy. When using FER-2013 as the source dataset, we obtain 15.02%, 3.76%, 3.72%, and 4.41% recognition accuracy improvement on the target dataset JAFFE (VGG11), JAFFE (Alexnet), MMI, and CK+, respectively. When using CK+ as the source dataset, we obtain 4.70% improvement of recognition accuracy on JAFFE and 4.68% improvement when using JAFFE as source dataset and CK+ as target dataset. Future work may extend the unsupervised DPL to a semisupervised version since the real-world samples with ground truth label in target dataset might provide better estimation of the target data. Also, it will be intriguing to apply our method to other domain adaptation tasks.
The datasets used during the current study are available in the following repository:
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was supported in part by National Natural Science Foundation of China, Grant no. 51575388.