Towards a Better Performance in Facial Expression Recognition: A Data-Centric Approach

Facial expression is the best evidence of our emotions. Its automatic detection and recognition are key for robotics, medicine, healthcare, education, psychology, sociology, marketing, security, entertainment, and many other areas. Experiments in the lab environments achieve high performance. However, in real-world scenarios, it is challenging. Deep learning techniques based on convolutional neural networks (CNNs) have shown great potential. Most of the research is exclusively model-centric, searching for better algorithms to improve recognition. However, progress is insufficient. Despite being the main resource for automatic learning, few works focus on improving the quality of datasets. We propose a novel data-centric method to tackle misclassification, a problem commonly encountered in facial image datasets. The strategy is to progressively refine the dataset by successive training of a CNN model that is fixed. Each training uses the facial images corresponding to the correct predictions of the previous training, allowing the model to capture more distinctive features of each class of facial expression. After the last training, the model performs automatic reclassification of the whole dataset. Unlike other similar work, our method avoids modifying, deleting, or augmenting facial images. Experimental results on three representative datasets proved the effectiveness of the proposed method, improving the validation accuracy by 20.45%, 14.47%, and 39.66%, for FER2013, NHFI, and AffectNet, respectively. The recognition rates on the reclassified versions of these datasets are 86.71%, 70.44%, and 89.17% and become state-of-the-art performance.


Introduction
Our facial gestures speak more than a thousand words.Among the dynamic activities of the human body, the muscular movements of the face have meaning and potential interpretation.Facial expressions associated with the emotional state of a person are considered universal and the main signal to manifest and infer our feelings and sensations [1,2].An important study [3] quantifed the degree of infuence of the elements involved in the communication of emotions, determining the nonverbal part (facial and body gestures) as the most infuential with 55%, whereas the tone of voice with 38%, and only 7% for verbal language.In a conversational context, the exclusively verbal manifestation of anger or happiness must be accompanied by a facial gesture to convey the credibility and conviction of the interlocutor.Even the gesture would be enough to describe the emotion we are experiencing, as we often pay more attention to the face than to the words.Te recent pandemic has shown that when a facial mask is present, the human capacity to infer emotions is reduced [4].Terefore, facial expressions that communicate emotions are essential in daily life at the individual, interpersonal and social levels [5].Apart from interacting with other people, we are increasingly surrounded by machines trying to imitate human behavior, so there is a need to interact.In near future, this will be a common practice and it is intended to make such interaction as natural as possible.In the same way that people can infer the emotional state of others from facial expressions, computers and robots may also be able to recognize expressions and interpret human emotions.In recent years, automatic facial expression recognition (FER) has become an important area of research and development to improve human-machine interaction (HMI), leading communication to a more emotional, afective, and intelligent level [6,7].Tis can be applied to many activities and felds such as human behavior, healthcare, medicine, psychology, psychiatry, marketing, digital advertisement, customer feedback assessment, video games, video security, video surveillance, mobile phone unlocking, crime investigation (lie detection), online learning, and automobile safety [8][9][10][11].
1.1.Problem.Humans can easily recognize facial expressions, however, it is still a challenge for machines [12].Automatic FER is one of the key tasks in the feld of computer vision.Tis problem has motivated competitions such as the one organized on the Kaggle platform [13].A popular approach is to classify the facial expression in a static image of a human face and associate it with one of the seven basic universal human emotions: happiness, surprise, anger, sadness, fear, disgust, and neutral [14,15].Some models measure emotions with continuous values (e.g., valence and arousal).However, there are very limited annotated facial databases [16].In contrast, for a discrete (categorical) model, a wider range of available datasets can be found.Deep learning is preferred for this task avoiding the high cost of time and efort of manually defning multiple and complex features of facial expressions.In particular, convolutional neural networks (CNNs) have shown promising results from diferent facial image datasets.Images captured in a specifc and controlled environment (in the lab) are taken of a few people, do not present variations in environmental conditions, and gestures have a high degree of expressivity, so a good level of accuracy can be achieved.Another way is collecting images in real-world situations from the Internet, which is referred to as in the wild [17].Te heterogeneity of human faces, people less expressive than others, subtle diferences between expressions, variations in head pose, diferent body postures, lighting changes in the environment, and occlusions, are some of the factors that make FER outside the laboratory a difcult task even for humans [9,[18][19][20].

Motivation.
A deep learning solution consists of a model and data.Te vast majority of work follows a model-centric approach, whose purpose is fnding new algorithms to achieve better performance on a certain facial image dataset.Several CNN architectures have been proposed, both customized (created from scratch) and pretrained using transfer learning and fne tuning techniques.Each one tests diferent hyperparameters and includes regularization mechanisms such as data augmentation, dropout, and batch normalization [9].In practice, this process is very time-consuming and has not achieved the aim of ideal performance.On the other side, there is research data-centric guided by the principle that data is the most important resource and its quality directly infuences the performance of learning models.Very few studies have focused on improving FER datasets even though the same creators admit the problems in the quality of the data [8].Te lack of remarkable results of the model-centric approach, the little work focused on the data, and the premise that the data would be more important than the model, motivate us to propose a novel data-centric method to improve existing FER datasets to achieve better performance of recognition models.
1.3.Hypothesis.Te quality of the dataset is a prerequisite for improving the accuracy of FER models.If the inherent drawbacks of the dataset are not reduced, it is very difcult to improve the performance of a FER system.In other words, better performance and higher accuracy are expected if the dataset is improved.
1.4.Method.Improving the main resource of a FER model, i.e., the dataset, implies improving the accuracy of the recognition.To validate our proposal, we used some representative datasets of this domain, which sufer from wellknown problems such as imbalance, irrelevant images, and misclassifed images.Our interest is to deal with misclassifcation, since balancing or removing irrelevant images would modify the size of the dataset.In contrast, a reclassifcation would generate a new distribution of the available images in a better-quality dataset.Te strategy is a progressive refnement of the dataset over several trainings of the same CNN-based model.After each training, the prediction of all facial images is performed, and only the correct ones are selected to form the dataset for the next training.Tis process is repeated until there are few incorrect predictions, usually single-digit numbers.As a result, the last trained model achieves very high accuracy, so it is in charge of relabeling all the images of the original dataset.Terefore, a new distribution of the dataset is generated without altering its size or modifying the images.In the fnal step, the same CNN model is trained on the reclassifed version of the dataset, and the accuracy is higher compared to the original dataset.Te experiments performed in the present work show an increase of 20.45%, 14.47%, and 39.66% for the FER2013, NHFI, and AfectNet datasets, respectively.Stateof-the-art performance was also achieved for these datasets.

Contributions. Our research work provides:
(1) a novel data-centric method to reclassify the images of a dataset that allows a higher precision of a FER model, (2) a methodology applicable to other datasets from diferent domains and supported by computer tools, especially Python and deep learning libraries, and (3) a reclassifed version of each dataset, which may be useful for further research, publicly available for FER2013 and NHFI, whereas for AfectNet this is not possible due to licensing restrictions.
Te content of this work is organized as follows: Section 2 reviews the data-centric works.Section 3 presents the FER datasets.Section 4 describes in detail the methodology.Section 5 explains the experimentation, and the results obtained in Section 6.Finally, Section 7 includes the conclusions and mentions future work.

Related Work
Our bibliographic search on improving the performance of FER in the wild using deep learning reports supremacy of model-centric research.Tis approach focuses on better architectures, hyperparameter tuning, and regularization techniques [21].However, no signifcant progress can be expected when the data used are not reliable.On the other hand, data-centric eforts are scarce.Tere are few studies that deal with the dataset to improve the performance of a FER system.After analyzing the related literature, we can say the techniques frequently used under this approach include: image preprocessing, removing noise, deleting images with errors, data augmentation, and reclassifcation.For instance, Liu et al. [11] analyzed expression recognition considering the importance of data preprocessing by improving the image contrast.More discriminative facial features are obtained using a hybrid method for extraction, and a classifcation network combining EGG-16 and ResNet.Experiments on three benchmark datasets: CK+, FER2013, and AR achieved state-of-the-art recognition rates: 98.6%, 94.5%, and 97.2%, respectively.Kim et al. [22] designed an image and video preprocessing system called FIT (facial image threshing) machine capable of eliminating irrelevant facial images, cropping, resizing, and reorganizing the classifcation of facial images before training the Xception algorithm, improving the validation accuracy by 16.95% with the FER2013 dataset.Mazen et al. [8] applied the following operations on the dataset: (1) nonface images, text images, and profle images are deleted, (2) wrongly labeled images are relabeled using a CNN, and (3) data augmentation to overcome the class imbalance, generating new face images for the minority classes with a cycle generative adversarial network (CycleGAN).As a result, the average test accuracy was increased from 64% for the original FER2013 dataset to 91.76% for the modifed balanced version.Te cited works address the preprocessing of the dataset before the training of a model, however, the operations applied to change the total number of images either by removing or augmenting.In addition, the images are modifed by cropping, resizing, or retouching the contrast.Our goal is to preserve the images and size of the dataset, so we focus on misclassifcation, one of the most infuential problems in the lower performance of the FER models.For instance, Kim and Wallraven [23] presented a study of the quality of the labeling on AfectNet.Due to the large size of the dataset, a subset with a total of 800 difcult-to-recognize images of the diferent categorical expressions was selected to be relabeled by 13 human annotators.After the crowd reannotation, 83.25% of the total number of votes did not match the original dataset labels.In addition, the predictions of several ResNets trained on the original AfectNet are compared with the labels assigned by the human crowd, fnding that there is no good coincidence for categorical expression.Tis pilot test suggests the low labeling quality of the original dataset for these difcult facial images, infuencing the poor performance of a deep learning model.It is mentioned that more extensive reannotation work is in progress to check more accurate performance, however, manual annotation demands great efort and time.Our work does not require any kind of preparation or modifcation of the images, and avoids decreasing or increasing their number.It aims to automatically reclassify images to reduce intraclass variability and interclass overlapping of the original dataset.As a consequence, improve recognition performance.

Datasets
Tere are multiple image datasets created for automatic emotion recognition based on facial expressions.We have considered FER2013, AfectNet, and NHFI (natural human face image), mainly due to availability, size, image format, and categories of facial expressions.
3.1.Characteristics.Te FER2013 dataset (created by Pierre-Luc Carrier and Aaron Courville) and AfectNet (Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor) are standards taken as benchmarks for competitions [24], whereas NHFI (Sudarshan Vaidya) is a novel dataset, created for the purpose of providing more data with better manual annotation, which we propose to analyze in the present study.Table 1 summarizes the most relevant characteristics of these datasets.
Te quality of the datasets is more afected as the size of the dataset increases, so we selected a dataset at diferent scales: small (thousands of images), mid (tens of thousands), and large scale (hundreds of thousands).Te facial images included are static, not video sequences, with a 2D or fat appearance, in contrast to the 3D images that generate a perception of depth [1].Each image has a facial expression category assigned to it, this is a task performed entirely by humans, except for AfectNet, where one part was manually annotated and the rest automatically annotated using a neural network trained on all manually annotated training set samples [16].Te datasets are not balanced, i.e., they do not have the same number of images for each category, or at least a similar number.Tis drawback is discussed later.To examine the infuence of image color and size on recognition performance, we have images in grayscale and RGB mode, as well as in small and medium sizes.JPG and PNG are standard image formats and are easy to convert to each other.Te datasets encompass difcult naturalistic conditions (in the wild), with images far from a controlled environment, closer to reality, diferent lighting levels, ages, poses, intensity of expression, and occlusions, making recognition a challenging task [19].

Acquisition.
Te FER2013 (https://www.kaggle.com/datasets/deadskull7/fer2013) and NHFI (https://www.kaggle.com/datasets/sudarshanvaidya/random-images-forface-emotion-recognition)datasets are publicly available in Kaggle, whereas AfectNet requires permission for use via a request form to the authors (request form: mohammadmahoor.com/afectnet-request-form/).FER2013 can be obtained in a comma-separable value (CSV) format whose columns represent the following attributes: Computational Intelligence and Neuroscience a value between 0 and 6 for each of the 7 possible emotions (0: angry, 1: disgust, 2: fear, 3: happy, 4: neutral, 5: sad, and 6: surprise), a list of 2304 integer values, each equivalent to one pixel of the image of size 48 × 48, and fnally the subset to which it belongs: training or test.Since the images are not directly visible, we used a Python script with the Pandas and NumPy libraries to read the fle, store the integer values as pixel arrays, and convert them to image fles.A total of 35886 images are obtained after transforming the pixel arrays to image fles in JPG format, in grayscale and with a resolution of 48 × 48 pixels, divided into two subsets: training and test, 28708 and 7178 images, respectively.Each subset includes 7 folders, each one for a particular type of facial expression.NHFI downloading is a compressed fle, which after decompression generates 8 folders, whose names are practically the same as the previous dataset, only the "contempt" category is excluded for a fair comparison.Inside each folder are images in PNG format.In the case of AfectNet, the link provided in response to the request allows for the download of two compressed archives for training and validation.After the extraction of each archive, an "images" folder containing the JPG fles and another one called "annotations" containing the NPY fles of the corresponding labels are created.We developed a Python script (github.com/cimejia/FERdatasets/blob/main/createAfecnet.py) to read the facial expression category from the NPY fle and move the JPG fle to the corresponding folder.It is worth mentioning that AfectNet has two versions of the dataset, we used the small one containing only the manually annotated images with 8 labels (but contempt is omitted) released in March 2021.Te full AfectNet dataset is huge (122 GB) and a specifc request is necessary [16].

Drawbacks.
Automatic collection from the Internet and label crowdsourcing are the main reasons for the quantity and quality drawbacks of FER datasets.Regarding quantity, the major disadvantage is the imbalance, even with categories that largely exceed the number of facial images in other categories.On the other hand, the quality of the content is highly afected by the presence of irrelevant images and misclassifcation.Tese problems are widely mentioned in the literature and increase as the size of the dataset grows [8].

3.3.1.
Imbalance.An imbalanced dataset could lead to a recognition model biased in favor of the majority classes.Having the same number of images per category is a difcult task.Facial images are usually sourced from the Internet and collected manually or automatically through browser plugins or programming scripts.Tese images are posted by people who tend to show smiling or happy faces, so this category predominates, in contrast to categories such as disgust, anger, or sadness, which users do not usually post.Table 2 indicates the number of images per facial expression category in each dataset.
All three datasets show a signifcant imbalance (Figure 1).In FER2013 (Figure 1(a)), the "happy" category predominates, and the "disgust" category has few samples, and it is approximately regular for the rest of the categories.NHFI (Figure 1(b)) presents a similar behavior, but is less irregular.In AfectNet (Figure 1(b)), the diference in the number of images between all categories is much more pronounced.
Comparing the distributions on the same scale (Figure 1(d)), the imbalance is much more signifcant in AfectNet.A common pattern is the higher number of samples for the happy category and the lowest number for the disgust category.As mentioned before, this is because people tend to post images of happy faces and avoid showing other types of expression.

Misclassifcation and Irrelevant
Images.Here, we join both problems related to the content of the datasets.Misclassifcation or mislabeling refers to placing facial images in the wrong directories.Among the factors that lead to this problem are: (a) emotions are subjective, it is common that two people to have diferent opinions on the 4 Computational Intelligence and Neuroscience same facial image, (b) there are slight diferences between certain facial expressions, e.g., fear and surprise, disgust and anger, and contempt and sadness, (c) the degree of expressiveness varies from person to person, so gestures may appear exaggerated in one case and inhibited in others, and (d) human beings can feel multiple emotions in a given instant, something that is difcult to combine in a facial expression and can be confusing, e.g., smiling carrying tears is a combined emotion mistaken for sadness [9,25,26].As irrelevant images are those with watermarks, occlusions, no faces, poorly visible or very dark, cartoons, text or symbols, half-side, sleeping faces or closed eyes, cropped, rotated, retouched, and duplicated images.It is important to check for these drawbacks in each dataset, however, an exhaustive manual and visual review of a large number of images are impractical.We designed the following procedure to easily locate such errors.
Search for facial images with errors follows the fowchart shown in Figure 2. We reused the CNN for facial expression recognition designed by Akshit Bhalla [27].During the training on each dataset, we monitored the accuracy of the validation set at each iteration (epoch) to save the best model parameters.Tis model is used to perform the prediction on all the images of the validation set.Te confusion matrix is obtained from these predictions, where the of-diagonal positions allow us to identify the failures and their corresponding images.As a result, we have a smaller set of images in each class that is stored in a separate folder.We then visually reviewed to select examples of mislabeling and irrelevant images with their respective fle names (Figures 3-5).
In this section, we examined the problems of the FER datasets, which can be summarized as class imbalance, the existence of a signifcant number of images that are  Computational Intelligence and Neuroscience irrelevant, or that do not correspond to the correct category.
Combined or separately, these problems cause the performance of a FER model to degrade considerably, as well as learning to be biased in favor of the dominant classes [8,9,18,22].Terefore, the search for more convenient architectures and confgurations for recognition models is a waste of time when the data used are of low quality.Firstly, it is necessary to address these problems to improve the datasets.Dealing with both the imbalance and the irrelevant images involves changing the size of the original dataset.Our work focuses on the problem of misclassifcation by keeping the number of available images of the dataset.To this end, we propose a novel data-centric method based on deep learning for the automatic relabeling of facial images.

Proposed Method
Our goal is to achieve increased accuracy in facial expression recognition through deep learning by previously improving the dataset used.We proposed a data-centric approach that specifcally addresses the misclassifcation typically encountered in FER datasets.Tis drawback is likely the most infuential in the lower performance of recognition models in the wild scenarios.Since a visual inspection of every facial image in a dataset would be an extremely time-consuming and tedious task, we designed a method to automatically reclassify images of a dataset and improve the performance of a FER model.

Workfow.
Te proposed method consists of a series of steps represented by a workfow diagram in Figure 6.
(      In summary, we propose a process of iterative trainings to create successively more refned versions of the dataset.Each version is smaller, only the correct predictions of facial expression are included, but maintains a signifcant number of images.At the last training, a much more reliable dataset is obtained, as well as a model that produces a low number of incorrect predictions (single-digit values for each class).Te convolutional network is fxed in terms of its architecture and hyperparameters along this process.
Te key idea is that feature extraction is a crucial part of a FER system, and the expression classifcation accuracy will improve with an efective extraction of facial features [10,11].Te progressive refnement of the dataset produces a smaller number of images in each training, but with less variability of the gestures of the faces.Terefore, the model can capture more distinctive features of each class gradually.As a consequence, it is possible to increase intraclass similarity and enlarge interclass diferences within a dataset, thereby improving the accuracy of facial expression recognition in real-world scenarios.

Models.
We leverage CNNs, current state-of-the-art tools in Computer Vision, for facial expression prediction in images.Te design of CNNs imitates the human visual system, where a convolutional part would be the eyes of the network whereas a classifer part would be the brain, which decides the class of the object.CNNs can be created from scratch or pretrained using the transfer learning technique.In this work, we demonstrate the use of both alternatives, describing the architecture implemented for each of the datasets selected.

FER2013.
We reutilized the CNN presented on the Kaggle site (https://www.kaggle.com/bhallaakshit/facialexpression-recognition),whose performance has shown good results in the task of facial expression recognition on this dataset (Figure 7).
Te 48 × 48 pixel grayscale input image is passed through 4 convolutional layers, each layer applies a number of flters (kernels) to generate feature maps that include hierarchically detected patterns, from the simplest to the most complex.Here, 64, 128, 512, and 512 flters of size 3 × 3, 5 × 5, 3 × 3, and 3 × 3 pixels, respectively, are applied.A ReLU activation function then turns the negative values to zero and maintains the positive values.Next, a maxpooling operation reduces the image dimensions by half, but preserves the found features.Batch normalization stabilizes the result of a convolution whereas dropout enables the active participation of all neurons in the learning process.Both are recommended regularization techniques to avoid possible overftting.Te fatten operation converts the feature maps into a vector of values suitable as input for the classifer, which is a traditional fully connected neural network with an input layer that receives the features in vector shape, two hidden layers of 256 and 512 neurons, and an output layer with a Softmax activation function for 7 probability values, one for each facial expression class.

NHFI.
We tested the same CNN model with this dataset, however, the results after the frst fltering indicated an insignifcant increase in accuracy (approx.1.5%) as shown in Table 3. Computational Intelligence and Neuroscience Terefore, we searched for other architectures to achieve higher accuracy.A model using the transfer learning technique showed the best performance for this dataset.In the frst fltering, the accuracy improved from 0.5597 to 0.8367 (27.7%) as opposed to 1.5% with the CNN from scratch.Tus, we were able to demonstrate that the proposed method works for both cases (pretrained and from scratch models).With transfer learning, the training phase will be much faster, since we only train the classifer parameters while keeping fxed the convolutional base that would have already learned features that are useful for most computer vision problems.Te structure is presented in Figure 8.
Te model is based on the EfficientNet, a very popular CNN pretrained on the ImageNet dataset [28].We used version B0, whose convolutional base is kept for feature extraction.Te advantage is that the image with the original size of 224 × 224 pixels is accepted as input.Te classifer receives the features in the form of a fattened vector to decide the class to which the input image belongs by means of a fully connected neural network with two dense layers of 256 and 512 neurons, to which the ReLU activation function is applied, plus the batch normalization and dropout regularization techniques to reduce possible overftting.Te Softmax function in the last dense layer outputs a distribution of probabilities corresponding to each of the 7 categories of facial expression.

AfectNet.
We performed several tries with diferent architectures to determine the most suitable CNN for this dataset.Te best result was obtained with the CNN used for the FER2013 dataset (Figure 7).It is only necessary to change the size and color mode of the AfectNet images from 224 × 224 pixels in RGB to 48 × 48 pixels in grayscale.Tis conversion is performed automatically using the image generator of Python.

Experiments
Te core of the experimentation is the run of trainings of each CNN-based model on the respective dataset.Te main characteristics of the computational platform used are a processor Intel(R) Core(TM) i9-7920X, 2.90 GHz, RAM 64 GB, GPU NVIDIA GeForce RTX208 with RAM 12 GB, and the operating system Linux Ubuntu 18.04.5 LTS.Te CNN architectures described in the previous section are implemented using Python version 2.7.17, supported by standard libraries such as OS, NumPy, and Matplotlib, to manage directories and fles, numeric arrays, and visualization, respectively.For deep learning work, we used libraries such as TensorFlow, Keras, and scikit-learn, as well as the Image Data Generator utility for image preprocessing.
Te learning process is aimed at model learning to associate facial images and labels of expression categories.A series of values known as hyperparameters must be explicitly defned by the programmer before training.Tere are no fxed rules for determining these values, they are the result of several tests to fnd the most convenient ones.Our method of dataset refnement required fve successive trainings for each dataset to meet the quality criteria.At each training, the model is fed with the facial images from the training subset of each dataset in batches of 64 images (batch size).We used the Image Data Generator utility from Keras to work with an image generator in batches, due to the large number and size of the images would cause a storage problem in memory.It also allows us to pass the images directly to the training model from directories, as well as automatically labeling the image with the respective category, and performing data augmentation.For each batch, predicted and actual labels are compared, obtaining a loss and an accuracy using the categorical_cross entropy function.Backpropagation and Adam (based on gradient descent) algorithms are applied to update the model weights according to the learning rate value.When all batches are completed, one epoch is accomplished, i.e., one iteration of all training images.Te accuracy and loss values are measured after each epoch using the images from validation the subset.One hundred epochs have been run for FER2013, whereas for NHFI and AfectNet ffty epochs were sufcient to know the maximum level of accuracy since beyond this value, the behavior of the model remains practically stable and an improvement is not appreciable.Te callback utility from Keras is leveraged to perform certain actions during training such as setting a checkpoint and reducing the learning rate.Te model will only be saved to disk if the validation accuracy in the current epoch is greater than what it was in the last epoch.On the other hand, the learning rate tells us how much the weights will be updated each time, and is often between 0 and 1.It will decrease from an initial value to a minimum if the loss does not improve after a certain number of epochs, which usually results in better training.

Results
Te results of the experimentation are presented graphically by means of learning curves and confusion matrices, whereas the numerical metric used for comparison is the validation accuracy.Tese tools allow us to evaluate the performance of the model and the improvement of the dataset.During the training and validation of each model, loss and accuracy values have been collected, respectively.Tis generates the so-called learning curves, where the horizontal axis represents the number of epochs and the vertical axis represents either the accuracy or the error.Te confusion matrix, also known as the error matrix, is a table to visualize the model performance as presents information about actual and predicted classifcations carried out by a classifer model.Rows represent the instances of actual classes, whereas columns represent the instances the classifer predicts [29].From this matrix, several performance metrics can be obtained, however, we focus on accuracy, which compares the number of correct predictions (on diagonal) divided by the total number.Te results obtained for each one of the analyzed datasets are presented next.Computational Intelligence and Neuroscience 6.1.FER2013.Te learning curves and confusion matrix for each of the fve trainings required for the FER2013 dataset are shown in Figure 9.For each training (including validation), the following are presented: the accuracy curves (left), the loss curves (middle), and the corresponding confusion matrix (right).As more trainings are performed, the accuracy curves (training and validation) reach higher values, whereas the loss curves are decreasing in height and near to zero.In addition, the pairs of curves are very close to each other in all the graphs.Terefore, the accuracy of the model is higher, the error is lower, and there is no overftting.Tis ideal behavior is the product of successive fltering of the dataset.Te confusion matrices include the predictions of facial expressions for all the images in the dataset used for each training.Te progressive trainings cause the desired efect in each matrix, that is, to reduce the values outside the main diagonal and to increase the values in this diagonal.Te model is each time more accurate because wrong predictions are discarded in subsequent training.As a result, more distinctive features of each class are captured.In this way, the intraclass variability of the facial images is decreased and the interclass variability is increased.
Te process of the dataset refnement is summarized in Table 5.Five trainings (four fltering operations) on the FER2013 dataset were necessary to achieve the expected performance metric (validation accuracy).Another training was not necessary because there is no signifcant improvement in the accuracy.Te number of images gradually decreases, but it is still considerable for each training.Te model with the highest accuracy (97.7%) has captured the most distinctive features of each facial expression category and is convenient for reclassifying all images in the dataset.Te predict() method is used to assign the category of every facial image of the original dataset, generating a new distribution of the FER2013 dataset.Te comparison is presented in Table 6.
Figure 10 shows that the categories "disgust" and "sad" have minimal variation, those of "angry," "happy," and "surprise" vary moderately, and the most afected categories are "fear" (decreasing) and "neutral" (increasing), indicating that the original FER2013 dataset sufers from misclassifed facial images, especially between these two categories.
Te decisive test of the efectiveness of our method is to train the same CNN on the reclassifed FER2013 dataset.Figure 11 shows that better learning curves are obtained, as well as the confusion matrix indicates more correct and fewer incorrect predictions.Higher accuracy and lower loss are verifed in Table 7.
Te results confrm a more reliable dataset keeping the number of images.Te reclassifed FER2013 enabled a very signifcant increase in the validation accuracy of the model by 20.45% and the loss is much lower (0.34).Te training accuracy is acceptable (88.76%), very close to the validation accuracy and the loss is lower.Tere is no overftting and no signifcant diference between training loss and validation loss.All the categories show improved accuracy, in particular, there is a remarkable improvement for "angry" (an increase of 26%), "fear" (an increase of 38%), and "sad" (an increase of 25%), i.e., those that showed the most overlapping or confusion.According to the experiments, only 40 epochs in each training would be sufcient, since the behavior remains practically stable beyond this number.

NHFI.
Te accuracy curves for the NHFI dataset (left side in Figure 12) start quite separated from the other, evidencing the presence of overftting, but as the trainings are performed, the curves become closer and reach high accuracy, similar to the loss curves, but in the opposite direction, becoming closer and nearer to the horizontal axis.Te confusion matrices show higher values on the main diagonal and lower values of this diagonal, indicating the progressive improvement of the model accuracy, as well as the quality of the dataset used in each training.Despite successive discarding of incorrect predictions, the number of images is signifcant with respect to the original quantity.Table 8 shows the evolution of the trainings on the NHFI dataset.
Te reclassifcation of the original NHFI dataset is performed with the highest accuracy model (96.66%).A new distribution of the dataset is generated, which is shown in Table 9.In Figure 13, we can note that the "angry" and "neutral" categories had the greatest changes, indicating that these categories have the most intraclass variability in the original dataset.
To demonstrate improved recognition, the same CNN is trained on the reclassifed NHFI dataset and the result is compared to the original dataset (Figure 14).Te overftting was not reduced, but the accuracy is higher, both in the training and validation subsets.Te loss is decreased for the reclassifed NHFI dataset, as well as the values of-diagonal from the confusion matrix.
Te performance results for the original and reclassifed distributions of the NHFI dataset are presented in Table 10.
We have been able to signifcantly increase the accuracy in both training and validation subsets, by 18.74 and 14.47%, respectively.Except for the "angry" and "happy" categories, the validation accuracy is highly increased in the rest of the categories, particularly in the "sad" category from 49 to 85%.Te methodology based on successive fltering with a transfer learning model has been successfully applied on a diferent dataset than FER2013.

AfectNet.
Te version of the AfectNet dataset we selected contains 287401 images with a large imbalance between the categories (Figure 1(c)).Training on this dataset can lead to biases and erroneous assessment of model accuracy.Terefore, we applied downsampling to balance all categories by considering the one with the lowest number of images.Te "disgust" category limited the other categories with 4300 images, of which 3800 have been randomly selected for training by using the split-folders (https://pypi.org/project/split-folders/) library, whereas 500 images by default come with the dataset for validation.Te balanced version is shown in Table 11.
Te refnement process is performed on this balanced version of the AfectNet dataset.Te accuracy curves for the training and validation subsets (Figure 15) start with a small  Computational Intelligence and Neuroscience separation, which decreases as successive trainings are performed, even the validation curve fnishes outperforming the training curve in accuracy.Te same behavior, but in the opposite direction, is presented for the loss curves.Te values on the main diagonal of the confusion matrix increase with each training and decrease of this diagonal, indicating a higher accuracy of the model due to a better dataset.Te evolution of the successive training is summarized in Table 12.
Te model in the last training reaches a higher validation accuracy (95.9%), which allows us to reclassify the balanced dataset.A new distribution of the AfectNet of 30100 images is generated, whose number of images per category is presented in Table 13.
After the reclassifcation of the balanced dataset, the new distribution is imbalanced (Figure 16).Te categories of happy and fear have increased signifcantly, whereas the category of anger has increased slightly.In the remaining cases, there is a decrease, mainly in the categories of disgust and surprise.
Next, the CNN-based model is trained on the new version of the AfectNet dataset to verify that our method works.Figure 17 presents the learning curves of both versions of the dataset, where the new AfectNet (Figure 17(b)) allows to achieve better performance with higher accuracy.
Tere is a notable improvement in accuracy compared to the frst training of the balanced dataset (Table 14).Due to downsampling, the split ratio is 88 and 12%, for the training and validation subsets, respectively.For the new version of the dataset, the proportion is 80 and 20% and having more validation images, the accuracy percentage is almost duplicated (39.66%).
We successfully applied our method to a smaller and more balanced version of the AfectNet dataset.Te purpose is to improve the original AfectNet dataset, which is larger and imbalanced.To this end, the last trained model is used to reclassify the facial images in the full version of AfectNet.Te new distribution is presented in Table 15.
Te bar plot in Figure 18 shows that the shape of the distribution of the new reclassifed AfectNet is similar, however, there is a clear increase of images in the categories of fear, disgust, and surprise.Tis suggests that many facial images of these categories were misclassifed as happy or neutral.
Te following demonstrates the improved performance in facial expression recognition.Te reclassifed version of the AfectNet dataset is used to train the same CNN-based model, resulting in the learning curves and confusion matrix displayed in Figure 19.Te accuracy curves of the training and validation subsets are increasing from the frst epoch and reach a very high level, close to 90%.Also, both curves stay very near to each other.Te error curves decrease together to levels near to zero, which is desirable.By using the validation results as suggested by the creators of the dataset, we calculated the accuracy with the evaluate() method and generated the normalized confusion matrix.Te accuracy on the reclassifed validation set is 89.17%, and for each facial expression, category fuctuates between 86% and 96%, which demonstrates a high rate of recognition and no bias for any of the categories as opposed to the original dataset.Tis behavior confrms better FER performance on the reclassifed AfectNet dataset.
Finally, in Table 16, the results of the proposed method are compared with the state-of-the-art performance on the same datasets used in the present work.Tese are single network models that did not use extra images to the existing ones in the datasets.In all cases, our reclassifed versions of the datasets allow us the highest accuracy values for both the          Te increase in validation accuracy by 20.45%, 14.47%, and 39.66%, for FER2013, NHFI, and AfectNet, respectively, corroborates the efcacy of the proposed method.Te results suggest that the quality and size of the dataset determine the most appropriate type of model.NHFI is a small and betterannotated dataset, so a pretrained model is convenient, unlike larger and lower quality datasets, which need a model from scratch, with longer training and more parameters.Te reclassifed versions of these datasets maintain the same number of images as the original dataset, but with less overlapping between categories, and less variability within the same category of facial expression.Tis allows us to achieve the state-of-the-art performance of single network FER models with 86.71%, 70.44%, and 89.17%, for FER2013, NHFI, and AfectNet, respectively.Te recognition rates improved most signifcantly for the largest and lowest classifed datasets, i.e., the proposed method works best for datasets with a high level of misclassifed images.Te refnement process of the dataset would enable several models to work well, not only diverse architectures of CNN, but others such as the transformer.Our proposal, beyond the application to the FER domain, is also useful for a variety of computer vision problems when the data are images.Furthermore, it can serve as a debugging tool in the automatic collection of image datasets.We maintained the size of the dataset, considering that quantity is important.However, there are irrelevant images that should be removed and the imbalance could be addressed with data augmentation or GANs.We believe that these contributions would improve the quality of the dataset and the accuracy of the models.Terefore, a methodology for automatic learning should consider the quality of the dataset as a prerequisite to the search for better network architectures and model confgurations.

Figure 2 :Figure 3 :
Figure 2: Workfow for selecting and showing some error samples.

( 4 )
Te training is monitored to save in a model fle the parameters (weights and biases) corresponding to the iteration (epoch) of the best validation accuracy.(5) Te best model is used to perform the prediction of all facial images in the dataset.Te results obtained allow us to generate the confusion matrix.

Figure 5 :
Figure 5: Some errors in the AfectNet dataset.

Figure 7 :
Figure 7: Architecture of the CNN for the FER2013 dataset.

Figure 8 :
Figure 8: Architecture of the CNN with transfer learning for the NHFI dataset.

Figure 10 :
Figure 10: Graphical comparison of both distributions.

Figure 19 :
Figure 19: Te learning curves and confusion matrix for the reclassifed AfectNet dataset.

Table 1 :
Datasets considered and their main characteristics.

Table 2 :
Distribution of categories and number of images in FER datasets.

Table 4
shows the hyperparameters for each model and dataset, which are maintained for all experiments.

Table 3 :
Refnement of the NHFI dataset using the CNN from scratch.

Table 4 :
Training hyperparameters set for our experiments.

Table 5 :
Summary of experimental results for the FER2013 dataset.

Table 6 :
Distribution of the original and reclassifed FER2013 dataset.

Table 7 :
Comparison of the training results for the original and reclassifed FER2013 datasets.

Table 8 :
Summary of experimental results for the NHFI dataset.

Table 9 :
Distribution of the original and reclassifed NHFI dataset.
Facial expression recognition in the wild is a challenging problem for computer systems.Promising results have been achieved with deep learning methods, where the model and the data share responsibility.Te vast majority of the research is oriented towards designing better models, which is not sufcient when the data sufers from drawbacks.One of the most infuential problems in FER datasets is misclassifcation.In this work, we presented and implemented a method to reclassify all the facial images of a dataset by generating a new distribution that increases the accuracy of the FER models.Te proposed

Table 14 :
Comparison of the training results for the balanced and reclassifed AfectNet datasets.

Table 15 :
Distribution of the original and new AfectNet datasets.

Table 16 :
Comparison of state-of-the-art performance on the FER datasets considered.convolutional network fxed and iteratively improves the data over successive trainings.After each training, the dataset is evaluated with the confusion matrix, and the facial images corresponding to the correct predictions (on-diagonal) are selected to form the subsequent training data.Tis process gradually generates a more accurate model and more distinctive features for each category of facial expression.Te model from the last training is used to reclassify all the images creating a new distribution of the dataset.We experimented with popular FER datasets and CNNs created from scratch and Transfer Learning.