Designing an Efficient System for Emotion Recognition Using CNN

,


Introduction
Various techniques have emerged to perform emotion recognition from faces including support vector machines [1], hidden Markov models [2], and neural networks [3,4].In machine learning, SVM is applied as categorizing and reverting exploration.It is a supervised machine learning algorithm that can be used in resolving data classifcation and regression challenges.
SVM was developed and created from statistical learning theory by Vapnik and Chervonenkis in 1990.Its basic idea is to transform the input space to a high-dimensional feature space using a kernel function and then obtain a maximum classifcation in this new feature space.
Hidden Markov models (HMMs) have been widely used to model the temporal behaviors of facial expressions from image sequences.HMMs are able to model temporal dependencies.HMMs are probabilistic models which consist of a countable number of states, transitions, and corresponding emissions.HMMs are easy to model but variable by the parameters that describe them.
Researchers proposed many techniques based on neural network's approach to recognize the facial expressions.Te neural networks can easily implement the mapping from the feature space of face images to the facial expression space.
In addition, neural networks can work well for tackling the pattern classifcation issues in engineering [5,6].Te neural network has been used in several problems in image classifcation for image processing.Researchers have developed various NN structures in accordance with their problem.After the network is trained, the generative model can be tested.To achieve optimization, you have the fexibility to fne-tune hyperparameters and customize the neural network architecture by incorporating specifc layers.In fact, a variety of optimization tools has been employed in neural networks to learn from past experiences and use that prior training to identify new patterns and classify new sentimental data.
We have recently witnessed a progress in the use of deep learning (DL) for neural networks leading to classifcation [7][8][9][10].Facial expression recognition in videos is an active area of research in computer vision.Accordingly, researchers in this feld are interested in developing techniques to interpret facial expressions, code them, and extract their features in order to have a better prediction of emotions.Capitalizing on the remarkable success of deep learning, various architecture types are harnessed to attain superior performance.Te remainder of this paper is organized as follows: Section 2 reviews and discusses the previous approaches used for expression recognition.Section 3 describes the proposed work.Section 4 introduces the results and discussion.Section 5 concludes this paper.

Previous Work
Sinha and Aneesh [9] embedded the handcrafted features in the training process of the network in order to reduce the diference between the features learned by deep networks and the handcrafted ones.Teir work was based on the HoloNet with feature loss (HNwFL) for feature learning and the fusion network for recognition.In HNwFL, the handcrafted feature information was integrated into the HoloNet and the new feature loss was tested on CK+, JAFFE, and FER2013 datasets.When compared with other works, this network obtained the best accuracy with a 97.35% on the CK+ dataset.In fact, the suggested network provided much better accuracy than the network which did not include feature loss as well as the original handcrafted feature.
In the study in [11], the initial work of both computer vision and image processing has been developed in order to facilitate the teaching of young autistic children recognizing the human facial expression.Facial expression recognition work was proposed by Haque et al. using a deep convolutional neural network.In order to experiment and train the deep convolutional neural network model, Kaggle's FER2013 dataset was used.Teir work resulted in 63.11% accuracy without overftting the model.In fact, the bright images achieved a better accuracy than the dark ones.Tus, the dataset was modifed across four groups with diferent lighting conditions, and each set is again trained with the same model.Mollahosseini et al. [12] advanced deep neural network architecture to address the FER problem via wellknown face datasets.Seven publically available databases, mainly MultiPIE, MMI, CK+, DISFA, FERA, SFEW, and FER2013, were used in this experiment.Te proposed network consisted of two convolutional layers.Each layer was followed by max pooling and then by four inception layers.Te architecture network took facial images as an input and classifed them into six emotional expressions (anger, disgust, fear, happiness, sadness, and surprise).Its results were comparable to or more efcient than traditional convolutional neural networks in both accuracy and training time.Tümen et al. [13] aimed to build a convolutional neural network (CNN) based on a facial expression recognition (FER) system so as to classify expressions presented in the FER2013 database.Te presented CNN achieved a 57.1% accuracy rate on the FER2013 database [13].It also provided a good result while detecting emotional expressions without any preprocessing.A high rate was particularly achieved in three classes of happiness, surprise, and disgust.Xie et al. [14] introduced a novel approach for FER named deep attentive multipath convolutional neural networks (DAM-CNNs).Te proposed model contained three modules including the VGG-Face, the Salient Expressional Region Descriptor (SERD), and the Multipath Variation-Suppressing Network (MPVS-Net).Te VGG-Face (visual geometry group) extracted a feature.Te SERD automatically located expression-related regions in the target image.Te MPVS-Net module separated expressional information from irrelevant variations.By jointly combining SERD and MPVS-Net, the DAM-CNN was able to highlight relevant facial traits and yield a robust image representation for emotion recognition.AlMarri [15]

Background
Te current paper is based on the studies of Github et al.'s [16,17] technique which was selected according to diferent tests presented in the "Study of previous works" section.Tis method was modifed to boost the performance of its neural network.Github et al.'s [16,17] approach is presented through the neural network architecture summarized in Figure 1.Initially, a lot of images were captured with a camera.Te face extraction module then used trained Haar cascade/deep neural networks (DNNs).Te classifcation algorithm was trained on the FER2013 dataset.Finally, a model was generated to classify emotions through diferent facial expressions such as angry, disgusted, fearful, happy, sad, surprised, and neutral.
3.1.Proposed Method.Te facial expression recognition was difcult to realize due to the slight diference between several emotions, which required an efcient and precise algorithm to be trained.
Te imbalanced distributions of emotion classes gave rise to low accuracy.Terefore, two fundamental rules, notably the data-centric machine learning (ML) and the model-centric approach [18], were used in deep learning algorithms.Te data-centric ML aimed at improving the quality of the used dataset by (i) Early stopping method using Keras (ii) Hyperparameter tuning Two enhancement strategies were adopted to overcome the shortcomings of the Kant et al. method and improve its performance.Te frst strategy accentuated the RelGAN (relative attributes generative adversarial networks) approach [19,20], while the second introduced the early stopping technique.Tese two strategies were explored in the functional block diagram of our implementation, as shown in Figure 2. First, the anavid dataset [21] was collected from both private and public datasets and its preparation was optimized in three versions.Ten, various data augmentation techniques were exercised on the anavid dataset.Moreover, the RelGAN method was performed on the same dataset.After creating our own dataset, the resulted images were divided into train, test, and validation sets as per its Journal of Electrical and Computer Engineering usage.Afterwards, memory optimization was integrated into the algorithm.Subsequently, early stopping was applied to the neural network architecture.Besides, the designed model was ameliorated by tuning some potential hyperparameter combinations.Finally, the emotions' classifcation was provided.
Tree contributions were described in detail in three bullets, and in each bullet, it explains one contribution clearly (data augmentation, generative adversarial networks (GAN), and early stopping).

Data Augmentation.
Data augmentation is a widely employed method in computer vision, which entails applying diverse transformations to the initial training data, thereby generating additional augmented samples.By adopting this approach, the size and diversity of the training dataset can be expanded, leading to enhanced performance and robustness of computer vision models.Both input images and the labels for those images can be enhanced with data using certain techniques.In computer vision, the following are some typical data augmentation techniques: Tese are just a few instances of computer vision techniques for data augmentation.Te selection and combination of the augmentation methods depend on the particular task, dataset, and the desired characteristics that the model should learn.Data augmentation is a useful technique for enhancing computer vision models' performance and generalization, particularly when there is a lack of training data.

Generative Adversarial Network (GAN).
Te generative adversarial network (GAN), defned as a powerful unsupervised generative model, was widely applied in order to generate photorealistic images and a variety of other image types.It is, thus, benefcial to classify emotion recognition and train data augmentation.An analysis study of the GAN method was performed leading to a whole GAN family identifcation including CycleGAN, DCGAN, MoCoGAN, and RelGAN [22][23][24] which are representative methods in a multidomain image-to-image translation.Te RelGAN technique was selected, and its code was modifed.Te RelGAN technique consists of generating the same face with other expressions and looks ("Pale_Skin," "Smiling," "Eyeglasses," and "Gray_Hair").Tis method is capable of modifying images by changing particular attributes of interest [19,20].Four attributes were applied to public datasets: (i) Pale skin (ii) Smiling (iii) Eyeglasses (iv) Gray_Hair Figure 3 illustrates the original image and its four efects ("pale skin," "Smiling," "Eyeglasses," and "Gray_hair").Tests revealed the efciency of the RelGAN technique on a public dataset because the public images were taken from a close distance and the facial features were therefore clear.Using this technology, a DL algorithm could generate more facial expressions.Examples of the RelGAN technology are presented in Figure 3.

Early Stopping.
Early stopping [25,26] is an efective technique which is applied to prevent the overftting Te experiments show that early stopping can enable the networks to learn more novel features and have a high predictive performance.

Datasets. A series of experiments was performed over diferent datasets:
JAFFE [27]: Te Japanese Female Facial Expression (JAFFE) database contains 213 images of posed expressions from 10 Japanese female subjects.Each subject represents 7 diferent emotional facial expressions (6 basic facial expressions + 1 neutral).Te database is challenging since it consists of a little example images per subject/expression.Tis dataset is asked to do acted facial expressions related to Eckman's emotions, including the following facial expressions: "happiness," "anger," "sadness," "surprise," "disgust," "fear," and "neutral."Te resolution of original facial images is 256 × 256 pixels with tif format.Several dataset images of each expression are assembled by Miyuki Kamachi, Michael Lyons, and Jiro Gyoba, at Kyushu University, Japan.Each image is annotated with average semantic ratings on nouns describing the posed expression by 60 Japanese viewers.
Facial expression recognition (FER) 2013 [28]: Te FER2013 database was presented during the ICML 2013.Te dataset was labeled and created by the Google search API.It [29] is used in 32% research studies and consists of 35,887 images.In fact, FER2013 includes 28,709 training sets, 3,589 validation images, and 3,589 test sets.In order to label all the 36k photos, the dataset uses Ekman's emotions which are composed by seven emotional expressions (neutral, fear, disgust, sadness, happiness, surprise, and anger).Each image is in grayscale with a resolution of 48 × 48 pixels.Te images are smaller than other datasets.FER has more variation in the frames, including facial occlusion (mostly with a hand), partial faces, low-contrast images, and eyeglasses.KDEF [30]: Karolinska Directed Emotional Faces (KDEF) comprised images from 70 individuals: 35 men and 35 women, all between 20 and 30 years old, where each image belonged to one of the following classes (fear, anger, disgust, happiness, neutrality, sadness, and surprise) and each expression is photographed from fve diferent angles (full left profle, full right profle, half left profle, right half profle, and straight profle).Te KDEF dataset represents color photographs with a 562 × 762 image format.Te KDEF dataset [31] consisted of 4,900 images which were made at the Karolinska Institute, Stockholm.It is widely used in the feld of facial expression recognition.
Te extended Cohn-Kanade (CK+) [32] is a popular facial expression recognition dataset and is commonly used in several works.Te CK+ dataset contains 593 image sequences of persons with diferent age (18 to 50 years old), gender, and heritage.Te dataset contains images with a resolution of 640 × 490.Among all, 327 sequences are labeled with an emotion in seven basic expression labels: anger, contempt, disgust, fear, happiness, sadness, and surprise.
(MMA Facial Expression Database MMAFEDB) [32] is collected from diferent emotion and expression images with a resolution of 48 × 48 pixels.It is divided into 3 sections as testing, training, and validating.Each section includes seven Te Yale face database [33] contains 165 grayscale images of 15 individuals with diferent facial expressions including center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink.
Te GENKI database [34] was collected by the Machine Perception Laboratory (MPLab), University of California, San Diego.Te MPLab GENKI-4K is a database subset containing 4,000 real-life faces annotated in two classes of images "smiling" and "nonsmiling," downloaded from publicly available Internet repositories.Tese images involve a wide range of backgrounds, geographical locations, illumination conditions, personal identity, and ethnicity.Tis dataset has large variations in pose, age, and gender.It consists of 1,940 grayscale face images with varying facial expressions.Te GENKI-SZSL subset includes 3,500 images collected from the Internet.Tey are classifed according to face location and size.Te current release of the GENKI dataset is GENKI-R2009a, a version which consists of 7,172 images that combine to form the following subsets: (1) GENKI-4K: 4,000 images representing expression and head-pose labels (2) GENKI-SZSL: 3,500 images introducing face position and size labels A training set consisting of 2,958 data samples is selected from the whole GENKI dataset.[35,36], an evaluation measure broadly applied in deep learning classifcation algorithms, is used to refect the behavior of various models in supervised classifcation contexts.It is a square matrix in which the rows represent the instances' actual class and the columns show their predicted class.Te following 2 × 2 confusion matrix reports the number of true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN).

Evaluation Criteria. Confusion matrix
(i) True negatives are the number of negative elements that are correctly predicted as negatives (ii) True positives are the number of positive elements that are correctly labeled as positives (iii) False negatives are the number of positive elements that are classifed falsely as negatives (iv) False positives are the number of negative elements that are incorrectly classifed as positives Memory management [37] is the biggest challenge in deep neural networks (DNNs).In fact, memory can still be an important constraint when the model size is too big, working with large amounts of sets, or running two models at the same time.
During training, the use and the generation of the Keras models require huge memory.Tus, the optimization for embedded GPUs is treated to ft in with the limited memory bandwidth and potentially decrease memory storage costs.
In the proposed approach, we intend to reduce memory computation.In fact, the integration of some instructions is applied in the CNN algorithm to realize real-time processing.Te following instructions are added: import tensorfow as tf from Keras import backend as k gpu_options � tf.GPUOptions (per_process_gpu_memory_fraction � 0.1) sess � tf.Session (confg � tf.ConfgProto(gpu_options � gpu_options)) Te memory consumption optimization of facial emotion recognition is obtained using the convolution neural network (CNN).In fact, optimization memory is integrated into the algorithm with a variable parameter (0.192 and 0.1).As illustrated in Table 1, the proposed experiment reduces memory consumption from 9.07 to 4.89.Hardware resource utilization is also reduced from 4.89 to 1.64 by turning the parameter to 0.1 (see Table 1).As a result, the current approach ofers a fexible trade-of between the high accuracy of the overall adopted network and the lower memory consumption.

Study of Previous Works.
In this section, three previous methods conducted by Github et al. [16,17], Li et al. [38], and Correa et al. [39] are compared.Te Github et al. method [16,17] is described in the abovementioned section of background.Li et al. [38] proposed a multikernel convolutional block to extract facial expression features.First, this approach was designed through three depth-wise separable convolutional kernels.Second, the multichannel information was fused to obtain multikernel enhancement features.Ten, a "channel split" task was performed on the multikernel convolutional block input.Finally, a lightweight multikernel feature expression recognition network was designed by alternately using the multikernel convolutional block and the depth-wise separable convolutions.Correa et al. [39] designed a neural network based on an artifcial intelligent system for emotion recognition.Tree promising neural network architectures were trained and subjected to various facial expressions' classifcation tasks.Ten, the best performing network was further improved.For this purpose, diferent approaches were experimented and evaluated.Te fnal model was portrayed and applied in a real video stream application that could instantaneously return the user emotion.In this paper, deeply trained models for emotion 6 Journal of Electrical and Computer Engineering detection are presented through the use of the FER emotion databases.Te Github et al. method [16,17] achieved the best results compared to the Correa et al. [39] and Li et al. [38] methods.Te simulation has the most prominent values in diferent classes (angry, happy, sad, and neutral).Tis highly efcient model of the Kant et al. method is explained by high true positive values in four categories (see Table 2).Consequently, the Kant et al. method is used in all remainder tests.In order to further ameliorate its performance, different tools such as data augmentation, RelGAN technology, and early stopping are performed.

Model Comparison (with and without Data Augmentation).
Te small dataset afects the trained model.Tey do not generate reliable data from the test and validation image.Tus, these generated models sufer from the overftting issue which can be solved through various proposed methods.According to the study in [40], the principle of this approach is to add regularization to the weights.Another technique is a dropout technique [41].Tis principle consists of dropping certain connections in layers or removing a neuron from layers during training.Another popular method is batch normalization [42].It is applied to any neural network layer.Some works explore data augmentation techniques to resolve the overftting problem.In deep convolutional neural networks, the limited numbers of pictures contribute to the overftting phenomena.During training, the number of samples in each class is too small and the rate of accuracy is wrong.For this reason, a diferent kind of data augmentation is applied to the dataset.Various data augmentation techniques such as rotating, noise adding, fipping, zoom, cropping, and brightness are performed.As a result, the loss classifcation is reduced, and the neural network learns better and generates the best model.Tis strategy is an efective way to improve the accuracy of image classifers.As it can be noticed in Table 3, data augmentation is an efective method for image classifcation.It can be observed that the balanced dataset model based on data augmentation achieves a high true positive value of four emotion categories compared to unbalanced dataset models.Te most facial landmarks are the mouth, the corners of the eyes, the lobes of the ears, the chin, and the tip of the nose.Tis helps to distinguish emotion class for pictures captured in the front of the camera.In the case of blurry images or pictures captured by the position away from the camera, the algorithm provides incorrect classifcation (see Table 4).
Tis model can be improved by adding a bunch of datasets and collecting enough data for training using the RelGAN technique.In the following section, future experimentations may be fulflled using various datasets.

Journal of Electrical and Computer Engineering
Te early stopping method is realized in the implementation phase.Te hyperparameters of early stopping are adjusted to further improve the model.Te hyperparameters are summarized in Table 5.
"mode": mode is set to min.We seek the minimum for validation loss and the maximum for validation accuracy.
"monitor": Quantity to be monitored."patience": Te number of epochs without decreasing the monitor after the training stops.For example, if patience is set to 3 without decreasing the loss for three consecutive epochs, the training stops.
Te "verbose" can be tuned to 1. Once the training is stopped, the epoch number is printed.
"min_delta": min_delta is fxed to 0.00001.Terefore, an absolute change of less than min_delta cannot be considered as an improvement.
"restore_best_weights": Te restored best weight is fxed to true positives to make sure the fnal model we get is the best.
Supervised machine learning (ML) algorithms require a lot of data to be able to generalize models.Te principle of data augmentation is to generate more images from the original dataset and create more variations of image appearance such as backgrounds and diferent contexts.Terefore, during training, more image variations are obtained to boost the performance of the recognition model.In order to increase both data size and training data diversity, various transformations are used.
As a result, the dataset anavid [21] is modifed in three diferent sets and each of these datasets is again trained.Table 6 contains these datasets with distribution frames of each class.
A more detailed analysis of our training data provides the frst element of understanding.When we do not have sufcient data, it is impossible to carry out the initial application efciently.In fact, the number of images for each class is not equivalent, so unbalanced classes are obtained, frequented classes are acquired, and rare classes are procured in case we have little data.Tis lack of data blocks our network in two diferent ways: (i) By not providing it with sufcient information to allow it to learn to diferentiate certain classes (ii) By leading it to specialize in learning data (overftting) and therefore be unable to provide a correct answer for the test data For this reason, the neural network can face the overftting problem.In brief, neural networks require a lot of learning data and balanced classes.
Te model which is generated by training the abovementioned three datasets is tested on a test dataset that includes 2,000 images for each class (happy, neutral, angry, and sad).Te result is illustrated in diferent confusion matrices (see Table 7).
It is interesting to look at class-by-class results through a confusion matrix.It indicates the proportion of data correctly predicted for each class through a diagonal column.Te rest of the data is assigned to other classes (see Table 8).Tis shows a considerable improvement in the confusion matrix as it helps to generalize the model and boost its performance.Early stopping is utilized to avoid overftting in our experiments.Troughout the simulation, the designed networks are trained with an iterative technique which allows the model to better ft the training dataset.Early stopping improves the model performance on any given data outside the training set.

4.7.
Comparison with State-of-the-Art Methods.Te proposed method is compared to previous works [11][12][13][14][15][16][17]43].Experiments demonstrate that our model outperforms the state-of-the-art methods on FER2013 and achieves high results on accuracy with 65.89% (see Table 8).Simulation results on FER2013 are listed in Table 8.Tey show that the proposed method provides greater results and a better performance gain from 0.55% to 35.7%.

. Conclusion
Facial emotions are important factors in human communication.Te feeling and the expression of emotions are the basic skills of social interaction.Tis work aims to study human emotion.FER can also be combined with early stopping and data augmentation using ordinary transformations and GAN technologies in order to boost the performance of our DL algorithm.Moreover, diferent experiments are applied to analyse the validity and the applicability of this technique with other methods in the feld of emotion recognition.In future works, we will carry out new experiments to further ameliorate this research paper.In addition, we will further enhance the performance of the Jetson card by studying the structural characteristics of embedded systems and improving their functionality.
(i) Balancing classes: the number of items should be the same in each class (ii) Data augmentation: adding augmented images to some classes 2 Journal of Electrical and Computer Engineering (iii) Increasing diversity: images should be from shops (high quality) and consumer-taken (low quality) (iv) Removing classes Te model-centric approach included two techniques:
(i) Flipping or mirroring: Tis entails fipping the image either horizontally or vertically to produce a mirrored replica.It aids in improving the model's ability to generalize to various orientations or viewpoints.(ii) Rotation: Rotating the image by a predetermined angle, such as 90 degrees or any other predetermined angle, might assist models become invariant to rotation fuctuations.(iii) Scaling and cropping: Rescaling the image to multiple sizes or randomly clipping a section of the image helps increase the model's robustness and ability to handle objects of various scales.(iv) Translation: Shifting the image horizontally or vertically makes models more tolerant of object displacement and helps them learn spatial invariance.(v) Gaussian noise: Incorporating random noise into the image simulates variations in lighting conditions and enhances the model's robustness in noisy environments.(vi) Color jittering: Changing the image's color values, such as brightness, contrast, or saturation, might help the model better adapt to changes in the color and lighting of an environment.

Figure 2 :
Figure 2: Block diagram of proposed method.

4. 6 .
Comparative Study of a Combination of Early Stopping and Data Augmentation Test.Te early stopping can be combined with data augmentation in order to enhance the model.Early stopping is added to prevent deep learning neural network models from overftting phenomena.Te early stopping technique provides three benefts including the following: (i) Reduce the overftting phenomena by adding early stopping to an existing model (ii) Reduce the training execution time by choosing the number of training epochs (iii) High rate in the confusion matrix

Table 4 :
Example of images.