Deploying Machine Learning Techniques for Human Emotion Detection

Emotion recognition is one of the trending research fields. It is involved in several applications. Its most interesting applications include robotic vision and interactive robotic communication. Human emotions can be detected using both speech and visual modalities. Facial expressions can be considered as ideal means for detecting the persons' emotions. This paper presents a real-time approach for implementing emotion detection and deploying it in the robotic vision applications. The proposed approach consists of four phases: preprocessing, key point generation, key point selection and angular encoding, and classification. The main idea is to generate key points using MediaPipe face mesh algorithm, which is based on real-time deep learning. In addition, the generated key points are encoded using a sequence of carefully designed mesh generator and angular encoding modules. Furthermore, feature decomposition is performed using Principal Component Analysis (PCA). This phase is deployed to enhance the accuracy of emotion detection. Finally, the decomposed features are enrolled into a Machine Learning (ML) technique that depends on a Support Vector Machine (SVM), k-Nearest Neighbor (KNN), Naïve Bayes (NB), Logistic Regression (LR), or Random Forest (RF) classifier. Moreover, we deploy a Multilayer Perceptron (MLP) as an efficient deep neural network technique. The presented techniques are evaluated on different datasets with different evaluation metrics. The simulation results reveal that they achieve a superior performance with a human emotion detection accuracy of 97%, which ensures superiority among the efforts in this field.


Introduction
Recognition of human emotions is a vital phase, which is involved in several applications such as augmented and virtual reality [1,2], advanced driver assistance systems [3], human computer interaction [4], and security systems [5][6][7]. Humans have several ways of interpreting the emotions of others, such as speech and linguistic aspects [8] and facial expressions [9][10][11]. Furthermore, emotions can be detected based on gaze direction [12] and biosignals including electroencephalogram (EEG) and electrocardiogram (ECG). Emotional expressions are used for intelligent Human-Robot Interaction (HRI). Emotion analysis can also be used to track the students' emotions to enhance the learning environment. erefore, the students can learn better using this approach. Such information obtained through emotion analysis is useful in monitoring of the overall mood of a group of persons to identify any destructive events [13]. In human interaction, 7% of the affective information is conveyed by words, 38% is conveyed by speech tone, and 55% is conveyed by facial expressions [14]. erefore, the facial emotion analysis can be a dependable approach to recognize human emotions for HRI applications. e robot vision issue can be handled using thermal images [15][16][17] and RGB images [18]. is paper presents a real-time study for emotion detection and deployment in robotic vision applications. e proposed approach consists of four phases: preprocessing, feature extraction and selection, feature decomposition, and classification. Feature extraction and selection is carried out by MediaPipe face mesh algorithm. is algorithm is based on real-time deep learning. In addition, the feature decomposition phase is performed by PCA. is phase is deployed to enhance the accuracy of emotion detection. It is required to decompose the extracted features using the Singular Value Decomposition (SVD). Finally, the obtained features are enrolled into a selected classifier. In addition, an MLP deep neural network is utilized. e introduced techniques are assessed on different datasets with the help of different evaluation metrics. Moreover, this paper introduces a hardware implementation of the proposed models. e main contributions of this work can be summarized as follows: (1) A novel fast and robust emotion detection framework for robotic vision applications is proposed. (2) Emotion face mesh is introduced depending on automatic key point determination from face images. (3) Key point angular encoding is presented to generate sensitive and distinguishable angular features. (4) Emotion classification is performed depending on various machine learning techniques. (5) A brief comparison is made between the deployed techniques in terms of accuracy, scalability, and processing time.
e remaining parts of this paper are organized as follows. Section 2 covers the works introduced in the literature. Section 3 shows the datasets utilized in this work. Furthermore, the proposed methodology is discussed in Section 4, and its simulation results are given in Section 5. Moreover, the result discussion highlights the performance of the proposed approach among the works in the literature in Section 6. Finally, the paper concluding remarks are given in Section 6.

Related Work
Several researchers presented their frameworks to handle the issue of HRI. e work in [19] offers a conditional-generative-adversarial-network-based (CGAN-based) framework to reduce intraclass variances by managing facial expressions individually, while simultaneously learning generative and discriminative representations. A generator G and three discriminators make up this architecture (Di, Da, and Dexp). Any query face image is transformed into a prototypic facial expression form with certain factors kept by the generator G. An accuracy of 81.83% was achieved. A model based on CNN was proposed in the work of [20]. It was designed for smile detection, emotion recognition, and gender classification.
erefore, it is considered as a multi-task model. It achieved an accuracy of 71.03%. Some efforts have been presented for emotion detection using deep learning. e work in [21] introduced a deep CNN to deploy a facial expression recognition system. is system can automatically extract the features of facial expressions to allow automatic recognition. In addition, it consists of input, preprocessing, recognition, and output modules. Furthermore, it was used to simulate and assess the recognition performance under the effect of several aspects such as network structure, learning rate, and preprocessing on both the Japanese Female Facial Expression (JAFFE) dataset and the Extended Cohn-Kanade (CK+) dataset. To make the results more convincing, the authors used the k-Nearest Neighbor (KNN) technique. For JAFFE and CK+ datasets, the performance accuracies are 76.7442% and 80.303%, respectively. Another model was proposed in [22]. It was tested on a facial expression dataset of HDR images, considering a collection of faces under different lighting conditions. It is based on SVM, Local Binary Patterns (LBPs), and appearance. It works depending on the Speeded-Up Robust Feature (SURF) transform to conduct the emotion recognition task. is model revealed accuracy levels up to 80%. In [23], the authors presented a model for submission to the fifth Emotion Recognition in the Wild (EmotiW 2017) group-level emotion recognition subchallenge. ey deployed a CNN to extract features from the detected face images. Another role for the CNN is to be trained for the face identification task, rather than traditional pretraining on emotion recognition problems. In the final pipeline, an ensemble of Random Forest (RF) classifiers was learned to predict an emotion score using an available training set. is model achieved an accuracy of 75.4% on the validation data.
Another trend in this field is to detect emotions from videos. e authors of [24] presented a hybrid deep learning model for emotion detection from videos. A spatial CNN is used for processing of static facial images and a temporal CNN for optical flow images. ese two processing branches are used to learn high-level spatial and temporal features on video segments, separately. ese two CNNs are fine-tuned using pretrained CNN models and target video facial expression datasets. A deep fusion network, which is deployed using a Deep Belief Network (DBN) model, fuses the collected features from the segment-level spatial and temporal branches. e obtained fused features are enrolled into a linear SVM for facial expression classification tasks. e authors achieved an accuracy of 75.39%. Moreover, another video-based emotion detection algorithm was presented in [25]. e authors investigated different ways for pooling spatial and temporal data. For video-based face expression identification, they discovered that pooling spatial and temporal information together is more efficient. Unlike the framework given in [24], this work is end-to-end trainable for whole-video recognition. e goal of this framework is to create a trainable deep neural network framework for pattern identification that integrates spatial and temporal information from video using CNNs and LSTMs. is framework achieved an accuracy of 65.72%.

2
Computational Intelligence and Neuroscience

Dataset Description
e proposed models are evaluated on three datasets: Cohn-Kanade (CK+) [26], Japanese Female Facial Expression (JAFFE) [27], and Real-world Affective Faces Database (RAF-DB) [28]. A description of each of them is given below.

Cohn-Kanade [CK+].
e CK+ dataset [26] consists of 593 video sequences from 123 participants. Each sequence contains images beginning from onset (neutral frame) and progressing to the peak expression (last frame). e label associated with each sequence is depicted from the peak expression. e dataset contains images for seven different expressions: anger, contempt, fear, disgust, happiness, surprise, and sadness. e images have a resolution of 640 × 480 pixels. In this work, the images are cropped into 48 × 48 pixels to focus on the subject face. Figure 1 shows sample images for each expression.

Japanese Female Facial Expression (JAFFE).
e JAFFE dataset [27] has 213 photos of ten different female actors posing for seven different facial expressions. ere are six primary expressions: happiness, sadness, surprise, anger, disgust, and fear, plus one neutral expression. e images have a resolution of 256 × 256 pixels. Figure 2 shows sample images for each expression.

Proposed Methodology
is paper presents an emotion detection approach based on deep and machine learning techniques. e main idea of this approach is to deploy deep learning as an automatic key point generator using MediaPipe technique. Hence, a sensitive mathematical process is performed to encode the generated key points into a set of distinguishable features. In addition, different machine learning techniques are implemented on the extracted features to perform the classification task. e proposed approach consists of four main phases. e first phase is image preprocessing in which a super-resolution task is carried out using SRGAN. In the second phase, we deploy MediaPipe to generate key landmarks on the face images. Furthermore, we present a key landmark analysis and an angular encoding module. is module contains three subphases (key landmark selection, emotional mesh generation, and mesh angular encoding). e main idea of this module is to generate an emotional mesh that connects the selected key landmarks. Hence, the obtained mesh is encoded into angular values to generate a feature map. Moreover, the generated feature map is enrolled into a classifier to be discriminated into six categories. Figure 4 represents the proposed framework.

Preprocessing.
Generally, the images that are captured by robotic vision devices have a limited resolution due to the hardware limitations of cameras involved in such systems. Furthermore, most of the available datasets for human emotion recognition are down-sized because of the storage limitations.
erefore, the first module in the proposed approach is the super-resolution. In addition, the proposed approach involves angular feature extraction from the geometry of the face images, which requires a clarified representation of the landmarks and boundaries of the face images to allow proper facial emotion recognition. SRGAN [29], a Generative Adversarial Network (GAN) for image Super-Resolution (SR), is employed in the current research to increase the perceptual quality of images prior to further processes. With SRGAN, the images are super-resolved with a 4x upscaling factor, while minimizing the Mean Squar-eError (MSE) between the super-resolved and original images and maximizing the Peak Signal-to-Noise Ratio (PSNR). Figure 5 illustrates the preprocessing step by employing the SRGAN. e figure displays an original image selected from the CK+ dataset and the corresponding super-resolved image after SRGAN. e original image size is 48 × 48 pixels, and the super-resolved image size is 192 × 192 pixels.

Key Landmark Generation.
e process of key landmark generation is performed using deep MediaPipe technique. MediaPipe [30] is an open-source ML framework developed by Google and devoted to building real-life computer vision applications. MediaPipe capabilities allow developers to focus on algorithm or model development, while using MediaPipe to iteratively improve their application with results that are consistent across different devices and platforms [31]. Solutions that are currently implemented with MediaPipe include face detection, face mesh annotation, iris localization, hand detection, pose estimation, hair segmentation, object detection and tracking, and 3D object detection (Objectron). ese solutions are released in different platforms: mobile (Android and iOS), C++, Python, and JS. Real-life examples of ML solutions in MediaPipe are shown in Figure 6.
In the current work, the face mesh solution from the MediaPipe framework is employed to annotate the landmarks and boundaries of the face. Face mesh calculates 468 3D face landmarks in real time. It uses ML to infer 3D surface geometry using just a single camera input without a specialized depth sensor [32]. e solution provides a realtime performance, even on mobile devices. Figure 7 displays an image selected from the JAFFE dataset with the 468 facial landmarks annotated on the image.

Proposed Key Landmark Analysis and Angular Encoding.
is paper presents a key landmark analysis and an angular encoding module.
is module contains three subphases Computational Intelligence and Neuroscience 3   (key landmark selection, emotional face mesh generation, and mesh angular encoding). e main idea of this module is to generate an emotional mesh, which connects the selected key landmarks. Hence, the obtained mesh is encoded into angular values to generate a feature map. In the following subsections, a discussion for each step in this module is presented.  Figure 8. e selection of the key landmarks and their locations is based on the Facial Action Coding System (FACS) [33,34], which encodes movements of individual facial muscles. It can be used to describe facial actions that make up an Computational Intelligence and Neuroscience expression based on changes in facial muscles regardless of emotion. e movement of particular facial muscles, known as Action Units (AUs), is encoded by FACS. is requires unique instantaneous changes in facial appearance [35]. Table 2 describes the facial emotion-related AUs and the corresponding FACS names. A graphicbased demonstration for FACS with isolated AUs is illustrated in [36]. Hence, facial emotions can be represented using reliable combinations of different AUs, as demonstrated in Table 3. Each key landmark location is chosen such that it is more probably affected by a specific emotion-related AU, which seeks better recognition of facial expressions.     Table 4 defines the edges that constitute the emotion face mesh, as well as the start and end vertices for each edge. e vertices IDs are defined in Table 1. e mesh yields 27 vertices and 38 edges. Deformation of emotion face mesh measured by the deviation of angles between edges reflects facial muscle contraction and relaxation, which will be used to identify facial emotions. Figure 9 displays the emotion face mesh for sample images selected from the JAFFE dataset with different emotions.

Mesh Angular Encoding.
After acquiring the key landmarks and establishing the emotion face mesh, we use the mesh to extract the relevant features for emotion classification. e relevant features employed are geometric features, since most emotions can be detected from geometric changes. Ten features are extracted, defining angles between specific edges of the emotion face mesh. e angles are represented in degrees in the range of (0°, 360°). ese features are then fed to the ML classifiers to learn from them to identify each emotion. e low dimensionality of features (10 features) makes them more resistant to local facial changes. In addition, the classifiers can be trained in a much shorter time. Moreover, the overall complexity of the proposed framework is significantly reduced. e list of angles taken as discriminant features for emotion classification, and the three vertices IDs forming each angle are given in Table 5. An example depicting the angular features and their locations on a test face image is shown in Figure 10.
e angle between the three vertices can be computed as follows (consider Figure 11). e angle θ between the line (edge) connecting P 2 and P 3 and the line (edge) connecting the points P 2 and P 1 is unknown. e angle β between the line P 2 -P 3 and the X-axis can be computed as Similarly, the angle α between the line P 2 -P 1 and the X-axis can be computed as Hence, the angle θ will be Using the above procedure, ten angles between prescribed edges in the emotion face mesh are computed, and then used for classification. Angle values are all positive, where negative values can be avoided by adding 360°to the values. Furthermore, the generated feature maps are redistributed using PCA to enhance their distribution.

Classification.
In this work, we develop an automated facial expression identifier to recognize human emotions for robotic vision applications. Discriminant features extracted from a face (Section 4.3) are fed to classifiers to recognize the emotion in the given face. DT, KNN, a multiclass SVM [37], Gaussian NB, MLP with backpropagation, QDA, RF, and LR classifiers are used for classification. e trial-and-error method and grid-search [38] are conducted to identify the   (18,5) Computational Intelligence and Neuroscience 7 optimal structure and hyperparameters of classifiers. In addition, 10-fold cross-validation is employed to estimate the optimal hyperparameter combinations to avoid overfitting. e optimal hyperparameters of classifiers adopted in the current work are investigated in Table 6. e images in the dataset are divided into two parts: training part and testing part. e training part is used to train/validate the classifier, and the testing part is used to test the performance of the classifier. e splitting scheme is 80/ 20, as shown in Figure 12. e 10-fold cross-validation adopted in the current model employs further splitting of the training part into ten folds (subsets). After that, nine folds are used to train the classifier, while the remaining fold is used to validate the training. is process continues until     Computational Intelligence and Neuroscience each of the ten folds is used exactly once for validation. e optimal configurations identified in the training stage are then applied in the testing stage.

Experimental Results
Experiments are performed on an Intel Core i3 machine with 8 GB RAM. Python 3.9 is used as the development environment. e To evaluate the performance of the proposed model, eight classifiers are employed to classify facial expressions across two benchmark datasets. e hyperparameters employed for each classifier are presented in Table 6. e classification is based on ten features extracted from images in each dataset using the procedure described in Section 4.
Learning curves, which determine cross-validation scores and behaviors for different training sizes for the adopted classifiers in case of CK+, are shown in Figure 13. e confusion matrix for each classifier on the CK+ dataset using the proposed model is shown in Figure 14. It shows that the per-class accuracies of Anger, Happy, and Surprise classes have higher values with all classifiers than those of other emotions, while the Contempt and Sadness classes have lower per-class accuracies. Moreover, the confusion matrices for classifiers on JAFFE dataset are shown in Figure 15. e performances of the proposed framework with eight classifiers on CK+, JAFFE, and RAF-DB [28] datasets are presented in Tables 7-9. e illustrated results show the classification report including accuracy, precision, recall, and F1-score as well as the training time taken for each classifier. A visual comparison between the classifier accuracies across the used datasets is shown in Figure 16.
Results reveal that the KNN classifier outperforms other classifiers in terms of accuracy, precision, recall, and F1-  score. It achieved the best accuracies of 97% and 95% on CK+ and JAFFE datasets, respectively. e accuracies for Gaussian NB, QDA, DT, LR, RF, MLP, and SVM classifers on the CK+ are 84%, 86%, 86%, 87%, 89%, 94%, and 94%, respectively, and those on the JAFFE are 90%, 79%, 90%, 86%, 93%, 90%, and 88%, respectively. In addition, the time required to train the KNN and Gaussian NB is 0.005 sec on CK+. It is the lowest time compared to those of other classifiers. e MLP and RF classifiers have the highest training times, which are 1.82 sec and 0.74 sec, respectively. Moreover, the proposed models are evaluated on theRAF-DB. e results of this evaluation reveal that the proposed MLP and SVM models can be considered as good emotion detection models for this database, especially with an accuracy of 67% for both models. erefore, the proposed approach provides a variety of models, which are optimal for robust emotion detection environments.

Discussion
e simulation results reveal that the proposed approach shows a high performance in human emotion detection. Furthermore, they clarify that the proposed encoding module has a superior performance with the deployed classifiers including KNN, SVM, and MLP. In this section, a brief comparison is presented between the proposed     Figure 16: Accuracies of eight ML classifiers on RAF, CK+, and JAFFE datasets for facial expression classification using the proposed approach. approach and the works in the literature as illustrated in Table 10. It can be observed that the proposed approach has a superior performance among the efforts in this field.

Conclusion
e issue of Human-Robot Interaction (HRI) has been discussed in this paper. As a solution, the paper presented a novel approach for facial expression recognition. is proposed approach consists of four phases, which are carried out to extract key points from facial images using a real-time algorithm (MediaPipe). Furthermore, these key points are enrolled into a sequence of selection, mesh generator, and angular encoding modules. Moreover, the generated feature maps are classified using several classification algorithms, including SVM, KNN, RF, QDA, NB, LR, DT, and MLP. e novelty of the proposed approach is highlighted in the proposed key point analysis and angular encoding algorithm. is algorithm is efficient, because it generates only ten features (angular values), which are discriminative for different emotional classification categories. e proposed approach has been evaluated on CK+, JAFEE, and RAF-DB datasets. It reveals a superior performance in terms of accuracy of detection and processing time evaluation metrics. Furthermore, the low dimensionality of extracted features enables the ML-based approaches to reach an optimum performance in a short time with much lower computational cost than those of the DL-based approaches, which require more time for convergence and need much computational cost.
In addition, the future work that can be deduced from this paper is introducing a method for emotion detection from other modalities such as videos, spoken words, and written text. Furthermore, hardware implementation of the proposed approach is a research trend, which we are working on. Moreover, further machine learning techniques such as dictionary learning and semi-supervised learning can be performed to solve this issue.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.