Ensemble Convolution Neural Network for Robust Video Emotion Recognition Using Deep Semantics

. Human emotion recognition from videos involves accurately interpreting facial features, including face alignment, occlusion, and shape illumination problems. Dynamic emotion recognition is more important. Te situation becomes more challenging with multiple persons and the speedy movement of faces. In this work, the ensemble max rule method is proposed. For obtaining the results of the ensemble method, three primary forms, such as CNN HOG-KLT , CNN Haar-SVM , and CNN PATCH are developed parallel to each other to detect the human emotions from the extracted vital frames from videos. Te frst method uses HoG and KLT algorithms for face detection and tracking. Te second method uses Haar cascade and SVM to detect the face. Template matching is used for face detection in the third method. Convolution neural network (CNN) is used for emotion classifcation in CNN HOG-KLT and CNN Haar-SVM . To handle occluded images, a patch-based CNN is introduced for emotion recognition in CNN PATCH . Finally, all three methods are ensembles based on the Max rule. Te CNN ENSEMBLE for emotion classifcation results in 92.07% recognition accuracy by considering both occluded and nonoccluded facial videos.


Introduction
Human emotions are inevitable in day-to-day interactions.Tey catalyse improvising communication.Generally, humans use face, hand, voice, and body gestures to express their feelings.Among all these, human faces have been the most prominent and expressive medium in carrying out emotions during interactions.Facial emotion recognition is the technology used to reveal information from one's emotional state or sentiments by analysing facial expressions from both static and videos.Tis is a part of afective computing.
Emotions are integral to human communication: during smiles, showing greetings and respect to others, frowning in confusion, raising voice during arguments, and so on.Tey are the best means of nonverbal communication, irrespective of culture, religion, or race.It assists in determining how someone feels by obtaining information about their emotional state.So, it can be used for verifcation and recognition purposes.
In conventional market research, companies use surveys and customer reviews (verbal methods) to understand the demands and needs of customers.Te other method is behavioural, in which companies record video feeds of users interacting with a product.Tey manually analyse the video to observe a user's reactions and emotions.Tis method is useful, but it is time-consuming and tedious.Furthermore, it raises the overall cost.Market research frms can now easily automate video analysis and detect their users' facial expressions using artifcial intelligence (AI)-enabled facial emotion recognition systems.Tis saves time and labour while also lowering costs.Market research frms can use facial recognition systems to scale their data collection eforts.

Benefts of Emotion Detection
1.1.1.Assess Personality Traits in Interviews.Personal interviews are an excellent way to interact with potential candidates and determine whether they are a good ft for the position.However, analysing a candidate's personality in such a short period of time is not always possible.Furthermore, many categories of discussion and judgement add to the complexity.Trough facial expressions, emotion detection can assess and measure a candidate's emotions.It assists interviewers in comprehending a candidate's mood and personality traits.Human resources can use this technology to develop recruiting strategies and policies to get the most out of their employees.
1.1.2.Product Testing and Client Feedback.When customers try a product, emotion detection technology can help the product industry understand their genuine emotions.Companies can set up a product testing session, record it, and then analyse it to detect and assess the facial emotions that emerge during the session.Because AI powers emotion detection, it can evaluate user reactions to new product launches.
1. 1.3.Enhances Customer Service.Emotion detection improves the user experience in nearly every industry.Tis technology enables retailers to create more personalised customer ofers by analysing their browsing and purchasing habits.Furthermore, healthcare providers can use facial recognition to create better care plans and deliver services much more quickly.In Phycology and Crime Prediction.Human emotion recognition has applications in psychology.Emotions, which are active most of the time, control our behaviours unconsciously.Emotions can signifcantly infuence criminal behaviour.According to criminal psychologists [1], there are nine levels of emotional motivation for criminal conduct: bothered, annoyed, indignant, frustrated, infuriated, hostile, wrath, fury, and rage.Using emotion analysis, the phycologist can understand a person's emotions and emotional fuctuations.As a result, emotion recognition can be used to predict crime.

Classes of Emotion.
Emotions are generally classifed as positive or negative.Te six basic emotions are anger, happiness, fear, disgust, sadness, and surprise.Other emotions are an embarrassment, interest, pain, shame, shyness, anticipation, smile, laugh, sorrow, hunger, and curiosity.Emotions can be discrete or dimensional [2].According to Natya Sastra [3], nine basic emotions are identifed.Tey are love, laughter, sorrow, anger, courage, fear, disgust, surprise, and peace.
If a person is angry, the eyebrows are drawn together and lowered inside and the extremes are pulled outward, either in both eyebrows or in one.Vertical lines, generally two, appear between the eyebrows, lower lid raised, eyes focusing at the centre, stare or bulging and lips tightly pressed together with corners down or square shape, moustache or the upper part of the mouth curved, nostrils may be dilated, the lower jaw projecting out.If a person is happy, the corners of the lips are drawn wide and upwards, the lips parted and teeth exposed, or lips widened with wrinkles running from the outer nose to outer lip, a portion of cheeks below the eyes are raised, lower lid may show wrinkles or be tense, exhibiting crow's feet near the corner of both the eyes.Terefore, identifying facial expressions is the key to emotion recognition.
For facial expression recognition [4], extraction of facial features for capturing the changes in appearance is achieved via harvesting deep feature semantics.To achieve this, traditional CNNs are optimised using a soft-max loss, which penalises the misclassifed samples, forcing the features of diferent classes to stay apart.
Deep convolutional neural networks (CNN) otherwise require massive training to produce better accuracy.Due to the limited public database available for facial expressions, data augmentation mechanisms must be employed.Cropping sample images at varied angles results in images at various positions at varied scales, which further reduces the sensitivity of the overall system.Terefore, if augmented for experimental purposes, the utmost care should be taken in data to preserve the model's robustness.
Deep CNNs can hierarchically learn the features from samples to represent all possible complex variations of input images [5].Te max pooling layers only consider input features' frst-order statistics, which limits learning deep semantic features.Tis becomes further complicated when pose variations and/or partial facial occlusions are included.Occlusions and variant poses are two major factors causing the signifcant change in facial appearance.Removing occluded regions is not practical when real-time video emotion recognition is considered.Real-world occlusion is yet another difcult task for emotion detection research.Using CNN to ignore occlusion and pose variations might lead to inaccurate facial features.
In contrast to the above claim, human intelligence can exploit local facial regions and holistic faces for better perception of emotions in partial or complete facial occlusions.Due to the dynamism in the variation of local parts such as the eyes, nose, and mouth, the vital issue rests in the robust detection of such energy from every keyframe.Directly feeding the keyframes leads to underutilising prior knowledge hidden in consecutive frames.Hand-crafted facial descriptors are unsuitable for interpreting the powerful temporal features in facial images.
Several mathematical models capable of processing under adverse conditions have been proposed to address these challenges.Constrained frontal faces shall be analysed by facial expression classifers [6].Subspace analysis techniques [7] require extensive training and are not suitable.Recognition systems based on local feature representations [8] respond better under face illumination variations.However, occlusion and posing reduce the accuracy.Stacked supervised autoencoder [9] is better at solving the above problems.However, they need accurate training data which is occlusionfree.Te proposed work concentrates on emotion recognition from videos performed via ensemble-based approaches, as there is less robustness in processing video data streams as in [6].Tough the work tends to achieve dynamic emotion recognition via crucial frame extraction, face recognition, and further processing of image-based emotion detection, the idea of SSPP is not addressed as it might be computationally expensive and lead to delayed recognition.

Scientifc Programming
Terefore, the proposed work discusses using KLT tracking to track the recognised faces in video accurately.Te extraction of visual facial features for facial recognition is crucial since the colour and shape of faces in video are similar.In this work, HOG features are used to precisely capture facial features such as directions and edges of the face and facial intensities, which are later fed to the SVM classifer for robust facial recognition.Avoiding a large number of layers and the need for colossal training, the proposed work encompasses nine layers of CNNs with extensive data augmentation.Te primary goal is to analyse the emotion of all the persons in a given video.

Challenges.
A facial expression is representative of a specifc emotion, so it is not easy even for humans to recognise emotions accurately.Studies show that diferent people recognise diferent emotions in the same facial expression.And it is even more challenging for AI to diferentiate between these emotions.

Technical Challenges.
Emotion recognition shares many challenges.Identifying an object, continuous detection, and incomplete or unpredictable actions are the most widespread technical challenges of implementing emotion recognition.
Face occlusion is one of the main challenges in captured videos and pictures.Another commonly seen challenge is lighting issues.Identifying facial features and recognising unfnished emotions are crucial challenges in emotion recognition.

Psychological Challenges.
Psychologists have studied the connection between facial expressions and emotions since the middle of the 19th century.Cultural diferences in emotional expression are one of the main challenges.Infants and children indicate feelings diferently than adults, so identifying children's emotions is another challenge.
An ensemble method of image classifcation is proposed to improvise emotion recognition.Ensemble learning aims to assemble diverse models or multiple predictions and boost prediction performance.Te ensemble combines numerous learning algorithms to obtain their collective performance and improve the performance of existing models by combining several models, resulting in one reliable model.For image classifcation, both occluded and nonoccluded facial videos are used.
Te online social networks produce billions of visual information, which are useful to recognise sentiment.A proverb in many languages goes, "A picture is worth a thousand words," which means that a single image can convey many ideas more efectively than spoken description.Visual sentiment analysis on social network content helps us to understand the user behaviour and provides useful information for related data analysis.Te majority of users of social networking platforms prefer to use images and emoticons other than typing long sentences.Te Twitter platform encourages the communication between users using short texts or images.Te main objective of the work is to classify the sentiment behind the messages represented in the form of image in social networks, using state of art machine learning and deep learning methods.Te primary goal of this paper is to examine social media post in the form of image data to identify user attitudes regarding a particular topic of discussion.Utilizing attitudes toward a topic of discussion on social media can help to identify and predict the sentiments.Additionally, it aids in assess personality traits in interviews, product testing and client feedback, enhance customer service to quickly adapt to constantly changing needs.Tis paper frst proposes an ensemble deep learning classifcation algorithm, namely, CNN ENSEMBLE, which is proposed in this work by combining the outcomes of CNN HOG-KLT , CNN Haar-SVM , and CNN PATCH for the analysis of sentiments.Tis proposed ensemble deep learning classifcation algorithm is employed in this work for performing classifcation over occluded and nonoccluded social media postimages to improve the accuracy of emotion classifcation.
Terefore, the main contributions of this paper are: (i) Te frst contribution of this work is that it proposes three diferent face detection methods such as HOG- KLT , Haar-SVM, and PATCH are used parallel in this work for emotion analysis, which efectively identifes the occluded and nonoccluded faces (ii) Second, an efcient CNN based emotion classifcation model that highlights the impact of handling occlusion and nonocclusion for improvising the classifcation of emotions (iii) Finally, an ensemble deep learning algorithm named, CNN ENSEMBLE is proposed in this work for performing efective emotion recognition.
Tese sentiment classifcation algorithms have been evaluated with extended CK+ for emotion recognition.Te results obtained from this work show that CNN ENSEMBLE emerges as the highly accurate model for emotion recognition for occluded and nonoccluded faces.
Te rest of this article is organized as follows: Section 2 provides a survey of related work that are existing in the literature on emotion classifcation and they are compared with the proposed work.Section 3 explains the methods used and the algorithms proposed in this paper.Section 4 discusses about the results obtained from this work and performs a comparison with existing work.Section 5 gives the conclusions arrived from this works and lists some future enhancements.

Related Works
In this section, we surveyed the identifcation of facial expression recognition, occlusion-aware facial expression recognition, and the techniques used for facial emotion recognition.
2.1.Facial Expression Recognition.Nowadays, distance learning and e-classrooms are the part of our life, in elearning scenario, by analysing the facial expressions, the teachers can understand the engagement of students in learning [10].Viola-Jones and HAAR cascade algorithms are used for object detection and feature extraction.CNN is used for expression classifcation.Facial expressions have been analysed for various applications.However, not all the facial regions contribute to expression detection, and a few areas do not change with varied terms [11,12].Determining human behaviour from facial expressions is an excellent application in healthcare, tourism and hospitality, and the retail industry.Sajjad et al. [13] proposed a framework for analysing human behaviour via facial expressions from video.Te face is initially detected using the Viola-Jones algorithm and then tracked via KLT.Viola-Jones has various stages, including Haar features selection, which selects the most important facial features [14,15], AdaBoost training, and cascading classifers.
Te localisation of the eyes and nose is detected by Haar cascades [16,17].Te algorithm frst identifes the rectangular area of the eyes and nose position.Later, the eye centres are computed.If there is a mismatch in detection, anthropometric statistics are used.Face alignment is achieved using the position of eyes since eyes do not move with expressions.After the nose position is extracted, the mouth region is identifed using the nose region as a reference.Te curves in the upper lips shall be detected using horizontal edge detection techniques.Te position of the eyes is also used for identifying the eyebrow region of interest.
Second, the detected face (if not priorly registered, is registered into the database and) is recognised using the SVM classifer, followed by facial expression recognition using CNN.Te proposed work is constructed along these lines with additional semantic feature extraction using CNN.Bounding box approaches have been combined with confdence score and class prediction parameters within layers of CNN to achieve improved detection accuracy in video surveillance.Te proposed work also uses bounding box approaches for improved face detection accuracy.

Occlusion-Aware Facial Expression Recognition.
Occlusion is the primary issue in handling real-time videos.Partial facial occlusion has been widely addressed in the literature.However, real-life occlusion detection is essential for applications in the healthcare and hospitality industries.
In handling occlusions, patch-based approaches [18,19] have emerged as the state-of-the-art in real-time.VGGNet is used to represent the input image as feature maps.ACNN decomposes the input image feature maps into subfeature maps.Tis disintegration into multiple subfeature maps results in the identifcation of local patches.Te feature maps are sent to the Gg unit to identify the facial occlusion's location.
Patch-based ACNN (pACNN) performs region decomposition and occlusion perception.Region decomposition uses an exclusive approach [20] to select 24 out of 69 facial landmarks.Te local feature maps are cropped and fed to the respective convolution layers using selected landmarks without compromising spatial resolution.After sufcient learning, the feature maps are converted to vectorshaped local features, provided to the attention layer.Te attention net determines the scalar weight as a means of quantifying the importance of the identifed local patch.Global-local-based ACNN (gACNN) takes care of the later stages of processing.It takes the full-face region and extracts the local details of the patches and their respective global context cues.

Datasets for Facial Emotion Recognition. Table 1 summarises publicly available video datasets and the addressed emotion categories.
Te proposed work assumes CK+ and ISED databases and proceeds to facial expression detection using deep learning models.Additionally, the proposed work concentrates on extracting emotions about basic categories.

Machine and Deep Learning Approaches for Facial
Emotion Recognition.Emotions related to e-learning, like boredom, confusion, contempt, curiosity, disgust, eureka, delight, and frustration were mainly identifed in recent literature [32][33][34][35][36][37][38][39].Deep learning models, mainly convolutional neural networks, are used for emotion classifcation.Diferent deep learning models such as VGGNet [34,39] and ResNet [35] are used for the implementation.A variant of CNN, DCFA-CNN [36], is tested with diferent image datasets and got excellent classifcation result.Yolcu et al. [40] presents a deep learning-based system for customer behavior monitoring applications.Te system uses 3cascade CNN for head pose estimation, facial components segmentation, and expression classifcation.GoogLeNet and AlexNet, which consists of 2 consecutive CNN layers, are widely used in facial expression recognition [41].Table 2 presents various classifers used for facial emotion recognition.

Ensemble Framework for Facial Emotion Recognition
Te framework in Figure 1 discusses the facial emotion recognition from videos using ensemble CNN classifers.It also highlights the ensemble CNN for robust video emotion detection with late multiple feature fusion using deep semantic facial features.Te video is the input used for the recognition of emotion.Te video may contain a single person or multiple people, and emotion can be identifed for both occluded and nonoccluded faces.Te input video contains sequence frames.Initially, the edges will be Scientifc Programming extracted from videos.Te structures may have faces or nonfaces.By using a keyframe extraction method, the keyframes are identifed.Te idea is to create a model that ensembles the inherent emotional information within the video frames.In this work, three diferent methods are used for emotion recognition and for improving accuracy.All the methods are fused based on an ensemble strategy.Before emotion recognition, the frst step is identifying the faces in the video frame.
In the frst method, the face is detected through the Haar cascade algorithm and tracked using KLT tracking.Te detected face image will be fed into CNN for further emotion classifcation.Similarly, face detection is achieved using HOG features and SVM in the second method.Te images classifed as face images will be the input of the CNN emotion classifer.Te third template-matching method is used to detect the faces in the frame.Ten, using a patchbased CNN, the emotion of the image will be recognised.Tis proposed end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) can automatically detect and focus on the most discriminative non-occluded areas of the face.After identifying emotions by three distinct methods, ensemble max rule-based emotion recognition accurately classifes the feelings.

Keyframe Extraction.
Keyframe extraction is used to reduce redundant frames that lead to the dimensionality reduction of the feature vector for classifcation.Te input video is processed for keyframe extraction, where multiple keyframes are extracted in this module.Tis work uses the histogram diference method for keyframe extraction.Te diference is calculated between each frame, and the threshold value is obtained.Consider two frames f 1 and f 2 .If there are any changes or diferences found in f 2 from f 1 , then f 2 is taken into account.If there are no changes next subsequent frame is taken for examination, and the process is continued till f n frame.
Te process has two main phases.In the frst phase, the threshold (TD) value will be computed using the mean and standard deviation of the histogram of the absolute diference of successive image frames.In the second phase, compare the threshold (TD) extracted from keyframes against the fundamental diference of consecutive image frames.
Te video frames will be extracted one by one at frst.Te histogram diference between two successive frames will be calculated for each video frame.To determine a threshold point, the mean (M) and standard deviation (SD) of the absolute diference of the histogram are calculated.Te

Face Detection and
Tracking.It used the Haar cascade and Kanade Lucas Tomasi algorithm to detect and track faces in the frames extracted from videos.Te steps used to compute face detection and tracking are mentioned below.
Step 1: Input the keyframes Step 2: Identify the relevant features RLF i using the Haar cascade algorithm, which locates the face.It requires the identifcation of located feature points needs to be reliably tracked Step 3: Use the KLT method, which computes the displacement of the tracked points from one keyframe to another.It fnds the traceable feature points in the frst frame and then follows the detected features in the succeeding frames using the calculated displacement.Te Haar features are applied to determine the facial features in which the line feature, edge feature, and rectangle features are denoted by L f , E f , and R f .Te value of the feature VF i it is calculated by identifying the sum of pixel values in the black area minus the sum of pixel values in the white space.A threshold T i it is set for each feature.Initially, the average sum of each feature is calculated.After that, compute the diference and check with T i .If the value is above or matches with T i Ten, it is detected as a relevant feature RLF i .During the creation of integral image Im(a, b), it identifes the sum of pixel values in an image or rectangular part of a painting by (2) in which I (a′, b′) is the intensity of the original image.
Te integral image can be calculated in a single pass using the following equations, in which csum(a, b) is the cumulative row sum  (3) In the integral image, csum (a, 1) � 0 and Im(1, b) � 0. After generating the integral image, each feature can be calculated at a constant time.
(2) Kanade Lucas Tomasi Algorithm.Face detection requires tracking of faces on keyframes.Kanade Lucas Tomasi (KLT) is an efective feature-based face-tracking algorithm.It continuously tracks human faces in a strong frame extracted from videos.Tis method fnds the parameters that reduce dissimilarity measurements between feature points related to the original translational model.For tracking the face, it fnds the traceable feature points in the frst keyframe and then follows the detected features in the succeeding keyframes based on computed displacement value.
Let us assume that initially, one of the corner points is (a, b).
Te coordinates of the new point will be a′ � a + v1 and b′ � b + v2.It uses warp function W(a; d) � (a + v1; a + v2) to calculate the coordinates.Te alignment is calculated by the following equation: where d is the displacement parameter.Assume an initial estimate of d as a known parameter and fnd ∆d using the following equation: Te displacement ∆d is calculated by fnding the Taylor series and then diferentiating it concerning ∆d using equation (8) Te frst layer to extract features from the input image is convolutional.Convolution can perform edge detection, blur, and sharpening operations by applying flters to an embodiment.When the image is too large, the pooling layer function is used to reduce the number of parameters.Spatial pooling, like average pooling, reduces the size of each map while retaining important information.Te fully connected layer has fattened the matrix into a vector and feeds it into a neural network-like fully connected layer.
During emotion recognition, the convolutional layer recognises features in pixels.Ten, pooling layers are responsible for making these features more abstract.Finally, the fully-connected layer is accountable for the classifcations of emotions.Te frst layer is convolutional with a kernel size of 5 × 5 pixels and 16 output channels.Te second layer is a max pooling layer with a 2 × 2 kernel size.In  Image pixels are directly used as input to standard feedforward neural networks for emotion recognition in the convolution layer.In emotion classifcation, one or more 2D matrices are fed into the convolutional layer, and multiple 2D matrices are generated as output using equation Each input matrix Im j is convoluted with a corresponding kernel matrix Kn j,k .Ten the sum of all convoluted matrices is computed, and a bias value C k is added to each element of the resulting matrix.Finally, a nonlinear activation function af is applied to each aspect of the previous matrix to produce one output matrix O k .Each set of kernel matrices represents a local feature extractor that extracts regional features from the input matrices.Te learning procedure aims to fnd groups of kernel matrices Kn j,k Tat extracts good discriminative features to be used for emotion recognition.Backpropagation, a neural network connection weight optimisation algorithm, can train kernel matrices and biases as shared neuron connection weights.
Te pooling layer is used to reduce feature dimension.It reduces the number of output neurons in the convolutional layer, and pooling algorithms should be used to combine the convolution output matrices' neighbouring elements.Max pooling is used to reduce dimensionality.Te Max pooling layer with a 2 × 2 kernel size chooses the highest value from four adjacent input matrix elements to generate one component of the output matrix.During the error backpropagation process, the gradient signal must be only routed back to the neurons that contribute to the pooling output.In our CNN model, the ReLU activation function f(x) � max (0, x) is used in the convolutional layer, which signifcantly improves both learning speed and emotion recognition performance in CNN.
Batch learning is used to accelerate and improve learning speed and accuracy.Instead of updating the connection weights after each backpropagation, we process 128 input samples in a batch and update the entire set with a single update.To further speed up the learning, momentum weight combined with weight decay is applied.Te weight ∆ω i is updated by utilising equation Te ω i (t) − ηδE/δω i part is the backpropagation, where ω i (t) is the current weight vector.δE/δω i is the error gradient concerning the weight vector, and η is the learning rate.Te α∆ω i (t) is the momentum part, where α is the momentum rate.Te momentum weight update will speed up learning.Te ληω i is the weight decay part, where λ is the weight delay rate.It slightly reduces the weight vector towards zero in each learning iteration, which helps stabilise the learning process.Te working process of CNN during emotion recognition is shown in Figure 3.In this, it detects the type of emotion after analysing the features by CNN from the face image.(1) Histogram of Oriented Gradients (HOG) Face Feature Extraction.Te HOG feature descriptor is used for emphasising face structures or shapes.Te magnitude and gradient angle are used to compute the features in this feature descriptor.It outperforms other edge descriptors.It generates histograms for the areas of the face image based on the magnitude and direction of the gradient.Using HOG, each face image is frst divided into small square cells.It then computes a histogram of oriented gradients for each cell.Te result is then normalised using a block-wise pattern, and a description for each cell is output.Figure 4 depicts the fow of HOG feature computation from the input face image.

Method 2: CNN
Seven signifcant steps are used to compute HOG features from the input face image.It is explained below.
Step 1: Input Face image and Perform Preprocessing.Consider the input face image.Initially, the images are preprocessed to reduce the width-to-height ratio to 1 : 2. Most preferably, the input face image size should be resized to a size of 64 × 128.Ten the resized images are considered for further processing for the optimal extraction of features.
Step 2: Compute Gradients.Combining the image's magnitude and angle yields the gradient such as gradient x and gradient y .It is calculated for each pixel value in the input image.Te gradient x is computed by equation ( 11) and gradient y is computed by equation (12) in which R and C represent the row and column of each image matrix A. After the gradient value computation of each pixel in the image, the magnitude and angle values are computed by equations ( 13) and (14).
For normalising, the matrix is constructed based on equation (16).It divides each value by the square root of the sum of squares of the values (k) as per equation (17).
Te process of creating HOG features for the image is completed.For the 16 × 16 blocks of the image, features for the complete image are built by integrating the features.Now the generated features are given as the input to the SVM classifer for detecting faces.
(2) Support Vector Machine (SVM) for Face Recognition.A classic two-class recognition problem is solved using a support vector machine (SVM).It transforms the data using a kernel trick and then fnds an optimal boundary between the possible outputs based on the transformations.Tis work uses SVM for face recognition by modifying the interpretation of an SVM classifer's output and devising a representation of facial images concordant with a two-class problem.It selects the decision boundary that minimises the distance between the classes' closest data points.Te maximum margin classifer or maximum margin hyperplane is the decision boundary generated by SVMs.Let P j be the HOG features and Q j be the class labels of training data.Te images may be labelled as face images as +1 and nonface images as −1.Te SVM algorithm considers the input (P j , Q j ) during training.After that, it fnds the optimal decision surface with T n the number of support vectors.Ten, the linear surface can be calculated by equation (8), in which ∝ j is the coefcient weight, Q j is the class label of the support vector SV j and the weighted summation (w).Te computation of w in equation ( 18) is calculated using equation (19).
w � A ∈ F n is a facial image representation vector, where F n is referred to as face space.Face space can be another feature space or the vectorised original pixel values.Te equation's function calculates the SVM classifer function (18).
Feed a training set with two classes-one of the nonfacial photos and the other of facial images to build a classifer for the image "A."An SVM algorithm creates a linear decision surface to determine if the face image "A" is a face or not.Te following equation states that image "A" is a face image if the input picture A meets the requirement.
If the input image A satisfes the condition given in the following equation, then image "A" is a nonface image.
After detecting the frames with faces, all the face images are passed to the CNN for emotion recognition.

Emotion Recognition by CNN.
Te emotion recognition is carried out by convolution with diferent flter sizes and pooling layers of CNN.Te fow of work detecting emotion from the face in CNN is mentioned in Figure 5. Te working process of CNN is already discussed in Section 3.2.2.

Method 3: CNN PATCH .
In this method, the recognition of the face is done by a template-based method, and the emotion is detected by analysing the patches from the face by patch-based CNN.

Template-Based Face Detection.
Using the correlation between the templates and the input photos, template matching locates faces using predefned face templates.For instance: A human face can initially be broken down into its eyes, face contour, nose, and mouth.Te edge detection technique can then create an edge-rich face model.It is a method of looking for and fnding a template within a bigger picture.It determines whether the input face and template images are similar (training images).Te presence of full-face features can then be ascertained by analysing the correlation between the input face photos and the standard patterns stored in the full-face parts.It looked at the input photos at various scales to achieve the shape and scale invariance.Algorithm 2 explains the process of template-based face detection in keyframes.
Let K f (x, y), t(x, y) denotes the keyframe and template image, respectively.During the matching of t and K f , the correlation value (cv) is calculated using equation (22).Ten, normalise the correlation value using equation (23).Te correlation threshold (T) is computed by adding the mean with an arbitrary number of standard deviations.After that, compare cv and T. If the value of cv exceeds T, then that segment is marked as the face.

Patch-Based CNN for Emotion Recognition.
For the efective handling of occlusion in face images, A Patch-Gated CNN (PG-CNN) [59] is used in this work.Te primary reason to use patches instead of the entire face is to increase the number of training samples for efective and optimal feature-based CNN learning.Te second reason is that traditional CNN needs to resize faces when using full-face images as input.It signifcantly reduces the discriminative information.Using local patches maintains the native resolution of original face images, which increases discriminative ability.Te framework of patchbased CNN during emotion recognition is shown in Figure 5. Tis approach uses facial landmarks for region decomposition to generate the input image patches.In this, an end-to-end trainable Patch-Gated Convolution Neutral Network (PG-CNN) [59,60] automatically percepts the occluded region of the face and focuses on the most discriminative unoccluded areas.According to the locations of facial landmarks, PG-CNN divides an intermediate feature map into 24 patches to identify potential regions of interest on the face.After that, a suggested Patch-Gated Unit in PG-CNN is computed from the patch itself and reweights each patch according to relevance.Te working of the Patch-Gated Unit, followed by CNN during partial facial occlusion is represented in Figure 6.Te algorithm for occlusion detection from the patched image is described in Algorithm 3.
Te keyframes from the keyframe extraction phase will be considered as the input image of the network.Te network receives the information and displays it as feature maps.Te feature maps of the entire face are then divided

Scientifc Programming
Ten, depending on the n points discovered, it fnds n facial landmark points.Te informative facial area, consisting of two eyes, a nose, a mouth, a cheek, and eyebrows, is then covered by a new computation of m points.Te selected patches are then defned as the regions by treating each of the points as the centre.Following are the procedures used in this study to compute region decomposition for creating feature maps.
Step 1: Detects 68 facial landmark points Step 2: Select or re-compute 24 points.It must hold informative regions of the face such as eyes, nose, mouth, cheek, and eyebrow.
(2) Occlusion Perception with PG-Unit.Te PG-Unit embedded in the PG-CNN automatically percept the blocked facial patch and pay attention mainly to the unblocked and informative patches.In each patch-specifc PG-Unit, the cropped local feature maps are fed to two convolution layers without decreasing the spatial resolution.Tis is more effective in preserving more information when learning region-specifc patterns.Te last 512 × six × six feature maps are processed in two branches.Te frst branch encodes the input feature maps as the vector-shaped local feature.Te second branch consists of an attention net that estimates a scaler weight to denote the importance of the local patch.Te computed weight then weights the local feature.
Each local patch is encoded as a weighted vector of local features by a Patch-Gated Unit (PG-Unit).PG-Unit computes the weight of each patch by an attention net, considering its obstructed-ness (to what extent the patch is occluded).Finally, the weighted local features are concatenated and serve as a representation of the occluded face.Tree fully connected layers are followed to assign the face to one of the emotional categories.PG-CNN is optimised by minimising the soft-max loss.Te steps are used to identify occlusion with PG-unit and further emotion recognition.
Step 1: Input the feature map SFM i of patch P i toPG unit i Step 2: PG unit i calculated the weighted feature φ i as per equation ( 24), Importance or unobstructed-nessα i based on equation (25) and feature vector ψ i by equation (26), in which P i � φ(P i ) is the last feature map ahead of the two branches, (Θ) denotes production, and (.) denotes the attention net operations: pooling, convolution, inner productions, and a sigmoid activation.
Step 3: Te sigmoid activation forces the output αi ranges in [0, 1], where 1 indicates the most salient unobstructed patch and 0 indicates the completely blocked patch.

Ensemble Max Rule Method for Emotion Recognition: CNN ENSEMBLE .
Te idea is to create a model that ensembles the inherent emotional information within the video frames.Two basic approaches are proposed to achieve this purpose: (1) maximum emotion ensembles (MSE) (2) late multiple feature fusion (LMFF).
In maximum emotion ensembles, three models are explored: (1) Max.Emotions (2) Max.Emotion Intensity (3) Max.Emotion Sustenance [19,61].All three models work on extracted keyframes of the given video.Maximum Emotions count on the maximum probability related to each emotion across the separated keyframes and depicts it as the fnal emotion.Te maximum Emotion Intensity model measures the intensity of emotions for every keyframe and recommends the most intensifed emotion.Te maximum emotion sustenance model is more accurate than the above two models [19,61].Tis model measures the emotion in every keyframe and looks globally at the emotion that had occurred repeatedly for the more extended sequence of keyframes.12 Scientifc Programming Late multiple-feature fusion operates independently in 3 ways.Te frst method (Method 1 ) performs face detection and tracking from input video and then uses CNN for emotion detection over the face boundary box.Te other two methods perform video emotion recognition via imagebased approaches.Te input sequence is split into multiple keyframes.Te keyframes are then fed to the face recognition module, which identifes the sample.Te corresponding face set from the database is identifed, features extracted from the input face and matched with the trained sample features using SVM, and then fed to CNN for emotion detection by Method 2 .Te last approach (Method 3 ) uses patch-based ACNN for occlusion-aware emotion recognition.All three emotion recommendations are later fused in the ensembling setting to recommend the emotion at the output.Te pseudocode for ensemble classifcation is stated in Algorithm 4.

Experimental Results and Discussion
Tis section discussed the dataset used in this work and the evaluation results of emotion recognition with the ensemble method.
4.1.Video Emotion Dataset.Te primary data set used in this work is the Extended Cohn-Kanade (CK+) dataset, which contains 593 video sequences from a total of 123 subjects.In this, 327 videos are labelled with anger, contempt, disgust, fear, happiness, sadness, and surprise.A detailed description of the dataset is tabulated in Table 3. Te CK+ database is one of the most widely used facial expression classifcation databases.We have included some camera-recorded participants' facial expressions without disturbing their natural emotion outlay.Te videos are recorded for 1-10 seconds.Each video contains an average of 10-15 keyframes.A total of 1830 videos were taken for the experiment and among which 80% were taken for training and 20% for testing.

Keyframe Extraction.
A keyframe extraction approach [61] uses the histogram with deep learning to extract the pertinent keyframe from the video sequence.Te keyframe extraction gets the highest recall and precision values for all the video sequences.In most cases, a metric's highest value is insufcient.Te precision metric assesses a method's capacity to obtain the most accurate outcomes.A high accuracy number indicates more substantial keyframe relevance.However, a high-precision number can be obtained by choosing just a few keyframes from a video sequence.Te keyframe extraction algorithm depends heavily on the accuracy and speed of both parameters.If the algorithm is slow, then the throughput of the system gets afected.It is also necessary that extracted keyframes are the relevant and accurate.Further, it will afect other processes, such as object detection, classifcation, and object description.Te Precision in equation (27) and Recall in equation (28) Te Precision and Recall value achieved using the keyframe extraction method is high, so the model gives unique frames without replica.Te result also calculates the CPU time (0.50) to extract the keyframes; it shows that the extraction speed is good.

Face Detection.
Facial detection plays a signifcant role in facial identifcation and emotion recognition.Te method of face detection in photographs is complicated due to the variability across human faces, including pose, expression, position and orientation, skin colour, glasses or facial hair, diferences in camera gain, lighting conditions, and image resolution.Tis method's strength is to concentrate computational resources on the area of an image holding a face.
One of the computer technologies involved in image processing and computer vision is called object detection, and it deals with fnding instances of objects like people, cars, buildings, and trees.Finding out if there is a face in the image is the primary goal of face detection algorithms.In this paper, we employ two face-detection techniques.Face detection allows us to gather the data required for emotion analysis.

Face Detection Using Haar Cascade and KLT
Algorithm.Te video may contain a single person or multiple people, and emotion can be identifed for both occluded and nonoccluded faces.Initially, the face is detected through the Haar cascade algorithm and tracked using KLT tracking, which accurately tracks the detected face.Te sample results of face detection using Haar and KLT are displayed in Figure 7 and tabulated in Table 5.
In Method 1, the keyframes with faces are identifed and tracked by Haar cascading and the KLT algorithm.Te model was accurately detecting the faces in the image.But partially occluded or side-angled faces are missing in the model.In our experiment, we got an accuracy of 92.6% for the model.
HOG achieves face detection, and SVM ofers more accurate results than Haar KLT because it detects faces with angle change or partially covered to some extent.

Emotion Recognition.
After identifying the face, the emotions are detected using CNN and patch-based CNN.Te performance of emotion recognition is discussed below.

Patch-Based CNN for Emotion
Recognition.In this study, CNN is used as the base classifer by PG-CNN.Te straightforward structure and unique item categorisation performance are the cause.Attach 24 PG-Units after selecting the frst nine convolution layers as the feature map for region decomposition.Te model was initialised using the pretrained model based on the ImageNet dataset.For each dataset, both the train and test corpus are mixed with occluded images with a ratio of 1 : 1.We adopt a batch-based stochastic gradient descent method to optimise the model.Te base learning rate was set as 0.001 and was reduced by the polynomial policy with a gamma of 0.1.Te momentum was set as 0.9, and the weight decay was set as 0.0005.Te training of models was completed on a Titan-X GPU with 12 GB memory.During the training stage, we set the actual batch size as 128 and the maximum iterations as 50 K.It took about 1.5 days to fnish optimising the model.
CNN disintegrates the feature maps as multiple subfeature maps.Te region decomposition the feature maps are divided by CNN into many subfeature maps.Te facial picture is aligned by fxing the 68 facial landmarks around the face, and the region is decomposed by splitting the facial landmark into 24 patches that span the entire informative area.Ten patches are extracted based on the locations of the landmarks on each subject's face.Te following procedure is used to choose the facial patches: (i) Sixteen points are picked from the original 68 facial landmarks to cover eyebrows, eyes, nose, and mouth.
Based on the 512 × 28 × 28 feature maps and the 24 local region centres, a total of 24 provincial regions are obtained, each with a size of 512 × 6 × 6.
Te inbuilt PG-CNN detects blocked face patches automatically and focuses mainly on unblocked and informative patches.Te cropped local feature maps are given to two convolution layers, the attention layer and the encoding layer, in each patch-specifc PG unit.Figure 9 depicts the regional features.
Table9 shows the performance of both nonoccluded and occluded images during emotion recognition.For both occluded and nonoccluded scenarios, the overall accuracy on seven facial expression categories is evaluated by performing a 10-fold evaluation.
A 10-fold test accuracy test has been performed on CK+, ISED Dataset with synthetic occlusions.Te size of occlusion are 8 × 8, 16 × 16, and 24 × 24, represented by R8, R16, and R24, respectively.Te full image size is 48 × 48.Te input images (size 48 × 48) without occlusion have high accuracy of 97.02%.In the same image set, synthetic occlusion was applied on a diferent scale (S8, S16, S24).Te accuracy of occluded images varies with the amount of occlusion.But Table 10 shows that the accuracy of the occluded images is also high.

Performance of CNN vs. Patch-Based CNN vs.
Ensemble.Table 9 shows the diferent sets of experiments conducted for emotion detection from facial expressions.In the frst method CNN HOG-KLT, face detection is performed by HOG and KLT methods, and detected CNN classifes frames.Similarly, in CNN Haar-SVM, Haar cascade and SVM techniques are used for face detection in extracted frames.Again CNN is    From the confusion matrix, the performance such as Precision in equation (32), Recall in equation (33), Accuracy in equation (34), and F-measure in equation ( 35) are evaluated and tabulated in Tables 14-17.From this, the

Conclusion
An ensemble of CNN methods performs the robust emotion recognition of faces using multiple facial features.Tis proposed CNN ENSEMBLE approach is suitable for a single person and multiple persons in the video.Despite partial occlusions, the proposed work responds much better than the previous approaches using CNN.All the faces with emotions within keyframes are initially detected using CNN HOG-KLT , CNN Haar-SVM, and CNN PATCH methods.
After that, the CNN ENSEMBLE method ensemble the detected emotions by the Max rule and achieved the maximum accuracy of 92.07%.In addition, other performance measures such as Precision, Recall, and F-Measure also proved that ensemble increases the emotion recognition rates.Tis system can detect emotions during occlusion.Te emotion recognition system has to be further improved to handle more partial and complete occlusions.In addition to this, there is a plan to consider contextual information along with facial images to recognise human emotions.Terefore, the extracted features from facial and context regions surrounding that person can be fused to make more labels and classify the emotions in the future.
(i) To propose new techniques for face detection (ii) To develop a system with high accuracy of face emotion detection (iii) To propose an ensemble convolution neural network for face emotion classifcation.

( 1 )
Haar Cascade Algorithm.Te Haar cascade is used to recognise faces in keyframes.It essentially identifes adjacent rectangular regions in a detection window at a specifc location.Te calculation entails adding the pixel intensities in each area and subtracting the sums.Tese features can be challenging to determine for a large image.To overcome the difculty, integral photos are used, reducing the number of operations compared to larger original images.Necessary images return the pixel value at any (a, b) location is the sum of all pixel values present before the current pixel.Instead of computing at each pixel, it creates subrectangles and array references for each subrectangle.Te Haar features are then computed using these.Haar is primarily used to extract three distinct parts: line, edge, and rectangle features.Te representation of line, edge, and rectangle features is represented in Figure2.

Figure 1 :
Figure 1: Ensemble framework for facial emotion recognition.

( 1 )
Extract the video frames (f 1 . . .f n ) (2) Find the histogram diference between two adjacent frames (f j , f j+1 ) (3) Calculate M and SD of absolute diference (4) Compute threshold, TH (5) Compare the diference (d) with TH if the d > TH select it as a keyframe(f k ) Else go to step no. 2. (6) Continue the process till the end of the video ALGORITHM 1: Histogram diference for keyframe selection.
Haar-SVM .Method 2, developed inside the ensemble framework, consists of two main steps such as (1) face detection and (2) emotion recognition.3.3.1.Face Detection.Face recognition is a two-step process.Initially, HOG-based normalised face features are extracted.After generating normalised feature vectors, all the features are given as the input to the SVM classifer for face recognition.

Figure 3 :Figure 4 :
Figure 3: Te process of CNN during emotion recognition.
Input: Keyframes Output: Representation of occluded face Input the extracted keyframe KF i as a face image Generate a feature map (FM) from each keyframe Return 24 local patches (P 1 , P 2 . . .P 24 ) For each local patch Decomposes the feature map into 24 subfeature-maps (SFM 1 . . .SFM 24 ) Encode a weighted vector (wv) of local feature (lf ) by a PG-Unit PG-Unit computes the weight by an attention net based on its obstructed-ness Concatenate the weighted local features Return the representation of the occluded face.End For ALGORITHM 3: Occlusion detection from patched image.

Figure 7 :
Figure 7: Sample result for face detection in method1.

Figure 8 :
Figure 8: Sample result for face detection in method 2.

Figure 9 :
Figure 9: Te sample patches generated from the input face image.

Table 1 :
Facial expression video datasets and emotion categories.
, in which H is called the Hessian matrix.
T .[T(a)−I(W(a;d))].(8)3.2.2.Emotion Recognition by CNN.Tis work uses CNN to achieve high precision in emotion recognition.Tere are two primary functions of CNN; Feature Extraction and classifcation.CNN has multiple layers in which each layer performs a specifc transformation function.Te goal of CNN is to reduce the images so that it would be easier to process without losing valuable features for accurate prediction.
Tis maintains visibility in the image.Normalising the gradients by taking 16 × 16 blocks may lead to fuctuation in lighting.Terefore, a 16 × 16 block will be formed by joining four 8 × 8 cells.In step 4, a histogram comprises a 9 × 1 matrix for each 8 × 8 cell.Terefore, a single 36 × 1 matrix or four 9 × 1 matrices are there while creating a 16 × 16 block.
are evaluated during keyframe extraction and tabulated in Table4.
Training Set [Data � (s a , r a )|s a ∈ R, r a ∈ Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral Data Identify the number of training samples that need replacement, i.e., TS � round(n × SR) For b � 1: TS Randomly pick "z" samples from Data a If x is a majority class sample, then Generate a neighborhood of z based on a i ′ (b) � a i (b) − λrσ i and replace z exists in Data a Else if Check z is a minority sample, then compute m � Round ((Imbalanced ratio − 1)/(SR + 1)) Replace m neighbourhoods of z in Data a End For Build base classifer BC a from Data a

Table 3 :
Video emotion dataset description.

Table 4 :
Recall, Precision, and CPU time for keyframe extraction.

Table 5 :
(31)le result for face detection in method1 for two diferent keyframes.Face Detection Using HOG and SVM.Feature extraction for facial emotion recognition is performed through HOG features, where the video is fed to keyframe extraction and the particular image has all the information and details about the face (i.e.) face directions, edges, intensities, and colour are extracted and saved in a separate fle.Tis information is fed to the SVM classifer for accurate face detection.Te sample results of face detection using HOG and SVM are displayed in Figure8and tabulated in Table6.Te efectiveness of the face detection model is often evaluated based on Precision in equation(29), Recall in equation(30), and Accuracy by equation(31).Te Precision and Recall are in Table7, and Accuracy in Table8proves that more accurate face detection is possible when using the HOG-SVM model.

Table 6 :
Sample result for face detection in method 2.

Table 7 :
Precision and Recall for face detection using HOG-SVM and Haar KLT.

Table 8 :
Accuracy for face detection using HOG-SVM and Haar KLT.

Table 10 :
Accuracy of PG-CNN under diferent amounts of synthetic occlusions.

Table 13 :
Confusion matrix for CNN PATCH .

Table 14 :
Accuracy of maximum ensemble of diferent techniques.Bold values show that the accuracy value of all emotions are more accurate in the CNN-ENSEMBLE model as compared with other models.

Table 15 :
Te Precision of maximum ensemble of diferent techniques.Bold values show that the CNN ENSEMBLE model has more precision values as compared with other models.

Table 16 :
Recall of maximum ensemble of diferent techniques.Bold values show that the CNN ENSEMBLE model has more recall values as compared with other models.

Table 17 :
F-measure of maximum ensemble of diferent techniques.Bold values show that the CNN ENSEMBLE model has more F-measure values as compared with other models.

Table 18 :
Comparison with existing topic embedding with sentiment classifcation methods.Comparison with Existing Emotion Classifcation Methods.Te comparison of existing emotion classifcation methods with the proposed model tabulated in Table 18.Based on the comparison, the proposed ensemble CNN (CNN ENSEMBLE ) is more suitable for identifying emotion class.