Fusion of Machine Learning and Privacy Preserving for Secure Facial Expression Recognition

Department of Computer Science & IT, Sarhad University of Science and Information Technology, Peshawar, Pakistan School of Information and Electronics, Beijing Institute of Technology, Beijing, China Department of IT & Computer Science, Pak-Austria Fachhoschule: Institute of Applied Sciences and Technology, Haripur, Pakistan Department of Computer Science, University of Swabi, Swabi, Pakistan Department of Accounting & Information Systems, College of Business & Economics, Qatar University, Doha, Qatar


Introduction
Facial expressions contain the most important nonverbal and rich emotional information in social communication [1]. People communicate with each other through verbal and nonverbal communications [2]. Nonverbal communication involves facial gestures, eye to eye contact, facial expressions, and paralanguage [3]. According to an earlier research, while communicating, 50 percent of the information is conveyed through facial expression, 40 percent through voice, and 8 percent through language. Apart from that, due to the rapid progression in technology, we spend most of the time on electronic devices that carry a variety of software interfaces that are tense, primitive, and nonverbal. erefore, facial expression recognition should further improve to have a more natural and intelligent human-machine interaction.
Facial expression recognition is used in various domains like Intelligent Tutoring System (ITS), psychology, humanmachine interaction, behavioral science, intelligent transportation, and interactive games [4]. It can help monitor the abnormal expressions in the crowd at public places to avoid any crime. It can also be helpful in the service industry to timely capture the feedback of customers and it can provide timely treatment of patients by looking at the real-time expressions of the patient at the hospital. According to Ekman and Friesman [5], there are six basic expressions: happiness, surprise, disgust, fear, sadness, and anger (some researchers have termed neutral expression as the seventh expression). ese expressions are conveyed almost among all species.
Facial expression recognition is widely studied by various researchers. Despite the available research, robust FER is yet an open and challenging task [6,7] However, most of the recognition algorithms do not consider inter-class variations caused by the differences in facial attributes of the same individual. Hence, mostly, expression classification is done through facial expression information along with identity-related information [8,9]. e main drawback it carries is that it affects the overall generalization capability of FER systems, thus resulting in degradation of performance on unseen identities [10]. An efficient FER system plays a vital role in the treatment of patients by observing their variable behavior patterns. Happiness expression depicts a healthy and positive mental state while sad and angry demonstrate an unhealthy mental state. Different mental diseases like autism or anxiety are detected due to the emotional conflicts of a particular patient. An important application of FER is E-health care; nowadays, almost 0.3 billion people are suffering from depression, which can also lead to suicidal tendencies if they are not treated timely and effectively [11]. In general, mental health treatment faces a lot of barriers like financial cost, social stigma, and shortage of accessible options. Normally, the clinical staff interviews a patient for identifying symptoms of depression via verbal and nonverbal indicators. Patients are asked to fill a questionnaire for the measurement of depression severity [12]. For timely detection of depression symptoms, an AI-based system will help in entrenching barriers for timely and effective treatment.
In this paper, we use a combination of different techniques to develop a robust model. Initially, we implement different preprocessing techniques to fine-tune and remove highly uncorrelated information in the images. Face detection is performed using facial attributes due to the following reasons: (1) e human face has a unique structure, with the most important local facial parts, such as eyes, mouth, nose, helping us to detect the face in an unconstrained environment. So, the partness map or the response map of five different parts is used in the method. (2) e face adheres to spatial arrangements like the hair being above the eyes and lips below the nose. Hence, the faceness score has been derived from the response configuration. (3) e face hypothesis is performed for the estimation of more accurate face locations. Our contribution is to introduce special attributes supervision to discover facial part responses. We adapt Deep Convolutional Generative Adversarial Network (DCGAN) for data augmentation. It helps us in the demonstration of realistic data augmentation and improvement in the generalization performance in the low-data regime.
For an accurate and robust FER, feature representation of the facial images is the most important step. A considerable amount of research has been done over local and global feature extraction [13]. Fan et. al [14] suggested a model, i.e., MRE-CNN, which aimed to enhance the learning power of the convolutional neural network by considering both the local and global features. Li and Deng [15] introduced the DLP-CNN framework in which the discrimination power of deep features is enhanced while maximizing the interclass scatter and by preserving the locality closeness. Still, they are unable to find the relative relationship between the local features. A face is composed of a certain structure where every part has a relative relationship with the other parts. To address this issue, we propose a method that is capable of spatial transformation due to the action unit aware mechanism and thus forwards the most desiring features for dynamic routing between capsules. Finally, the squashing function is used for classification purposes. We also faced the challenge of achieving classification while having client/ server as mutually distrustful of disclosure of the private contents of the facial images and without presenting the result to the server.
ere are many practical and potential applications, but the main focus is to capture useful and discriminative features. e better feature representations can help to improve the overall efficiency of the system. An appropriate, flexible, and effective facial expression recognition system will add benefit to the industry.
ere are a lot of standard cryptophic techniques just like secure multiple-parties communication and homomorphic encryption, but they were computationally way too expensive. us, we have provided a practical solution to the aforementioned problems. We assess the effectiveness and performance of the introduced model on the Extended Cohn-Kanade, MMI, Oulu-CASIA, and Real-world Affective Faces (RAF) databases. Figure 1 shows some sample images from the CK + database. e main contributions of this paper are as follows: (1) We propose a network, which is capable of finding the active relationship between the features from different local regions. Spatial information is also introduced by having prior knowledge of the probability of an object's existence. (2) Implementation of a simple FER without using the cryptographic techniques having high computational complexity. (3) Simultaneously achieving the same classification accuracy as that of a conventional algorithm (nonprivacy-preserving).
e organization of the next sections is as follows. In Section 2, we provide the problems with the existing methods. In Section 3, we elaborate on our novel architecture with the underlying information. Section 4 comprises the results and analysis. Finally, we provide the conclusion of our research and explain the direction for future work in the last section.

Related Work
e main goal of FER is to capture the meaningful features that are discriminative and descriptive, and invariant to facial variations such as occlusion, illumination, pose, and other identity-related details. ere are two main methods available for feature extraction: (1) handcrafted method and (2)deep-learning-based method. Nowadays, deep learning methods are gaining remarkable results. However, earlier, mostly facial expression recognitions were based on handcrafted/human-engineered features such as Histograms of Oriented Gradients (HOG) [16], n-dimensional scale-invariant feature transform (n: sift) [17], and Local Phase Quantization (LPQ) [18]. ese methods are used for the extraction of global as well as detailed information of an individual face. However, the information obtained is from the overall facial region, and it ignores the expression changes in the local regions, which contain the eyes, nose, and mouth. ese methods perform pretty well in a labcontrolled environment where subjects pose expressions under constant illumination, stable eye gaze, and head pose movement. Existing handcrafted approaches demonstrate comparatively less recognition accuracy. Efforts are exerted for manually extracting the desired discriminating features that are linked to expression changes. Considering in-thewild scenarios deep learning methods for the robustness of facial expression recognition have been implemented [19][20][21][22]. However, deep representation is affected just because every facial attribute of a particular subject carries a hefty number of variations such as gender, ethnicity, and age of the particular posing expressions. It holds a very big disadvantage, i.e., the generalization capability for any model is highly and negatively affected; as a result of unseen objects, the performance of facial expression recognition is degraded. Although quite a lot of work has been done toward improving the performance of FER, alleviating the influence of inter-subject variations is still a challenge and an open area of research.
Several techniques have been implemented by reducing intra-class variations and by increasing the interclass differences, which further increases the discriminating property of the features extracted for FER in the real-time scenario [23]. Identity-Aware CNN (IACNN) proposed that by reducing the influence of identity-related information with the use of expression and identity-sensitive contrastive losses, the facial expression recognition performance can be enhanced [24]. e island loss has been proposed for extracting the effective discriminative features for FER [25]. Moreover, in [26], with the use of residue learning the person-independent expression representation has been learned. However, this technique was computationally costly, and due to the same intermediate representation used for the generation of neutral images for the same identities, it also was unable to disentangle the expression information from identity information. However, in [24], due to large data expansion caused by the compilation of training data in image pair forms, the effectiveness of contrastive loss is heavily affected [25]. Similarly, in [27], a fixed identity has been proposed for the transfer of facial expressions to fix the influence of identity relative information. e problem persists with the methods as the efficiency of FER depends on the expression transfer procedure. In short, it has been noticed that FER based on the deep learning methods has outperformed the traditional handcrafted methods. However, there is still a gap in deep learning because very few studies have employed facial depth images in the deep networks as an input. Compared with the existing models, the main goal is to design a network that can be fully adopted for the decomposition of the facial region, easy to implement, and is robust.
Different researchers have implemented different methods to ensure privacy. In [28], the privacy-preserving data classification was done with the use of Principal Component Analysis for feature extraction, and for classification, the nearest neighbor was used. However, it failed to perform in the presence of nonlinear facial variations. Fisher Linear Discriminant Analysis has been proposed in [29] and it had less error rate compared to PCA. However, it did not work well for maintaining the privacy of the discriminative features of a specific class in the multimodal class. Hence, LFDA was proposed to overcome the deficiencies in the FLDA. e work in [30] meets the privacy requirements by hiding the test image and achieving results using the Paillier Homo-morphic encryption [31]. In the research work in Security and Communication Networks [32], the author proposed EPOM that achieves the goal of secure integer number processing without resulting in privacy leakage of data to unauthorized parties. In [33], it has been proposed that subprotocols can dramatically reduce the number of messages exchanged during the iterative approximation process based on the coordinate rotation digital computer algorithm. Due to the large keys for the encryption as well as decryption, it involves computationally intensive operations such as a large number of exponents. Meanwhile, it also has a limited number of operations during the classification of data, which makes the client/server communicate even a lot more with each other. Hence, our proposed method can achieve true recognition rate even in the presence of the privacy protocol, which uses randomization and is capable of intense multiplication and addition.

Preprocessing.
Preprocessing is very important as it aims to capture the meaningful features, align, and normalize the most needed visual information conveyed by the facial image. Every real-time image is affected by nonlinear facial variations, i.e., varying illumination, the difference in the contrast between the foreground and background, and irrelevant head poses. erefore, to get the maximum possible semantic meanings of the features for further training the deep neural network, we need to perform some preprocessing techniques. is step is used for the elimination of highly uncorrelated data in the image.

Face Detection.
Face detection is one of the vital steps in the FER because of the excessive background, and there is still highly uncorrelated information in the image even taken from a few benchmark datasets. Most of the datasets have an almost frontal view and high-resolution images. So Viola and Jones algorithm [34] is used in most scenarios. Faceness-net has been used in this paper. A full image is provided as an input image to the convolutional neural network for generation of partness map. e partness map is generated for different facial parts like eyes, nose, mouth, etc. Facial attributes are further categorized to distinguish it from other parts, just like how hair can be blond, black, wavy, straight, etc. erefore, in the next stage, face proposals are much more refined, so that the usefulness of facial attributes are explored for learning an optimized and robust face detection. A CNN is trained over uncropped images and is used for obtaining face part detectors without any explicit part supervision. e faceness score is evaluated based on the face part responses and considering the spatial arrangements associated with them. After the generation of face proposals, a strong face detector is trained and it outperforms all other methods.
In Figure 2, the face is divided into five important parts, where eyes, nose, and hair are much more effective as compared to mouth and beard, which can be partially occluded.
erefore, the combination of facial parts gives much better results compared to individual facial parts.

Data Augmentation.
As far as the deep neural network is chosen for FER, data augmentation is used to produce much better results by providing a large amount of data. It is effective in the generalization capability of the model as many of the publicly available datasets are not large enough to validate the results more efficiently. Large training data yield to a well-trained model.
ere are some standard methods of data augmentation like skewing, rotating, shifting, changing the color scheme, resizing the image, and enhancement of image noise [22]. To automatically learn the augmented data in the low-data setting, we have used Deep Convolutional Generative Adversarial Network (DCGAN). It is used for the alleviation of the overfitting problem over the on-the-fly data. e samples provided as input are randomly cropped from all the four sides and then a horizontal flip is performed for making a dataset ten times bigger than the original one.

Dual Enhanced Capsule Network (DE-Capsnet).
e entire network has been shown in Figure 3, where the model is divided into portions. Firstly, we have to preprocess the images to avoid the uncorrelated information linked to the facial image. en, we have two modules for further processing. In the first part, the box with the purple dashed line is attention aware of action units and consists of deep convolutional layers for the extraction of the enhanced features maps, and this has been termed as enhancement module 1. In the later part, with the use of dynamic routing, those enhanced feature maps are encoded between capsules, and the process of decoding is done by the fully connected layers (the process has been shown in the green dashed lines). At the end, the squashing function is used for the recognition of facial expressions.
VGG19 is used in enhancement module 1 because it is very much robust in object classification besides having a simple architecture. For a better understanding of the description, each stage is having multiple convolutional layers followed by a max-pooling layer. In the first 2 stages, each stage is having 2 convolutional layers. Whereas in the last 3  stages, each stage is having 3 convolutional layers, respectively. We do not retain the last 3 layers as we have to get the feature maps.
To achieve the attention map, we have used the generation method by Li et al. [35]. Furthermore, we have made appropriate adjustments to the datasets used in our work for getting the key facial landmarks. Figure 3 shows the facial image with blue facial landmarks along with the attention map. Action unit's centres are obtained with key facial points by using scaled distance. To make sure that the scales must be the same among all the facial images, the facial images are resized. Hence, for making the shifting distance among the images as much adaptive as possible, the measurement reference is used to indicate the shift in the distance. To locate the action unit, centres of the inner corner distance have been used as scaled distance. For each action unit, the 7 pixels in the nearby area have been taken in the experiments, as a result, size of each action unit area is 15 × 15. H w is assigned as the higher weight, which is the closest point to the action unit centre.
e Manhattan distance is termed as m d to action unit centre. Hence, those areas that are having higher values in the attention map correspond to the active areas of action units in facial images, and an attention map will further enhance them.
After the generation of attention maps, the maps are further forwarded to stage 3 and stage 4, as shown in Figure 3. e feature maps which are generated after the pooling layer of the second stage are multiplied with the attention map of the first stage, and after that being parallel with the convolutional layers of the third stage. Hence, the results obtained after the convolution are added element by element and then forwarded to the max-pooling layer of the current stage as an input. A similar operation is done at the fourth stage by jointly combining the convolutional layers with the attention map. Here, we explain the reason behind using attention maps; it is just because all areas are not equally important for facial expression recognition.
After the enhancement module 1, we get 512 × 7 × 7 feature maps. For the dynamic routing the feature maps are further fed between primary capsule layers and face capsule layers. ree fully connected layers are used for decoding and reconstructing the facial image. e nonlinear function, i.e., the squashing function is used for facial expression recognition, which is defined in equation (2) as follows: where k is used for the capsule, and u k and j k are output and input vectors, respectively. L m is the minimizing margin loss and L r is the reconstructing loss used for updating the parameters in the network. Total loss is defined as L t . Loss function expressions are defined in the equations (3)-(5), respectively.
where cc is termed as the classification category and for that particular category the indication function is denoted by I cc . e upper and lower boundaries are represented by b + and b − , respectively. e f represents the original image, whereas f c represents the reconstructed image. is classification is based on one party; the training and testing phase is done by that party. However, we propose a method through which the server will be in charge of training, and testing will be done by both parties collaboratively.  Figure 3: Overview of the proposed method.

Information Security.
A security algorithm is information-secure in the sense that its security springs purely from scientific theory. e thought of information secure communication was initiated by the applied scientist, Shannon; he further added that the one-time pad system records excellent security subject to the subsequent two conditions [36]: (1) e key that randomizes the information ought to be random and will be used one time (2) e length of the key had to be as long as the length of the information Even if any rule randomizes its parameter and the above conditions are satisfied, it is still hard to unmask the parameters even if an adversary is having exceptional computation power; e.g., if the random pace is the same as the message space, and is adequate to 1024−bits, then prior and posterior probabilities are the same, i.e., there is no particular advantage to urging posterior probability than prior probability.

Privacy Preserving Security.
e main theme is to ensure secure operation between the client and the server. Both of them want to communicate with each other, and for that purpose need to compute u T v + p. Where u is a vector known to client and v is a vector known to server with p being a scalar. However, only client will know the outcome of u T v + p.
Where u input to client is composed of integers and v input to server is composed of floating points. Since we tend to perform integer random numbers, the process of conversion into integers is achieved by scaling the elements of the vector v, and it is approximated to the nearest integer. We use the scaling factor s that is large in (u T sv + s * p). First client adds random numbers in the vector v and server does few operations and returns the result to client. So, the operation is made valid by first scaling the scalar p and vector v by scaling the factor s, then the outcome is divided by that scaling factor. So one thing is for sure, i.e., the server won't know anything about the client input and the same will be the case with the client. e client will just get to know about the result without having any knowledge about server vector and scalar. Hence, the above process is called a two-party protocol, which is completely information secure. Figure 4 demonstrates that any unknown input face image of any identity can be applied to synthesize a realistic equivalent face image of any other image.

Facial Expression Recognition Based on Privacy
Preserving. e first step of every procedure is to mark the basic requirements and then fulfill them accordingly. e 3 requirements meeting with this process are as follows: Requirement 1: without using more sophisticated public encryption system. Requirement 2: hide a sample of client input data and server split result. Requirement 3: hide server classification parameters and client will be unaware of means of database.
To explain this, let us break down the traditional assessment phase into four steps as follows: Step 1. Find test image difference.
where test is a difference test image, test is a testing image, and b is mean of database. At the start, the client cannot send test image due to privacy issues. erefore, the client only sends the image with the noise vector, n ∼ ∈ Z n * 1 having the same size as the test image. Since the server only receives the noise vector, it receives no information about the test image vector. So the difference noise vector is given as However, the difference between test image and noise image is just known to them. Let's represent it as Step 2. Illustration of the lower extremity difference.
where B is the transformation matrix, a is a low dimensional vector corresponding to test. However, the server needs to project a low-dimensional vector with noise image given below: Step 3. Euclidean distance calculation. where j i is the training image (low dimensional) and Ed i is the Euclidean distance and i � 1, 2 . . . ..N.
Step 4. Calculation of distance length to match the test image in the known section.
where the matching training image is denoted by m i , but it is hard for the server to calculate the original distance Ed i because the server doesn't know the r vector and the test image. So, in order to attain the matching image, the server will send all the Ed i , with a random number r i for each Ed i where Now the client can calculate the actual distance from the above equation as It is to make sure that only client knows the m i ; if the server gets to know about that, then he can calculate Euclidean distances between the provided test images and training images. us, the server will be able to find the expression corresponding with the test image and ultimately it will effect privacy.

Privacy Analysis.
In this part of the research, we are interested in knowing whether our method is susceptible to any privacy leakage. Our method is based on the computation of both parties and therefore the only single possibility of privacy leakage can be the interaction between both the parties. To prove that our method does not leak unwanted information to a client or server, Goldreich's Privacy definition is used [37]:

Definition of Secured Privacy for Both Parties'
Computation.
e protocol we use for security should not disclose the hidden information to a third party (semihonest) except the information that can be triggered by looking at the input and output of those parties.
Our primary purpose is to verify whether the proposed two-party calculation satisfies the definition of privacy or not. In the above four steps, it is clearly mentioned to the client and the server about their inputs and outputs. erefore, we have to make sure that both of them don't infer other than the known inputs and outputs so that the proposed method would make sure that privacy is assured. e client's ultimate goal is to make sure that the server is unaware of the test image and also just keep away the classification result. On the other side, the server had to keep the classification parameters away from the client. e client will just share the noise image initially, instead of sharing the true/original image; however, the size of both the images will remain the same. So, the server will know the size, and it will not be a privacy leakage. In return, the server also shares the random Euclidean distance obtained with the help of a random integer. Hence, information-theoretic security is achieved.

Results and Discussion
We have used four most popular databases for populating the results. ese databases are CK+ [38], MMI [39], Oulu-CASIA [40], and RAF [23]. e RAF is used for large posed and real-world expressions, as the first three don't have large posed expressions. So to check the robustness of our method over large posed expressions, we have used the RAF data base.

Description of Databases.
e Extended Cohn-Kanade database is the widest and the most popular database used in facial expression recognition. It contains 593 video sequences, which do vary from 10-60 frames with a shift from neutral to other expressions. ere are a total of 123 subjects who performed different expressions, the ages of the subjects ranging from 18 to 30 years. Out of the 123 subjects, most of them are females. A total of 327 video sequences out of them are categorized into seven expressions. e core reason behind the algorithms not being uniform over CK+ is that it doesn't provide specific training, validation, and test sets. e MMI database is laboratory-controlled and 75 subjects have performed 2900 expressions, both video sequences and static images with high resolution, out of which 326 video sequences are obtained from 32 subjects. e MMI database is different from CK+ as it uses onsets, offsets, and apex phases. In the sequences, the neutral expression is performed at the start of every sequence and reaches the peak and then returns back to the neutral expression. is database has very challenging conditions, i.e., it takes care of large inter-personal variations; every subject is performing different nonuniform expressions while wearing glasses, mustaches, etc.
e Oulu-CASIA database consists of 2880 images from 80 subjects for six expressions; most of them are males aged between 23 and 58 years. is database is specially designed to tackle the problem of illumination due to environmental changes. It consists of two different imaging systems; the first one is Near Infrared (NIR), whereas the second one is Visible Light (VIS).
ere are 3 different variable illumination scenarios: the first one is normal indoor illumination; the second one is used for weak illumination considering the scenario where just the computer display is on; and the third one is having all the lights off, i.e., dark illumination.
e Real-world Affective Faces Database is used, which consists of 29672 great, diverse real-world facial images.
ese images are downloaded from the Internet based on the approach of crowdsourcing; 40 annotators are used for independently labeling each image. is database consists of the large variability in different subjects' gender, age, ethnicity, varying lighting conditions, head pose, eye gaze, occlusions, and post-processing operations, which helps us to validate our network over versatile databases.

Implementation Details.
e facial image is first preprocessed using face detection, data augmentation, and illumination normalization for fine-tuning of the image. e highly uncorrelated data are removed in order to process them further for a high-quality result. en, the landmark detection is used to identify the key facial points. After that, VGG19 is used as a backbone of the network, where feature maps of 512 × 7 × 7 are obtained after the 1st enhancement module. en, 256 × 6 × 6 feature maps are obtained from 2 × 2 convolutional kernels having the stride value of one; those feature maps are further forwarded to primary capsule layers with an 8D capsule and 32 convolutional layers. ere are 3 routing iterations which are then executed between the primary capsule layers and the Face Capsule layers. Every expression is having 16D Capsules, where all the lower capsules forward information to the above capsule. en with 3 fully connected layers, we use the squashing function for further classification. Adam optimizer is used for learning with a rate of 0.0001. e value of b + is 0.9 and b − is 0.1. Furthermore, the batch size is set to 16 and the maximum iteration is set to 300. Our whole network training is end to end.
In the Extended Cohn-Kanade database, we take the last frame to three frames and consider the first frame as a neutral expression for data selection. e subjects have been divided into a group of 10, and a 10-fold cross-validation is performed. Table 1 shows the average accuracy rates compared with other existing state-of-the-art methods. Our image-based method achieves the highest accuracy of 98.95 percent against sequence-based techniques that extract the features from a sequence of images or videos.
In the MMI database, we take three frames from the middle of each sequence that is associated with peak information and develop a dataset consisting of 624 images. Afterward, the data augmentation is performed and then distributed among 10 sets. For experimentation, the 10 cross-fold person independent validation is performed using the first frame, i.e., neutral expression, and it takes three peak frames from every frontal sequence. Table 2 shows the dominance in the average accuracy rates compared with other existing methods.
In the Oulu-CASIA database for training and testing, we use the last three frames from every sequence. A 10-fold cross-validation is performed just like CK+ in which based on the subject, each fold is completely disjointed with all the remaining folds. Table 3 shows the average accuracy rates, which outperform all novel methods. It achieves the highest accuracy of 91.2 percent.
Just like other databases in the RAF database, we perform a 10-fold cross-validation too. Table 4 shows the average accuracy rates of our method on the RAF database. We first obtained the true positives, false positives, true negatives, and false negatives, and then over 10 folds we calculated the F1 score and precision per class. Figure 5(a) shows the per-class precision and Figure 5(b) shows the perclass F1 score on the following databases.

reats to Validation.
ere are a few factors that can enhance the robustness of facial expression recognition. While validating our approach, there are some limitations to the existing publicly available novel databases. e recognition of the expression with a closed mouth is less accurate as compared with the expression with an open mouth. Considering the agreement of facial expressions by face angles, we noticed that perceived arousal from the frontal face is more than compared with the shift in face angle. e happiness, disgust with closed mouth, and surprise remains unaffected with the face turned away. Furthermore, the

Method
Accuracy LBP TOP [41] 88.99 HOG 3D [16] 91.44 MSR [42] 91.40 STM-explet [43] 94.19 DTAGN [44] 97.27 3D-CNN-DAP [45] 92.4 NMF-SSCCA [46] 97.3 FER-MPI-SFL (baseline) [47] 98.2 (Ours) 98.95 Method Accuracy LBP TOP [41] 59.51 HOG 3D [16] 60.89 CSPL [48] 73.53 STM-explet [43] 75.2 DTAGN-joint [44] 70.3 3D-CNN-DAP [45] 63.4 FER-MPI-SFL (baseline) [47] 83.1 (Ours) 89.31 Method Accuracy LBP TOP [41] 68.1 HOG 3D [16] 70.6 STM-explet [43] 74.59 Atlases [49] 75.52 DTAGN-joint [44] 81.46 FN2EN [50] 87.71 PPDN [51] 84.59 FER-MPI-SFL (baseline) [47] 87.39 (Ours) 91.2 effective valence near the frontal is conveyed more by the full left-side profile rather than the full right side profile. It is because of this reason that the left hemiface observes a more spontaneous response than the right hemiface. e facial expression analysis can be enhanced by the facial motion information if the image is subtle or degraded. e dynamic neutral expression with the blinking of eyes or chewing is also a threat. Moreover, the dwell time is also a key factor; it takes more time over eyes than the mouth. However, the dwell time over the mouth of happy expression is relatively high. With an increase in the intensity, it can also be noticed that the accuracy is also increased, whereas the dwell time and round trip is decreased. Overall, the response time of females is faster than males even in a low-intensity environment. In the end, it was also concluded that the dwell time of the female eye is more than that of the male.

Conclusions
In this paper, we have introduced a state-of-the-art architecture that is robust and effective. A facial image is first preprocessed using different techniques to counter the problems of the excessive background, limitation of data, varying illumination, pose-variation, and occlusion. e facial image is fine-tuned and then forwarded to a dual enhanced capsule network that is capable of handling the Security and Communication Networks spatial transformation. It uses action units aware mechanism, which helps to locate the active areas, which can help in better facial expression recognition. e feature representation ability is enhanced due to multiple convolutional layers and it helps to capture the key information present in the particular structure of the face. We performed the privacy preservation with the help of a randomization technique, which added the benefit of less computationally expensive. It also performs secure communication between the two untrusted parties. Different databases have different sets of pictures under varying conditions. As a result, class imbalance occurs due to the inconsistency in expression annotations. So a costsensitive layer can be enhanced for training the deep neural networks. Meanwhile, a powerful, deep neural network can be designed to have prior knowledge of the change in the local environment, which is capable of predicting specific parameters and inherently handling and recovering facial occlusions without any intervention. Furthermore, to improve the robustness of the FER, it can be fused with other models. e incorporation with other modalities like depth information from three-dimensional face models, neurosciences, cognitive sciences, infrared images, and physiological data can be a good future research direction.

Data Availability
All the data are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.