Improved Generative Adversarial Networks for Student Classroom Facial Expression Recognition

To assess students’ learning efficiency under different teaching modes, we used students’ facial expressions in the classroom as a study point. An enhanced generative adversarial network is presented. We designed a generator as an automatic coding-decoding combination in a cascade structure with a discriminator configuration. It can retain different expression intensity features to the maximum extent. We also added a new auxiliary classifier, which can classify different intensity features and improve the model’s recognition of detailed features of similar expressions, thus improving the comprehensive facial expression recognition accuracy. Our approach has a great advantage over the other facial expression recognition approaches on public datasets. Finally, we conduct experimental validation on the self-made student facial expression dataset in all cases. The experimental findings showed that our approach’s recognition accuracy is superior to that of other methods, demonstrating the method’s efficacy.


Introduction
In classroom teaching, what the teacher explains and what the students understand is not visually represented in the current assistive teaching systems. It is also a topic of debate which teaching style students would prefer between the traditional classroom teaching style and the modern smart classroom teaching style. e literature [1] then mentions that smart teaching and intelligent learning environments can give full play to students' cognitive abilities, greatly increase their interactivity, and provide better mastery of new knowledge. In terms of the current investigation, there is no intuitive system to measure students' acceptance of di erent teaching methods. For this reason, we will concentrate on this problem, we set out to identify facial expressions, and by obtaining the emotional expressions of the teacher and the facial expressions between students and then performing facial expression analysis, we can determine the students' acceptance and satisfaction with the teaching method. Our research, to some extent, provides some reference value for the quality of teaching and can respond to the e ectiveness of teaching at the biotechnical level.
In human communication, facial expression is an important communication tool. It often adds di erent emotional factors to nonverbal communication, and it is crucial in the process of comprehending one another's emotional expression. With the advancement of biotechnology and computer science, facial expressions are used in various industries. e most common application area is privacy and security, which is most directly demonstrated by the face unlocking feature on cell phones and computers. Second, in the eld of transportation, driver fatigue and drunken driving detection are also predicted by capturing facial expressions. Also, facial expression recognition technology is also frequently integrated into the elds of virtual reality, medical care, and service robotics [2][3][4]. Of course, the facial expression recognition technology is not so simple, and there are several technical di culties to be broken. Di erent countries have di erent language and cultural backgrounds, and their meanings conveyed by facial expressions are more or less di erent. In addition, the results of facial expression recognition are not su cient due to the objective in uence of nonstructural conditions, such as occlusion, illumination, and focus problems. Recently, many researches have arisen in the field of facial recognition to address these technical challenges, but the technological breakthroughs are all relatively limited [5]. e process of recording real-life student emotions is known as facial expression recognition, and the inner feelings can be mapped side by side from the fluctuations of emotions. e process is mainly based on video dynamic frames and still image sequences as the main recognition subject, and based on face recognition, it rises to a level to synthesize the linkage reaction among five senses, thus predicting facial expressions. e literature [3] starts the study from the simplest basic facial expressions, mainly the expressions of joy and sadness series. e authors, in order to obtain facial expressions accurately, first remove the noise from the images by preprocessing operations, followed by face detection to delineate the range of facial expression features. en feature fusion is performed jointly with the linkage between the eyes, eyebrows, mouth, and cheeks, and finally, and facial expressions are predicted by matching with the training feature library.
To address the difficulties in facial expression recognition research, related researchers have made unremitting efforts. Some researchers have focused their research on manual features. For example, literature [6] proposed the use of Gabor filters to optimize manual features, and literature [7] proposed local binary patterns to break the limitations of manual features. e literature [8] proposed a gradient histogram method to extract features, which further enriched the artificial feature set. Some researchers put their research focus on deep neural networks. For example, the literature [9] innovatively improved the network structure in the approach using neural networks, and the authors picked to fine-tune the two-stage training algorithm to adapt the feature linkage between the five senses and enhance the expression recognition. e literature [10] both adopted generative adversarial networks, which further explored the intrinsic features of the face and eliminated the interference of nonsubjective factors. e literature [11], on the other hand, performed adaptive optimization on the constraint function and proposed island loss to determine the attribution problem between features by learning the connection between different expressions. e literature [12] places the research focus on the attention mechanism and proposes an adaptive regional attention network and validates the high efficiency of the network on the available dataset, and results proved that integrating the learnt model can increase the model's robustness.
However, facial emotion detection is not a simple work, so the previously mentioned studies ignore the direct connection between facial attributes and emotions, and the main reason for the poor recognition results is the inability to positively map the way of distortion among the five facial nodes, and the changes between specific locations cannot be responded. Some researchers have proposed setting up standard lines on the face for facial node calibration, and the literature [13] also mentions that using this approach can decrease the data variance and improve the stability of the model. e literature [14] also proposed model-aware flags for the automatic perception of facial position, and experiments demonstrated that this method not only reduces the workload but also preserves the robustness of the model. In the literature [15], it was unexpectedly found in the experiments that additional flagging of facial positions by predetermined trajectories could increase the recognition speed of the model without affecting the accuracy. All the above methods take an end-to-end form, and such methods also have certain limitations. Its recognition effect is limited by the quality of facial markers, and when facial expression features are captured, they can easily be incorporated into shallow features in a nonmaximal suppression operation.
To counteract the drawbacks of deep learning approaches, the literature [16] used a multitask learning strategy in neural network construction to enhance the primary task by shifting the learning number of different tasks. In addition, the literature [17,18] added facial detection flags in the feature design of the facial action structure unit, which can aid in improving facial emotion recognition accuracy. In terms of multitask parameters, most of the previous studies launched optimization based on hard parameter sharing, but this approach limited the recognition efficiency of facial expressions to some extent. Nowadays, more soft parameters have started to be developed for sharing, such as the multitask convolutional partial sharing strategy in the literature [19] and the cross-stitch network proposed in the literature [20], which successfully break the efficiency limitation.
In our study, we consider various models comprehensively. We finally choose a generative adversarial network as the base method. To obtain the intensity features of different expressions hierarchically, we added a new auxiliary classifier and optimized the network structure. Finally, the effectiveness of our approach is demonstrated on both public and self-made datasets. e rest of the study is arranged as follows. Section 2 presents the work related to different facial expression recognition methods. Section 3 introduces our adaptive improvement strategy and implementation process for generative adversarial networks. Section 4 presents the comparison of experimental databases and experimental methods. Finally, Section 5 presents research prospects and improvement directions.

Related Work
Traditional facial expression recognition research mainly relies on extracting geometric features, texture features, and hybrid features of the face as the basis [21]. e active shape model is the mostly used in facial expression recognition work and is the geometric feature method, which mainly uses facial feature points as a reference to construct geometric features and then localizes them. In practical application, the method is affected by lighting and occlusion and does not achieve better recognition results. e facial action unit is also a typical example of the geometric feature method. is method first divides the face into units and then compares them with the facial reference points by calculating the relative distance between units. However, this method requires intensive training in advance and has a very high computational complexity at the time level [22]. Texture feature-based facial expression recognition methods are more common and usually have faster computational speed, but they are not effective for motion scenes such as Gabor filter and local orientation pattern methods. In the face of occlusion, the most effective method is the scale-invariant feature variation, which can automatically find the spatial extrema and extract their position, scale, and rotation invariants and can circumvent the effect of occlusion by local mapping, but this method is not effective for the target smoothed by edges.
In facial expression recognition work, the input video frames or image information are subjected to preprocessing operations and then input to convolutional layers of different scales for feature extraction, and then the facial features are transformed into independent vectors, and finally, the classification is completed by fully connected layers [23]. Different application scenarios have different structural requirements for convolutional neural networks [24], and to address the influence of nonstructural environmental factors, facial expression recognition work often requires specific preprocessing operations, such as the HOG feature method [25], the LBP method [26,27], and the ROI method [28][29][30]. Different features have different extraction stages, resulting in multiple features in different dimensions, which cannot be unified at the time level and affect the convergence efficiency of neural networks. Besides, convolutional neural networks are often used by researchers as a basic network. According to different requirements for different tasks, convolutional neural networks are optimized and upgraded accordingly to the increase in the adaptability and performance of deep networks. Some researchers have designed cascade networks to enhance the efficiency of the localization of facial nodes [31]. Some researchers tried to add auxiliary modules to improve the robustness of the model [32]. Some researchers divided the network into parallel or tandem networks of small modules to achieve the inclusion of features at the decision level [33,34]. All of the above research methods aim to improve the depth and parameter tuning of the network, which invariably increases the number of parameters. Considering the computational cost, some researchers have proposed recurrent neural networks [15], capsule networks [35], deep belief networks [36,37], and so on.
For deep learning methods, the recognition accuracy is proportional to the volume of training data, and the richer the dataset, the higher the recognition accuracy. For facial emotion detection, building a database of facial expressions is undoubtedly a difficult and long-lasting task. e features of facial expressions are deeply related to different background cultures, and the process of data annotation usually requires the annotators to have a certain understanding of national culture and background. In addition, the optimization process of neural networks is often not transparent enough, and most researchers rely on constant repetition of experiments and experience to verify the optimal parameter sizes [38]. erefore, the period and computational cost factors of the project need to be considered before adopting a deep learning approach. To circumvent complex parameter tuning strategies, the literature [39] proposed the multigranularity cascade forest method, an integrated neural network structure inspired by the cascade forest classification rule and the random forest rule. Compared with pure deep learning methods, this method has a smaller number of parameters and sets hidden layer hyperparameters to reduce the computational cost.

Method
3.1. Pipeline Overview. Researchers usually take an unsupervised approach to train the adversarial model, which belongs to the same deep neural network model and is divided into two parts in the phased design of the network. e generator part belongs to the front-end of the network and the discriminator belongs to the back-end. e generative adversarial network principle is simulated training at the neural network level, where different samples are iterated and generated in a random mode. e original samples are input at the input side, and the generator generates pseudosamples based on the original samples, and the usability of the generated samples is judged by comparing the difference between the original samples and the generated samples within a specified threshold of the pseudosamples. If the generated sample does not meet the standard value, by iterating this method, the pseudosamples can be approximated to the eigenvalues of the true samples in terms of eigenvalues.
e structure of the generative adversarial network is shown in Figure 1.
In our study, face recognition systems can be made more robust by combining facial expression recognition with adversarial generating networks. Generative adversarial networks essentially play the facial expression details against each other by repeatedly updating iterations until the best facial expression features are obtained and then output to the terminal. Considering the facial expression details feature refinement, we define the classification of facial expressions to prevent the problem of increasing errors with different expression strengths.

Generator.
e generator is in the front part of the adversarial network and its input is the real sample. After the real samples are input, the generator parses the real samples, divides the real samples into different feature nodes, and finally simulates the feature nodes to generate pseudosamples. e working process of the generator is shown in Figure 2.
We refer to the literature [40,41] for an enhanced method to generative adversarial networks, where the generator is meant to work as an encoder and decoder in the tandem, which is a creative design. After several experimental verifications, we also apply the nested combination of encoding and decoding to the generator network. e encoder of the generator acquires different intensity facial expression features I low by downsampling. Researchers in the literature [42] added a residual structure to the generator optimization to improve the efficiency of the generator encoding. We also verified the effectiveness of the method Scientific Programming experimentally. In the decoder network layer, we use upsampling to transform the intensity features of facial expressions and then implement nonlinear activation by RELU. According to the decoder network optimization method in the literature [43], we implemented facial expression intensity guration using the X-conv operator. Assuming the expression K input point (p 1 , p 2 , ..., p k ), where K denotes the result of a multilayer perceptron of real samples, in a transformation matrix Χ MLP(p 1 , p 2 , ..., p K ) of dimension K × K is computed, and the summation between feature elements can be simpli ed to the commonly used convolution operator. When X is performing the computation of the transformation matrix, di erent facial expression nodes have di erent e ects, and we de ne the mathematical equation of the X-conv operator as follows: where p represents the facial expression feature node, K represents the facial expression traversal function, P (p 1 , p 2 , ..., p k ) T represents the nodes within the neighborhood expression feature node with K nodes, and F (f 1 , f 2 , ..., f K ) represents the expression feature nodes in di erent domains. In the nonlinear connection of the X-conv operator, facial expressions of di erent intensities will have di erent feature expressions in the generator, and the details of the X-conv operator at each level are shown in Figure 3.

Discriminator.
e discriminator network consists of a combination of fully connected and deconvolutional layers.
e discriminator is at the output port of the generator. In the discriminator, di erent threshold ranges are set and the pseudosamples are marked as invalid if they are below the threshold range. e feature information of the invalid sample will be fed back to the generator with the simulation side of the real sample. All the feedback methods will pass the correct feature values in this back propagation way, and   the generator will automatically correct the newly generated expression features based on the feedback feature values. e discriminator principle is shown in Figure 4. e intensity of facial expression features was not consistent according to the di erences in facial expression types. Low-intensity expression features are less demanding on the generator and only need to lter the facial contour data density. For high-intensity expression features, it is necessary to rst decompose the high-intensity expression features and then convert them into low-intensity feature combinations. Researchers in the literature [44] will have used an alternating training model to optimize the discriminator with threshold discretization detection of pseudosamples. We de ne min-max as follows: where Gen denotes the twin sample of the generator and real sample and Di s denotes the threshold discrete detection of the discriminator and pseudosample. I low , I high represents the feature intensity grading corresponding to facial expressions, and the generator Gen and discriminator Di s are distributed in a certain linear function, and the mathematical expression is as follows: where N represents the expression feature intensity. During the intensity feature convergence process, the pseudosample features can be ranked with respect to the degree of threshold discretization under the detection of the discriminator. e generator ne-tunes the new features at a later stage based on the feature discretization values fed by the discriminator. e di erent levels of discriminator network layers we constructed are shown in Figure 5.

Auxiliary Classi er and Loss Function.
e intensity of facial expression features can cause feature loss in the middle transition layer of the network layer. For this reason, we add auxiliary classi ers in the middle layer, which can retain the facial expression feature information under di erent intensities. In the actual course scenario, facial expressions will have di erent levels of facial muscle expressions. In order to maintain a stable mapping relationship between expression changes and feature intensities, the adversarial loss function is utilized to guide the feature decomposition of real expressions. Adaptive linear tting function is added to the auxiliary classi er network layer, and all samples are congured with low intensity features combined with low intensity features by default during the production of classi er pseudosamples. It prevents the problem of feature intensity confusion in the process of expression feature perception. e mathematical equations of feature perception added in the auxiliary classi er are shown below: where ϕ represents the expression feature intensity perceptron. In re ning the pixel feature representation of 2dimensional images of facial expressions, the high-intensity facial expression feature I high and the linked expression feature Gen(I low ) generated by the generator take advantage of the point-by-point loss optimization to overcome the feature re nement and loss problems arising from the highintensity feature decomposition. Researchers in the literature [45] performed experimental validation on the algorithm of point-by-point loss optimization, and the authors found that the L2 loss function is more stable. e mathematic functions are calculated as follows: where N pixel denotes the intensity expression of the facial expression at the two-dimensional level. According to the  Scienti c Programming constraint e ect of the loss function, we designed the new loss function has the following mathematical expression: where ω 1 , ω 2 , and ω 3 denote the expression intensity feature weighting coe cients.

Improved Generative Adversarial
Networks. In our study, to assess students' learning e ciency at the level of their facial expressions in the classroom, we present an enhanced generative adversarial network strategy for improving the accuracy of facial expression recognition models while also separating comparable expressions using feature intensity classi cation. e auxiliary classi er can provide feature generation guidelines and pseudosample feature discrimination to the generator and discriminator. At the pixel level, the auxiliary classi er middle layer neural network uses the X-conv operator to assist in synthesizing independent facial expression pseudofeatures, which are fed back in parallel with the generator in the joint output. e back propagation information from the discriminator will act as a lter in the auxiliary classi er to extract the feedback that aids in enhancing the e ectiveness of the pseudofeatures into the real sample perception network. e facial expression detection network is shown in Figure 6  dataset with occlusion (FMEO) for the experimental test. Before performing expression classi cation operations on the above datasets, we collaborated with medical schools to manually standardize clear boundaries between expressions, and then we preprocessed all data to segment the images to speci ed sizes, with di erences in the testing approach we took for di erent sizes of data. e Oulu-CASIA dataset [46] contains a total of 2880 samples from the expression acquisition of 80 volunteers, which were captured using video recording and divided into visible light (VIS) series and near infrared (NIR) series according to the imaging system. ree di erent illumination methods were selected for the acquisition process to analyze the e ect of detection methods on the structural environment. ere are 480 videos of normal illumination samples, 60 videos of low illumination samples, and 15 videos of dark scenes. For the selection of the training set, we chose all the normal illumination video frame samples. e details of expression classi cation are shown in Table 1.
e Cohn-Kanade(CK+) dataset [47] contains a total of 593 video samples of facial expressions captured from 118 volunteers. Each piece of video is divided into 20-50 frames, and all video frame sequences are captured using a facial action coding system, which automatically classi es the expressions and labels them accordingly after the capture is completed. Its detailed facial expression classi cation information is shown in Table 2.
To evaluate the e ectiveness of our strategy in complex situations such as occlusion, we chose FMEO to do the validation test. e dataset contains a total of 690 samples of data from 10 young volunteers, who were used in the experiment to collect facial expression samples by masking their faces with props, such as hats, glasses, and masks. e detailed classi cation of facial expressions in this dataset is shown in Table 3.

Experimental Settings.
We trained the two-dimensional samples separately from the three-dimensional samples. e detailed parameter settings are shown in Table 4. In the validation process, we adopted the method mentioned in the literature [11]. For multitask learning training, to fairly compare random input expressions, we utilized a random search strategy with hyperparameter tuning.

Experimental Results.
In the facial emotion detection work, we mainly analyze three metrics, such as accuracy (Acc), F1 score, and recall (R). To ensure that our method is e ective, we conducted a test, and we choose traditional facial emotion detection approaches and a neural network series of facial emotion detection methods as control group experiments. We compared three methods, LBP_SVM, CNN, and LSTM. During the training and tuning phase, each network was trained independently without the recognition module to con rm the accuracy of each technique. e experimental results are shown in Table 5. Table 5 proves the facial emotion detection e ectiveness of our strategy. Considering the results of the experiments, CNN is the more commonly used method; however, it falls short of the LSTM approach in terms of facial expression recognition accuracy. is is mostly owing to the bene ts provided by the LSTM's unique network topology, which can achieve local perception and maximize memory information fusion. Our method uses generative adversarial networks with a new CNN-based auxiliary classi er, which can recognize similar expressions hierarchically starting from the expression feature strength, further improving the accuracy of facial expression recognition while obtaining better robustness.
e experimental results show that the datasets OC and FEMO perform the best. Due to the computational cost, we mainly use the experimental results of datasets OC and FEMO as the main judging criteria. To test the e ciency of our approach for facial expression recognition in the classroom, we conducted experimental validation by self-made datasets. We collected classroom expression video data of 300 college students and manually labeled the homemade dataset according to the OC dataset labeling rules, and then tested it with the trained model. e results are shown in Table 6.

Scienti c Programming
In the students' facial expression recognition experiments, our improved generative adversarial network outperforms the others, and it further proves the effectiveness of our approach.

Conclusion
We offer a method for recognizing facial expressions based on an upgraded generative adversarial network. e method belongs to the deep training model, we divide the network into three stages. e front end of the network is the generator network layer, which relies on real sample features to generate pseudosamples. e middle of the network is the auxiliary classifier, which assists the generator in generating pseudosamples that are closer to the real samples. e end of the network is the discriminator network layer, which determines whether the pseudosamples satisfy the output conditions according to the degree of threshold discretization, and the pseudosamples that do not satisfy the conditions are fed back to the front layer for reconstruction. During the experiment, we test the efficiency of the strategy on the open-source datasets. In addition, we also test on the homemade student datasets. e experimental results prove that the facial expression detection accuracy of our method stays above 92%. Comprehensive performance of the model outperforms other methods.
Facial expressions are a very complex task to capture, and there are thousands of facial expressions in different scenes. In this paper, we tentatively select facial expressions with more prominent features as the study points. However, for many obscure expressions, our method still does not perform well. In further research, we are going to use a dual RNN framework to perceive the 3D features of facial expressions, and enhance the model's tolerance of high-intensity feature expressions.
Data Availability e dataset can be accessed upon request.