Improved Convolutional Neural Networks for Course Teaching Quality Assessment

,


Introduction
Higher education research shows that the cultivation of innovative talents not only is closely related to the quality of classroom teaching [1], but also depends on the concentration of students' facial expressions during the teaching process [2]. In view of this, based on the correlation analysis between facial expressions and classroom teaching e ects, using deep learning technology to build a teaching e ect evaluation model has become the main research direction in the eld of education.
Facial expressions re ect real human emotions. Psychologist Albert Mehrabian pointed out that "Emotional expression 7% language + 38% voice + 55% facial expression" [3]. As a carrier of emotion and psychology, facial expression plays an important role in human emotion judgment. According to Ekman's basic emotion theory, facial expressions contain a large number of emotional semantics, which are generally divided into six types: happiness, disgust, anger, sadness, fear, and surprise [4]. However, emotion is usually continuous and context-related, with di erent strong and weak expression relations. e basic theory of emotion still has some limitations. Di erent from ordinary expressions, microexpressions are spontaneous expressions generated under the in uence of subjective emotions [5]. Microexpressions are characterized by short duration (1/25-1/3 s) and small amplitude of action. is brings great di culty to microexpression recognition.
In the past microexpression recognition, the method of feature extraction is used to analyze the microexpression. However, due to the arti cial extraction of the underlying features, the feature extraction is insu cient, resulting in low accuracy of microexpression recognition [6]. Deep learning algorithm has outstanding performance in image feature extraction. erefore, deep learning algorithm can be used for more e ective feature extraction of microexpressions, which can improve the recognition e ect. In addition, due to limited computing power and the scale of facial expression video data, traditional methods usually use static facial expression or single facial expression for analysis, ignoring the periodicity of facial expression. e generation of facial expression is a process that changes over time. Dynamic facial expression more naturally expresses the changes of facial expression, while a single frame of facial expression cannot reflect the overall information of facial expression.
erefore, the analysis based on dynamic expression sequence is more helpful to the recognition of microexpression.
Based on dynamic multiexpression sequences, this paper proposes a separate long-term recurrent convolutional networks (SLRCN) model combining spatial features and spatial time. First, convolutional neural network is used in deep feature vision extractor to extract static features of microexpressions in images [7], and features extracted from video sequences are provided to bidirectional cyclic neural network to obtain the output of time series. is can improve the accuracy of microexpression recognition. In addition, the practical application scenarios of facial expression sequences are studied to combine teaching evaluation with facial expression analysis. Students' learning status was analyzed by collecting their facial expressions. is model can effectively obtain the changes of students' facial expressions and evaluate their learning status, thus promoting the improvement of teaching quality and providing a new method for the evaluation of teaching quality.
is model has the following advantages: (1) e model is simple in structure and does not require much data preprocessing. (2) e overfitting problem is solved by introducing transfer learning to optimize. (3) It can be used when the data set is insufficient.
is paper mainly consists of five parts, including the first introduction, the second state of the art, the third methodology, the fourth experiment and analysis, and the fifth conclusion.

Correlation between Facial Expression Recognition and
Teaching Effect. Facial expression recognition generally refers to the representation of various emotional states through facial changes. It is an extremely important means of nonverbal communication. Artists often express the inner feelings of characters by describing their facial expressions, so as to show their spiritual outlook lifelike. Paul Ekman, a renowned American psychologist known as "the Pope of the Face," has long been studying facial expressions and inner truth. He found that involuntary reactions were the best indicator of true feelings. When the subject's facial expression is not consistent with his real thoughts, there are always corresponding flaws. In view of this, the relevant universities in China used deep learning and other technologies to conduct in-depth research on the facial expressions of the subjects and the classroom teaching effect. Literature [8] takes FER2013 face data set as the research object. An improved multiscale feature fusion algorithm is proposed for face detection with small size. e final experimental results show that the recognition accuracy of this method is up to 73.669%. Literature [9] proposes a classroom teaching effect evaluation system based on the improved VGG network model, which combines expression concentration and head-up rate. Many interesting teaching rules can be deduced from practical classroom teaching experiments. For example, about ten minutes before class, the overall attention of the class increases slowly, and about five minutes before class ends, it drops sharply. is requires teachers to adopt different teaching methods in different teaching periods by promoting students' interest in learning to achieve satisfactory teaching results. Literature [10] proposed a facial expression detection method based on deep learning. Firstly, face detection model is constructed based on optimized fusion of FaceBoxes and MTCNN. Secondly, this model is tested and optimized by FDDB, an open-source network face database. Finally, based on the statistics of the students' head up rate, the evaluation standard of classroom teaching based on facial expression is constructed. At present, the recognition accuracy of classroom teaching effect evaluation model based on facial expression recognition is still low and has not reached the level of commercialization. erefore, on the basis of existing research, this paper focuses on the inner relationship between face detection and classroom teaching effect evaluation and focuses on constructing a feasible classroom teaching effect evaluation model and applying it to teaching practice.

Facial Expression Recognition.
Literature [11] proposed the Facial Action Coding System (FACS) in 1976. FACS divides the face area into 44 Action units (AU) and combines different AU to form FACS code. Each FACS code corresponds to a facial expression. Based on this, Emotion FACS was developed after analyzing a large number of facial expression pictures [12]. e MIT lab trained sparse codebooks for emotion analysis of microexpressions. By using the sparsity of tiny temporal motion patterns, local spatiotemporal features are extracted in the facial region [13], microexpression codebooks are learned from the data, and features are encoded in a sparse manner. Experiments on AVEC 2012 data set show that this approach has good performance.

Expression Feature Extraction.
Expression feature extraction methods can be divided into two categories: static image and dynamic image. Dynamic feature extraction mainly focuses on facial deformation and facial muscle movement. Representative methods based on dynamic feature extraction include optical flow method [14], motion model, geometric method, and feature point tracking method.
In literature [15], through the method of 3D histogram, microexpression detection and recognition is carried out through the gradient relationship between associated frames. Literature [16] uses strain mode to process long videos by optical flow method. Facial expressions are segmented by dividing several specific subareas (such as mouth and eyes) on human face, and then microexpressions are identified. Literature [17] uses the Local Binary Patterns from ree Orthogonal Planes (LBP-TOP) algorithm to extract the features of microexpression image sequences. In this method, dynamic local texture features in time domain and space domain are extracted by 2D to 3D extension. CASME database was established in literature [18], and Gabor filtering was applied to extract the characteristic values of microexpression sequences. e smooth adaptive boosting algorithm combined with support vector machines based on Gentle Adaptive Boosting (GentleSVM) is used to build a classifier for classification recognition.
In terms of microexpression recognition based on spacetime motion information description, literature [19] detected and recognized microexpressions by constructing optical strain features and optical strain weighted features using facial optical strain. Literature [20] Euler image amplification was used to analyze the phase in the frequency domain and the amplitude in the time domain, to amplify the movement information of microexpression, eliminate irrelevant microexpression facial dynamics, and use LBP-TOP algorithm for feature extraction. Literature [21] proposed a facial dynamics map (FDM) method to characterize the sequence of microexpressions. e method calculates the optical flow information of the microexpression sequence and aligns it accurately in the optical basin.

Deep Learning and Microexpression Recognition.
Different from traditional machine learning algorithms, deep learning highlights the importance of feature learning. rough feature mapping layer by layer, features of the original data space are mapped to a new feature space, making classification and prediction easier. Deep learning can use data to extract features that meet the requirements and overcome the defect that artificial features cannot be extended. Literature [22] introduced the method of deep learning in microexpression recognition and extracted microexpression features through feature selection. However, due to the small sample size of the data set, the overfitting phenomenon is easy to occur in the training, which affects the identification accuracy of the network. In literature [23], convolutional neural network was used to encode the spatial features of microexpressions in different states. e spatial features with expression state constraints were transferred to the temporal features of microexpressions, and LSTM network was used to encode the temporal features of microexpressions in different states. Literature [24] proposed a rich long-term recursive convolutional network to extract optical flow features from data sets to enrich the input of each time step or given time length.

Methodology
Microexpression recognition obtains the face position from the complex scene through face detection algorithm, detects and divides the face contour, carries on the microexpression feature extraction, and establishes the recognition classification model. Its basic steps include (1) facial expression image, expression sequence acquisition and processing; (2) extracting microexpression features from facial expression sequences, removing redundancy between features to reduce feature dimensions; (3) based on long-term recursive network, microexpression features are used as input of time series model to learn the dynamic process of time-varying output sequence; (4) establish a dynamic prediction model to classify and recognize facial microexpressions. See Figure 1. e method in this paper is based on the framework of long-term recurrent convolutional Networks (LRCN), and the model is improved to make it more suitable for the recognition of microexpression video clips. Faced with the problem of small amount of data in microexpression data sets, transfer learning method is adopted to avoid network overfitting. Convolutional neural networks (CNN) and LSTM are fine-tuned, and the SLRCN method is proposed. Combining convolutional neural network and long-term recursive network, the spatial domain features are obtained by two independent modules, and the temporal domain features are classified. Firstly, the feature vector of each microexpression image frame is extracted using the pretrained CNN model to form the feature sequence. en, the feature sequence with timing correlation is input into LSTM network, and the timing output is obtained. rough this method, the structure and output of CNN can be fine-tuned to make its classification accuracy higher and conducive to learning on small-scale data sets.

LRCN Network.
LRCN is a circular convolution structure combining traditional CNN network and LSTM. e network is capable of processing both sequential video input and single frame image, as well as single value prediction and sequence prediction. It is also suitable for largescale visual learning. LRCN model directly connects the long-term recursive network with convolutional neural network to carry out convolution sensing and time dynamic learning simultaneously.
Combined with the deep hierarchical visual feature extraction model, this model can learn to recognize and serialize spatiotemporal dynamic tasks.
is includes sequence data (input, output) video and description. See Figure 2. At time n, parameterized feature transformation will be passed to each visual input q n (single image or video frame), which can generate a fixed length vector l n , l n ∈ R d , where R d represents the d-dimensional real number set. e feature space representation of the video input sequence [l 1 , l 2 ,...,l 3 ] is established and then input into the sequence model.
In the usual form, the input i n and the hidden state b n-1 of the previous time step are mapped to the output k n and the updated hidden state b n by the sequence model. b 1 are calculated successively, and b n is finally obtained, where M is the weight parameter. In the time step n prediction distribution U(j n ), the last step of U(j n ) is to take a Softmax logistic regression function on the output k n of the sequential model. Mapping a vector to a probability LRCN instantiates the following learning tasks for three major visual problems (behavior recognition, image description, and video description): (1) Sequential input, xed output: Vision-oriented behavioral activity prediction, with arbitrary length TT video as input, predicts behaviors corresponding to labels. (2) Fixed input, sequential output: i ⟶ [j 1 , j 2 , . . . , j N ].
For the problem of image description, a xed image is used as input to output description labels of arbitrary length. Experimental results show that LRCN is a model combining spatial and temporal depth. It can be applied to a variety of visual tasks involving input and output of di erent dimensions and has a good e ect in video sequence analysis.

SLRCN Network.
Since microexpressions are about video frame sequence, it is particularly important to realize feature extraction in spatial domain and temporal domain of microexpressions. erefore, by taking advantage of LRCN's "dual depth" sequence model in behavior recognition and applying LRCN to microexpression sequence classi cation, a SLRCN model was proposed. e method consists of three parts: preprocessing, microexpression feature extraction, and feature sequence classi cation. Preprocessing includes facial cropping and alignment to extract key facial areas. Feature extraction includes image frame pretraining of faceoriented CNN model to establish feature set. Sequence classi cation provides the feature set of the video sequence to the network by LSTM and then classi es the microvariations of the given sequence. is method has the following advantages: (1) Based on LRCN, the structure is simple, requiring less input preprocessing and manual characteristic design, reducing intermediate links.
(2) It is suitable for the situation of insu cient data amount of microexpression data set and can extract facial microfeatures through transfer learning to avoid over tting during training. (3) Visualization of training process, which is convenient for modifying model and tuning parameters and features.
SLRCN consists of two parts in the training process. CNN was used to extract the image features of facial expression frames, and LSTM was used as a temporal classi er to analyze the correlation of features in the temporal dimension.

CNN as a Feature Extractor.
As a deep learning model, CNN is more suitable for extracting basic features of images and reducing model complexity. erefore, CNN is used to extract feature vectors of microexpression sequences, which has stronger adaptability and better feature expression in different environments. For microexpression recognition, the sample size of the dataset is very small, and the phenomenon of overfitting may occur in network training. It is not feasible to train CNN models directly from microexpression data. In order to reduce overfitting when training deep learning networks on microexpression datasets, CNN models based on objects and faces are used for transfer learning, and feature selection is used to extract deep features related to tasks. Literature [25] used ImageNet database to initialize the residual network based on transfer learning in microexpression recognition, and further pretraining was carried out on several macro expression databases. Finally, microexpression data sets were used to fine-tune the residual network and microexpression units. However, in general, the expression in the macroexpression database changes greatly and has obvious expression characteristics, while the microexpression changes slightly and is closer to the unchanged face image. erefore, the VGGFace model for face recognition is used as the feature extractor of microexpression frames, which can extract subtle features from different environments and people.
e VGGFace model used in this paper is based on the compression-and-Congestion Networks (SENet) architecture and is trained on the VGGFace2 face database. SENet enhanced the self-adaptability of the network by embedding SENet structure in Residual network (ResNet) and improved the network performance through the relationship between feature channels. See Figure 3.
As shown in Figure 3, F tr : the realization process of I ⟶ P , P � [p 1 , p 2 , . . . , p z , . . . , p C ] N is as follows: where q z � [q 1 z , q 2 z , . . . , q C′ z ] , I � [i 1 , i 2 , . . . , i C′ ] N where q i z is the two-dimensional space kernel, q z represents the z-th convolution kernel, and i x represents the x-th input. After the above convolution operation, feature P is obtained, which is a feature graph with the size of M × B × C. Characteristic compression transforms the input of M × B × C into the output k ∈ R C of 1 × 1 × C, which is calculated as follows: e dimension of feature S � [s 1 , s 2 ,...s c ] obtained in feature excitation process is 1 × 1 × C, which is mainly used to describe the weight of C feature graphs in feature P, namely, where M 1 ∈ R C/r×C is the dimensionality reduction operation of the full connection layer, and M 2 ∈ R C×C/r is the dimensionality increase operation of the full connection layer and redirects the features: Feature extraction is performed by fine-tuning feature compression in Global average pooling (GAP) layer, using two fully connected layers to model the correlation between channels, and minimizing overfitting by reducing the number of parameters and computation in the model.

LSTM Builds Sequence Classifier.
Since microexpression changes occur in continuous time, it is difficult to accurately identify microexpression changes without using the time information of microexpression. erefore, in order to make use of the time variation information of the expression sequence, the cyclic neural network is used to process the input sequence of any time sequence, and the time dimension information can be processed more easily. LSTM node bidirectional cyclic neural network model was used to process time series data, and a long-term recursive convolutional network was constructed to judge and classify whether a given sequence contains relevant microexpressions. e expression feature input sequence MicroE_Features � (i 1 ,...,i N ), the propagation implicit vari- , and the output sequence j � (j 1 , . . . , j N ) of the bidirectional LSTM model are defined. en, the update mode of output sequence Y is where M is the bidirectional LSTM model weight. h is the offset term. B(i) represents the activation function. Calculations were made using short-and long-memory neurons. Bidirectional LSTM and memory neurons are shown in Figures 4 and 5f n , x n and o n in Figure 5, respectively, represent forgetting gate, input gate, and output gate. C n represents the state of the memory Cell at time n. LSTM inputs are spatial features extracted from all sequence frames using a pretraining model. A single-layer bidirectional LSTM structure is used in this paper. It contains a hidden layer of 512 nodes. e dropout layer is used between the LSTM hidden layer and the fully connected layer to shield the neurons randomly with a certain probability. It enhances the robustness of network nodes by reducing the coherence between neurons.

SLRCN Is Used for Microexpression Recognition.
According to the improved method, for a given sequence of microexpressions, the steps of microexpression recognition in this paper are as follows: (1) Load the microexpression video le to establish the sequence set I (I 1 , I 2 , I 3 , . . . , I T ) and its corresponding label set J (J 1 , J 2 , J 3 , . . . , J T ). I x represents the x-th sequence of microexpression in the set.
. i x y represents the yth image in the xth microexpression sequence. t x is the length of the x-th microexpression sequence. J x is the xth tag in the set.
(2) Load the microexpression video le, and rst normalize the sequence length. at is, enter the LSTM network time step, and set a xed value T to get . Face detection is performed on the normalized video sequence images in turn to extract the face part. e e ective image sizes are normalized, and then the processed data set I (I 1 , I 2 , I 3 , . . . , I T ) is obtained. is step makes the input video sequence suitable for input to the CNN network. Because the microexpression sequence collected contains a lot of noise and redundant information, it is necessary to remove irrelevant areas in the image and eliminate data noise. Face alignment and face clipping were performed on the microexpression sequences in the dataset. Haar face detector was used to detect faces, and active appearance model (AAM) algorithm was used to extract facial feature points under neutral expression state of each microexpression sampling sequence. According to the coordinates of feature points, the face contour was cut out, and the image was normalized to 224 × 224 × 3 to avoid the size di erence a ecting the results.
(3) Facial features were extracted using transfer learning and pretraining weights of VGGFace model, and the pretraining weights of VGGFace were ne-tuned to make the model more e ectively adapt to microexpression and accelerate convergence. e network input is 224 × 224 × 3 face expression image, and the output is 2048-length feature vector i obtained from the full connection layer after the global average pooling layer.
In Formula (7), w x ∈ R t normalizes the feature vector i output by the extractor through L2 to obtain i.
Finally, the features obtained are saved into the data set I I 1 , I 2 , I 3 , . . . , I T , and the feature set is established. In this case, Xi represents the ith extracted feature sequence in the set. I x i x 1 , i x 2 , i x 3 , . . . , i x N . It represents the vector N × 2048 generated for a sequence, and i x n represents the nth Establish the sequential input, fixed output prediction where F is the activation function. M is the decision parameter model of bidirectional LSTM. j is the prediction of multiple categories. e implementation steps are shown in Figure 6.

Result Analysis and Discussion
In order to test the performance of the model, CASME-II data set was used for training. Train the network model according to the method in this paper, and verify the effectiveness of the method.

e Data
Set. CASME-II data set was used for experiments. CASME-II is a database of naturally induced microexpressions created by fu Xiaolan's team at the Chinese Academy of Sciences. It contains 255 samples of microexpressions and video clips from 26 Asian participants with an average age of 22 years. e data set was collected under appropriate lighting conditions and strict experimental environment, and the image resolution was 640 pixels × 480 pixels. e database sample is labeled with the start frame, end frame, and corresponding microexpression labels. It provides classifications of happiness, disgust, repression, surprise, fear, sadness, repression, and other emotions (happiness, surprise, disgust, fear, sadness, repression, and others). e microexpressions captured in the database were relatively pure and clear, with no noise such as head movements and unrelated facial movements. e data set in this paper is divided into 5 categories, as shown in Table 1.

Data Set Preprocessing.
In order to reduce the differences between different individuals and different microexpressions, the microexpression sequences in the dataset should be preprocessed first. Face alignment was carried out on the image, and the facial expression region was cropped. e resolution of the image frame was uniformly adjusted to 224 pixels × 224 pixels, so as to match the input spatial dimension with the VGGFace network model. Due to the ununiform frame number of microexpression sequence in the data set, the method of Temporal Interpolation model (TIM) is used to interpolate each image sequence of the data set sample into 20 frames to obtain a frame sequence of fixed length of 20. e 20-frame sequence was split into two 10frame time series, and then the 10-frame samples were spliced and saved as training data. Two sets of data are obtained by processing a video.
In this paper, mirror mode is adopted to expand the dataset, and the samples in the dataset are horizontally mirrored one by one. Because the data sample data volume is small, it is necessary to expand the data set.

Experimental Analysis.
e algorithm adopted in literature [26], literature [27], and literature [28] and the Advances in Multimedia algorithm proposed in this paper were used for comparative training respectively. After 50 iterations, the accuracy rate and loss rate changes of the above four algorithms for model recognition are shown in Figures 7 and 8, respectively.
In order to evaluate the actual performance of the model, the above four algorithms were used to train the test set, respectively. e changes of model accuracy and loss rate of each algorithm are shown in Figures 9 and 10.
By analyzing the accuracy and loss rate of model recognition, it is found that the performance of the above four algorithms is di erent in training set and test set. e algorithm proposed in this paper has the best performance in the training set and test set, while the algorithm in literature [26] has the worst performance in the training set and test set.
To further analyze the e ects of the above four algorithms, ROC curves are drawn in this paper, as shown in Figure 11.
ROC curve is a graphical display method to show the compromise between the true rate and false positive rate of classi cation e ect. True rate was plotted along the Y-axis, and false positive rate plotted along the X-axis. In the ROC curve, the model near the upper left corner is better. In Figure 11, the model in the upper left corner is the model proposed in this paper, which is suitable for facial expression recognition. e worst-performing model is literature [27], farthest from the upper left corner. In addition, the ROC curve area represents another classi er standard, and the larger the area of the model, the better the predictive type of the model. In terms of total area, the model proposed in this literature [26] literature [27] literature [28]  literature [26] literature [27]    literature [26] literature [27] literature [28] Proposed Epoch literature [26] literature [27] literature [28] Proposed

Conclusion
e application of facial expression recognition in learning scenes is a trend of constructing new classroom. With the help of the relevant research foundation of informatics, psychology, and pedagogy, learners' learning state can be studied through expression analysis. is paper focuses on the common problems in the current research on microexpression recognition and realizes the recognition and classi cation of microexpression sequences through deep learning. Based on the excellent performance of LRCN in behavior recognition, a SLRCN method is proposed to improve the method.
is method is more suitable for microexpression data set. In order to reduce the risk of over tting in the training deep network, the feature set of facial expression frames is extracted through the pretrained VGGFace model with the method of transfer learning. e feature sets were input into the bidirectional LSTM network to address the characteristics of short duration and time dependence of microexpression changes. Experimental results show that this method has high accuracy. However, the main reasons for the low recognition rate are the insu cient amount of labeled microexpression data, uneven distribution of all kinds of data, and generally weak intensity of microexpression. Further work is needed to enrich the data. Based on the dynamic expression sequence analysis of learners' emotions, a psychological characteristic model was established to study the corresponding relationship between learning state and emotion changes in the learning process, so as to promote the progress of microexpression recognition in teaching quality evaluation.
Data Availability e labeled data set used to support the ndings of this study is available from the author upon request.

Conflicts of Interest
e author declares that there are no con icts of interest. literature [26] literature [27] literature [28] Proposed