A Method for Analyzing Learning Sentiment Based on Classroom Time-Series Images

. With the development of smart classrooms, analyzing students’ emotions for classroom learning is an efective means of accurately capturing their learning process. Although facial expression-based emotion analysis methods are efective in analyzing classroom learning emotions, current research focuses on facial expressions and does not consider the fact that expressions in diferent postures do not represent the same emotions. To provide a continuous and deeper understanding of students’ learning emotions, this study proposes an algorithm to characterize learning emotions based on classroom time-series image data. First, face expression data for classroom scenarios are established to address the lack of expression databases in real teaching environments. Second, to improve the accuracy of facial expression recognition, a residual channel cross transformer masking net expression recognition model is proposed in this paper. Finally, to address the problem that the existing research dimension of learning emotion is too single, this paper uses the facial expression and head posture data obtained from deep learning models for fusion analysis and innovatively proposes a Dempster–Shafer evidence-theoretic fusion model to characterize the learning emotion within the lecture duration of knowledge points. Te experiments show that both the proposed expression recognition model and the learning sentiment analysis algorithm have good performance, with the expression recognition model achieving an accuracy of 73.58% on the FER2013 dataset. Te proposed learning emotion analysis method provides technical support for holistic analysis of student learning efects and evaluation of students’ level of understanding of the knowledge points.


Introduction
Improving learning outcomes is a constant theme in education, and measuring learning outcomes can be analyzed in terms of behavioral engagement, emotional engagement, and cognitive engagement.In the classifcation of educational objectives, Bloom's classifcation of emotion as a separate broad area indicates the importance of emotion in the analysis of learning outcomes in education.Morrish et al. [1] demonstrated that emotions can infuence behavior, thinking skills, and decision-making abilities, and Shen [2] pointed out that students' afective states are an important factor in learning outcomes.Research has shown that students' learning emotions in the classroom are an important indicator of students' classroom learning status and classroom learning efectiveness.
In traditional classrooms, teachers mainly judge students' learning emotions through the observation method, but teachers have limited energy to perceive the learning emotions of multiple students.With the development of artifcial intelligence, it has become crucial to reduce teacher efort by automated machine learning methods to automatically select models for data analysis [3].Terefore, Pei and Shan [4] developed a classroom microexpression recognition algorithm based on a convolutional neural network and automatic face detection, which improved the recognition rate and provided a new direction for the application of deep learning in the classroom expression recognition.
However, many current studies analyze students' classroom learning emotions from a single dimension, which has certain limitations; for example, Fakhar et al. [5] developed a real-time automatic emotion recognition system that uses a deep learning model to identify three emotions, happy, sad, and fear, for classroom assessment.Liu et al. [6] proposed a new approach to infrared facial expressions with multilabel distribution learning, constructing an expression recognition network through label learning based on the Cauchy distribution function, through which seven traditional labels of raw, scared, disgusted, happy, sad, surprised, and neutral were detected to analyze students' classroom learning emotions.However, there are some limitations in these studies, as the expression categories used in these studies do not apply to real classroom scenarios, and the accuracy of the analyzed learning emotions is somewhat fawed when the same facial expression is produced by students either thinking about a problem or by the surrounding environment.To remedy this defciency, multimodal emotion recognition methods have been developed.In multimodal recognition, the evaluation accuracy of the model is mainly improved by analyzing the combining features among diferent data [7].Based on this, Wang et al. [8] have proposed multimodal deep belief networks combining diferent features from multiple physiologicalpsychological signals and video signals to obtain fused features of each modality for a more accurate assessment of emotion.However, this type of research is extremely demanding in terms of data collection conditions and is not universally applicable within the larger classroom environment.In recent years, classroom recording via video cameras has become the main means of data collection in smart classrooms.And, from a data analysis perspective, the use of appropriate scientometric analysis methods to map, mine, sort, and analyze to show the logic and relationships among classroom data and the use of combined predictive approach can help improve the accuracy of analyzing [9,10].Terefore, this paper addresses the lack of existing classroom expression databases and the problem of inaccuracy due to limited analysis of learning emotions in a single dimension; based on smart classroom scenarios, this paper presents a fexible and diverse analysis of classroom data in smart classroom and proposes a learning emotion analysis method applicable to real classroom environments by acquiring facial expression and head posture features and fusing multidimensional features using the Dempster-Shafer theory (DST) to achieve an accurate characterization of students' learning emotions in the classroom.Te main contributions of this study can be summarised as follows: (1) An expression dataset applicable to a real classroom environment is constructed to provide a database for expression recognition in classroom teaching videos (2) A residual channel cross transformer masking net (RCTMasking-Net) is proposed, which uses downsampling and up-sampling for multiscale feature extraction and fusion, combining shallow and deep information to increase the efective perceptual feld of the model, while using a channel cross-attention mechanism for information fusion to better capture feature information so that the model is less likely to lose too much information at the shallow level and thus better classify expressions (3) A learning emotion representation model based on multidimensional temporal image analysis is proposed, which integrates facial expressions and head posture features to analyze learning emotions, avoiding the limitations of the single-dimensional analysis and better characterizing not only students' classroom learning emotions but also refecting their learning outcomes Te rest of the paper is organized as follows.Section 2 presents the relevant research work.In Section 3 a method for analyzing learning sentiment based on classroom temporal images is proposed.Section 4 presents the experimental procedure and a comparative analysis of the results of diferent experiments to evaluate the performance of the proposed algorithm.Finally, Section 5 concludes the paper.

Related Work
In the feld of education, learning efects can be analyzed by facial expressions, head posture, and multimodal emotions with the artifcial intelligence technology in ofine classroom.Terefore, this paper will present research in these areas.
In a study on student-based facial expression recognition, Yi [11] analyzed students' afective states from their learning status, learning level, and learning efectiveness and established a student afective model with an afect-based evaluation index system.Han et al. [12] proposed a classroom evaluation method based on video analysis of students' facial expressions by combining diferent states of other organs to evaluate students' states and analyzed classroom efectiveness by head angle, eyebrows, eyes, and lip state.Krishnnan et al. [13] proposed a new algorithmic framework for keyframe recognition in video using structural similarity to better improve the recognition rate, incorporating facial expressions as well as sleepiness detection to sense the student's learning status.Mukhopadhyay et al. [14] proposed a method for assessing students' afective states in online learning based on facial expressions, which was efective in assessing learning outcomes.Te aforementioned study had good results in expression recognition, but the expressions defned were six categories of emotions: happiness, sadness, anger, fear, surprise, and disgust, which do not apply to assessing students' emotional state of learning during classroom learning.Pabba and Kumar [15] proposed a realtime student group engagement monitoring system by analyzing students' facial expressions to obtain academic affective states related to the learning environment: boredom, confusion, concentration, frustration, yawning, and sleepiness.
In the study of emotion based on head posture, Huang et al. [16] distinguished students' classroom emotional states by head posture and facial expressions and proposed a method for locating face feature points based on deep 2 Mathematical Problems in Engineering convolutional neural network and cascading.Duan [17] analyzed learning emotion by detecting attention and proposed a method based on attention detection method based on head up and head down and eye closure detection.Leelavathy et al. [18] used a variety of machine learning techniques to predict student attention and learning afect using eye movements and head posture.Xu and Teng [19] proposed a classroom attention efect scoring system based on head Euler angles, introducing spatial information to modify the Euler angles to assess attention and thus analyze students' learning emotions, which can obtain more accurate Euler angles.Te abovementioned studies explored the relationship between head posture and learning outcomes; however, they did not consider that analysis from a single dimension alone would lead to inaccurate results.However, their study provides a reference for extracting head posture features as input for multimodal analysis.
In multimodal-based emotion research, Yang et al. [20] proposed a multimodal emotion computation model combining logic functions with a framework containing emotion expression patterns such as speech, text, and facial expressions for emotion computation by analyzing emotional interactions in online collaborative learning.Li et al. [21] proposed a multichannel learning sentiment analysis method using speech and image data, together with a quantitative pleasure-arousal-dominance sentiment scale to analyze learning states.Peng and Nagao [22] proposed a multimodal intelligence detection model to identify students' classroom learning status through the multimodal fusion of face, heart rate, and voice.Peng et al. [23] extracted eye, lip, and head features from interactive videos of student online tutoring systems and combined them with electroencephalogram brainwave sensor data to analyze learner learning efectiveness, and Ling et al. [24] used multiple deep learning models to obtain head posture and classroom audio information and fused the corresponding audio information and head posture for analysis.Te classroom learning efect was analyzed by fusing the corresponding audio information and head posture to detect students' learning attention.Te research described previously has conducted fusion analysis from multiple modalities, avoiding the limitations of a single dimension and improving accuracy in analyzing student learning outcomes.However, the data required for the abovementioned study is difcult to obtain in most smart classrooms and the cost required to wear the sensors is too high to be applicable in a large smart classroom environment.
In summary, there is a paucity of research on the definition of student expressions in the classroom, most studies use expression categories that do not apply to real teaching environments, and there is still much room for improvement in studies that apply to the analysis of learning outcomes in real classroom scenarios.Terefore, this paper constructs a dataset of expressions applicable to the classroom environment.Te fusion analysis of students' classroom head posture and facial expressions is used in a real teaching environment to obtain students' learning emotions, provide technical support for assessing learning efectiveness, and help teachers to understand students' classroom learning promptly as well as intervene with individual students to improve classroom learning efectiveness.

A Model of Affective Representations of Classroom Learning
Te block diagram of the whole classroom student learning emotion representation model is shown in Figure 1.In this paper, RCTMasking-Net and HeadPoseEstimate are used to obtain facial expression and head pose data, and a modifed DST model is used to fuse facial expression and head pose to obtain the student learning emotions within the lecture duration of the knowledge point.Face detection is performed using trained MTCNN and FaceNet networks.

Expression Recognition Based on RCTMasking-Net.
Te RCTMasking-Net network model is shown in Figure 2. Tis network structure uses ResNet34 as the backbone network, splits the network into four modules, uses the four residual layers of ResNet34 as the feature processing, and adds the channel cross transformer masking (CTMasking) block responsible for the corresponding feature mapping, respectively.Te CTM asking block mainly relies on UCTransNet, which is a U-net based network structure [25].
Unlike the traditional U-net, this network does not use the hopping links in the U-net but uses channel crossing transformer (CCT) and channel crossing attention (CCA) instead.
First, the collected classroom face images are used as the original input, and the input feature map F ∈ R C×W×H is obtained through the frst step of convolution pooling in ResNet34; second, the feature map F goes through the frst residual layer to obtain the feature map F RL � RL(F) and F RL ∈ R C×W×H ; then, the feature map F RL goes through the CTMasking block to obtain the same size masking feature map F CTM � CTM(F RL ); fnally, the output [26] feature map F RCTM of the frst RCTM block is obtained through the following equation: Te perceptual feld of the model can be improved by equation ( 1) without losing too much information at a shallow level.Te feature map F RCTM is more important to assess than the feature map F RL [27].

Channel Cross Transformer.
As shown in Figure 2, the output features of the frst three down-sampled convolutions are adjusted to a two-dimensional tiled sequence of patch size P, P/2, P/4, respectively.Tis three-layer output is labeled F i (i � 1, 2, 3), with F i ∈ R HW/i 2 ×C i being the key and F Σ � concat(F 1 , F 2 , F 3 ) being the value, and the input Q i (i � 1, 2, 3), K, and V for multiheaded cross-attention is generated after. ( In equation ( 2), , and the V is weighted by M i through the cross-attention mechanism so that the gradients can propagate smoothly when the channels are subjected to the attention operation.Te calculation formula is shown in the following equation: where ψ(•) and σ(•) denotes the normalization process performed, respectively [28].Unlike self-attention, this method performs attention operations along the channel and uses instance normalization, thereby allowing the similarity matrix of each instance on the similarity graph to be normalized, allowing the gradient to propagate smoothly.
First, after a multiheaded attention mechanism, the output equation is as follows: where N is the number of polytopes.Next, after MLP and residual concatenation, the output equation is shown in the following expression: (5)

Channel Crossing Attention.
In the channel crossattention block, O i ∈ R C×H×W and D i ∈ R C×H×W are used as inputs to the channel cross-attention, where D i (i � 2, 3) is the result of up-sampling after cascading  O i with the channel attention block output M i and D 1 is the result of upsampling after the third down-sampling in the network.A global average pooling layer is used for spatial compression to produce the vector ς(X) ∈ R C×1×1 , whose k channel ς(X) is calculated as follows: 7) are the weights of the two linear layers and the ReLU operator [29].
A single linear layer and sigmoid function are used to construct the attentional feature  O i , with the following equation: Te activation function σ(M i ) in equation ( 7) indicates the importance of the channel.
Finally,  O i and D i are cascaded upsampling in turn before passing through a convolution layer to generate F CTM .

Head Pose Estimation.
Te head pose estimation algorithm aims to calculate the Euler angles of the head of individual people from the images.Face alignment is performed using 3D dense face alignment (3DDFA) [30].Te algorithm uses 3D Morphable models [31] for the face representation, which is calculated as follows: In equation ( 8), S represents the 3D face shape, S represents the average face property, α id is the shape parameter of the 3D base shape principal axis A id , and α exp is the expression parameter of the 3D ofset shape principal axis A exp .Te 3D face shape is then projected onto the image plane using a scaled orthogonal projection.Te calculation formula is as follows: Equation ( 9) has V(p) as the projection function generating the 2D positions of the model vertices, f is the scale factor, Pr is the positive projection matrix, R is the rotation matrix with Euler angles, and t 2d is the translation vector.
Each pitch angle, yaw angle, and roll angle is obtained using R. Te head attitude of each student for the duration t of the n knowledge point lecture is denoted by HP id .HP id � p i1 , y i1 , r i1 , p i2 , y i2 , r i2 , . . ., p it , y it , r it  . (10)

A DST-Based Approach to the Emotional Representation of Learning.
In the actual teaching environment, the three main positions of students' gaze are PPT, desktop, and teacher, and the attention states that students show can be determined according to their gaze orientation in the classroom.Combining existing research and the actual environment, this paper sets the attentional recognition framework as Ω � attention, inattention, lookdo wn { }.Te attentional framework includes the gaze towards the PPT and the teacher's position, the head-down framework includes looking at the desktop, and the gaze towards other positions is classifed as the inattentive framework.
In this paper, the yaw angle Y and pitch angle P in the obtained student head posture HP id are used as two separate bodies of evidence: M � Y, P { }.Te head data within the lecture duration of the knowledge point are used to assign basic probabilities to each part of the attentional recognition frame based on the gaze fall thresholds corresponding to the Mathematical Problems in Engineering pitch and yaw angles at each position.Tis paper proposes a new two-dimensional probability assignment method that not only fts the actual teaching scenario but also makes full use of the information contained in the two bodies of evidence and improves the accuracy of DST decision fusion to a large extent.
Te assignment formula is shown in the following equation: In equation (11), Y threshold1 and Y threshold2 are the thresholds when the yaw angle is towards the PPT and teacher position, respectively, and P threshold is the threshold when looking down at the desk.
Based on the number of knowledge points n taught by the teacher and the respective knowledge point teaching time t, the underlying probability assignment values for each attention frame were calculated for the probabilities of each body of evidence during the time t.Te probability values of the two bodies of evidence for each attention frame were spatially fused [32] to obtain the probability of each attention state within the n knowledge point lecture duration t, with the maximum probability being the main attention state of the s student within that knowledge point A sn , as shown in the following equation: In equation (12), K is the normalization factor and m(h 1sn ) with m(h 2sn ) denote the values of the two bodies of evidence Y and P for the n knowledge point taught by the s student for the duration t.Te formula for calculating K is shown in the following equation: Since in learning emotion calculation, judging from facial expressions alone does not accurately portray the learning emotion of the student in the learning scenario; this paper combines attentional states to portray learning emotion. (

Experimental Results and Analysis
In this section, frst, the performance of the proposed expression classifcation algorithm (RUTMasking-Net) is evaluated.Te algorithm is trained on the publicly available dataset FER2013 and the trained model is evaluated by metrics such as confusion matrix and accuracy; second, the proposed student learning emotion representation model is evaluated and analyzed to verify the reliability of the model in analyzing learning outcomes.3 shows the number of parameters as well as the accuracy of each model.
As can be seen from Table 3 although the network model in this paper is the largest in terms of the number of parameters, it outperforms network models of recent years in terms of accuracy.Te experiments show that the fusion of multiscale feature extraction by downsampling and upsampling, the combination of shallow and deep information can increase the efective perceptual feld of the model, and the use of the channel cross-attention mechanism for information fusion can better capture the feature information so that the model does not lose too much information in the shallow layer, which ultimately makes the efective perceptual feld of the model larger than other models and thus improves the recognition accuracy.
In this study, the training period was set to 100, the best quasigo rate reached 95.96% during the training process, and the best accuracy rate at the validation level was 70.97%, and the accuracy curve is shown in Figure 4.
In addition, as the models in this paper use the ResNet34 and ResMasgkingNet models as the primary framework, the ResNet34 model, the ResMasgkingNet model, and the RCTMasking-Net model under dataset 2 were used for comparison.Te results are shown in Table 4. Te model accuracy for this chapter was 65.16%, compared to 62.23% and 63.35% for ResNet34 and ResMaskingNet, respectively.Te results in Tables 3 and 4 show that the proposed network signifcantly outperforms the other network models in the application of expression recognition.

Ablation Experiments.
To explore the role of the channel cross-attention mechanism in the RCTMaking-Net model, this paper will use the dataset FER2013 under ResMaskingNet without the channel cross-attention mechanism and the same experimental parameters (optimization algorithm using Adam, the initial learning rate of 0.0001, weight decay of 0.001, batch size of 48, and the training period is 50 times) for the ablation experiments, and Mathematical Problems in Engineering the confusion matrix of the experimental results is shown in Figure 5.
Figure 5 shows the correct recognition rates of both models for facial expressions in each category, with the RCTmasking-Net model outperforming the ResMaskingNet model for most of the expressions.Te experimental results show that the incorporation of the channel cross-attention mechanism can better capture feature information and thus improve the accuracy of the model.

Sentiment Analysis of Learning the Whole Knowledge
Point Lectures

Experiments on the Integration of Learning Emotions
within the Lecture Duration of Knowledge Points.In this paper, the collected keyframe image set is used as input data to obtain the head pose data of classroom students through  Networks Parameters (×10 6 ) Accuracy (%) VGG19 [27] 139.5 70.80 ResNet34 [27] 27.6 72.42 EfcientNet-XGBoost [33] -72.54 Inception-v3 [34] 37.0 73.09ResMaskingNet [27] 142.9 73.11 VGG [35] 143.7 73.28 STN + TL [36] -73.31Cbam ResNet50 [27] 28.5 73.39 LHC-Net [37] 32.4 73.39 LHC-NetC [37] 32.  the trained HeadPoseEstimate model.Te network model achieves an average NEM of 3.59% on the AFLW200-3D dataset, with good recognition results and a computation time of 7.2 ms and outperforms other network models in multiperson scenarios.Te results of the data going through the HeadPoseEstimate model are shown in Figure 6.Te model is used to obtain individual student head posture data for the duration of the lecture, which is then used to analyze the gaze direction of the students.In the classroom scenario, two gaze points are specifed for the student's head up towards the PPT and the teacher's position and one for the head down towards the desk.Te yaw angle was used to determine whether the student's gaze landed toward the PPT or the teacher's position or both, and the pitch angle was used to determine whether the student's gaze landed looking down at the desk.Due to the diferent spatial coordinates of the seats, the corresponding gaze landing points have diferent rotation angles.In this paper, the threshold values Y threshold1 , Y threshold2 , and P threshold , corresponding to each landing point are obtained by gazing at the corresponding gaze landing point in each position.Te thresholds and angle change curves obtained for seat 1 (frst position on the left in the middle frst row in Figure 6) are shown in Figure 7.
Subsequently, this paper conducted experiments with six students within 48 knowledge points lecture hours.three attentional states for the 1st student over the length of the lecture at knowledge point 48.Spatial fusion of the results in Table 5 yields probability values for each attentional state for that student and the results are shown in Table 6, which m(h 11 ) represents the probability of each attentional state for the 1st student over the length of the lecture at knowledge point 1.
Table 6 shows that the probability of the attentional state during the lecture length of knowledge point 1 is 0.8796, so the attentional state of student number 0616 during the lecture length of knowledge point 1 is attention.Te attentional state during the lecture length of knowledge point 3 was looking down at the desk, and the analogy can be drawn to the attentional states of students in other knowledge points.
At the same time, the expression recognition method in Section 3.1 was used to obtain the temporal expression data of student number 0616 in the classroom and the main expressions within the lecture duration of each knowledge point, as shown in Figure 9.To portray the learning emotions within the lecture duration of a knowledge point, the most frequent expressions within the lecture duration of a knowledge point were used as the main expressions within the lecture duration of that knowledge point.
Te results are shown in Table 7, which incorporates students' attentional states and main expressions during the lecture time of the knowledge point by using the formula (15).For presentation purposes, concentration is dedicated by "1," doubt by "0," and distracted by "−1."

Analysis of the Results.
In this paper, we invited experts in education and psychology to watch the videos and label the learning emotions within the lecture time of the corresponding knowledge points, and the emotion with the highest superimposed score was the learning emotion within the lecture time of the knowledge point.It also uses classroom tests to assess the reliability of the students' learning emotions in the classroom.In this section, the accuracy of the DST fusion model-based approach to characterizing learning emotion is compared with the accuracy of the student learning emotion using expression classifcation alone [8] and head posture alone [13], and the results are shown in Table 8 below.Finally, the results of Table 6 are combined with the results of Table 7 to synthesize the learning emotions of these six students over the forty-eight lecture hours and compare the classroom test scores of each student as shown in Table 9.
As can be seen from Table 8, there is a large margin of error in the actual classroom if only a single dimension of facial expressions is used to characterize students' learning emotions.Te accuracy rates for students 0616 and 0621 were 27.27% and 54.55%, respectively, for facial expressions only, and 72.72% and 45.45% for head pose only.By comparing the classroom videos, it was found that student number 0616 remained silent during the lecture, resulting in a lower accuracy rate from facial expressions.Student number 0621 had a puzzled expression during the lecture and was easily judged as concentrating from head posture only, resulting in a lower accuracy rate.Te experiment showed that as most of the students remained silent in the classroom, the expressions were classifed as neutral expressions and it was difcult to judge the learning emotions by facial expressions.Te analysis of head posture can only determine whether students are paying attention, ignoring the emotional phenomenon of their presence of doubt.Tese problems can be avoided when facial expressions are combined with head posture to characterize students' learning emotions.
Classroom tests can be used to assess the efectiveness of student learning, which is infuenced by the emotional state of students in the classroom.When students are in a positive afective state of learning, they have a higher level of mastery of knowledge and their test scores are relatively high.As can

Conclusion
In this paper, a facial expression recognition network based on a channel cross-attention mask block and a DST-based learning emotion analysis algorithm was proposed to improve the assessment method of learning efectiveness in a real classroom environment using classroom time-series image data.Te method predicts students' learning emotions in a real classroom setting and verifes its efectiveness with individual student performance.Te experiments demonstrate that the learning emotion analysis algorithm can analyze learning emotions more accurately, can efectively avoid the limitation of judging learning emotions by a single expression, help teachers understand students' learning efectiveness better, and take intervention measures to improve learning efectiveness.However, the method has certain limitations that it requires a gaze drop threshold to be set for new position before the student's learning emotion can be analyzed.At the same time, students in the headdown state will be judged as an inattentive state and their emotions eventually are judged as wandering.Te method ignores the presence of head down when they are reading books which lead to emotional misjudgment.In general, the proposed algorithm helps teachers to analyze the overall classroom learning efect over the length of the lecture and helps teachers to consider whether to enhance the teaching of the lecture, which is a good application in real classroom situations.In addition, in an attempt to better defne classroom expressions, this paper samples and marks classroom videos to create a facial expression dataset ClassFaceD, which applies to the classroom environment.Te analysis of learning efectiveness includes aspects such as learning efect, cognitive state, and active thinking.Terefore, future research will consider combining cognitive states to refect students' classroom learning efectiveness, using student seating, interactions, and student relationships in smart classrooms to build classroom temporal social network features, and combining temporal data from learner knowledge tests to accurately portray learners' cognitive states.Te integration of cognitive state, classroom attention, and learning emotions will help teachers to understand students' learning status in a timely and accurate manner and support teachers in optimizing the teaching process.At the same time, more data will be collected to create a rich and diverse dataset.In addition, the paper plans to deploy the algorithm to embedded devices for use in smart classrooms to continuously help improve learning outcomes.

Figure 4 :
Figure 4: Accuracy curves of the training and validation sets.

Figure 6 :
Figure 6: Results of the HeadPoseEstimate model for classrooms.

Figure 8 :
Figure 8: Curves of change for diferent angles for student number 0616.

Figure 9 :
Figure 9: Expressions of student number 0616 over a while and the main expressions over the length of time of some knowledge points: (a) a period of time and (b) some knowledge points.

Table 3 :
Comparison of the RUTMasking-Net model with other models.

Table 4 :
Comparison of models under the ClassFaceD dataset.

Table 5 :
Probability distribution of each body of evidence for student number 0616 over the length of lectures on multiple knowledge points.

Table 6 :
Results after spatial fusion.

Table 7 :
Student learning emotions within lecture time for each knowledge point.beseenfromTable9, the results of the learning emotion characterization model proposed in this paper correspond to their test scores, and students with higher assessment scores are basically in a focused mood, indicating that the learning emotion characterization model proposed in this paper has good reliability.At the same time, in Tables7 and 9of this experiment, student 0616 was basically in a distracted learning state in the early stage, and the teacher of the course intervened by shifting student 0616's seat to the front row, and after the 11th knowledge point, his learning state gradually became positive and his fnal overall test score improved.It can be shown that the proposed algorithm can accurately analyze students' learning emotions and provide a basis for teachers to take interventions to improve learning outcomes.