Innovative Research on the Development of Online Education Mode of Internet Thinking Based on the Discrimination of Learning Attention under the Analysis of Head Posture

With the rapid development of Internet technology and the popularity of 5G and broadband, online education in China, especially mobile online education, is in full swing. Based on the development status of online education in China, this paper analyzes the innovative application of learning attention discrimination based on head posture analysis in the development of online education mode of Internet thinking. Learning attention is an important factor of students’ learning efficiency, which directly affects students’ learning effect. In order to effectively monitor students’ learning attention in online teaching, a method of distinguishing students’ learning attention based on head posture recognition is proposed. In the tracking process, as long as the head angle of the current frame is close to the head angle of the key frame in a certain scale model, the visual angle apparent model can reduce the error accumulation in large-scale tracking. A Dynamic Bayesian Network (DBN) model is used to reason students’ Learning Attention Goal (LAG), which combines the relationships among multiple LAGs, multiple students’ positions, multicamera face images, and so on.Wemeasure the head posture through the similarity vector between the face image andmultiple face categories without explicitly calculating the specific head posture value. ,e test results show that the proposed model can effectively detect students’ learning attention and has a good application prospect.


Introduction
Traditional education can no longer meet the needs of modern education. e cost of computer production is declining, and the modern education mode with computer as the medium is more and more widely used in education. China's online education has been on the rise, making many offline education and training institutions vigorously develop online education, and even many Internet companies that did not pay attention to the field of education began to enter the field of online education [1]. China's online education is so full of vitality that everyone applauds it, but its development cannot be smooth sailing, and it is bounded by many problems. How to solve these problems is the basic requirement to promote the further development of online education.
Since the explosive birth of information transmission technology in the 19th century, people have never stopped exploring the application of various new technologies in the field of education. Root et al. define online education management system as "a software application for classroom education and after-school training, involving educational administration, information transmission, report generation and learning effect tracking" [2]. However, Lockee et al. define online education management system as "the integration of networked tools to support online learning" [3]. Zhang J. and Zhang F. discuss the development trend and form of online higher education through the research based on relevance theory, and emphasize that Massive Open Online Courses (MOOCs) are not only an online classroom, but also the interaction between teachers and students [4]. Fehr introduces the practical application of MOOCs, analyzes their existing problems, and holds that " ere is no doubt that the development of MOOCs is of great significance to higher education all over the world [5]. It provides opportunities for the online development of traditional higher education in a flexible and convenient way, but it cannot be considered that virtual education can replace traditional education. In disciplines such as medicine and architecture, the advantages of traditional higher education model are still incomparable." By analyzing the development data of online higher education in the United States from 2002 to 2014, Mcauliffe et al. preliminarily explore the development direction and path of online higher education in China, and form its own theoretical model. In a word, scholars at home and abroad started their research on online higher education late, mostly focusing on analyzing foreign literature. Even if a few models are put forward, their feasibility and suitability are open to question [6].
Attention in visual learning is related to head posture and eye sight direction. e research shows that [7,8], in many cases, it is enough to analyze students' Learning Attention Goal (LAG) through head posture. Because students are not used to staring at a certain goal with slanting eyes for a long time, they will turn their heads to face the goal. erefore, this paper uses the head posture to analyze the LAG of students, so we propose a Dynamic Bayesian Network (DBN) model to deduce the LAG of students. e head pose is measured by the similarity vector between the face image and multiple face categories without explicitly calculating the specific head pose value.
e observation of probability model includes face image and face position under multicameras. We collected test data in the teaching environment, and the experimental results show that our model is effective.
Structure of is Paper. e first section mainly introduces the research background, significance and main innovations of this paper. Section two presents a summary of related research. is paper introduces the research status of the key technologies (head pose estimation, face feature points, and attention recognition). In section three, the discrimination process of learning attention based on head posture analysis is mainly studied. Section four analyzes and discusses experimental results. Finally, the paper summarizes the full text, analyzes the existing problems in the current methods, and looks forward to the future research directions.

Related Work
In recent years, more and more researchers began to study the problem of visual LAG recognition. Bce et al. study the problem of students' LAG recognition in a small round table environment, in which an omnidirectional camera is placed on the conference table [9]. Later, they studied the method of identifying LAG in the environment of multiple remote cameras [4]. Hamrah et al. also study the problem of LAG recognition in different conference environments [10]. However, in the meeting environment, students mainly sit in fixed seats, and their bodies do not move much. Zhou et al.
also monitor students' LAG in outdoor environment, but the range of their head posture is limited to face posture close to the front. ese tasks mainly deal with the analysis of multiple LAGs in fixed positions or single LAGs in multistudent positions [11]. However, our application environment includes multiple student locations and multiple LAGs. Zhao proposed a head pose recognition algorithm based on template matching technology. For each recognition object, multiple head images in different poses are extracted as sample images, and each image is marked with corresponding pose parameters [12]. Yeager et al. found that in most cases, students' attention target behavior can be obtained by analyzing the head posture angle [13]. Yfka et al. detect face feature points through random cascade regression tree, and use N-point perspective algorithm to estimate head posture, thus realizing the visualization of students' learning attention [14].
Hirata and Kusatake establish a human-computer interaction system by using the learning attention detection model, which can effectively judge the position and head posture of the target person in the multiperson environment [15]. Guo et al. study the problem of attention target recognition in different conference environments. However, in the conference environment, users mainly sit in fixed seats, and their bodies do not move much [16]. Lu and Yanmin study the problem of attention target recognition in outdoor environment.
ey mainly analyze whether passers-by watched posters on the wall [17]. Xiao et al. introduce the run-length matrix of binary pattern into the random feature selection of random tree, which improves the classification ability of single decision tree and achieves better recognition rate for multiclassification discrete head pose estimation [18]. Ren et al. define the distance between eyes and screen and the range of head posture by calibrating the system, and use a single light source to track the gaze direction of eyeballs [19]. In the literature [20], three-dimensional faces are recognized by combining stereo vision information such as rotation and pitch, and the learning attention direction of eye sight can be accurately tracked through the details of eye images.
rough the study of the abovementioned related literatures, it is found that most of the research methods have certain requirements for equipment. Because of the low resolution of the individual's face in the classroom scene, and the illumination change, occlusion, and large posture change in the environment, it is very difficult to learn attention recognition. erefore, we adopt a noninvasive learning attention recognition method based on the head posture direction, and recognize the LAG of many people at the same time in the large classroom scene.

Analysis of Students' Learning Attention.
e main purpose of this study is to distinguish students' learning attention, and to determine the direction of students' learning attention by estimating whether students' eyes are concentrated in the blackboard area according to their head posture.

Scientific Programming
As shown in Figure 1, when students' eyes are focused on a certain point in the blackboard, such as P 1 , students are considered to be focused on learning. On the contrary, when students' eyes deviate from the blackboard area for a long time, such as P 2 , students are considered to be distracted in learning. Under normal circumstances, people are not used to looking at the target they pay attention to with oblique eyes. erefore, the rotation direction of head posture can be regarded as the line of sight of students approximately to analyze students' learning attention.
According to the classroom environment, a coordinate system is established, which takes the center point on the blackboard as the coordinate origin, the horizontal right direction of the origin as the X-axis positive direction, the vertical origin direction as the Y-axis positive direction, and the vertical XY plane pointing to students as the Z-axis positive direction. According to the students' head line of sight reaching the edge of blackboard, it is regarded as the criterion of students' abnormal behavior, as shown in Figure 2. α 1 , α 2 , β 1 , β 2 is the threshold of abnormal head deflection of students, and α 1 , α 2 is the rotation range of students in θ Yaw direction; β 1 , β 2 is the rotation range of students' θ Pitch direction. When the rotation range of the head exceeds the threshold, it can be considered that the students' sight is outside the blackboard area, and it is judged that the learning attention is distracted.
Assume that the blackboard has a length of h, a width of d, and a head center coordinate of F(x, y, z). When the students sit in the first row of the classroom and look at the left and right edges of the blackboard at points B and D shown in Figure 1, it is the maximum rotation range of the students' heads in the θ Yaw direction, which is written as (1) When the student sits at point C and looks at the upper and lower edges of the blackboard, it is the maximum rotation range of the student's head in the θ Pitch direction, which is written as According to the actual teaching environment, assuming that the center point of the head coincides with the eyes, and the height of the adult students' eyes from the ground is 1.3 m, the head rotation range of the students is determined to be θ Pitch direction [−7°, 28°], θ Yaw direction [−48°, 48°].

Discrimination Process of Learning Attention Based on
Head Posture Analysis

Visualization of Students' Learning Attention Based on
Head Pose Estimation. Combining the advantages of eye tracker and single camera learning attention analysis system, this paper proposes a visual analysis method of students' learning attention based on head pose estimation of single image, and constructs a corresponding visual analysis system of students' learning attention [21]. In this paper, the front camera installed in the middle position above the blackboard is used to record the students' lectures, and then the method shown in Figure 3 is used to estimate the students' head posture. Finally, the students' eyes are projected to the teacher's lecture video recorded by the rear camera by mathematical deduction. As shown in Figure 3, this method mainly consists of the following six steps.
(1) Acquisition of data (video frame): the classroom teaching video is acquired by LifeCam camera of Microsoft 1080p, and the video frames are separated. (2) Camera calibration: in order to improve the accuracy of head pose recognition, it is necessary to use a convenient and accurate calibration method to calibrate camera parameters. (3) Face detection: use the disclosed face detector to detect faces from video frames [22].  Scientific Programming (4) Face feature point detection: this paper uses random cascade regression tree to obtain the coordinate information of 19 face feature points, which is used to provide two-dimensional information in solution.
Students' viewpoint positioning: according to the rotation and translation matrix information of head posture, the students' viewpoint is projected to the teacher's lecture video shot by the rear camera by using the transformation relationship of spatial coordinates, so as to realize the visual display of students' learning attention.
Combining the oculomotor and the single-camera learning attention analysis system, the random cascade regression tree is used to locate the face feature points, and a rigid model obtained by statistical measurement is introduced as the 3D face approximation. e students' eyes are projected onto the video images taught by teachers, and the visual analysis of students' learning attention is realized.

Visual Angle Apparent Model
(1) Key Frame Adjustment. After tracking the current frame t, in order to calculate the attitude parameters of t + 1 frame, the current frame t becomes a new key frame in the model, and the expectation and covariance of X should be expanded accordingly. Because x t+1 is unknown at this moment, the expectation and covariance of X are extended as follows: Here, E t X and (σ t X ) 2 represent the expectation and covariance of X after the tracking of the current frame t is completed, and E t+1 X , (σ t+1 X ) 2 represents the extended expectation and covariance of X.
To reduce the number of key frames in the model, only one key frame is selected from each perspective. If the attitude parameters of one key frame are very close to those of other key frames in the model, the key frame will be removed from the model [23]. Since the previous frame t − 1 is always used as the reference frame of the current frame t, the previous frame t − 1 is likely to be removed from the model after the tracking of the current frame t is completed. Let the largest one of the three rotation angles of the previous frame be ω t−1 , and when the corresponding angle ω i of a certain key frame i in the model satisfies the condition remove the previous frame t − 1 from the model, where τ is the threshold determined by the number of key frames in the model. e removal process is completed by deleting the corresponding rows and columns in E t X , (σ t X ) 2 . (2) Multiscale Visual Angle Apparent Model. When the head movement range is small, the visual angle apparent model can effectively reduce the tracking error. Visual angle apparent model when the head moves in a small range is also called single-scale visual angle apparent model, but when the head moves in a large range, it will exceed the effective range of single-scale visual angle apparent model. ere are two main reasons for this phenomenon: first, the scale transformation of the head image is large when moving in a large range; second, the apparent model parameters themselves contain large error accumulation due to long-term movement, which leads to the tracking result deviating from the true value [24]. erefore, each visual angle apparent model is only valid within a certain range.
When the head movement range is large, multiple visual angle apparent models can be used to cooperate with each other to reduce the error accumulation caused by large-scale movement. In the specific implementation, the effective range of each visual angle apparent model is defined as a space neighborhood around its initial frame head position.
Let m represent the initial frame of the current apparent model, D m represent the distance between the head of the initial frame and the camera, and the distance D t between the head of the current frame and the camera satisfies the condition e tracking method will generate a new apparent model, where η is a predefined threshold. e current frame t will become the initial frame of the new model. Multiple visual angle apparent models which are continuous in space are called multiscale visual angle apparent models.
On the one hand, the multiscale visual angle apparent model can effectively solve the error accumulation problem of tracking large-scale motion, especially large-scale forward and backward motion. On the other hand, because each scale visual angle apparent model is only responsible for tracking motion in a small range, it is beneficial to the application of the model. For example, when the head leaves the camera view and re-enters, the head posture can be quickly recovered by using the multiscale visual angle apparent model.

DBN Model for Analyzing Visual LAG.
We propose a DBN model to reason students' attention goals. Reasoning students' attention goals by calculating the maximum posterior probability. e model integrates the relationships among multiattention targets, multistudent positions, and multicamera face images, and conducts joint reasoning.  Scientific Programming (1) Overview of Models. A Dynamic Bayesian Network (DBN) model for analyzing LAG proposed in this paper [25] is shown in Figure 4.
In the model, we have the following: Implicit variable F t represents the LAG of students at time t, and its value is M possible lags. Implicit variable C i t represents the pose category of the face image shot by camera i at time t, and its value is K face pose category. e observation variable Z i t represents the face image taken by the camera i at time t.
e observation variable L i t represents the horizontal position of the face image shot by the camera i at time t.
e model combines multicamera information to analyze LAG more accurately. For example, when students stand in area 2 and area 3, both cameras can capture students. In some cases, the images obtained by a single camera may not accurately analyze the LAG of students. At this time, through the information of another camera, it may be easier to analyze students' goals.
(2) Description of Each Part of the Model. According to the probability model, the joint probability distribution among all variables can be written as follows: Among them, P(F t |F t−1 ) stands for the purpose of transition probability matrix between different LAGs at adjacent moments, which is to enhance time smoothness. P(C i t |C i t−1 ,L i t , F t ) represents the probability dependence of face pose category C i t on LAG F t , face position L i t , and face pose category C i t−1 at the previous moment. is is the core of this model, which describes the probability dependence among multiple LAGs, multiple student positions, and faces obtained by multiple cameras.
is the likelihood of face observation. e face posture class C i t is known, and the likelihood represents the probability that the face observation Z i t is generated by the face posture class, as shown in Here, Λ is the normalization factor, M k represents the image subspace C i t � k of the face pose category, and d 2 (Z i t , M k ) represents the distance from the face image to the image subspace, such as the reconstruction error when the image is projected to the subspace.

(3) Model Reasoning.
e analysis problem of LAG is regarded as the reasoning problem of probability model. Given the observations Z and L, we hope to deduce the hidden variables Fand C. at is, our objective function is to maximize the following joint probability density distribution: We use the approximate reasoning algorithm proposed in [26] to minimize the cost function "free energy." e free energy uses a simple probability density distribution Q(h) to approximate the true posterior probability density distribution P(h|v) where h and v represent hidden variable (F, C) and observed variable (Z, L), respectively. en, Q(h) is used to calculate the objective function with the following formula: According to the method of [11], free energy can be written as Given the state of hidden variables F and C at time t − 1, by minimizing E, we can get In formula (12), it can be calculated as follows: Scientific Programming en, the probability density distribution can be calculated by iterative method: Finally, the LAG with the highest probability is the final reasoning result of the model.

e Efficiency Test of Visual Analysis of Learning Attention in is Paper.
is paper transforms a conference room into a small classroom. e subjects sat 6 m in front of the blackboard and looked at the teachers who were writing the edition books. A front camera was installed directly above the blackboard to monitor the subjects' learning status. e rear camera is installed behind the subjects to monitor the teaching situation of teachers. In order to ensure the test accuracy, the camera was calibrated during installation.
Based on the captured video frame images, using the head pose estimation method proposed in this paper, the three-dimensional angle information and three-dimensional coordinate axis (total six-dimensional information) of the subject's head pose are calculated, in which the six-dimensional information is marked at the tip of the nose for convenience of display. Using the derived visualization method of students' learning attention based on head posture, the physical position of the gaze point of the subject can be calculated and marked on the captured video frame image. e 1080P high-definition video taken by the front/rear camera is visually analyzed for single-person learning attention, and the video duration is 2 minutes. At the same time, in order to test the parallel acceleration performance of the visual analysis method of learning attention in this paper, 1∼4 physical threads of i5-4570 (4-core) CPU are used to process two videos in series/parallel. e experimental system uses 32 GB memory and TitanX graphics card (12 GB memory) to ensure that memory and graphics card do not become the performance bottleneck of hardware system.
Firstly, the two videos with a duration of 2 min are subjected to five steps, including frame image reading, face detection, face feature point detection, head pose estimation, and learning attention visualization, and the running time of each step and the total running time of the whole algorithm are obtained, respectively. In this paper, we first use a single thread to get the running time of serial computing. Based on this, we use 2∼4 threads to get the running time of parallel computing to observe the effect of parallel acceleration (see Figure 5). e following conclusions can be drawn from all the data listed in Figure 5.
When one thread is used for serial processing of visual analysis of students' learning attention, the processing speed of 1080P single face video is very slow, and it takes 711.58 ms to process one frame image.
Comparing the running time of each step of serial processing for visual analysis of students' learning attention, we can find that the two steps of face detection and face feature point detection are the most time consuming, which are 541.33 ms and 122.36 ms, respectively. erefore, the key to improve the visual analysis speed of learning attention lies in how to reduce the time consumption of face detection and face feature point detection.
Using 2∼4 threads for parallel computation of visual analysis of learning attention, it is found that parallel computation can effectively reduce the time-consuming of face detection, but for the other 4 steps (reading frames, face feature point detection, head pose estimation, and learning attention visualization), the acceleration effect is not obvious. In order to analyze the attention of 25-50 students in classroom teaching in real time by using 4 KB high-definition cameras in practical applications in the future, largescale parallel processing must be carried out by computing clusters with many node machines. is method uses a single image for head pose estimation, which will be completely suitable for parallel processing of many node machines.

Head Posture Tracking Experiment Using the Visual Angle
Apparent Model 4.2.1. Experiment 1. Evaluate the tracking results of visual angle apparent model when the body moves back and forth in a large range. In the experiment, a video sequence was recorded at a speed of 12 Hz by using a Digiclops stereo camera, in which the subjects moved from a position about 0.5 m away from the camera to a position about 1.5 m away from the camera along the z-axis. During the movement, the subjects constantly changed various head postures, and the rotation angle of their heads around three coordinate axes ranged from −45°to 45°. According to the effective range of

Experiment 2.
Firstly, the tracking results of the proposed method are tested when the head leaves the camera view and then re-enters. In the experiment, another video sequence was recorded at a speed of 12 m by using the Digiclops stereo camera, in which the subjects left the camera's perspective when they were about 1.2 m away from the camera, and then re-entered the camera's perspective when they were about 0.8 m away from the camera.

Experiment 3.
In order to further evaluate the performance of the proposed method, the tracking error of this method was measured in Experiment 3. e real head posture parameter data is obtained by using the motion sensing sensor pciBIRD.
ree video sequences were recorded in the experiment, and the head pose parameters of each frame were obtained by pciBIRD. All three video sequences were recorded at a speed of 12 Hz, with an average length of 1020 frames. During recording, the motion of the tracked object was similar to that of Experiment 1, and the parameter settings were the same as those of Experiment 1. e average tracking error (average error of three angles) using video sequence one is shown in Figure 6, in which only the tracking error of the tracked object moving from about 0.9 m away from the camera to about 1.3 m away from the camera is displayed, with a total length of 320 frames. erefore, this method can accurately track attitude parameters (4°root mean square error), while Morency's method has a larger error (8°root mean square error), and the maximum error can reach 20°. Figure 6 also shows the error when using the single-scale visual angle apparent model. It can be seen that when the head movement exceeds the effective range of the single-scale visual angle model, the tracking error is larger.

LAG Recognition.
We evaluate the validity and extensibility of the model through two sets of experiments. In the first group of experiments, we used the data of all 8 people for training and testing. In the second group of experiments, the method of row-by-row cross-validation was adopted, that is, training with the data of 7 people and testing with the data of another person. A total of 8 rounds were run, and each person's data was taken as the test set. Figures 7 and 8 are the results of two groups of experiments, respectively. e percentage in the figure is the recognition accuracy obtained by dividing the number of correctly recognized frames by the total number of frames. Accuracy evaluation is only performed on manually marked video clips. In the second group of experiments, we only consider the results of the test set video, and give the average results of 8 rounds of experiments.
As can be seen from the above figure, the result is very good. is is because the students in the training set and the test set are the same. e results of the second group of experiments are not as good as those of the first group. is is understandable, because the students in the test set have not appeared in the training set, and the appearance and illumination of different students are quite different. It can be seen that when students stand in area 1 and watch LAG4, 5, and 6, the accuracy is not very high. ey often do not mistakenly identify as adjacent LAG.
is is because the distance between LAG4, 5, and 6 is relatively short, and they are far away from Area 1. erefore, when students look at these different targets, they often only turn their heads very slightly, and sometimes they only turn their eyeballs instead of their heads. If these conditions are ruled out, the experimental results are acceptable considering the difficulty of the data captured in real scenes.

Discriminant Analysis of Students' Learning Attention.
Based on the above analysis, the students' learning attention is analyzed and studied by the criterion of students' learning attention. In order to verify the detection effect of this method, it is designed that learners imitate students' traditional classroom learning process, and test students' daily learning classroom behaviors such as irregular listening carefully, looking down at their mobile phones, and looking around. e following steps were taken: selecting a high-definition camera with 12 million pixels as an acquisition tool, fixing the camera 2 m in front of the learner, detecting the learning attention of the learner by analyzing the sampled video, and recording the learning process of the learner within 60 s under ordinary illumination.
Implementation process of the algorithm: when students sit in front of the camera, the camera will record the students' learning situation, and then each frame image of the students' learning process will be detected by the   learning attention detection system, and the information of the students' head rotation will be recorded, in which one corner out of range will be recorded as 1, and the one without exceeding will be recorded as 0. First, if the record is 1 continuously within 2 seconds, it will be judged as learning distraction and counted. Secondly, the ratio of the sum recorded as 1 per unit time to the total time will be calculated and the students' classroom learning attention will be output. From this, we can count the number of distractions in learning and the situation of students' learning attention per unit time. Some typical behaviors are shown in Figures 9-11.
As shown in the above figures, the displacement deviation of the head posture in X/Y/Z direction by this method is within the acceptable range. e maximum error of displacement estimation in the x-axis and y-axis is less than 40 mm and 50 mm, respectively, and the maximum error of displacement estimation in z-axis is less than 230 mm. In this paper, the discriminant analysis of students' learning attention is accurate, which can basically coincide with the calibration curve, and the deviation is small.

Conclusion
Online education is a new trend in the development of education, which will bring profound changes in educational concepts, educational systems, teaching methods, personnel training models, etc., and play a positive role in deepening education and teaching reform, improving education quality, and promoting education equity. In this paper, a discrimination method of learning attention based on head posture analysis is proposed. By selecting key frames with different head postures and generating them online, besides attaching posture parameters to the key frames, the head region is accurately extracted from each key frame as the head perspective, and the key frames are combined into a multiscale visual angle apparent model according to the spatial distribution. A DBN model is proposed to reason students' LAG. e model integrates the relationships among multi-LAG, multistudent positions, and multicamera face images, and conducts joint reasoning. We measure the head posture by the similarity vector between the face image and multiple face categories without explicitly calculating the specific head posture value. We collected test data in the teaching environment, and the experimental results show that our model can get better results.
Real data Data from literature [26] Data in this paper Real data Data from literature [26] Data in this paper Real data Data from literature [26] Data in this paper In the current experiment, when the user looks at the distant visual attention target, the attitude measurement is inaccurate due to the small difference of images. In the future, we will consider fusing motion information to detect the change of the user's visual attention target. e visual attention target in this paper is several screens on the wall. In the future, we will consider extending the visual attention target to more places, such as different areas on the workbench.
Data Availability e dataset used in this paper are available from the corresponding author upon request.