Research on Feature Fusion Speech Emotion Recognition Technology for Smart Teaching

At present, there are numerous intelligent teaching applications based on various advanced pattern recognition technologies, such as face sign-in, classroom action recognition, student facial expression recognition, and other systems which have been gradually applied to major schools. Speech emotion recognition can analyze the characteristics of current teaching emotions, so as to discover the rules and details of teachers’ grasp of emotions. In order to realize the task of emotion classification in the process of intelligent teaching through machine learning technology, a teaching speech emotion recognition method with multifeature fusion and deep learning is proposed. The proposed method front-end processes the speech spectrum features in a multifeature fusion manner and combines them with an artificial neural network classifier to form the final speech emotion recognition model. First, after speech preprocessing, three features are selected and fused features using a network structure trained by parallel subnetworks. Then, the classifier used is a hybrid neural network classifier combining a convolutional neural network and a recurrent neural network. Finally, 100 open web courses were used to train and test the model. The test results show that the hybrid neural network sound model using multifeature fusion has good teaching speech emotion recognition capability.


Introduction
Emotional measurement in classroom teaching has been a pressing issue in instructional psychology. Classroom emotion is also a very important but easily neglected issue in the teaching process, and it is extremely important in promoting the development of students' quality. At present, most teaching evaluation methods still mainly use people's subjective judgments and inferences [1][2][3][4][5][6]. e method of evaluating the quality of classroom teaching through this manual approach su ers from the problems of uneven evaluation benchmarks, too many competent factors, and poor generalization, and it is impossible to come up with more convincing data from an objective perspective [7][8][9]. At the same time, traditional classroom teaching tends to focus on one or several classrooms only, and it is impossible to dig out the laws of the existence of excellent classrooms from a macroscopic perspective.
In recent years, big data and arti cial intelligence technologies have started to develop rapidly in all walks of life. Intelligent teaching based on AI technology has gained wide attention. Technologies such as face sign-in, cloud classroom, intelligent Q&A, and photo search greatly save teachers' and students' time and energy and change the traditional classroom teaching mode. In terms of teaching evaluation, arti cial intelligence technology can also be used to achieve a more objective evaluation of smart teaching [10][11][12]. With a large amount of data as support, the evaluation results can be guaranteed to be convincing. erefore, the use of deep learning technology to achieve teaching evaluation is of great research importance. e teaching emotion as an aspect of teaching evaluation also has a very practical application value.
Teaching emotions have a very big role in the process of intelligent teaching. A good teacher can accurately grasp the teaching emotion, which can improve students' attention while also making them feel happy and satisfied and eventually make them enjoy the course they are taking [13,14]. However, the complexity of emotion itself has led to a lack of knowledge on how to use emotion in teaching to optimize instruction. Nowadays, the maturity of deep learning algorithms allows us to have a clearer and more objective perception of emotions. By recognizing and analyzing classroom videos through neural networks, we can comprehensively analyze the characteristics and shortcomings of teaching emotions in excellent classrooms from a macro perspective [15][16][17]. MOOC platforms have a large number of excellent public class videos, which can be made into a more standard classroom speech emotion dataset after professional audio processing.
In order to fully explore the features of teaching emotion in the classroom, so as to achieve the task of classifying teaching emotion, we propose a multifeature fusion and deep learning method for teaching speech emotion recognition. We use the trained neural network to analyze and compare more public classroom videos, so as to summarize the current emotion patterns of excellent courses. e rest of the paper is organized as follows: In Section 2, the related works are studied in detail, while Section 3 provides the detailed preprocessing of emotional speech signals. Section 4 provides the details of the three feature extraction methods. Section 5 provides an overall architecture for teaching speech emotion recognition. Section 6 provides results and discussion. Finally, the paper is concluded in Section 7.

Related Works
In the 1960s, J.S. Bloom, an American psychologist, began to examine the emotion-related aspects of teaching and learning and proposed the influential affective domain and classification of the educational goals system. Bloom stated that the fundamental value of education is to achieve development rather than a challenge. Ketonen et al [18] proposed a theory of classification of educational goals, which can be classified into categories based on emotions. Christelle et al [19] argue that emotions can be seen as a continuum ordered according to a hierarchy and point out that emotions are not only an interpretation of figurative or abstract matters but also a control of the unconscious. ese studies overcame the inability to practice traditional methods of evaluating affect in teaching and made it possible to have a concrete operational approach to evaluating affect in teaching.
However, the above method descriptions are too abstract, leading to difficulties in achieving accurate category classification by machines without relevant knowledge. Speech signals are the fastest and most natural way of interaction between humans. Speech emotion recognition is particularly useful for applications that require natural human-computer interaction, such as Web TV, fatigued driving, language translation, and distance learning. is is because the emotional state of the speaker plays a crucial role in all aspects of communication.
e main purpose of employing speech emotion recognition is to adjust the system's functions accordingly when emotions are detected in the voice. Currently, speech emotion recognition systems consist of two phases [20][21][22][23]: (1) front-end processing (feature extraction), which extracts the appropriate available speech features from the available speech data. Each feature parameter in speech emotion recognition contains only part of the information of the speech signal. In order to fully characterize the speech signal, the combined use of various feature parameters becomes an important research direction. How to combine, transform, and trade-off various feature parameters is a problem that must be solved; (2) classifier, which determines the potential emotion of speech. Most of the current research has focused on the selection of classifiers. e main types of various existing classifiers are hidden Markov models (HMM), Gaussian mixture models (GMM), artificial neural networks (ANN), and support vector machines (SVM) [24][25][26]. ese classifiers are widely used in speech emotion recognition in their individual form or in a combined form.
To address the fusion problem of feature parameters and the selection problem of classifiers, a multifeature fusion approach is proposed for front-end processing of speech spectrum features and combining them with an artificial neural network classifier to form a final speech emotion recognition model. e classifier used is a hybrid neural network (HNN) classifier combining convolutional neural network (CNN) and recurrent neural network (RNN), and the effectiveness of the method is demonstrated experimentally.
e main innovations in this paper are reflected in the following two points: (1) three features are selected based on the similarity and complementarity among sound features, and the features are fused using a network structure trained by parallel subnetworks; (2) in the proposed HNN classifier, the CNN part uses DenseNet with excellent performance in picture recognition tasks, and the RNN part uses the DenseNet with excellent performance in speech recognition and text annotation LSTM neural network, which is commonly used in speech recognition and text annotation tasks. e combination of the two makes it easier to find patterns in data with time-series features and thus identify the target sentiment class more smoothly.

Preprocessing of Emotional Speech Signals
e preprocessing of speech signal includes preemphasis, short-time analysis, framing, windowing, and endpoint detection.
(1) Preaggravation: e speech signal spectrum is processed by Fourier transform. e main content of the preemphasis is the high-frequency part, which can ensure the signal spectrum is smoother, thus ensuring its analysis or the analysis of channel parameters is easier. (2) Short-time analysis: As a whole, the speech signal changes with time, which belongs to a nonstationary process, and the digital signal processing technology for stationary signals cannot be used to process it.
However, if we look at a short-time range (generally considered to be 10-30 ms), the frequency of speech during oral muscle movement is extremely slow. erefore, the analysis and processing of speech are all based on short time.
(3) Frame splitting: Because in the short time analysis of speech, the signal needs to be divided and processed. A frame is usually 10-30 ms, and in order to achieve a smooth transition and maintain continuity, the adjacent frames are usually overlapped. According to the frame length and step length, all speech frames of a speech segment can be obtained. (4) Windowing: After framing the speech signal, it is often necessary to use Fourier variation, but then a Gibbs phenomenon occurs. erefore, we need to add window processing. e commonly used window function is a rectangular window and Hamming window. e Hamming window function is defined as follows: where N is the length of the Hamming window. (5) Endpoint detection: e start and endpoints of the speech signal are accurately found from a segment of speech, and the effective speech signal is separated from the invalid noise signal.

Three Feature Extraction Methods
Speech expresses emotion because it contains parameters that characterize emotion. e variation of emotion is reflected by the difference in feature parameters. e focus of feature extraction is to retain information with cause-effect relationships and eliminate information that is not involved with each other. e main features commonly used in speech recognition are sound spectrogram, Meier cepstral coefficient (MFCC) features, and filter bank (FBank) features. In this paper, these three features are selected for teaching speech emotion recognition.

Sound Spectrogram Characteristics.
e speech signal is a one-dimensional signal, and intuitively only the information of speech in the time domain can be seen, but not its frequency domain information. erefore, this chapter extracts the sound spectrogram features, and the flow chart of extracting sound spectrogram features is shown in Figure 1.
e frame length is set to 25 ms, the frame shift is set to 10 ms, and then, a window is added to each frame, and the Hamming window is used in this paper. e short-term Fourier transform (STFT) is applied to each frame after the windowing. e STFT is introduced to preserve the timefrequency relationship of the speech signal, and the transformation equation is as follows： where z(m) is the source signal, w(m) is the window function, F is the scaling factor, and T is the time period. Finally, each frame of the signal obtained by the transformation is stacked from another dimension to obtain the sound spectrogram features.

FBank Features.
FBank is a feature similar to the way the human ear processes audio, which can optimize the recognition effect of the speech recognition system. In order to maximize the information of the sound signal and get the best feature parameters, the FBank features need to be extracted by splitting frames and adding windows, etc. e flow chart is shown in Figure 2.
Preaggravation refers to the use of the high-pass filter to enhance the high-frequency region of the speech signal and keep it in the whole frequency range from low frequency to high frequency. e high-pass filter function selected for the preaggravation process is where x(n) is the input signal, y(n) is the output signal, and a is the preemphasis factor. a is a value between 0.9 and 1.0. In this paper, a � 0.97. When extracting FBank features, the  Mobile Information Systems length of each frame is set to 25 ms, the length of the overlap between the two frames is set to 10 ms, and the window function is selected as the Hamming window. After adding the window function, in order to obtain the spectral energy distribution of the signal, the processed frame signal needs to be subjected to fast Fourier transform (FFT) in the following way: where N is the number of samples and s i (n) denotes the i-th frame of the input signal.
Since the FBank features take into account the auditory characteristics of the human ear, it is also necessary to transform the obtained frequency domain features into the nonlinear spectrum in the Mel domain. e equation for the Mel transformation is as follows： where f is the signal frequency. Set the number of triangular filter banks to M and set f(m), m � 1, 2,..., M to be the center frequency of this filter bank. ese frequencies are equally distributed on the Mel frequency axis. e frequency-domain features are filtered through the Mel filter bank to obtain the energy value of the corresponding frequency band for each frame of the signal on this filter. en, logarithmic operations are performed on the resulting energy values to finally obtain the FBank features. e logarithmic energy output of each filter bank is calculated as follows:

MFCC Features.
Although the FBank feature has closely matched the response characteristics of the human ear, it still has some shortcomings, such as the superposition part between each filter bank of the FBank feature. erefore, MFCC features have been proposed. e MFCC features can be obtained by performing a discrete cosine transform (DCT) on the base of the extracted FBank features, and the extraction process of MFCC features is shown in Figure 3. e DCT serves to remove the correlation between the signals in each dimension and maps the signals to a multidimensional space. e discrete cosine transform equation is as follows: where L is the order of the MFCC coefficient and M is the number of triangular filters. In general, to better reflect the dynamic continuity of the signal, the differential form of the static features can be used to represent its dynamic continuity.
where Δ t denotes the first-order difference equation. e related parameters of the second-order difference formula can be obtained by substituting the result of the first-order difference into formula (8).

Structure of HNN Classifier.
e overall architecture of instructional speech emotion recognition consists of three parts, the first two of which constitute the HNN. e first part is a convolutional feature extractor, which takes as input a spectrogram image representation of an audio file.
is feature extractor uses DenseNet [27] to convolve the input image and merge it in several steps and generate a spreading feature map. For the feature map represented as  e second part is a recurrent neural network (RNN) [28]. RNN can handle inputs of various lengths and therefore does not require audio input clips or padding. An LSTM structure was chosen for the RNN that addresses the long-term dependencies present in sequential data. en, the statistics of the LSTM output are computed through the pooling layer. In the usual case, only the average pooling is applied. In order to obtain richer statistical information about the output of the LSTM network, we perform maximum pooling and minimum pooling. e LSTM is set to have 128 units. e merging process can be expressed as follows: where r(t) i represents the i-th element of r(t). e network structure of HNN is shown in Figure 4.

Multifeature Sound Model Training.
In this paper, the network structure of parallel subnetwork training is used to fuse the features. e advantage of feature fusion is that it can obtain the most effective feature and the minimum dimension feature vector set which is beneficial to the final decision. e structure diagram of the parallel subnetwork training is shown in Figure 5. e feature fusion process is trained using parallel, separate networks for the three features. Each individual network consists of an HNN and performs deep processing of the three different features. A fully connected layer is connected afterwards, whose role is to converge the outputs of the independent networks using a tandem connection to form a sound model. e third part of the overall architecture for teaching speech emotion recognition consists of two fully connected layers and a softmax layer for predicting emotion categories with sizes of 128, 32, and 9, respectively.
Since three different types of features are selected in this paper and their dimensions are different, the traditional Euclidean distance cannot be used for direct comparison. erefore, this paper introduces the feature proximity and  Mobile Information Systems uses the average dimensional spacing between vectors instead of Euclidean distance to represent the proximity between different features. e average dimensional spacing between vectors is shown as follows: where d μ (i, j), d σ 2 (i, j) denote the mean interval and variance interval of the average dimension among the feature vectors, respectively. u i , u j denote the vectors of the mean values of the class i and j sound features, respectively. v i , v j denote the vectors of the variance of the class i and j sound features, respectively.

Experimental Data Sources and
Preprocessing. e source of the dataset is 100 courses on the MOOC (MUOC) platform of Chinese universities (https://www.cmooc. com). In this paper, we use BeautifulSoup4 crawler based on Python language to extract the video connection of the web page and then use the Requests library to send Post Request to this connection, so as to get the download link of the video. e audio splitting uses the spli-t_on_silence function in the pydub library. Audio splitting determines the speaker's speech interval based on the silence time, thus splitting the audio of a class into a large number of segments based on the speech. e 100 courses crawled were divided into more than 30,000 audio segments after audio cutting, and manually labeled. e distribution of the labeled data is shown in Figure 6, with 50 classrooms as the training dataset.
It can be seen that there is a serious data imbalance in the labeled teaching emotion dataset. e highest number of "calm" tags was 10,459, accounting for 71.52% of the total data set, while the lowest number of "surprised" tags was only 2, less than 0.1% of the total data set. Since the distribution of the data set is very unbalanced, we need to take some data preprocessing to ensure that our neural network has some generalization ability so that no serious overfitting will occur. Undersampling is performed on the "Calmness" labels, which account for more than 71% of the total data. e undersampling operation is an operation that discards a large amount of data, and we randomly discarded about 90% of the data volume, leaving a final sample of 1085 "Calmness" labels. Similarly, for smaller data sets, such as the "Tension," "Hesitation," and "Satisfaction" labels, we perform an "oversampling" operation. Oversampling is the process of repeating a positive proportion of the data to ensure that the amount of data is not too small to cause overfitting problems. Although the proportion of data may have changed after oversampling and undersampling, the balance of data is better improved. e distribution of the preprocessed teaching emotion data is shown in Table 1.

Analysis of Emotion Recognition Results.
In order to verify the recognition effect of multifeature fusion, this paper forms a comparison experiment by fusing different feature combinations. Because the dimension of spectrogram features is relatively large, its network structure adds a pool layer on the basis of convolution pool convolution to reduce the dimension. e network parameters of HNN are set as shown in Table 2. e error rate comparison results of the test set are shown in Figure 7.
As can be seen from Figure 7, Test I only trained on FBank features for recognition, so Test 1 had the worst recognition effect, with a recognition error rate of 25.92%. Test II is based on the introduction of MFCC features on top of FBank feature training, and the recognition effect is slightly improved, and the recognition error rate is 0.41% lower than that of Test I. Test III introduces the sound spectrogram feature on the basis of Test I. e recognition effect is also improved, and the recognition error rate is reduced by 0.86% compared with Test I. Test VI introduces both MFCC and sound spectrogram features on the basis of Test I. Analyzing the test results, we can get that the recognition effect of test VI is the best, and its recognition error rate is reduced by 1.28%, 0.87%, and 0.42%, respectively compared with the first three tests. From the overall recognition results, it can be obtained that the HNN sound model with multifeature fusion has better teaching speech emotion recognition ability.

Teaching Emotion Distribution Pattern.
In this paper, based on the data obtained, the classes were classified into the following six categories according to their    emotional characteristics: calm classroom, inquiry classroom, question-and-answer classroom, encouraging classroom, exciting classroom, and balanced classroom. After the identification of teaching emotions, the distribution of categories in 100 courses is shown in Table 3. Almost all classrooms are dominated by one emotion: the "calm" emotion. In other words, almost all classrooms are currently teacher-led classrooms. Only three inquirybased classes, in which the percentage of teacher-led lessons is below 50%, are truly student-led. Teachers should explore more ways to guide students and really make them the masters of the classroom.

Conclusions
In this paper, we fuse FBank features, MFCC features, and sound spectrogram features for HNN network training, thus building a multifeature sound model. e fusion process uses independent parallel-style subnetwork training, followed by common training through fully connected layers. e HNN classifier used is a hybrid classifier combining CNN and RNN. It is concluded from the experimental results that the recognition effect of the model is optimized with the increase of fused features. By training the teaching speech emotion recognition model, we can achieve fast recognition and classification of emotions in the intelligent teaching process. However, the study has certain shortcomings and needs to be improved: (1) the data set produced has too much imbalance in data content, which is very unfavorable for neural network training; (2) only 50 classroom test data sets were produced, but they are still insufficient for large-scale training. Both the data and the neural network model have a lot of room for improvement and modification.
Data Availability e raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.