Evaluation of Multimedia Popular Music Teaching Effect Based on Audio Frame Feature Recognition Technology

Music education should pay attention to popular music that exists in students’ real life and deeply affects them. Moreover, it needs to be combined with “popular classical music” to make them happily learn popular music, appreciate popular music artistically, and feel popular music aesthetically. This study combines the audio frame feature recognition technology to evaluate the effect of multimedia popular music teaching and improve the quality of multimedia popular music teaching. Moreover, this study adaptively revises the speech spectrum technology to construct a multimedia pop music system based on audio frame feature recognition technology. Finally, this study verifies the performance of this system through experimental research. According to the results of experimental research, it can be seen that the effect of the system proposed in this study is very good.


Introduction
e composition of music art should be diverse, including both traditional music and modern music, and other mainstream and nonmainstream forms of music. erefore, since we can attach importance to classical music and traditional music, we must also attach importance to modern music and popular music. Most popular music is passionate, full of emotions, sentimental, or happy and enmity, and it is always based on the principle of depicting and adapting to the public's psychology to the maximum. In addition, it should be noted that the appreciation of popular music also requires some kind of artistic guidance and the artistic imagination of the audience. e need for spirit is the essence of popular music. Although popular music is as vast as a sea of smoke, it will always leave something shining after the big waves wash the sand. It will become classic and inspiration [1].
At present, our country is in a period of rapid development, and our society is in a period of transformation. Students living in such an era are facing heavy learning pressure on the one hand and are in a psychologically sensitive period on the other hand. ey are in a special growth stage of transition from immaturity to maturity.
Compared with the previous childhood and later adulthood, the psychology of this period has the characteristics of poor stability, high emotionality, high sentimentality, and strong sensitivity [2]. Popular music has a distinctly popular character. Most of its content is close to the lives of ordinary people and expresses the feelings of ordinary people. Today, popular music has become the mainstream music culture of the society. For teenagers, it is obviously di erent from children's songs, and it has the atmosphere of the times, which can resonate with their own hearts. Moreover, popular music occupies an important part in the lives of college students [3].
is study combines audio frame feature recognition technology to evaluate the e ect of multimedia popular music teaching, improve the quality of multimedia popular music teaching, improve the role of popular music in the growth of students, and promote the healthy development of students' body and mind.

Related Work
Due to the rapid development of computer technology and informatics, and people's demand for fast and e ective audio recognition, audio recognition demonstrations using audio frame recognition have been widely used. Literature [4] laid a theoretical basis for the audio frame recognition technology. Literature [5] found a speech spectrogram and can automatically depict this spectrogram. People think that everyone's fingerprints are different from each other, and it usually takes millions of people to find almost identical fingerprints. e same should be true for audio frames [6]. Literature [7] obtained a method based on pattern matching and probability statistical analysis to support the development of audio frame recognition technology. Many scholars paid attention to this, which pushed the audio frame recognition to a peak. During this period, everyone focuses on the feature extraction direction. Literature [8] proposed the UBM-MAP (Universal Background Model-Maximum Posterior Probability) structure in the speaker verification task, which made the audio frame recognition from the laboratory to the practical. Important contribution: UBM-MAP reduces the dependence of the statistical model GMM on the training set. When training the model, only a few sentences of the speaker are needed, so it is relatively simple and flexible to use, and its accuracy is relatively high. Subsequently, the support-vector machine (SVM) technology was introduced into the audio frame recognition and achieved good results [9].
Although there have been many matching algorithms such as GMM-SVM, the effect is not as good as GMM and GMM-UBM [10]. Under the current development trend, audio frame recognition has gradually moved from the original laboratory stage to the practical stage. When in a pure voice environment, the audio frame recognition rate can reach a high accuracy rate, but when in a noisy environment, it will reduce the accuracy rate a lot, so now noise has become one of the main reasons that affect the recognition performance. erefore, the research on noise suppression algorithms is urgent. Among them, the speech enhancement technology is produced in this environment, and its purpose is to extract pure speech signals from noisy speech as much as possible [11]. Literature [12] proposed the use of spectral subtraction to eliminate noise; literature [13] studied Wiener filtering algorithms for noise removal. ese algorithms based on short-time spectrum estimation are suitable for environments with relatively large signal-tonoise ratios, and the algorithm is simple and easy to implement, so it has always had a strong vitality, and many people still use it.
Due to the vigorous development of very large-scale integrated (VLSI) circuit technology, the possibility of realtime implementation of voice enhancement is provided. Literature [14] published an algorithm for soft decision noise removal; literature [15] applied the Kalman filter to speech denoising. However, these traditional various filters are processed by spectrum analysis technology, which is a method of using Fourier transform to map the signals one by one into the frequency domain and then analyze them. is method will only work when the selected signal is stable and the spectral characteristics are obviously different from the noise, but in real life people often encounter unstable signals, and the frequency band of the signal and the frequency band of the noise tend to overlap together, so traditional methods are becoming less and less satisfactory. e rapid development of mobile communication technology has given a realistic impetus to the research of speech enhancement technology. For example, wavelet decomposition technology [16] is proposed for speech signals with noise. is method is formed with the mathematical analysis method of wavelet decomposition. It is a time-domain and frequency-domain analysis with multiresolution characteristics. Because of this, the local characteristics of the signal can be combined with the time domain and frequency domain.
is feature is superior in the analysis of nonstationary signals. At the same time, it also combines part of the theoretical basis of spectral subtraction, which is now the focus of multidisciplinary attention. But there is a weak point in wavelet denoising, that is, the energy of noise needs to be estimated, but people often do not know what noise is there. erefore, the independent component analysis method [17] has been developed. Its central idea is to combine a set of observation signals linearly mixed from source signals (such as pure speech and noise), assuming that the source signals are independent of each other in time. e algorithm separates the source signal, and the signal and noise meet this point. is method does not need to understand the noise characteristics.

Audio Frame Feature Recognition Algorithm Model
Adaptive postfiltering is a technique that adaptively corrects the speech spectrum according to the spectral characteristics of the local speech in order to improve the quality of the synthesized speech. In order to essentially understand the principle of adaptive postfiltering in speech coding, it is explained in terms of Wiener filtering and the hearing model of the human ear. A very important element of signal processing is to extract the signal from the noise or to suppress the companion noise to the maximum extent possible. One effective way to achieve this is to design a filter with optimal linear filtering characteristics. e classical Wiener filter describes how to design the best filter for noise suppression: determine the system function H(z) of the filter so that the mean square error (MSE) between the filtered output signal and the original signal is minimized. We assume that the energy spectral density of the signal is S(w), the spectral density of the independent additional noise is H(w), and the frequency response of the optimal filter should be [18] . (1) From formula (1), it can be seen that the gain of the filter is close to 1 at frequencies with a large signal-to-noise ratio (SNR). At frequencies with smaller SNR, the gain of the filter is correspondingly smaller. e postfilter of the conventional narrowband encoder is usually applied to the synthesized speech at the decoding end, as shown in Figure 1.
e ideal short-time postfilter has a frequency response that is similar to the spectral envelope of the speech signal. In 2 Advances in Multimedia the linear predictive encoder, the frequency response of the LPC synthesis lter is similar to the spectral envelope of the input speech signal. erefore, the expression of the transfer function of the short-time post lter is generally [19] H(z) Among them, A(z) p i 1 a i z − 1 is the transfer function of LPC predictor coe cients, a i is the LPC predictor coe cients, p is the order of LPC predictor, and the corresponding transfer function of the LPC synthesis lter is 1/(1-A(z)). e scale factor c corrects the LPC synthesis lter as shown in Figure 2.
If 1 − A(z/β) is used only as a short-time post lter, it reduces noise, but it introduces a spectral skew with a lowpass e ect, which can lead to a "mu ed" sound. erefore, a corresponding zero-point lter 1 − A(z/β) is introduced to reduce the spectral skew.
us, the frequency response of the short-time post lter H(z) is as follows: 20lg H e jw 20lg 1 From formula (3), it can be seen that, in the logarithmic domain, the frequency response of H(z) is the di erence between the frequency responses of the two weighted LPC synthetic lters so that some of the skews can be removed, as shown in Figure 3.
Usually, in order to further reduce the low-pass e ect, a rst-order lter with a transfer function of 1 − μz − 1 can be added to cascade with a short-time post lter. e long-time post lter is introduced to weaken the staccato rate component between the fundamental tones without introducing spectral skew. e transfer function of the long-time post lter with zero and pole is [20] Among them, G is the adaptive gain factor, p is the fundamental period, and 0 < λ < 1, 0 < c < 1. e phases of the p poles of H(z) are 0, 2π/p, 4π/p, . . . , (p − 1)2π/p, corresponding to the peaks of the harmonics of the fundamental tone in turn. e phases of the p zeros of H(z) are π/p, 3π/p, . . . , (2p − 1)π/p, corresponding to the troughs between the harmonics of the fundamental, respectively. c and λ vary with the clearness of the speech, thus controlling the degree of long-time postltering according to the periodicity of the speech. e adaptive gain G is very important for the long-time post lter. For clear or most consonants, usually c and λ are 0, that is, there is no long-time post lter. If G 1, the energy of the speech signal after long-time post ltering is equal to the energy before ltering. For stable turbid tones, if G 1, the energy of the signal is ampli ed after the long-time post lter.
is is because according to formula (5), each current fundamental tone cycle waveform is superimposed on the previous fundamental tone cycle waveform.

Advances in Multimedia
is leads to different effects of the postfilter power gain on the clear and turbid tones, making the volume of the clear tones decrease relative to the turbid tones, and thus, the speech quality is impaired. A derivation is given as follows: e full polar part (denominator part) of the transfer function in formula (3) corresponds to the recursive infinite impact response (IIR) filtering operation. Its impact extends to future frames, and the full-zero part (numerator part) corresponds to the nonrecursive FR filtering operation, and its impact basically stays in the current frame. erefore, in practical applications, a very small λ value is generally chosen, or even λ � 0. In this postfilter design of the wideband embedded speech encoder, the long-time postfilter used is the filter with no poles.
For an analytic-synthetic encoder like the CELP-based model, the optimal excitation parameters are searched in the perceptually weighted domain, obtained by minimizing the minimum mean square error between the input speech and the synthesized speech. e perceptually weighted filter for a conventional narrowband signal is [21] Among them, A ′ (z) is the linear prediction coefficient, and c 1 and c 2 are the control factors. In this way, the quantized noise (usually assumed to be white noise) is weighted by 1/W ′ (z), which also shapes the noise spectrum to have a resonant peak spectrum similar to the input speech signal.
However, traditional perceptually weighted filters for narrowband signals do not exhibit large spectral tilts. For broadband signals, the dynamic range between low and high frequencies is very large, and the spectral tilt is also very large, which requires the perceptually weighted filter to represent not only the resonant peak structure but also the spectral tilt. erefore, the perceptual weighting of the broadband signal should be decomposed. First, the input signal is pre-emphasized, that is, the high-frequency part is raised by pre-emphasizing the filter P(z) � 1 − μz − 1 . en, LPC prediction coefficients are calculated with the transfer function A(z). Finally, the perceptually weighted filter is obtained, as shown in the following formula: A(z) is calculated on the basis of the pre-emphasized signal, so the tilt of 1/A(z/c 1 ) is smaller than the A(z) directly calculated on the input speech. At the same time, the synthesized speech has to be de-emphasized at the decoding end, that is, by 1/P(z). In this way, the spectral correction of the quantization error is W − 1 (z)P − 1 (z), that is, 1/A(z/c 1 ).
Although the noise spectrum is suppressed according to 1/A(z/c 1 ) shaping, the experiments show that there is still subtle noise in the synthesized speech, especially in the low code rate case, so it is necessary to introduce the postfiltering design at the decoding end. erefore, if the object of long-time postfiltering is the prediction error signal, it is better than the object of the speech signal. Moreover, the calculation of the control factor in the long-time postfilter is related to the turbidity of the speech, so the control factor can be calculated in the residual signal domain to obtain more accurate values. e postfilter design in G729 proves the correctness of this idea. e synthesized speech is first passed through the short-time predictor to obtain the residual signal; then, the long-time postfilter is applied to this residual signal, and finally, the short-time postfilter is applied. Figure 4 shows the postprocessing flowchart of this wideband embedded encoder, and the modules are described in detail in the following. e antisparse processing is performed only at the rate of 8 kb/s, and it acts on the fixed codebook vector with the purpose of improving the low bit rate perception quality.
is is because if only 8kb/s streams are received at the decoder, the fixed codebook vector has only three nonzero sample points per subframe (called "sparse"), and this sparsity causes subjective auditory unrealism. In order to reduce the artificial perception of this sparsity, antisparse processing is applied to the surrogate digital book vector. e smoothing of the fixed codebook gain is processed based on two parameters, the turbidity and smoothness of the speech. e turbidity of the speech is estimated as follows: E v and E c are the energy of adaptive codebook and fixed codebook, respectively, E v � g p 2 · v(n) 2 , and E c � g c 2 · c(n) 2 . e closer λ is to 0, the closer the frame is to pure turbid speech. e closer λ is to 1, the closer the frame is to pure clear speech.
e stability factor θ is estimated by using the distance Ds between the ISP coefficients of the current frame (ISP is the frequency pair of the conduction spectrum, which is the frequency-domain representation of the LPC coefficients) and the ISP coefficients of the past frames [22]: Among them, p is the order of the line prediction coefficient, ispn is the ISP coefficient of the current frame, and ispn-1 is the ISP coefficient of the previous frame. e closer Q is to 1, the more stable the frame is.
Considering the comprehensive turbidity and stability, the smoothing control factor Sm can be defined as follows: 4 Advances in Multimedia at is, if Sm is close to 1 then it indicates a smooth nonturbulent signal, such as smooth background noise. e smoothing process for a xed codebook gain is as follows: (1) If the xed codebook gain g c < g c thres , the algorithm calculates tmp 1.19 g c and then compares tmp with g c thres . If tmp > g c thres , the algorithm sets tmp to g c thres . Its initial value of g c thres is 0. (2) If the xed codebook gain g c ≥ g c thres , the algorithm calculates tmp 0.84 g c and then compares tmp with g c thres . If tmp < g c thres , the algorithm sets tmp to g c thres . (3) e algorithm updates g c thres , that is, the algorithm sets up g c thres tmp. (4) Finally, the smoothed xed codebook gain is obtained: g c S m · tmp + (1 − S m ) g c . e xed codebook describes the details of speech, and the energy is mainly concentrated in the high-frequency part, and the low-frequency part has less energy. For pure turbid speech, adjusting the energy of the xed codebook in low and high frequencies within a reasonable range can improve the perception of speech. e encoder uses highfrequency enhancement lters to enhance the rst and second layers, as shown in Figure 5. e high-frequency enhancement lter is a high-pass lter whose coe cients c pe can be adaptively adjusted according to the turbidity of speech. c pe 0.125(1 + r v ), r v (E v + E c ), and E V and E C are the energy of adaptive codebook and xed codebook, respectively. When the turbidity is larger (that is, C pe 0.25), the higher frequency is enhanced and the lower frequency is weakened. e high-frequency enhancement lter expression is shown as follows: e xed codebook is passed through this lter to get a new xed codebook: c ′ (n) c(n) − c pe (c(n + 1) + c(n − 1)).

(13)
In turn, the total synthetic excitation exc2(n) is calculated according to formula (14), and among them, v(n) adaptive codebook, for g p that is the adaptive codebook gain, g c is the xed codebook, and Liang is the xed codebook gain. exc2(n) g p v(n) + g c c ′ (n). (14) e long-time post lter of this encoder is designed using the idea of a conventional long-time post lter. e purpose of applying it to the excitation is to eliminate the noise between the excitation harmonics. Figure 6 shows an example of a long-time post lter with the following expression: T is the integer fundamental delay of the current subframe. r 0.5. G is the adaptive control factor, and 0 < g < 1, Post-treatment incentives  which allows adaptive control of the long-time post lter, which is expressed as follows: if the current subframe excitation is strongly correlated with past excitations (for example, a clear tone), g tends to 1. Conversely, if the current subframe excitation is weakly correlated with past excitations (for example, a clear tone), g tends to 0, that is, it does not pass the long-time post lter. e values of T and g are calculated by the following procedure.
Here, the selection of T is very important because it determines the harmonic period of the long-time lter, so it has to be re ned. First, the best integer fundamental delay T 1 is selected in the range [(T 0 − 1), (T 0 + 1)], where T 0 is the integer fundamental delay of the current subframe. By calculating the autocorrelation R(k) of the current subframe excitation r(n) and the delayed excitation r(n − k) (as in formula (16)), the one with the maximum R(k) is the best integer fundamental delay T 1 .

R(k)
e best fractional fundamental delay T is then selected. t is chosen around T 1 with an accuracy of 1/8. e algorithm then calculates R ′ (k) (as in formula (17)) so that the maximum is the best fundamental delay T. . (17) Among them, r(n) is the current subframe excitation and r k (n) is the excitation code vector obtained by interpolating around T 1 . r k (n) is rst obtained by an interpolation lter of length 33, and after nding the optimal fractional fundamental delay T, r k (n) is then rederived by an interpolation lter of length 129. When the R(k) calculated by the lter of length 129 is larger than the Z obtained by the lter of length 33, the lter of length 129 is chosen.
When the optimal fundamental delay T is found, the normalized autocorrelation is obtained by dividing R(T) by the sum of the squares of r(n). If the normalized autocorrelation is less than 0.5, as in formula (18), then g 0, which is equivalent to the excitation not passing through the longtime lter. at is, when the correlation between the excitation of the frame and the past excitation is small, the longtime lter is not passed.
e gain coe cient g is calculated by the following equation: . (19) e core layer of this embedded encoder is the CELP model. At the same time, it is necessary to be able to handle both wideband speech (bandwidth 50-7000 Hz) and narrowband speech (bandwidth 300-4000 Hz). In order to improve the quality of synthesized speech for these two types of input speech, this study tries to introduce the traditional short-time post lter. e purpose of applying the short-time post lter to the synthesized speech is to attenuate the noise between the resonance peaks. e expressions are as follows: Among them, A(z) is the quantized linear prediction lter. It is experimentally concluded that the short-time post ltering performs best when the control factors r 1 0.6 and r 2 0.7. e control factor also shows that the shorttime post ltering for wideband speech cannot be too strong (usually, the control factors of short-time post ltering in narrowband speech encoders are r 0.5 and r 0.8). If it is assumed that hf(n) is the impulse response of A(z/r 1 )/ A(z/r 2 ), the gain gf is calculated from h(n) as in formula (21): Figure 7(a) shows the frequency response of the synthesis lter 1/ A(z) for one-frame speech and (b) shows the frequency response of A(z/r 1 )/ A(z/r 2 ). It can be seen from the gure that (b) can track the resonance peaks of the speech spectrum and weaken the energy between the resonance peaks, but this lter introduces a spectral tilt. By adding the spectral tilt compensation lter, the spectral tilt of the lter after a short time is reduced, as shown in Figure (c). So the synthesized speech has to undergo spectral tilt compensation and adaptive gain control after entering the short-time lter, and these three modules are one and the same. e lter Ht(z) is used to compensate for the skew of the short-time post lter, and the expression is as follows: Here, r i k 1 ′ is the skew factor and g i 1 − |r i k 1 ′ |. r is a constant, r, 0.9 when k 1 ′ ≤ 0, and ri 0.2 when k 1 ′ > 0. e purpose of an adaptive gain control is to compensate for the energy di erence between the synthesized speech s(n) before ltering and the ltered speech sf(n). e gain adjustment factor is calculated as follows: (23) e gain-adjusted speech sf'(n) is as follows: sf ′ (n) g (n) sf(n), n 0, . . . , 64.
(24) e initial value of g (n) is g (− 1) 1, and then, it is updated point by point: For a given input signal x(n), if we want to obtain an output with a sampling rate of LM times, the method is to interpolate x(eight) by L times, pass it through a low-pass lter h(n), and then extract it by M times. e frequency response of the low-pass lter h(n) is expressed as follows: Among them, ω x is the normalized cuto frequency, and C is a constant in the equation, which is the calibration factor and should be taken as C-L. e LM time sampling rate conversion equation is as follows: Among them, K N/L, N is the length of the lter h(n), nM L denotes the remainder of nML, and nM/L denotes rounding to nM/L.
(1) Algorithm performs downsampling from 16 kHz to 12.8 kHz. We set L 4, M 5, that is, 4/5 downsampling, and after conversion, the sampling rate is 12.8 kHz, that is, each frame of speech from 320 sample points to 256 sample points. e normalized cuto frequency o for h(n) is 0.2 n, the length is N 120, and the amplitude response is shown in Figure 8. (2) Algorithm performs upsampling from 8 kHz to 12.8 kHz. We set up L 8, M 5, that is, 8/5 upsampling, and the converted sampling rate is 12.8 kHz, that is, each frame of speech changes from 160 sample points to 256 sample points. e normalized cuto frequency ω x of h(n) is 0.125π, the length is N 256, and the amplitude-frequency response is shown in Figure 9. (3) Algorithm performs upsampling from 12.8 kHz to 16 kHz. e 4/5 upsampling is performed, and the converted sampling rate is 16 kHz, which means that each frame of speech changes from 256 sample points to 320 sample points. e amplitude response is shown in Figure 10, where L 5, M 4, h(n) normalized cuto frequency ω x is 0.2π, and the length is N-120. (4) Algorithm performs downsampling from 12.8 kHz to 8 kHz. e 5/8 downsampling is performed on the speech signal with a 12.8 kHz sampling rate, and the converted sampling rate is 8 kHz, which means that each frame of speech changes from 256 sample points to 160 sample points. Among them, L 5, M 8, h(n) normalized cuto frequency ω x is 0.125π, the length is N 240, and amplitude-frequency response normalized cuto frequency is shown in Figure 11.

Evaluation of Multimedia Popular Music Teaching Effect Based on Audio Frame
Feature Recognition e music teaching system provides a variety of music learning services, online guidance, virtual environment learning, and intelligent evaluation. In order to realize its functions, the entire platform adopts a ve-layer architecture, and from bottom to top, they are as follows: access layer, data processing layer, data storage layer, scene management layer, and application layer, as shown in Figure 12.
e system builds a corresponding database for students. Based on the traditional teaching experience, this study does a quantitative analysis of the teaching content at all levels.
e statistical analysis and results of a large number of data can provide more powerful reference data for the teaching  Advances in Multimedia and training of teachers and students. e framework of the external facilities and equipment of the system is shown in Figure 13. e audio frame feature recognition e ect and teaching e ect of the system proposed in this study are evaluated, and the results shown in Tables 1 and 2 below are obtained. It can be seen from the above research that the multimedia popular music system based on audio frame feature recognition technology proposed in this study has good results, so the multimedia popular music system based on audio frame feature recognition technology can be practiced in actual teaching later.

Conclusions
Popular music is diversified, some are suitable for college students to appreciate, and some should really be kept away from college students. It is precisely because of this uneven development of popular music that as educators always worry that young students will be harmed, so they have an attitude of rejecting popular music. However, in the context of the entire society, this kind of educational rejection will not reduce the impact of popular music on college students. In the past, the theoretical circles' rejection and criticism of popular music were somewhat influenced by the opposition between Eastern and Western ideologies. ey subconsciously think that as long as they are imported from the West, they are corrupt and bad. Popular music is purely westernized regardless of its origin or its own content and form. erefore, as a socialist country, we should resist it and protect young people from this decadent culture. Before the reform and opening up, this recognition lasted for a long time. is study combines audio frame feature recognition technology to evaluate the effect of multimedia popular music teaching, improve the quality of multimedia popular music teaching, improve the role of popular music in the growth of students, and promote the healthy development of students' body and mind.

Data Availability
e labeled dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
e author declares no competing interests.

Acknowledgments
is study was sponsored by the Shandong University of Arts.