Research on Multimodal Music Emotion Recognition Method Based on Image Sequence

*e work of music performance system is to control the light change by identifying the emotional elements of music. *erefore, once the identification error occurs, it will not be able to create a good stage effect. *erefore, a multimodal music emotion recognition method based on image sequence is studied. *e emotional characteristics of music are analyzed, including acoustic characteristics, melody characteristics, and audio characteristics, and the feature vector is constructed. *e recognition and classification model based on neural network is trained, the weight and threshold of each layer are adjusted, and then the feature vector is input into the trained model to realize the intelligent recognition and classification of multimodal music emotion. *e threshold of the starting point range of a specific humming note is given by the center clipping method, which is used to eliminate the low amplitude part of the humming note signal, extract the short-time spectral structure features and envelope features of the pitch, and complete the multimodal music emotion recognition. *e results show that the calculated kappa coefficient k is greater than 0.75, which shows that the recognition and classification results are in good agreement with the actual results, and the classification and recognition accuracy is high.


Introduction
Music is an art form that takes sound as a means of communication and then produces emotional experience. Music can communicate emotion directly in the form of sound movement. e essence of music is emotion. e specific form of music sound wave vibration is directly related to human emotion. According to this connection, music can be used to describe people's emotional activities in detail. All music activities obey and reflect the fluctuations of people's inner world, whether the creators and performers vent their emotions or the listeners accept the emotional connotation of the music. Nowadays, digital music technology has brought great changes to music, a traditional and classic way of emotional communication. e development of computer science has brought revolutionary progress to the creation, communication, storage, and release of music works. Especially with the continuous enrichment of computer music materials, it has become an urgent scientific research topic to study the emotional information of music works by using intelligent information analysis and processing methods, so as to make the computer have the ability to recognize and express music emotions like people. Music appeared earlier than language. When human beings did not use language to express their feelings, they had learned to use music [1,2]. It can be said that music plays an important role in human history, and music has been integrated into all aspects of human life [3]. With the continuous development of science and technology, the creation, storage, and dissemination of music have been greatly changed. Music is an art form that takes sound as a means of communication and then produces emotional experience. Music can directly carry out emotional communication in the form of sound movement [4,5]. It can be said that the essence of music is emotion. e specific form of music acoustic vibration is directly related to human emotion. According to this connection, music can be used to describe people's emotional activities in detail [6]. All music activities obey and reflect the fluctuations of people's inner world, whether the creators and performers vent their emotions or the listeners accept the emotional connotation of the music. Nowadays, digital music technology has brought great changes to music, a traditional and classic way of emotional communication. e development of image sequence has brought revolutionary progress to the creation, communication, storage, and release of music works. Generally, image sequence noise is an unpredictable random signal. Noise is very important for image sequence processing. It affects all links of input, acquisition and processing of image processing, and the whole process of output results [7,8]. In particular, the input of image and the suppression of acquisition noise are very key problems. If the input is accompanied by large noise, it will inevitably affect the whole process and output results. erefore, a good image sequence processing system, whether analog processing or digital processing by computer, takes reducing the noise of the first level as the main target [9,10]. In particular, with the continuous enrichment of computer music materials, it has become an important research content to use the image sequence intelligent information analysis and processing method to study the emotional information of music works, so as to make the computer have the ability to recognize and express multimodal music emotions like people.
In this regard, relevant scholars have proposed many studies. Reference [11] proposed the common neural mechanism of emotion processing in music and vocalization and compared the neural mechanisms involved in vocalization and music processing, so as to observe their possible similarities in emotional content coding. Positive and negative emotional sounds (such as laughter and crying) and violin music stimuli extracted by numbers are used as stimuli, which have common melody contour and main pitch/frequency characteristics. Reference [12] proposed that the semantic and episodic memory of music are provided by different neural networks, and the extraction of brain semantic memory and episodic memory is completed by different neural networks. It is basically obtained through language and visual space materials. Two delay identification tasks are constructed, one containing only familiar items and the other only unfamiliar items. For each recognition task, the general extraction target is presented in the previous semantic task. By comparing two perceptual control tasks with another perceptual control task, the situational task and semantic task are compared. Based on the above analysis, a multimodal music emotion recognition method based on image sequence is proposed. e music emotion features including acoustic features, melody features, and audio features are analyzed, and the feature vector is constructed.
e recognition and classification model based on neural network is trained, the weight and threshold of each layer are adjusted, and the feature vector is input into the trained model to realize the intelligent recognition and classification of multimodal music emotion. e threshold of the starting point range of a specific humming note is given by the center clipping method, which is used to eliminate the low amplitude part of the humming note signal, extract the shorttime spectral structure features and envelope features of the pitch, and complete the multimodal music emotion recognition.
e recognition and expression of multimodal music emotion enable users to realize emotional humancomputer interaction through music, which enriches the research content of human-computer interaction technology.

Multimodal Music Emotion Recognition and Classification Based on Image Sequence
In addition to the necessary music itself, a perfect music performance is a complementary live atmosphere. In music performance, the contrast of the on-site atmosphere is mainly realized by lighting, which is often changed with the emotional factors expressed in the music to assist the music to create a good stage effect. In this context, in order to better control the light, multimodal music emotion recognition is very important [13][14][15]. erefore, aiming at multimodal music emotion, a classification and recognition model is constructed to complete the research on intelligent recognition and classification of multimodal music emotion in music performance system.

Analysis of Emotional Characteristics of Multimodal
Music.
e realization of multimodal music emotion recognition is based on multimodal music emotion features, so multimodal music emotion feature extraction is the first link of this research [16,17]. In the previous multimodal music emotion classification, most of them take a music feature as the classification basis. Although they can also complete the classification task, their accuracy cannot be guaranteed. In order to solve the above problems, in this study, a variety of music features are extracted and fused based on image sequences and then classified and recognized based on fusion features. e principle of image sequence is shown in Figure 1.
In order to identify the emotional characteristics in music [18], it is necessary to understand the composition of music. Among them, the music related factors that can obviously show emotional characteristics include acoustic characteristics, melody characteristics, and audio characteristics.

Acoustic Characteristics.
Acoustic feature refers to the physical quantity that represents the acoustic characteristics of multimodal music speech. It is also a general term for the acoustic performance of many elements of sound, for example, the energy concentration area, formant frequency, formant intensity, and bandwidth representing the timbre of multimodal music, as well as the duration, fundamental frequency, and average voice power representing the prosodic characteristics of multimodal music speech. For the classification of multimodal music speech, the traditional method is to study the characteristics of pronunciation organs, such as the tongue position of vowels, front and back, and the pronunciation position of consonants. Now, with the progress of science and technology, further fine research can be made according to the acoustic characteristics.

Scientific Programming
Acoustic factor is the most basic component of music [19,20]. Music with different emotions shows different acoustic characteristics, and the basic corresponding relationship is shown in Table 1.

Melody Characteristics.
Melody features are also called melody features; that is, the lines composed of high and low tones with different lengths are the soul of music and the melody of music. e tones are organized according to certain laws [21][22][23].
e extracted features include five aspects.
(1) Balance parameter Y 1 : Balance refers to the proportional value of the volume in the left and right channels. e calculation formula is as follows: (2) Volume parameter Y 2 : Volume refers to the loudness of the sound that can be heard by the human ear. e calculation formula is as follows: (3) Pitch parameter Y 3 : Pitch refers to the vibration frequency of the fundamental frequency of a note. Fast paced music has fast vibration frequency; on the contrary, it has slow vibration frequency. e calculation formula is as follows: (4) Average strength parameter Y 4 : Strength refers to the strength of the power generated by music. Soothing music has weak strength, while more shocking music has strong strength [24,25]. e calculation formula is as follows: (5) Note energy parameters Y 5 : Note energy refers to the sum of the product of note pitch and length. e calculation formula is as follows: In the formula, P an (k) represents the balance value of left and right channels, and its value range is 0-127; V olume represents the volume of the track, with a range of 0-127; P itch stands for note pitch; n represents the number of notes in the track; V el(k,i) represents the intensity value of the i note in the k track; k indicates track number; N represents the number of notes in the k track; P ij and D ij represent the pitch and length of i notes in the j track channel.

Audio Features.
Audio feature is an important condition for recognizing and identifying multimodal music emotion. Different music emotion is expressed through different audio features. Audio is one of the important influencing factors in music, which affects the rhythm of music. e faster the rhythm, the more obvious the audio, and the happier the multimodal music emotion expressed. On the contrary, multimodal music emotion is more dull or depressing [26,27]. e description of audio features based on image sequences can be carried out from two aspects, realtime domain features and frequency domain features [28].
(1) Time domain characteristics e time domain characteristics of audio refer to the time domain parameters of each frame calculated from the music signal, mainly including zero crossing rate and amplitude [29][30][31]. e following is a specific analysis.
(1) Zero crossing rate Z n : Zero crossing rate refers to the frequency at which the audio signal waveform passes through the zero level. Generally speaking, the zero crossing rate in the highfrequency band of a piece of music will be relatively high; on the contrary, the zero crossing rate will be relatively low. rough this parameter, we can well distinguish between voiced and unvoiced sounds in music. Generally, unvoiced sounds are mostly used in cheerful music, while voiced sounds are often used in slow and deep music. e calculation formula of zero crossing rate is as follows: In the formula, s n x(m) represents the symbol function of the audio signal x(m); N represents the effective width of the window; n represents the time position of the window.
(2) Range M n : Amplitude refers to the width expanded by the waveform vibration of audio signal [32][33][34]. e more passionate the music, the greater the audio amplitude. e more soothing the music, the smoother the audio amplitude. e audio amplitude is described as follows:

Scientific Programming
In the formula, w(n − m) represents the moving window function.
(3) Frequency domain characteristics: e frequency domain characteristics of audio include two: spectral centroid C t and spectral flux F t . e calculation formula is as follows: In the formula, M t [n] represents the amplitude of the short-time spectrum of the t frame at the frequency point n; N t [n] and N t [n − 1] represent the normalized amplitude of the spectrum of the t frame and the t − 1 frame at the frequency point n, respectively.
Based on the above three categories and 14 multimodal music emotional features, a feature vector is formed, which is used to describe the emotional factors of a piece of music. It is described as follows: In the formula, U 1 represents acoustic characteristics; U 2 represents melody characteristics; U 3 represents audio characteristics. e audio feature structure is shown in Figure 2.

Construction of Multimodal Music Emotion Recognition
Classification Model. Based on the emotional features contained in the above music, a classification and recognition model is established to realize multimodal music emotion recognition and classification, and a neural network is used to construct the model [35,36]. BP neural network is an intelligent algorithm invented by simulating the working principle of human brain neural network.
e neural network mainly includes three layers, and the classification processing is realized through the operation of each layer. e classification and recognition model constructed by this algorithm is shown in Figure 3.
In Figure 3, training is the key in model construction, and the specific process is as follows. First, enter the choice of training samples, and after operation of hidden layer and output layer, you will get results, and then to compare the results with the expected results, when the difference between them is less than the set threshold, the training is completed; otherwise, there will be back propagation, difference from the output to the input, and repetitive process, until you reach the optimal weight and threshold. e purpose of BP neural network training is to adjust and optimize the weights and thresholds connected at every two levels in the model. erefore, the formula is given as follows.
(1) Adjustment formula of connection weight w ij and threshold θ j between input layer and hidden layer: In the formula, μ k j represents the error value in the hidden layer; c i represents the input eigenvector; N represents the number of iterations; k represents the number of training samples; n represents the number of neurons in the input layer; p represents the number of neurons in the hidden layer.
(2) Adjustment formula of connection weight v jt and threshold c t between hidden layer and output layer:   Scientific Programming In the formula, d k t represents the error value between the target eigenvector and the actual output vector; y j represents the output of the hidden layer. e trained model based on BP neural network can realize multimodal music emotion classification by inputting test music samples.

Calculation of Correlation Function between Note
Signals. In the process of intelligent optimization and recognition of the note starting point of feature tone retrieval, the initial note signal is preprocessed based on the image sequence to filter the noise of the high-frequency part. e random note signal is divided into short-term stationary signals based on the image sequence, the similarity between different phonetic waveform signals is calculated, and the cross-correlation function between each note signal is obtained. e design of recognition framework based on image sequence is shown in Figure 4.
Pervasive environment combines network technology and mobile technology and designs a customer-oriented adaptive recommendation structure. Pervasive environment is composed of network devices, including computers, mobile phones, and various network connected appliances, and network services including computing, management, and control. In this environment, the network can collect query, configuration, and management information from  Scientific Programming users and administrators, transfer these pieces of information to each server port, and then apply them to the comprehensive platform through the network to provide the basis for the design of the recommendation system. e specific steps are detailed as follows. Assuming that n represents the note frame length and N represents the sampling points in the frame, each humming note signal in the feature tone retrieval is windowed and framing processed by formula (11), so as to make each humming note signal short-term stable: In the formula, x(n) represents any humming note signal and E(n) represents the short-time energy of x(n).
Assuming that s(k) represents the current sampling value of short-time humming note signal, s(k) is defined as the linear combination of historical sampling value and excitation signal, which is expressed by the following formula: In the formula, a i represents the prediction coefficient of the image sequence, p represents the prediction order of the image sequence, G represents the gain factor of the image sequence, e(n) represents the glottic pulse excitation of the image sequence, and v(n) represents the channel response value of the image sequence.
x(n) is judged as the result of glottic pulse excitation e(n) filtered by channel response v(n), and e(n) is a shorttime humming note signal with periodic characteristics.
Assuming that R corss (t) represents a function with the same period, the similarity between waveform signals of different humming notes is calculated by the following formula: e similarity between the waveform signals of different humming notes mainly has two states: Cross and Jiugong grid, as shown in Figures 5 and 6.
Regular squares are used to represent the similarity between different humming note waveform signals. Generally, the image sequence value is 0 or 1. e two-dimensional space is formed by a large number of image sequences. e adjacent elements are the subelements to be studied, and their shape is mainly square. y(n) represents x(n) and signals with the same period T, and the mathematical expression for discrete-time signals is given by the following formula: Based on the image sequence, the center clipping method is used to give the threshold of the starting point range of specific humming notes, which is expressed as follows: In the formula, represents Additive White Noise Gaussian independent of s(n). y(n) is the third-order level signal of humming note searched by x(n) through clipping method, eliminates the low amplitude part of humming note signal, and calculates the correlation function between humming note starting point signals.
To sum up, it can be explained that, in the process of intelligent optimization and recognition of note starting point of feature tone retrieval, the initial note signal is preprocessed, the similarity between different note waveform signals is calculated, and the cross-correlation function between each note signal is obtained, which lays a foundation for intelligent optimization and recognition of note starting point of feature tone retrieval.

Intelligent Optimization Recognition of Note Starting
Point Based on Starting Point Feature. Because the melody pitch feature extraction is a key link in the intelligent optimization and recognition of the note starting point of the feature tone retrieval and directly affects the quality of the feature tone retrieval, in the process of recognition, the short-term spectral structure features and envelope features of the melody pitch are extracted based on the correlation function between the obtained note starting point signals based on the image sequence. Based on the feature transformation and fusion of each melody pitch starting point, the intelligent optimization recognition of note starting   Figure 7.
According to Figure 7, firstly, the input multimodal music audio signal is prefiltered to convert the input analog audio into a digital audio signal within the sound frequency range that can be received by the human ear. Secondly, according to the short-time stability of the audio signal, the preweighted audio signal is processed into frames, and the Hamming window is used to window the signal of each frame to reduce the influence of Gibbs effect. e short-time Fourier transform converts the time domain signal into the frequency domain signal, which is convenient for the triangular window filtering of the subsequent Mel filter. en, the logarithm of the filtered signal is taken, and the discrete cosine transform is carried out to remove the correlation between the signals of various dimensions, and the signal is mapped to the low dimensional space. Finally, the Mel cepstrum coefficient is obtained by spectral weighting, cepstrum mean subtraction, and difference processing. Because the lower order parameters of cepstrum are easily affected by the characteristics of speaker and channel, the recognition ability is improved. e specific steps of intelligent optimization identification are detailed as follows.
Assuming that z(o) represents the smoothing parameters of the pitch trajectory, based on the obtained R cross ′ (t), the short-time spectral structure features of the extracted humming melody pitch represented by BN and the envelope features represented by MFCC are extracted by the following formulae: In the formula, ϖ (j) represents the number of starting points of humming notes and ε(h) represents the offset vector.
A set of transformation matrices for the starting points of humming melody pitch is obtained by discrimination training. Based on the image sequence, each transformation matrix in the set corresponds to a region in the feature space division of the starting points of humming notes, which is transformed with the transformation matrix corresponding to the region to which the feature vector belongs. It is assumed that o(t) represents the input feature of time t, A i represents the transformation matrix corresponding to the i domain, and the characteristic transformation of the s melody pitch segment is described by the following formula: In the formula, R represents the starting paragraph of melody pitch after domain division and x i,s represents the weight coefficient corresponding to the selected feature transformation matrix A i .
Assuming that h represents the excitation signal of the BN layer humming melody pitch node, the transformation matrix features represented by M BN and M RDLT are fused by the following formula: In the formula, xo(t) represents the regularization function.
Assuming that the estimated value of β noise spectrum is used, the parameters of the fused transformation matrix feature y con (t) are optimized by the following formula: In the formula, M O represents the transformation matrix corresponding to the nonzero coefficient term. Based on the results calculated by formula (21), the intelligent recognition of note starting point in feature tone retrieval can be effectively completed, so as to complete the research of multimodal music emotion recognition method based on image sequence.

Experimental Analysis
In order to test the application effect of the multimodal music emotion recognition method based on image sequence, MATLAB software is used as the algorithm operation platform, and a specific example is selected for simulation test and analysis. e experimental environment settings are shown in Table 2.
e samples selected in the test are from the emotional corpus. According to the selected samples and the emotions to be expressed, they are divided into five categories. e specific distribution of samples is shown in Table 3.
Kappa coefficient is selected as the index to evaluate the intelligent recognition and classification of music emotion. Kappa coefficient is used for consistency test and classification accuracy. Its calculation formula is as follows. Scientific Programming In the formula, p o is the observation consistency rate and p e represents the expected consistency rate. e larger the values of k ∈ [− 1, 1] and the larger the k value, the more consistent the two results. When k ≥ 0.75, the results are consistent and the classification recognition is more accurate. If k < 0.4, it indicates lack of consistency and poor classification and recognition accuracy.
Input the test samples in Table 3 e kappa coefficient k values calculated above are greater than 0.75, indicating that the recognition and classification results are in good agreement with the actual results, and the classification and recognition accuracy is high, which has achieved the research purpose. e    Table 4. It can be seen from Table 4 that the identification results of the same test sample by different methods are different. Rhythm and melody characteristics have a great influence on the recognition of music emotion. On the premise that the image sequence is unchanged, selecting the appropriate music feature input vector will improve the accuracy of multimodal music emotion recognition to a certain extent.

Conclusion.
Multimodal musical emotion is a breakthrough in the field of artificial intelligence. It has become a new research feature of computer science, cognitive science, neuroscience, brain science, psychology, behavioral science, and other interdisciplinary fields. Multimodal musical emotion understanding is an important branch of emotion computing and has a broad development prospect. e multimodal music emotion recognition method based on image sequence verifies the performance of the algorithm through an example. e kappa coefficient proves that the classification recognition accuracy of the algorithm is high, which achieves the research goal. Meanwhile, the rhythm and melody characteristics have a great influence on the recognition of music emotion.

Prospect.
e possible future research direction is to apply deep learning method to music emotion recognition. Deep learning is a kind of based on feature hierarchical structure, characteristics of unsupervised learning learning method, has a lot of the hidden layer of all the excellent characteristics of artificial neural network learning ability, learning and to the characteristics of the characterization of the nature of the data more through millions of music is used to study characteristics. us, let the machine independently choose better music features to describe the relationship between the music and the emotion.
Data Availability e raw data supporting the conclusions of this article will be made available by the author, without undue reservation.

Conflicts of Interest
e author declares that there are no conflicts of interest regarding this work.