Music Emotion Classification Method Based on Deep Learning and Improved Attention Mechanism

Since the existing music emotion classification researches focus on the single-modal analysis of audio or lyrics, the correlation among models are neglected, which lead to partial information loss. Therefore, a music emotion classification method based on deep learning and improved attention mechanism is proposed. First, the music lyrics features are extracted by Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec method, and the term frequency weight vector and word vector are obtained. Then, by using the feature extraction ability of Convolutional Neural Network (CNN) and the ability of Long Short-Term Memory (LSTM) network to process the serialized data, and integrating the matching attention mechanism, an emotion analysis model based on CNN-LSTM is constructed. Finally, the output results of the deep neural network and CNN-LSTM model are fused, and the emotion types are obtained by Softmax classifier. The experimental analysis based on the selected data sets shows that the average classification accuracy of the proposed method is 0.848, which is better than the other comparison methods, and the classification efficiency has been greatly improved.


Introduction
Due to the complexity of music duration and composition, the emotional features extracted from music show the characteristics of large quantity, multiple dimensions, and difficult to analyze [1]. e results of music emotion classification can be well applied to music recommendation function to reduce the disadvantages of collaborative filtering recommendation [2,3]. At the same time, music can artistically express the emotional information contained therein, and listeners can also obtain emotional tendency through music audio and lyrics information [4]. Music emotion analysis can be well applied to the music recommendation function. Major online music applications have launched the music recommendation function to recommend suitable music and improve the user experience by analyzing the listening habits of different users [5,6]. However, most of the applications recommend popular songs but ignore personalized works, which is difficult to meet the needs of listeners under different emotions. erefore, the research on emotional classification of songs has a good application prospect [7].
Before the emergence of intelligent algorithms, the way of establishing classification labels mainly depended on manual work, and the songs with different music styles were established into corresponding song lists [8]. However, such methods are not only inefficient, but also have high requirements for manual experience, and the classification accuracy is also uneven [9,10]. On the basis of manual classification, the traditional classification methods are gradually mature, which mainly include methods such as logistic regression, naive Bayes, random forest, and support vector machine [11]. For example, Rao Veeranki et al. [12] analyzed and compared the performances of four traditional methods in the process of music emotion classification: logical regression, naive Bayes, random forest, and support vector machine, and took the parameters such as mean, variance, kurtosis, and skewness as analysis indicators, which effectively improved the efficiency of music emotion classification [12]. However, the targeted feature extraction in mixed audio needs to be improved. Kumar and Vardhan [13] made full use of other emotional features according to the rule-based emotion classification algorithm, and obtained better classification accuracy by adding more words to the dictionary [13]. However, the granularity segmentation of music needs to be further improved. Tiple and Patwardhan [14] proposed a new emotion classification system through link preprocessing, feature extraction, and classification steps [14]. e proposed Spiking Neural Network (SNN) classifier based on gradient descent was used to process frames and extract the time, spectrum, and energy features related to music. Combined with the optimal weight value to reduce the training error, the gradient descent method was optimized. Chen and Li [15] proposed a multi-modal ensemble learning method based on stacking [15]. is method was different from the feature-level and decision-level fusion methods. However, this classification method is inefficient in the environment facing a large number of new music creation, and cannot flexibly meet the needs of category expansion in the later stage [16].
Nowadays, classification algorithms based on machine learning methods and deep neural network learning have carried out extensive research in the fields of audio, image and text, and achieved rich results [17,18]. With the rise of artificial intelligence-related technologies, computers can realize complex emotion analysis and calculation, and automatically output emotion analysis results through algorithms. Scholars' researches on music emotion feature extraction and classification model are also gradually carried out. Hizlisoy et al. [19] proposed a music emotion recognition method based on CNN-LSTM [19]. e experimental evaluation on the constructed emotional music database effectively showed the good performance of the proposed method. However, this method ignores the timing features of audio itself. Sorussa et al. [20] proposed a digital music emotion classification system with different emotion categories, which used supervised machine learning technology to identify the acoustic features and create prediction models [20]. is method effectively improves the accuracy of the algorithm classification, but the efficiency of machine learning needs to be improved. Gan [21] proposed a recurrent neural network method with channel-attention mechanism to classify the music features [21]. e above machine learning methods have achieved good results in the field of music emotion classification, but in the process of dealing with lyrics and melody, the relationship between lyrics and melody emotions is separated in different ways, without considering the consistency of emotion between lyrics and melody [22], so there is room for further optimization of the detailed classification of music emotion.
Aiming at the problem that most existing classification methods are difficult to deal with multi-dimensional and complex music texts, a music emotion classification method based on deep learning and improved attention mechanism is proposed. Its innovations are summarized as follows: (1) Aiming at the problem of high dimension and sparsity of word vector, the proposed method combines CNN and LSTM to construct emotion classification model, and integrates the matching attention mechanism to further improve the classification accuracy. (2) In order to solve the problem of insufficient performance of single feature classification, the proposed method uses CNN-LSTM model and deep neural network to process word vector and word frequency weight vector, respectively, and carries out feature concatenation to ensure the reliability of emotion classification.

Lyrics Feature Extraction
2.1. TF-IDF Feature Extraction. Term Frequency-Inverse Document Frequency (TF-IDF) is a feature extraction method that represents the weights according to the frequency of word items in the text. TF-IDF can calculate the number of word occurrences by means of probability statistics, evaluate the proportion of word items in the text to determine the importance of the word, and use this to represent the emotional polarity of the lyric text. e more times an emotional representative word appears in a lyrics text, the higher the importance of the emotional word in the emotional classification evaluation. Integrating all word frequency information, the emotional tendency of the whole lyrics text can be evaluated [23]. TF is the word frequency of a word, indicating the number of times a word item appears in the text. TF is calculated as follows: where n(i, j) represents the number of times that word w i appears in document d j , and its denominator represents the total number of words in the document. IDF is the inverse text frequency. e fewer times the current text contains word items, the stronger the classification ability of the word items to the text. It can be obtained by dividing the total number of words in the data set by the number of samples containing word items and through logarithmic operation. IDF is calculated as follows: where |D| represents the number of all documents, and the denominator represents the number of documents containing the word w i . e TF-IDF calculation result is represented by the product of TF and IDF. If the current word item appears less in the current category and more in the overall sample, the larger the TF-IDF value, the higher the classification ability of the feature item. To sum up, the core of TF-IDF text feature extraction method is to remove the influence of common words and retain important features with text resolution.
TF-IDF is a simple and convenient text word frequency feature extraction method, but it also has some defects. e words in the text are regarded as independent feature items, 2 Computational Intelligence and Neuroscience ignoring the relationship between words and ignoring the relationship between words and the whole article. is representation method has good statistical significance for local content, but ignores the integrity of the text, resulting in the loss of fine-grained emotional semantic content. For example, a certain emotion polar word only appears in the song lyrics text of this emotion, but less in other emotion types, which will lead to the error of emotion evaluation.

Word2vec Word Vector.
Word2vec is a distributed text representation method, which maps each word item in the text to a word vector. Word2vec improves the shortcomings of the traditional deep learning word embedding model, with faster training speed and fewer vector dimensions. Word2vec usually includes two model structures: Continues Bag of Words (CBOW) and Skip-Gram, as shown in Figure 1. e model includes input layer, projection layer, and output layer. In CBOW method, the surrounding words are used to predict the central word, so as to use the prediction results of the central word to represent the a priori probability; Skip-Gram uses the central word to predict the surrounding words, so as to predict the overall result and represent a posteriori probability. erefore, CBOW will be faster than Skip-Gram in practical use. e parameter dimension of Word2vec-generated word vector is related to the number of hidden layer units in the network. e default value of the program is 100 dimensions.
Word2vec also has some defects: because words and vectors are one-to-one, the problem of polysemy cannot be solved; a static word vector representation, although it has strong universality, it cannot be dynamically optimized for specific tasks. For special text types such as lyrics, text information is different from evaluation text, which can be expressed directly through the emotion polar words. e implicit semantic expression in lyrics is often difficult to summarize emotion through separated word frequency information. e word vector method integrating context semantic information has better classification performance [24].

Model Construction.
After preprocessing the lyrics text, two emotion feature vectors, vector space model and distributed vector representation, can be extracted. Word2vec is extracted as word vector representation, which can be applied to deep learning methods, but it often has the characteristics of high dimension and sparsity. A single network model cannot deal with the features well. e architecture of CNN-LSTM not only has the advantage of CNN extracting local features, but also has the advantage of LSTM connecting the extracted features in sequences. Although TF-IDF representation method based on word frequency statistics has some defects in the semantic representation, it also has good distinguishing ability for text information with prominent keywords. In order to integrate the emotional feature representation of two kinds of text, a lyric emotion classification model based on CNN-LSTM is constructed. e model architecture is shown in Figure 2. e proposed model is divided into two parts: word vector + CNN-LSTM and word frequency weight + Deep Neural Network (DNN). First, CNN is used to extract multiple sets of word vector features of the input text, and the extracted features are integrated into the input of LSTM neural network to output a new set of word vector feature representation. en, the word bag model vector extracted by TF-IDF is processed by DNN. e features of the two categories are concatenated as the fusion representation of lyrics text, which is finally classified and output by Softmax to obtain the text emotion classification results.
e lyrics emotion classification model based on CNN-LSTM is similar to the audio emotion classification model in network structure. e inputs of the audio classification Computational Intelligence and Neuroscience 3 model are spectrogram and low-level descriptor features, respectively, while the inputs of the text classification model are word vector and word frequency weight vector, respectively [25]. e proposed model also has some performance differences when applied to audio features and text features. Because the audio feature dimension depends on the extracted spectrum description feature, the sequence length depends on the segmentation mode and frame interval of the original music; the text feature dimension depends on the distributed representation dimension set in the feature extraction stage, and the length of the text sequence is directly related to the number of word items. For the theme of song audio classification, CNN plays a leading role in the CNN-LSTM combined network. CNN is used to extract spectrum feature, which requires deeper convolution operation.
e size of convolution kernel and stride will affect the classification performance. For the propose of lyrics text classification, the original serialized text feature word vector is difficult to train due to its high dimension and sparsity. CNN mainly provides feature compression ability. Bidirectional LSTM and matching attention mechanism have a greater impact on classification accuracy than convolution layer.
BiLSTM is a structure composed of forward LSTM and backward LSTM, which can well complete the extraction of data features. BiLSTM can well analyze bidirectional data information and provide more fine-grained calculation. e calculation process is as follows: where, one LSTM layer processes the sequence from left to right, and the other LSTM layer processes the sequence from right to left. W �→ and W ← represent the network hidden layer parameters, x t represents the input data, h e BiLSTM structure is shown in Figure 3.

Input Layer.
e input of the model is audio feature data. e original music file is preprocessed, and the word vector and word frequency weight vector are obtained, respectively.
e feature size of each spectrogram is normalized to 256 × 256 × 3, where 256 is the width and height of the image, and 3 represents the number of channels (RGB) of the color spectrogram.

CNN Layer.
e detailed view of the CNN layer model is shown in Figure 4. In the model implementation, CNN layer includes two convolution layers and two pooling layers. e input of first convolution is the word vector, the convolution operation is performed through 64 convolution kernels with size of 2 × 2 and step of 1, and Relu is used as the activation function [26]. e output vector size H of CNN depends on G (input size), κ (convolution kernel size), P (padding size), and τ (step size). e calculation is as follows: During the convolution feature extraction, first, use a single convolution kernel to calculate each local feature of the input. en, concatenate the calculated features vertically, and finally perform nonlinear calculation on the concatenated features through the activation function to obtain the final convolution feature. e mathematical expression is as follows: where, J κ represents the convolution kernel with height κ, H is the size of the output vector, h 1κ (i) is the i-th local feature, hr 1κ is the output c8onvolution feature, X is the input, and f is the tanh activation function.

LSTM Layer.
In order to fuse different features to improve the classification accuracy, cascade is used to connect the merged results as the input of LSTM layer. e mathematical expression is as follows: where, hp 1κ is the result of pooling operation; φ is the merge connection function, and h 1 is the input of LSTM. e word vectors generated from the sample set are further extracted by the CNN layer. Specifically, for the lyrics sample represented as [v (1) , v (2) , . . . , v (T) ], where T is the number of frames after lyrics segmentation. After passing through the CNN layer, a sequence vector [c (1) , c (2) , . . . , c (T) ] is obtained as the input of the LSTM layer. e detailed view of LSTM model is shown in Figure 5.
Input the vector output from the previous layer and selected by the feature into the bidirectional LSTM. e LSTM layer in the model has 128 units, and the output results can be expressed as [l (1) , l (2) , . . . , l (T) ].

Matching Attention Mechanism.
For fine-grained emotion analysis, the ordinary attention mechanism cannot accurately extract the target words of fine-grained elements, resulting in the low accuracy of emotion analysis. erefore, based on the original attention mechanism, a matching attention mechanism is built to improve this problem. e input of attention matching mechanism includes two parts. One part is the weighted word vector after the feature fusion of Word2vec word vector feature and TF-IDF feature based on word frequency statistics; the other part is the word vector generated after the fine-grained feature information in the data set is sent to Word2vec. First, these two parts of input are fed into the matching attention mechanism, and the context information and fine-grained element information q s are captured at the same time. e calculation is as follows: where, Average represents the average value of the input vector, e a i is the word vector, e ω j is the weight word vector, and m and n are the numbers of word vectors and weight word vectors, respectively. In order to make the information of fine-grained elements meaningful, only the dimensions related to fine-grained elements will be retained in the Q t , while other dimensions will be deleted. e calculation process is as follows: where Q t is the weight vector of k fine-grained elements. It mainly looks for the dimensions related to fine-grained elements by looking at the words nearby in the word vector space. ω t and b t represents the weight matrix and offset vector, respectively. After the loss function is determined, the parameters in ω t and b t can be updated by gradient descent method. When the loss function converges, the optimal solution can be obtained. en multiply Q t by a randomly initialized matrix ψ to obtain the target word a s matching the fine-grained elements identified by matching attention, which is expressed as follows: where, the dimension of ψ is k × d. It can be updated by gradient descent method. Finally, matching attention weight p i is calculated according to a s . e calculation is as follows: where ω 0 ∈ R d×d is a trainable weight matrix, h is the output of LSTM hidden layer. Finally, the weighted sum of the hidden vector h i generated by the bidirectional LSTM and the matching attention weight p i is used for the sentence representation Z s of emotion classification, which is expressed as follows: Take Z s as the feature of the final emotion classification and put it into the fully connected layer for linear transformation, and Softmax classifier is used for emotion classification to obtain the final emotion ϕ. e mathematical expression is as follows:

DNN Layer.
e input of DNN layer is the word frequency weight vector, which contains three hidden layers, also known as FC (fully connected layer). All nodes of FC in the network are connected with the nodes of the previous layer to achieve the purpose of integrating feature information and reducing dimension. e number of nodes of the three FCs in the model is 256, 128, and 64, respectively. e dimension of the input feature is further compressed after passing through the DNN layer.

Output Layer.
e output layer consists of FC and Softmax. First, the output of word vector features through CNN layer, LSTM layer, and attention mechanism layer and the output of word frequency weight vector features through Computational Intelligence and Neuroscience DNN layer are concatenated as the final classification feature vector representation. Output classification is through FC and Softmax layers. e Softmax layer is a loss function, which is used to map the output to the probability interval to obtain the classification probability distribution, so as to output the classification results. e model is actually classified into four emotional categories: happy, sad, healing, and calm.

Experiments and Analysis
In order to build a parallel corpus of Chinese audio and lyrics, the data source is locked on the domestic music platform, and the data is collected based on the target of the task. In order to select the songs with higher quality, the songs with more credibility are selected, that is, the songs with a playback times of more than 2.5 million. In order to further carry out the task of music emotion classification, four kinds of emotion labels with happy, sad, healing, and calm were selected as candidates. A total of about 6000 music were collected. After further screening of song length, audio quality and language, 5286 music were finally retained as the candidate data set. On this data set, it is divided into two parts: training set and test set. e specific information of each data set is shown in Table 1.
In the experiment, the lyric text is preprocessed. First, the word segmentation is carried out to remove the invalid information related to music composition and stop words, and constructs a pure lyric text word item combination representation. en, the text is transformed into a digital vector recognized by the computer through different feature extraction methods, and the feature dimension of the text vector is set to 100 dimensions. Finally, input them to the classifier to output the emotion classification results.
e parameters of LSTM model are shown in Table 2.

Classification Accuracy of Different Text Features.
e experiment adopts different word frequency weight vector feature extraction methods to verify the emotional classification performance of lyrics. First, the preprocessed text is represented by text vector through TF-IDF and Word2vec feature extraction methods, and then the same SVM classifier is used to output the emotion classification results. e classification accuracy of different text extraction methods is shown in Table 3.
As can be seen from Table 3, TF-IDF is completely based on the features of word frequency. When facing the sample data with implicit emotional semantics such as lyrics, the emotional classification performance is slightly insufficient, and the average classification accuracy is only 0.701. At the same time, the distributed word vector feature representation extracted by Word2vec tool has also achieved good accuracy in SVM classifier. e classification accuracy of happy emotion is as high as 0.801 and the average classification accuracy is 0.746. is distributed vector can be well used as the input of deep network method.

Classification Accuracy of Different Attention
Mechanisms. In order to study whether the matching attention mechanism can further improve the performance, it is compared with the traditional attention mechanism. e evaluation results of the joint training model on the selected data set are shown in Table 4, in which the average classification accuracy is used for performance evaluation.
It can be seen from Table 4 that the matching attention mechanism significantly improves the classification performance of the model, and its average classification accuracy is 0.826, which is 0.073 higher than that of the traditional attention mechanism. e traditional attention mechanism cannot accurately extract the target words of fine-grained elements, resulting in the low accuracy of emotion analysis. e matching attention mechanism can solve this problem, and greatly improve the classification accuracy combined with the context information.

Experimental Comparison and Analysis with Other
Methods. In order to demonstrate the performance of the proposed method, it is experimentally analyzed with Table 1: e size of the data set and the number of songs contained in each type of emotion. Test set  Total  Happy  1078  194  1272  Sad  1267  317  1584  Calm  882  221  1103  Healing  1061  266  1327  Total  4288 998 5286   Reference [14,15,19] on the selected data set. e classification accuracy of different emotions in lyrics is shown in Figure 6 and Table 5.

Training set
As can be seen from Figure 5 and Table 5, the proposed method combines the characteristics of CNN and LSTM, constructs an emotion analysis model based on CNN-LSTM, and combines DNN network learning to greatly improve the accuracy of music emotion classification, with an average classification accuracy of 0.848. e fusion processing of deep learning network improves the classification performance, especially integrates the matching attention mechanism, accurately extracts the target words of fine-grained elements, and significantly improves the classification accuracy of emotional types such as calm and healing, which are 0.056 and 0.045 higher than those in reference [19]. Reference [19] uses CNN-LSTM architecture to complete music emotion recognition. Although the average classification accuracy reaches 0.815, it is easy to confuse emotion types such as calm and healing, and the performance needs to be improved. Reference [14] classifies emotions based on gradient descent SNN classifier, and reference [15] classifies emotions based on stacking multi-modal ensemble learning method. Both of them are difficult to accurately distinguish massive and complex music types, and the average classification accuracy is about 0.800. In conclusion, the proposed method uses the comprehensive feature extraction ability of CNN and the ability of LSTM to process the serialized data to obtain better classification results, and has stable performance and high robustness under each subclassification.

Conclusion
Music contains rich human emotional information. e study of music emotional classification is helpful to organize and retrieve massive music data. Because of the large number and multiple dimensions of music, it is difficult and incomplete to extract emotional features. erefore, a music emotion classification method based on deep learning and improved attention mechanism is proposed. e word frequency weight vector obtained based on TF-IDF is input into DNN for feature analysis, and the word vector obtained by Word2vec method is sent into the emotion analysis model based on CNN-LSTM. After the output of the two feature extraction channels are fused, the output layer outputs the emotion type. e experimental results based on the selected data set show that matching attention mechanism can more accurately extract the target words of finegrained elements and improve the classification performance. Compared with the traditional attention mechanism, its average classification accuracy is improved by 0.073. e processing of audio features in this paper is still a little rough. Only using the existing common features cannot fully reflect the relationship between music structured information and human emotion. erefore, the feature extraction method with more music emotion classification ability can be further explored.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e author declares that there are no conflicts of interest regarding the publication of this paper.