Research and Implementation of Emotional Classification of Traditional Folk Songs Based on Joint Time-Frequency Analysis

,


Introduction
Music is a traditional art that is very good at expressing emotions.This has been confirmed by many aestheticians and musicians in modern times [1].Music is good at expressing emotions because its way of existence and form are similar to those of emotions.With the emergence of traditional folk songs, working people have a way to express their thoughts, feelings, or demands and wishes.Traditional folk songs are generally created orally, and they are also passed down orally and continuously processed and revised by the collective, eventually forming a faction with its own characteristics, which has certain social functions and reflects its different social characteristics in different regions, different times, and different social states [2,3].Traditional folk songs represent the culture of a nation and the memory of an era.With the development of the times, traditional folk songs gradually fade out of people's vision; therefore, we must remember traditional folk songs and inherit the classics [4].However, in order to search for traditional folk songs more effectively, the classification of traditional folk songs is particularly important.
Emotion is a one-dimensional dynamic art that flows in time; emotional orientation, persistence, frequency, speed, intensity, and other aspects are varied and changeable, while music has a variety of modes, tonality, rhythm, sound speed, pitch, and intensity [5].It can be seen that the study of emotional factors in traditional folk songs in China is still a topic that needs to be discussed in depth.After consulting a large number of materials on the emotional factors in traditional folk songs and then expanding to the collection of research materials in traditional folk songs, including various books, magazines, and websites related to traditional folk songs, I discovered that there are few books and articles on the research of emotional factors in traditional folk songs in China.Less than 5% of songs experience emotional shifts as a result of the singer's interpretation [6,7].In light of the foregoing, emotional analysis of text has gradually shifted to emotional tendency analysis based on lyrics in recent years, making emotional tendency analysis based on lyrics popular.As a result, this paper begins with traditional folk song lyrics [8].Instead of processing signals in a single time domain or frequency domain, the JTFA method combines the two, allowing signals to be presented in both time and frequency domains at the same time.The composition of the entire signal can be obtained through transformation, making it easy to extract the changing characteristics of the signals in three-dimensional space [9].JTFA (Joint Time-Frequency Analysis) has been used in a variety of fields, including biomedical signals, speech signals, and mechanical vibration signals.
However, the algorithm based on short-term features only considers the short-term spectrum or energy of music, which is more suitable for some traditional music information retrieval tasks, such as identification and classification, but not for some applications, such as music structure analysis and music sound source separation, which must consider the spectrum characteristics of music as well as its distribution and changes over time.To solve these issues, researchers typically employ the second type of music information retrieval algorithm, namely, the JTFA-based method (Joint Time-Frequency Analysis).The purpose of this paper is to classify the emotions in traditional folk songs, to combine existing music emotion classification [10,11] standards, to calculate their weights using effective feature fusion and an extreme value table of emotional words, and to judge the degree of traditional folk song emotions.

Related Work
In fact, in modern Chinese folk songs, the five major emotions are inextricably linked, integrated with one another, and one transforms into the other.These characteristics can reflect a variety of personality traits, cultural backgrounds, and aesthetic preferences of people in various locations.Traditional folk songs are an inexhaustible source of information for people interested in learning about history, society, and life.S-VSM (Sentiment Vector Space Model) was proposed in reference [12], and the result was clearly improved.Literature [13] differs from the above-mentioned scholars' research methods in that it uses an emotional dictionary to obtain emotional words in the corpus.Reference [14] used neural network to extract emotional features and then classified the samples, and the classification accuracy reached 89%.Reference [15] proposed a text classifier based on this hybrid model.In English text classification experiments, this hybrid model achieves high accuracy by combining two common deep learning models: sparse autoencoders and deep belief networks.The tendency of unsupervised learning methods to judge opinions is discussed in literature [16,17].This method extracts adjectives and adverbs from the text, estimates their emotional tendencies, and then calculates the overall emotional tendency based on these words' average tendencies.On multidomain data, this method has an average accuracy rate of 74%.
Another method of emotion judgment treats this as a classification problem and then uses machine learning to finish the classification.To distinguish between factual statements and emotional expression, literature [18] employs a variety of models.Literature [19] has made some advancements in the classification process, introducing the hierarchical emotion analysis process, which differs from previous work's practice of categorizing music directly.To accomplish the task of music emotion analysis, they use a two-stage method.Musician emotions are divided into six categories in literature [20], and the classification of musical emotions is studied using spectrum-related information as a feature.According to literature [21], the classification of music emotions should be changed from discrete to regression and the characteristics of timbre, pitch, and volume of music extracted.The coordinates of music pieces in the Valance-Arousal emotional coordinate system are calculated using the SVR (Support Vector Regression) model, and then, the music pieces are converted into emotional categories.The spectrum of a music signal is divided into subbands, some frequency domain features are extracted, subband audio features and non-sub-band audio features are fused, and finally, an SVM classifier is used for training and classification, according to literature [22].
The JTFA method can accurately extract the timevarying characteristic information of nonlinear signals and accurately extract the important characteristics of nonlinear signals, making up for the shortcomings of time-domain and frequency-domain analysis methods.STFT (Short Time Fourier Transform), wavelet transform, Wigner-Ville distribution, and empirical mode decomposition are some of the most commonly used JTFA methods.Speech analysis, image recognition [23][24][25], bioengineering, mechanical equipment fault diagnosis, and other disciplines and engineering fields have all used these methods [26].

Research Method
3.1.Music Feature Extraction.Traditional folk songs are the sum of music and literature.They are the direct embodiment of emotion, which is a kind of emotion.It is an attitude, emotion, feeling, and reaction of people to things, which can be expressed or not.Or because good music can resonate with the audience, there is a feeling of wanting to follow the rhythm of music.Good lyrics can express the feelings of the song incisively and vividly and arouse the audience's resonance.In previous scholars' research, features were extracted mainly by audio, but good results have not been achieved.
Traditional folk songs' emotional characteristics are influenced by their living environment and mode of communication.Traditional folk songs have a real-life theme and are an essential part of the spiritual lives of working people of all ages.This is how people express their emotions [23].Traditional folk songs are the essence of national culture, as they encapsulate a nation's spirit, character, temperament, and psychological characteristics, as well as possessing distinct national characteristics and local color.Due to the colloquialism of traditional folk song lyrics, the accuracy of word segmentation will affect the accuracy of emotion analysis.Spaces between words are used as spacing marks in English, and they are relatively simple to use.There is no clear separation mark between words in Chinese [25].There is only a demarcation between sentences (punctuation marks).The expressive force of the basic elements of music in each piece of music varies, which explains why music can express a variety of emotions.For example, music in the category of joy and enthusiasm has a bright and lively timbre, a warm and cheerful rhythm, and a tone that is mostly major; angry and irritable music has a deep and heavy timbre, a tense and rapid rhythm, and a tone that is mostly minor.
This paper intends to use audio signal processing to distinguish the emotions expressed by different music.From the above analysis, it can be seen that the basic elements of music dominate the emotional expression of music.Therefore, by mapping the basic elements of music into audio features, it lays a foundation for the follow-up work (Figure 1).
To sum up, by establishing the relationship between music emotional expressiveness and basic elements of music and then mapping the basic elements of music into audio features, the relationship between music emotional expressiveness and audio features is finally established, which lays a foundation for the follow-up work.
From the point of view of audio signal processing, the feature extraction of tones needs to analyze the frequency spectrum, and chroma vector feature and mode feature are the features that can best describe the music mode.Timefrequency domain transform adopts constant Q transform, and the solution of chroma features is deduced in frequency domain under constant Q transform, and a filter bank with log 2 change rule is designed, in which the central frequency of each filter is where f min is the minimum center frequency of the filter bank, this paper selects 220 Hz; k lf is the filter index number.β is the number of filter banks/levels per octave.
The cross-correlation function can calculate the exchangeable energy between the signals xðtÞ, yðt − βÞ and quantify this exchangeable energy, so as to obtain the similarity between the two signals.The formula of the crosscorrelation function is as follows: In which α = 1 and β are used to eliminate the time offset.
From the mathematical essence, data space, feature space, and category space can be linked together, and the nonlinear transformation between them can be completed by kernel processing.If there is a mapping function that represents the relationship between the data space and the feature space, and this function is called ϕ, then the function of the kernel function is to realize the inner product transformation of the vector: Of course, if you want to use the kernel function, you must meet the following conditions: for any original data function Kðx i , y i Þ, for any function gðxÞ, gðxÞ is not equal to 0 and satisfies Ð gðxÞ 2 dx < ∞.
When applying kernel principal components to nonstationary signals, the sample data should be divided into training set and test set at first.Normalize the data, then get the kernel function matrix, then get the eigenvalues and eigenvectors of the kernel function matrix in the highdimensional space, select the first few main eigenvectors to form a feature space, and project the training set and test set into this feature space.Finally, input the training set into SVM for training and verify the model with the test set.
Implementation process of KPCA (kernel-based principle component analysis): The nonlinear mapping ϕ completes the transformation from input space to feature space.Finally, PCA is used to process the mapped data.
Assuming that there is a set of observed values For the general PCA method, it is to solve the characteristic problems of the following equations: Solve the eigenvalues with large contribution rate and the corresponding eigenvectors.
Traditional folk song is a type of song that people create orally to express their emotions in social situations.It is spread orally by working people and continuously processed, culminating in the formation and transmission of a song with corresponding characteristics.Traditional folk songs are also beneficial to people's productive work.The emergence of chant in productive labor greatly encourages working people's upward confidence in life.As a result, it is also a necessary item for working people engaged in productive labor.

Emotional Classification of Traditional Folk Songs.
China is a big civilized country with a vast territory, many nationalities and a long history, and it has a very rich musical cultural heritage.It has all kinds of traditional folk songs of different nationalities and regions.Each traditional folk song is spread and spread by word of mouth, and it is a social activity at all levels of society.
Traditional folk songs truly reflect the aspirations of the people.Traditional folk songs transcend the characteristics of their own music, bringing us closer to the soul of a country.It is a never-ending supply of traditional folk songs.Music has been expanded in this field to express all types of special emotions to varying degrees.All soul emotions, including joy, humor, frivolity, willfulness, and elation, as well as anxiety, worry, sadness, pain, and melancholy in various degrees, and even emotions like awe, worship, and love, fall under the umbrella of music.
Emotion has a requirement for catharsis and release, and human beings are the external expression of this 3 Mobile Information Systems requirement.Its linguistic expressions are closely related to music, and it serves as the foundation for expressiveness by allowing the will to translate expression movements into musical tones.Traditional folk songs are songs that are written orally and passed down through the generations.It has a distinct feature, which is the ability to express emotions.Traditional folk songs may lack solemnity, solemnity, and deep twists and turns, but they are extremely rare and valuable in their ability to express emotions.
Only emotional words in lyrics can contribute to the emotional tendency of songs, according to the dictionarybased emotion classification method, while degree adverbs and negative adverbs that modify emotional words affect the intensity and tendency of emotional words.The following assumptions underpin the dictionary method used in this paper: Negative adverbs and degree adverbs change the tendency or intensity of emotions by modifying the emotional words closest to them.When a negative adverb appears alone, the polarity of the emotional words it modifies is usually reversed, whereas when a degree adverb appears alone, the intensity of emotion is usually increased or decreased.
Assume that the unique emotional word contained in the emotional unit U is represented as w, and its emotional value vðwÞ is one of -1,-0.5 and 0.5.1 given by the emotional dictionary.In addition, the negative factor is defined, denoted as n, which indicates the change of emotional tendency or emotional degree caused by negative adverbs, and the degree factor is denoted as d.
Then, the emotional value of U can be calculated by formula (6): Let t contain k emotional units, which constitute the set U, that is, U i ∈ U, ði = 1, 2,⋯,kÞ.Finally, sum the emotional values of all sentences to get the emotional value vðtÞ of song t, as shown in formula (7).
Fine classification of music uses the long-term characteristics of music such as pitch, rhythm, and volume to divide music into various emotional categories.The second layer of the music emotion model consists of two fine classification models, namely, the bright system classification model and the dark system classification model.The brightsystem classification model inputs the data determined as bright-system music by the first-layer coarse classification model.The dark system classification model inputs the data judged as dark music by the first layer of rough classification model.This model deals with the second classification problem, which is used to distinguish between anger and irritability and fear and gloom.
RF (Random forest) model is based on the design idea of model integration, and its core idea is bootstrap aggregating (bagging), which is an algorithm of machine learning model integration, aiming at improving the stability and accuracy of classification model in statistical classification and regression tasks.The decision tree in RF model needs to be completely split, because the proper fitting degree of the model is ensured through two random sampling processes.
Therefore, the RF model has been built.The accuracy of each decision tree model is not high, but different decision trees are independent of each other, and they play a good role in distinguishing data samples in some narrow areas (features).After bagging integration, the RF model with high stability and accuracy is finally obtained.The schematic diagram is as follows in Figure 2. SVM (Support Vector Machines), the idea of structural risk minimization, is well realized in SVM, and it is a machine learning method.At present, the research is very popular, and it is also the latest in statistical learning theory.
It can be proved that the Lagrange coefficient a i of most samples will be 0, and the coefficient with nonzero value is the support vector corresponding to the samples.To solve the optimal solution of the above problems, the optimal function is as follows: b * is the classification threshold.The general formula (8) can also be obtained by taking the median value of any one of two kinds of support vectors.
SVM based on the Gaussian radial basis kernel function is now used the most, and it also shows the best performance [12], and its expression is as follows: It can be seen that this kernel function can map the original finite-dimensional space into an infinite-dimensional space through the nonlinear kernel, and the value of expression parameter δ is very important.If δ is too large, the weight of higher-order features will be rapidly attenuated, and the result is equivalent to a low-dimensional space, which makes the samples become one kind.
However, if δ is too small, there may be a serious overfitting problem, because at this time, random data will be mapped into linear separable.Therefore, choose appropriate parameter values to make the Gaussian kernel function show high flexibility.

Results Analysis and Discussion
It is a broad music genre that includes both songs and music, as opposed to pure poetry or instrumental music.It has artistic value in that it combines music and literature, and it provides people with a new aesthetic experience.The expression of emotion in traditional folk songs is concrete and meticulous.Every traditional folk song contains a plethora of emotional elements.The most regional national folk music is traditional folk songs.Traditional folk songs have different characteristics in different places, but their emotional expression is the same: there will be likes, sorrows, anger, and disgust.People respond to truly beautiful concerts, and traditional folk songs have this quality.Formally, songs that evoke nostalgia frequently use progressive melody lines with a melody amplitude of no more than two octaves.The melody is fairly gentle and relaxing.It has a denser rhythm, notes that are shorter in duration, and beats that change more frequently.To express quiet and gentle emotions, the majority of them use minor tunes.This type of emotion is often expressed unilaterally at first, but it is bound to elicit a response in order to gain understanding, and the expression of this emotion evolves into communication over time.Communication has many different aspects.Singing is a way to express love that dates back to human nature's most basic needs.
The audio features chosen in this paper can be divided into two categories in terms of signal processing: shortterm features and long-term features.The difference between them is due to the different frame lengths used in the calculation of music signals.Short-term frames (50 ms per frame) are used to calculate short-term features, while long-term frames are used to calculate longterm features (5 s).It is necessary to combine short-term and long-term features in order to train the classification model.Because the short-term and long-term feature pairs require different calculation times, the dimensions of the short-term and long-term feature matrix calculated for the same piece of music are different, making it impossible to splice the matrix and form a complete feature matrix directly.As a result, feature fusion is used to solve this issue.Each single-layer classification model receives the features fused with short-term and long-term features as training data for model training.The overall classification accuracy obtained by training with a single-layer classification model is shown in Figure 3.
As can be seen from Figure 3, BPNN (BP neural network) is a deep learning algorithm, which has better overall classification accuracy.RF belongs to the integration model, which uses the idea of bagging combination model and integrates several tree models.This method also achieves good overall classification accuracy.To sum up, BPNN and RF models have achieved good overall classification accuracy.
Figure 4 shows the recall rate of each classification model for each emotion category.
When looking at the recall rate for each emotional category, it can be seen that the two categories of joy, enthusiasm, and anger and irritability have higher calling rates, indicating that the classification model has a higher recognition degree for these two types of music, indicating higher recognition ability; however, the recall rate for the category of fear and gloom is low, indicating that the classification model has poor recognition ability for this category.While other models perform poorly in this category, BPNN has the best ability to identify music in the quiet, sad, and lost categories.
Traditional folk songs' lyricism, of course, refers not only to love but also to love for the motherland, society, hometown, parents, and children.Emotion is the most common factor in traditional folk songs, and it is the most important factor in traditional folk songs.From a content standpoint, nostalgic songs primarily describe people's yearning, nostalgia, attachment, and love for their hometown's landscapes, folk customs, and specialties (especially those who have been away from their hometown for a long time).They can be divided into two categories based on the emphasis on emotional expression: the first is songs that express homesickness.This genre of music focuses on expressing the longing and nostalgia of wanderers or expatriates in foreign countries for their home country.
For KPCA, for the same data, compare different kernel functions and select different eigenvalues, and compare the classification results of each kernel function as shown in Figures 5, 6, and 7.
Comparing the above figures, it can be concluded that when the radial basis function has the same characteristic   Mobile Information Systems value, the classification result is the best.As can be seen from Figure 6, the training accuracy of P-order polynomial kernel function and linear kernel function is very high, but the classification result is not high.Comparing with the same characteristic dimension of 3, it is found that the first principal component contribution rates of P-order polynomial kernel function and linear kernel function are too concentrated, reaching over 99%, which is unfavorable for classification.It can also be seen that the Gaussian KPCA can obtain fewer features, which is more effective than PCA, which shows that KPCA has better classification effect than PCA for stationary signals.
The Gaussian kernel is used in SVM.When performing classification training, not only should the parameter δ be determined, but also determine a weight parameter c, which is called the error penalty coefficient and represents the deviation degree of incorrect samples.The size of c needs to be selected appropriately.If the data is larger, the fitting degree is higher.However, if the generalization ability is weak and the data is smaller, the complexity will be reduced, but it may lead to large empirical risk value.
This paper solves the classification problem of nonstationary nonlinear signals.For the classification problem, feature extraction is the key, which is related to the performance 7 Mobile Information Systems of the classifier.The performance of the classifier can directly verify the correctness of the experimental principle.The energy features of coefficients are extracted, and the extracted energy features are subjected to kernel principal component analysis of different kernel functions.After the principal component analysis, important feature values are obtained, and then, the classification problem comes.Figure 8 shows the classification results of nonstationary signals.
After analyzing the overall performance of the classification model in the last two sections, this section analyzes and compares the accuracy of each emotion category.Figure 9 shows the performance of each classification model for each category.Where TL_C is a two-layer music emotion classification model, SLN_C is a single-layer neural network classification model, and SLR_C is a single-layer RF classification model.
The indexes of the double-layer model are slightly higher than the SLN C model and much higher than the RF model for quiet and soft music.The performance level of the double-layer model and the SLN C model in the sad and lost category of music is quite flat, and both models have a better ability to identify positive samples and distinguish negative samples, which is better than the performance of the RF model in this category.
The two-layer classification model is slightly better than the RF model for angry music, and the two-layer model's accuracy rate is about 3% higher than the RF model.For music in the fear and gloom category, the SLN C model has a low recognition rate, whereas the other two models have a relatively high recognition rate.The double-layer classification model has a slightly higher accuracy than the RF model (less than 2%).To summarize, the double-layer music emotion classification model has a good ability to identify positive samples for each emotion category, as well as a good ability to distinguish negative samples, and the accuracy of each category is kept at a relatively high level, demonstrating that it is more robust and prominent than SLN C and RF.

Conclusion
The most widely expressed content in traditional folk songs is emotion, which is shared by different ethnic groups of human beings, such as affection, friendship, love, or homesickness.Joy, anger, sadness, and joy are all in it, From its short length, there is a glimpse into a colorful world.Based on JTFA, this paper makes an in-depth analysis and research on the emotional expressiveness of music signals, links music emotions with music elements, and maps them to corresponding audio features.The time-frequency domain characteristics of music signals with different emotions are studied.An emotion classification system based on public dictionary is constructed, from which emotion words can be automatically obtained.Based on all the experimental data, the overall classification accuracy of the double-layer classification model is obtained compared with the best single-layer classification model.The doublelayer classification model is more robust and excellent in the task of music emotion classification, which proves that the double-layer classification model is effective for the classification of music emotion.

Figure 4 :
Figure 4: Recall rate of different emotional categories in different models.

Figure 6 :Figure 7 :Figure 5 :
Figure 6: Results when the number of features is 4.