Feature Extraction of Music Signal Based on Adaptive Wave Equation Inversion

The digitization, analysis, and processing technology of music signals are the core of digital music technology. There is generally a preprocessing process before the music signal processing. The preprocessing process usually includes antialiasing filtering, digitization, preemphasis, windowing, and framing. Songs in the popular wav format and MP3 format on the Internet are all songs that have been processed by digital technology and do not need to be digitalized. Preprocessing can affect the effectiveness and reliability of the feature parameter extraction of music signals. Since the music signal is a kind of voice signal, the processing of the voice is also applicable to the music signal. In the study of adaptive wave equation inversion, the traditional full-wave equation inversion uses the minimum mean square error between real data and simulated data as the objective function. The gradient direction is determined by the cross-correlation of the back propagation residual wave field and the forward simulation wave field with respect to the second derivative of time. When there is a big gap between the initial model and the formal model, the phenomenon of cycle jumping will inevitably appear. In this paper, adaptive wave equation inversion is used. This method adopts the idea of penalty function and introduces the Wiener filter to establish a dual objective function for the phase difference that appears in the inversion. This article discusses the calculation formulas of the accompanying source, gradient, and iteration step length and uses the conjugate gradient method to iteratively reduce the phase difference. In the test function group and the recorded music signal library, a large number of simulation experiments and comparative analysis of the music signal recognition experiment were performed on the extracted features, which verified the time-frequency analysis performance of the wave equation inversion and the improvement of the decomposition algorithm. The features extracted by the wave equation inversion have a higher recognition rate than the features extracted based on the standard decomposition algorithm, which verifies that the wave equation inversion has a better decomposition ability.


Introduction
Music can express people's thoughts and can convey people's happiness, anger, sorrow, and joy. It exists in various cultures and countries and is closely related to people's lives [1]. Since the reform and opening up, music has been constantly developing and changing, and there have been many different styles of music and a large number of music works [2]. 64% of users cannot find the song they want to listen to when using a music search engine, and many users are not clear about their needs for music [3]. In the face of the large number of music, it is difficult to find your favorite music, and music classification and search still have huge room for development [4]. Today, with the popularization of the Internet and the continuous development of network applications, in the face of a huge user group and massive scale of data, the importance of digital music retrieval and recommendation is self-evident [5]. Music classification is an important field of music retrieval. It is the premise, technical means, and main work content of the research on content-based music retrieval and recommendation. The study of music style classification has a broad development space [6]. The classification of music styles can help people quickly find their favorite music and can play different styles of music at different times according to different occasions.
In music signal recognition technology, the key issue is to establish an acoustic model of music signal recognition primitives [7]. At present, some technologies for the acoustic model of music signals have not been completely solved, resulting in the performance of some products that are still difficult to meet the ideal use requirements [8]. The acoustic model is established on the basis of the characteristic parameters of the music signal. Therefore, the amount of useful information contained in the characteristic parameters of the music signal directly determines the accuracy of the acoustic model's description of the music signal. The parameter information is less, so the final acoustic model is also imperfect. Before the acoustic model is established, the characteristic parameters must be studied to extract the parameters with the most useful information [9]. Acoustic models established based on characteristic parameters mainly fall into two categories. One is mapping planning on the time axis, and the distortion of the two characteristic parameters is measured; the other is a model based on statistical knowledge. The establishment is based on the initial model and training data and constantly reevaluates and optimizes the parameters until the model converges. This algorithm is not a global optimal analytical solution, and it is easier to fall into a local optimal solution. The final model parameters are quite different. Therefore, the study of feature parameter extraction and model initialization is of great significance in music signal recognition.
This article introduces the adaptive wave equation inversion method as the core and analyzes the wave equation inversion in the time domain, including the idea of wave equation inversion. The objective function is given for the full-wave equation inversion method in the time domain, and the local optimization algorithm, namely the gradient method, is used for inversion. The gradient formula is given, and a detailed derivation process is attached. Aiming at the period jump, the proposed adaptive wave equation inversion is introduced, including the basic principle and objective function of the method. A new objective function is used to give the formulas of the accompanying source and gradient and give a detailed derivation process. The gradient difference between adaptive wave equation inversion and full-wave equation inversion is compared. The gradient formula of the conjugate gradient method and the selection of the step length are introduced. We perform vowel recognition experiments on the music signal libraries 1 and 2, respectively. For features of the same dimension, on music signal library 1, the wave equation inversion has a higher recognition rate than the three contrasted features. On music signal library 2, it also has a higher recognition rate under the original signal and low signal-tonoise ratio. For the combination of features in this article, HMS-MFCC has a strong characterization ability, while EWCF is more susceptible to noise pollution, but it has the lowest dimensionality.

Related Work
The purpose of feature extraction is to obtain information that is conducive to identification and eliminate interference in the music signal. The music signal contains a large amount of not only music signal information but also personal characteristic information. The characteristic parameters of the music signal should be able to accurately represent all the information contained in the original signal that helps to distinguish. Thorough research makes the existing characteristic parameters unable to completely and accurately characterize the information of the music signal. At present, the characteristic parameters in music signals can be divided into time domain, frequency domain, and cepstrum domain. The time domain parameters are obtained by reducing the dimensionality of each frame of music signal in the time domain to form a set of feature vectors. Time domain parameters mainly include short-term energy, short-term zero-crossing rate, and autocorrelation coefficient. The frequency domain and cepstrum domain parameters are to transform each frame of music signal into the frequency domain range and extract characteristic parameters in the frequency domain or convert the frequency domain parameters into the cepstrum domain.
At present, there is no parameter in feature extraction that can represent all useful information of music signals, even if the more mature MFCC parameters are used [10]. Among the various parameters, it is an approximate description of a certain aspect of the music signal. For example, the commonly used MFCC parameters simulate the human auditory system, which mainly considers low-frequency components. The low-frequency components of the parameters account for the main part, and the use of the differences of the components of the MFCC parameters is not considered for feature selection, so that the parameters will lose some important information [11]. Researchers have proposed many algorithms to improve the characteristic parameters of music signals [12].
The Mel-frequency cepstral coefficient is currently the most widely used characteristic coefficient in the music signal recognition system. It is based on the auditory system of the human ear and extracts parameters by simulating the auditory system of the human ear to establish a model to describe the energy distribution of the music signal in the frequency domain [13]. For sounds of different frequencies, the ability of the human auditory system to perceive them is different. For sounds with a frequency below 1,000 Hz, the auditory system's ability to perceive it satisfies an approximate linear relationship, but when the frequency is higher than 1,000 Hz, the auditory system's perception of sound meets a logarithmic relationship with the frequency approximately [14]. Compared with PLC and PLCC parameters, MFCC parameters emphasize the low-frequency information of music signals, shielding high-frequency noise interference, and without any assumptions can be used in various situations.
With the advancement and development of computer science and technology, the basic theories and key technologies of music signal recognition technology have been 2 Advances in Mathematical Physics initially promoted [15]. The main research results of music signal recognition technology during this period are dynamic programming (DP) and linear prediction (LP). Among them, the dynamic programming technology is a technology to calibrate a group of music signals in time. It can better solve the problem of unequal length correction of music signals in music signal recognition [16]. The linear predictive analysis technology proposes a better solution to the mathematical model of music signal generation, which has a profound impact on the development and application of music signal recognition technology [17]. At the same time, NEC Laboratory, Tokyo Radio Laboratory in Japan, and Kyoto University have successively researched and produced dedicated hardware devices to be used in music signal recognition technology, laying a solid foundation for their further theoretical research and practical application [18,19]. The Baum-Welch algorithm is essentially an algorithm that uses the maximum expected value [20]. This algorithm can ensure that the output probability of the model that is not reevaluated once is increased, but this algorithm has a large dependence on the initial parameters. For different initial parameters, the final output probability is not unique. Therefore, the traditional Baum-Welch algorithm cannot accurately and completely establish an acoustic model of the trained music signal observation sequence [21]. In terms of the hidden Markov models, how to train a perfect acoustic model has always been a difficult point in research [22]. In order to solve the problem that the Baum-Welch algorithm's dependence on the initial model parameters may cause the final training model to fall into a local optimum, researchers have proposed various solutions and algorithms [23]. These algorithms are mainly aimed at two aspects: one is in the algorithm training process, combined with other algorithms, to intelligently optimize the model parameters obtained from each revaluation. These algorithms generally have the advantages of global optimization [24]. The other is to optimize parameters in the model initialization stage and try to choose more appropriate model initialization parameters [25]. The music signal generation system is formed by connecting these three functions in series, namely

Music Signal Processing Technology
Common vocal tract models include lossless sound tube and formant models. The excitation wave of the sound source is affected by the resonance of the vocal tract, and resonance occurs in certain frequency bands. The peak produced by the envelope of the spectral line at the resonant frequency is the resonant peak. The vocal tract model of gen-eral vowels is represented by the all-pole model, and the nongeneral vowels and most consonants are represented by the zero-pole model. The transfer function expression of a second-order resonator is Multiple V i linear combinations are obtained to obtain the formant model of the sound channel: Since the excitation model of the music signal is an expression in the form of all poles, we call the ratio of the music signal to the output wave velocity of the vocal tract as the radiation impedance, ignoring that the open area of the lips is much smaller than the head surface area, and derive the radiation impedance expression: In the actual process, the physical process of music signal generation is different from the above three models but is approximately equivalent. This also verifies that the music signal is a short-term stable signal and a signal that changes over time. In addition, the fricatives in voiced sounds have both unvoiced and voiced excitation sources at the same time and cannot be obtained by simply superimposing the two.

Preprocessing of Music
Signal. The music signal is represented by a time-varying function curve on the mathematical image, and its dimension is N × 1, which is a column vector. Among them, N is the sum of the number of samples in the music signal. Through sampling and A/D conversion, the music signal is changed from an analog signal to a digital signal. The sampling rate is the number of times the music signal is sampled within 1 s per unit time. The higher the sampling rate, the more music signal information is obtained per unit time. The restoration of the music signal is more real. In order to maintain the maximum characteristics of the music signal and avoid spectrum aliasing, the Nyquist theorem must be satisfied when sampling, that is, the sampling frequency f s > 2 f m, and f m is the highest frequency of the music signal. Quantization is to divide the amplitude of the entire range into a finite set, specify the waveform of one of the ranges as the standard, and treat the amplitude of all the waveforms as having the same amplitude as that.
The preemphasis processing is to consider that the music signal in the high-frequency band above 800 Hz has a 6 dB/octave amplitude drop. Sometimes, it is also considered to eliminate the DC level offset, so the high-frequency part of the music signal must be added through a transfer function.

Advances in Mathematical Physics
The music signal is a short-term stable signal, and its characteristics can be considered to remain unchanged within 10 ms. The part of the sound interval obtained by multiplying the music signal by the window function is called a frame. The length of the interval is called the frame length. Generally, there are 33-100 frames per second. The overlapping part between adjacent frames is called a frame. In order to make a continuous and smooth transition between frames, the frame shift is usually 1/3 of the frame length.
The main lobe of the rectangular window is narrow and sharp, and the corresponding frequency resolution is high, the side lobe peak is large, and the spectral smoothing effect is good, but the spectrum leakage is more serious; the width of the main lobe of the Hamming window is large, which can greatly retain the waveform characteristics of the music signal. But its side lobe attenuation is relatively large. According to the music signal waveform multiplied by the window function, there will be no sharp changes, and the music signal waveform characteristics should be maintained as much as possible. When selecting the window, the main lobe width, frequency resolution, and side lobe attenuation should be comprehensively considered.
Endpoint detection can find out the start and end of the sound segment in the signal, which can remove the silent segment, enhance the useful part of the signal, and reduce the length of the voice. For isolated word recognition, the main purpose is to reduce the amount of calculation and noise interference and increase the calculation accuracy; for continuous speech recognition, it is mainly used to divide the recognition primitives and to model and recognize the recognition primitives. Only by accurately finding the starting end of the voice signal can the subsequent processing of the voice be accurately performed. The schematic diagram of dual-threshold method endpoint detection is shown in The forward simulation wave field is u, the wave field is d, and δd is the residual of the two. The residual equation of the wave equation inversion is When the phase difference between the predicted data and the real data is greater than half a cycle, a cycle jump will occur at this time. When used in actual seismic data, because the initial model is not so accurate in most cases, it is prone to cycle jumping, which has a great impact on the inversion. Based on this, we proposed to introduce a penalty term to constrain the objective function to overcome the cycle jump. Figure 2 is a schematic diagram of cycle skip artifacts in FWI. The solid blue line represents the time function of the true waveform of period T. The solid red line above represents the predicted waveform with a time delay greater than T/2 cycles from the real waveform. In this case, FWI will update the underground medium model so that the seismogram of the ðn + 1Þth period predicted data will match the nth period of the observation data map. An error occurs in the update of the underground medium model, resulting in the inversion effect, deviation. In the example at the bottom, the n periods of the predicted data and the observed data are consistent, because the time delay is less than T/2, and the FWI can get the correct underground medium model updated.
The adaptive wave equation inversion is proposed to suppress the influence of cycle jumps on the inversion, and it can be inverted under an unsatisfactory initial model to obtain relatively still ideal inversion results.
The theory and method of adaptive wave equation inversion are different from the traditional full-wave equation inversion method. Here, the convolution of the filter and one of the data sets is used to subtract from the other data set instead of direct subtraction. The adaptive full-wave equation inversion can well suppress the occurrence of cycle jumps.
The convolution of a signal f ðtÞ and the impact signal δðtÞ is equal to f ðtÞ itself. When the wave field value d is convolved with the shock function, the wave field d is obtained. When the predicted wave field data u is very close to the real wave field data, u ⋅ δ = d can be obtained. The filter coefficients are calculated, and the simulated data is convolved with the filter coefficients. Through continuous iteration, the simulated data keeps getting closer to the real data, and at the same time, the phase difference between the two is gradually reduced, and the cycle jump is well suppressed. The filter coefficient gradually becomes a shock function or approximates the shock function. At this time, the difference between the simulated data and the real data is minimized, and finally, an ideal inversion effect is achieved. This method is called forward adaptive fluctuation equation inversion. At the same time, when the real data is convolved with the filter coefficients and then compared with the simulated data, the gap between the two can also be reduced through iteration. This method is called the subsequent adaptive wave equation inversion.

Inversion of Objective Function by Adaptive Wave Equation.
The objective function of adaptive wave equation inversion is different from that of traditional full-wave equation inversion. With dual objective functions, the inversion is also divided into two steps: the first step is to calculate the filter coefficients. The second step is to determine the new accompanying source through the filter coefficients and calculate the gradient combined with the step size for iterative calculation. The first step is to design a Wiener filter here, that is, to define a Wiener filter w of order l, first convolve the filter with the real data, and then the result of the convolution with the least squares of the simulated data 4 Advances in Mathematical Physics can obtain the objective function CðmÞ: The forward simulation wave field u and w are the coefficients of the Wiener filter, D is the Toblitz matrix, each column contains the seismic survey record wave field d, and D is the real data d convolution filter w. In the traditional fullwave equation inversion, the objective function is the minimum mean square error of the difference between the predicted data and the real data. Under the less-than-ideal initial prediction model, the inversion result is poor or the inversion result is wrong, and the cycle jump is one of the influencing factors.
The first step is to find the coefficients of the filter. Here is a brief introduction to the principle of the Wiener filter: in the system, if wðmÞ is its unit response, xðnÞ is an input random signal, and where sðnÞ represents the signal and vðnÞ represents the It is known that the desired output is The error is The mean square error is Further, we get The process of designing a Wiener filter is to find the expression of the unit impulse response or transfer function of the filter under the minimum mean square error, and its essence is to solve the Wiener-Hopf equation. Here, the use of a Wiener filter can well suppress the cycle jump. Through the introduction of Wiener filtering, the filter w in the objective function CðmÞ in the adaptive wave equation inversion can be derived: D T D is the autocorrelation of the seismic survey record wave field d, and D T d is the cross-correlation between the forward simulation wave field u and the seismic survey record wave field d. The meaning of the filter w formula is the inverse matrix of the autocorrelation matrix of the observed data multiplied by the cross-correlation between the observed data and the predicted data. When the observed data is consistent with the predicted data, that is, when d = u, w should be an impulse function. But in general, the predicted data is not equal to the observed data. Through the filter w and subsequent algorithms, we try to make the filter w an impulse signal. When designing the l-order filter w, the seismic source wavelet should be taken into consideration.
After calculating the coefficients of the filter, the objective function f ðmÞ of the adaptive full-wave equation inver-sion is given: The purpose of designing this objective function is to constrain the filter w, using the idea of a penalty function, where T is a ðl + 1Þ × ðl + 1Þ diagonal matrix. The T function is based on the absolute phase difference between the simulated data and the real data. But the more complex form T function can provide faster and more stable convergence.

Adjoint Sources and Gradients of Adaptive Wave Equation Inversion
. Due to the change of the objective function, the accompanying sources and gradients of adaptive wave equation inversion are different from those of fullwave equation inversion. The formula is given and deduced here. A represents a matrix of numerical operators to implement the wave equation. s is the seismic source, and u is the wave field generated by model m.
When the model m takes the partial derivative of the objective function, we can get The above is still the derivation process of the gradient formula for the inversion of the full-wave equation. Here, if the δs variable is set, the gradient of the adaptive fullwave equation inversion is The accompanying source δs is Through the above deduction, the gradient and accompanying source of the inversion of the full-wave equation are obtained. This is the wave equation inversion in the time domain. Compared with the full-wave equation inversion in the time domain, transformation is needed to obtain the final gradient.
The gradient in the full-wave equation inversion is the integral of the second derivative of the forward wave field with respect to time and the back propagation of the residual wave field. The gradient of the adaptive wave equation inversion is different from the former. From the second derivative of the forward wave field with respect to time and the back propagation integral of the new accompanying source, 6 Advances in Mathematical Physics finding the new accompanying source plays an important role in the realization of the whole method. On the whole, the objective function and gradient formula of the adaptive wave equation inversion design are aimed at how to suppress the adverse effects caused by the cycle jump. When the filter is convolved with the analog data, and then the second norm of the difference with the real data, the method of obtaining the filter coefficients and accompanying sources in this form is called the previous adaptive wave equation inversion.
Among them, U is the Toblitz matrix, each column contains analog data u, and v is the coefficient of the previous filter. It can be seen that the difference between the two methods is whether the filter is convolved with real data or with analog data.

Conjugate Gradient Method Adaptive Wave Equation
Inversion. The gradient method is the earliest local optimization algorithm used. Its advantage is that the algorithm is relatively simple, the calculation amount of each iteration is relatively small, and the memory usage is also small. Under the condition of low initial point requirements, it can also converge to a local minimum. The disadvantage is that the convergence rate is slower and converges to a local minimum instead of a global minimum. Newton's method has a very fast convergence rate and has the advantage of quadratic convergence. It can converge to the global minimum. However, the Hessian matrix needs to be processed. The amount of calculation is large and the convergence rate is slow. At the same time, it requires one of the initial points, which is difficult to construct. The Gauss-Newton method is improved on the basis of the Newton method to avoid the entanglement of the second-order partial derivative using the least square sum extreme value problem.
The conjugate gradient method is an important method in the local optimization algorithm. It has many advantages such as good convergence, high stability, and no need to add additional parameters. This method also uses the gradient of the objective function to generate the conjugate direction. Although the calculation amount is slightly larger than that of the steepest descent method, it overcomes the shortcomings of slower convergence of the steepest descent method. Compared with Newton's method, it needs to calculate not only the first-order derivative information but also the second-order derivative information, storage, and the Hessian matrix and inverse; the conjugate gradient method only needs to calculate the first-order derivative information, and the convergence effect is compared with the Newton law. Therefore, the conjugate gradient method can be a more effective algorithm for solving linear or nonlinear optimiza-tion. Combining the above methods, this paper selects the conjugate gradient method as the nonlinear conjugate gradient method for adaptive wave equation inversion. The calculation formula is as follows: According to the calculation of the negative direction of the gradient of the current model and the previously calculated conjugate gradient direction as the search direction of this conjugate gradient method, the conjugate gradient direction at the kth iteration is gðkÞ, the conjugate gradient direction in the second iteration is gðk − 1Þ, the negative direction of the gradient calculated by the initial model is E m0 , the negative direction of the gradient calculated by the model mk in the kth iteration is E mk , and the weighting coefficient is βk. The flow chart of adaptive wave equation inversion is shown in Figure 3.

Experiment and Result Analysis
We use Table 1 to describe the characteristic characters. This article deals with the experiment in the following two points: (1) The result of the recognition rate in the case of adding noise is the intermediate value taken under five repeated experiments. The formation method of "SNR=mixed" is as follows: suppose the sample size of music signal is l, and l random numbers are generated with a mean value of 25 and a standard deviation of 6 through a random function. We add noise with the SNR value of the random number generated to the original music signal to form a music signal library with different SNRs. (2) The feature extraction method in this paper adopts the Sliding-fastBSpline-EMD decomposition algorithm. If there is no special description, the window length is 3, and the sliding overlap number is 2.
5.1. Experiment on Music Signal Library 1. Since music signal library 1 is different vowels of the same person, it can be understood that only the characterization and distinguishing ability of different vowels of features are examined, so the classification is more accurate, and the recognition rate of each group of features is also higher. From the comparison of the recognition rate of the same-dimensional features in Figure 4, the recognition rate of wave equation inversion is higher than the commonly used features LPCC, MFCC, and WPTSBCC under several noise levels. It can also be found that the lower the signal-to-noise ratio, the better the recognition rate of wave equation inversion is relative to the three contrast features. This not only reflects that wave equation inversion is better than these three characteristics in distinguishing different vowels but also reflects that it has better antinoise performance under this condition.
In Figure 5, the wave equation inversion has a higher recognition rate than the other three methods in general. At the same time, it can be found that the difference between their recognition rates can reach up to 9.5 percentage points.

Advances in Mathematical Physics
This fully reflects that the wave equation inversion has a strong characterization ability in the combined features.
The results in Figure 6 show that the time-consuming inversion of the wave equation is the smallest, with an average of about 0.2 ms, which meets the real-time requirements of the system. The WPTSBCC takes the most time, about 0.6 ms, which is three times the time-consuming wave equation inversion.

Experiment on Music Signal Library 2.
It can be seen from the results in Figure 7 that the recognition rate of wave equation inversion is higher than the three comparison  Figure 3: Flow chart of adaptive wave equation inversion. Characteristic symbol Feature composition WPTSBCC The first 12-order WPTSBCC and its first-order and second-order differential combination HMS-MFCC 24th order HMS-MFCC LPCC The first 12-order LPCC and its first-order and second-order difference combination EWCF Instantaneous energy-weighted center frequency of all IMF components MFCC The first 12-order MFCC and its first-order and second-order difference combination      9 Advances in Mathematical Physics features, and its advantages are more obvious under the noise level. This reflects that in these two cases, the HMS of the signal provides a spectrum that better reflects the true frequency of the signal-the energy distribution than the Fourier spectrum and the wavelet coefficient energy spectrum, and the wave equation inversion has better characterization capabilities, except for LPCC. The other three features are all based on the frequency spectrum. Since the music signal in music signal library 2 is the pronunciation of six different vowels of different people, and the pronunciation of different people itself has diversity, this has brought a great degree of influence on the recognition of six vowels. The difference in recognition rate reflects the different effects of this diversity and noise on the three spectrums.
The recognition rate of different feature vector dimensions on the music signal library 2 is shown in Figure 8. The recognition time of different feature vector dimensions on the music signal library 2 is shown in Figure 9. Because EWCF is greatly affected by noise, the results of the recognition experiments on the two music signal databases, respec-tively, list the recognition results that it has the characterization ability in the case of high signal-to-noise ratio. It can be found that the characteristics extracted by the wave equation inversion are generally higher than those extracted based on the standard decomposition algorithm. This fully reflects that the wave equation inversion provides clearer and more realistic signals compared to the standard decomposition algorithm.

Conclusion
In this paper, the research of adaptive time domain wave equation inversion method is carried out. We introduced the concept of inversion and the principle of full-wave equation inversion. According to the inversion of the full-wave equation in the time domain, the objective function is given, and the calculation formula of the gradient is derived. The principle of adaptive wave equation inversion is introduced in detail, two objective functions are introduced, and the calculation formula of the accompanying source and gradient  The solution method of adaptive wave equation inversion is introduced. Through the analysis of the principle of feature extraction in common music signal recognition, the effective mechanism of integrating HHT into the feature extraction process is studied, and the feature extraction framework of this article is established. Based on the instantaneous frequency and instantaneous energy of HMS and IMF, respectively, two sets of features, HMS-MFCC and EWCF, are extracted. The experimental results on the music signal libraries 1 and 2 show that HMS-MFCC has strong characterization capabilities, and in most cases, wave equation inversion has a higher recognition rate than that of LPCC, MFCC, and WPTSBCC. Although EWCF is greatly affected by noise, it has a high recognition rate in the case of high signal-to-noise ratio, but its feature dimension has been greatly compressed, which helps reduce the complexity of the recognition system. However, the research and experiments in this article are all based on the recognition of nonspecific music signals based on small vocabulary and isolated words. Human language is generally continuous, large vocabulary, and relatively large noise interference from the background environment of music signals. Moreover, the music signal contains various other characteristics such as phoneme and timbre. Since the research on music signal recognition technology is not long enough, we only conducted some in-depth research on the feature parameter extraction algorithm of music signal and the matching model of music signal recognition system, and other aspects of music signal recognition technology. There are deficiencies in the research. Music signal is a complex signal, which contains many characteristics of music signal. Integrating these important characteristics in music signals and applying them to music signal recognition technology are another important direction for follow-up research.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.