Music Genre Classification Based on Deep Learning

Humanmusic life can be traced back to ancient times.emusic art of human society is rich and colorful, which makes the music classication unable to classify eciently and accurately. Moreover, the classication has become a daunting task. On this basis, this paper studies the method of deep learning for processing music classication. Not only is the design structure of music signal channel classied, but also all connected neural networks associated with the music are investigated to design an appropriate network model. According to dierent music sequence measurements, the feature sequence mechanism of music design feedback optimization is also investigated. e type probabilities of dierent calculated orbits are measured by softmax activation function, and the function value of cross loss is obtained. Finally, an Adam optimization algorithm is used as the optimization algorithm of the proposed network model. Subsequently, an independent adaptive learning planning rate is designed. By adjusting the network parameters, the rstand second-order estimates of the calculated gradient are classied. e experimental outcomes prove that the anticipated method can meritoriously increase the correctness of music classication and is helpful for music channel classication. Moreover, we also observed that the number of neurons in the network has also a signicant impact over the training and testing errors.


Introduction
e creation and performance of early popular music were mostly commercial, and it was carried out in cities and towns, which was di erent from folk music with strong rural color. At the same time, it does not have the standardization and stability of art music [1]. ese in early days, in many cases, were just oral. erefore, some people say that popular music is di erent from art music and folk music. is, in fact, generally refers to a kind of music that is easy to understand, relaxed and lively, and easy to spread and has a large audience. Some people say that some particular music is "popular music" [2]. Music genre is an important label to describe music. Music tags play a virtuous part in pinpointing and separating digital music resources [3]. erefore, from a huge amount of musical data, their identi cation and classi cation have become more daunting. Facing the enormous music catalogue, depending on manual explanation for classi cation will devour signi cant computational costs, resources, and time. Moreover, we believe that they will still not be able to meet the needs of the current times enriched by big data, Internet of things, and people's increasing interest in music. erefore, music classi cation has gradually become a research hotspot.
At present, scholars in related elds have made theoretical research on the classi cation of music themes. For example, the authors in [4] proposed an engine system for classifying genres, which aims to replace these features by a new model. e model can also recommend music from vocal music that has been extracted from online music. eir experimental results show that this method not only has certain e ciency but also can e ectively modulate speech pitch and construct separation masking based on neural recursion. It should be kept in mind that the voice signals mixed with music can be screened and deleted. e music pitch classi cation method based on the RNN model can improve the time trajectory of speech and music pitch values. Moreover, this can also determine that the unknown continuous pitch sequence belongs to speech or music. is method has signi cant classi cation performance without losing speech noise separation performance. Nevertheless, the previously mentioned approaches still have some complications, such as low classification precision, poor effect, and lengthy computational time.
In order to solve the above complications, a classification method of music genres based on deep learning is proposed in this paper. Using deep learning, the data preprocessing is used to filter the music signals. Furthermore, using a fully connected neural network structure, the extraction of music genre features is completed. Finally, the attention mechanism is used to design a music genre classification network model. e music genre classification effect of the suggested method is better than those of other approaches, which can effectively improve the classification accuracy of the music genre. Moreover, our approach shortens the classification time significantly. e main contributions are as follows: (i) We study the classification of the design structure of music signal channel, and the connected neural network associated with music is designed. (ii) According to different music sequence measurements, the feature sequence mechanism of music design feedback optimization is studied. (iii) e type probabilities of different calculated orbits are measured by softmax function, and the function value of cross loss is obtained. (iv) Finally, an Adam optimization algorithm is used as the optimization algorithm of network model, and an independent adaptive learning planning rate is designed.
e remainder of the paper is organized as follows. In Section 2, we briefly discuss the basic theory of deep learning. Neural and back-propagation (BP) networks along with activation functions are discussed. In Section 3, fundamentals of music signal analysis are illustrated. In the fourth section, we discuss the classification of music genres and feature extraction and propose a neural network model. Experimental discussion and results are presented in Section 5. Finally, Section 6 summarizes the paper and presents directions for future research.

Basic Theory of Deep Learning
Deep learning is a branch of machine learning that deals with learning algorithms using deep neural networks. In fact, deep learning methods are developed from artificial neural networks (ANNs). It should be noted that ANNs are the most commonly used and representative model structure in the field of machine learning. Deep neural network (DNN) is a neural network, which is formed from the interconnection of various neurons and weights and may have many hidden layers and neurons [5]. Deep learning can learn higher-level feature expression from complex and large samples.

Neural Networks.
Deep learning is developed from artificial neural networks. Furthermore, neural networks are abstracted from the structure of biological neural networks. In the network, information is transmitted and activated through the interconnection between basic units, known as neurons, which in fact imitates the process of information transmission between the biological neurons [6]. e basic unit of the neural network is called neuron, and several neurons are connected with each other in such a way that communications occur among them [7]. e basic structure of the neuron is as shown in Figure 1.
In Figure 1, x 1 is the input signal, and the arrow starting from the input signal represents the connection. Each connection corresponds to a particular weight w 1 . After the input signal passes through these connections, it is weighed and summed to obtain a (a usual output of the hidden neurons). Finally, the previous output goes through a nonlinear function in order to get output o. It should be noted that the nonlinear function h is called the activation function that is used to tune the performance of the network [8]. e process of neuron input to output can be described by mathematical expression as follows: In formula (1), b is the bias term of the neuron. Multiple neurons with the same inputs form a hidden layer. e input of one layer of neurons is used as the input of the next layer of neurons, and the basic neural network is formed according to this connection method. e input of a neuron can come from either the input signal or the output of other neurons [9]. e structure of the fully connected neural network is shown in Figure 2. From bottom to top, as shown in Figure 2, the input layer takes inputs, passing through several neuron layers, and the output layer creates the output. e network structure, in Figure 2, has only one hidden layer, and this type of neural network is also called a single hidden layer feedforward neural network. In deep learning, multiple hidden layers can also be set, and each hidden layer is set with a different number of neurons according to the actual situation to improve the learning capability. e connection weight matrix of each layer and the previous layer is multiplied by the output value of the neuron of the previous layer, and the bias term of this layer is added to obtain a linear output. Subsequently, the obtained linear output then passes through the activation function of this layer performing nonlinear transformation to get the output of this layer of neurons [10]. e process of neurons in each layer from receiving input to calculating output can be described by a calculation formula as follows: In formula (2), z l is the linear output vector of neurons in layer l, which is calculated from the output vector a l− 1 of neurons in layer l − 1, the connection weight matrix W l of layer l, and the bias term b l of layer l. Furthermore, a l is the nonlinear output vector of the l layer neuron obtained by the linear output z l of layer l neuron through the activation function f l (·) of layer l.
Let us again refer to the basic architecture of the neural network, as shown in Figure 2, starting from the input layer, along the direction from input to output. For example, according to the above process, a series of linear and activation operations are carried out for the input vector, connection weight matrix, and offset term of each layer [11]. All these parameters are calculated layer by layer until the target prediction result is obtained at the output layer. is process is a forward propagation process.

Back-Propagation (BP) Algorithm.
e input layer, hidden layer, and output layer are the three components that make up the front end, middle, and end of the BP neural network. It is assumed that x0 = −1; the beginning of the imported input is the input vector, whose formula is x = (x1, x2, . . ., xi, . . ., xn)T; the middle of the neural network is the hidden layer, which will slow down training. e output vector is the result of the generated data, and its formula is y = (y1, y2, . . ., yi, . . ., yn)T. y0 = −1 can be provided as an additional assumption. e algorithm is a part of a unique programme, and, right now, one of the most cuttingedge fields is neural network. e result of combining the two is BP neural network. e topology of the BP neural network is shown in Figure 3. is research employs the modified BP neural network model to evaluate music classification, which can successfully eliminate the difficulties of instability and slow convergence of the classic model and can comprehensively improve the accuracy of the evaluation findings [12]. Topological structure of BP neural network model is shown in Figure 4.
In this first step, we calculate the error of the output layer according to the error loss function and then transfer it layer by layer to the middle layers in some form and update the parameters of each layer [13,14]. rough continuous iteration, the error of loss function calculation is minimized and the parameters converge. e back-propagation algorithm adopts the gradient descent method, as illustrated in equation (3), to update the parameters: In formula (3), η is the learning rate, and ∇w l ij and ∇b l i are the gradients of the error loss function to the connection weight w l ij and the paranoid term b l i , respectively. It can be seen that the key of the back-propagation algorithm is to find the gradient of the error loss function to the parameters [15]. e calculation process is given in the following steps.
Step 1: Calculate the loss error according to the target prediction and expected output of the output layer using the following equation: In formula (4), L is the loss error, a N is the target prediction vector of the output layer, y is the target expectation vector, and the function c(·) denotes the loss function.

Mobile Information Systems
Step 2: Calculate the error term δ l of layer l in the network according to the error loss L using the following equation: Step 3: Calculate the error term of neuron i in layer l according to the chain rule, as illustrated in the following equation: It can be seen from formula (6) that the error term of layer l is affected by the error term of layer l + 1. In other words, the error of the network will propagate in the opposite direction layer by layer through the backpropagation algorithm.
Step 4: Calculate the connection weight of each layer and the gradient of the bias term according to the error term using the following equation: As can be seen from formula (7), the gradient of the current layer connection weight w l ij strongly depends on the error term of the current layer neuron and the output of the previous layer neuron. Moreover, it can also be observed that the gradient of the current layer bias term b l i depends on the error term of the current layer neuron. rough substituting the above calculation results into formula (3), the parameter update of each round of the training process can be completed.

Activation Functions.
e activation function achieves delinearization, turning the neural network into a nonlinear model and bringing the network model the ability to solve linear inseparable problems [16]. ere are various activation functions that are related to neural network and each function can be replaced with another one in order to boost the accuracy of the model. Few of the well-known and largely used activation functions comprise the tanh function, ReLU (Rectified Liner Units) function, sigmoid function, and the softmax function. Among these, the softmax function is often used in the classification tasks [12,17]. It should be noted that an appropriate activation function is selected according to the needs of the task and the characteristics of the network layer. e three activation function images are illustrated in Figure 3.
In the next discussion, we offer a brief description and mathematical model of each activation function. In the later sections, we will demonstrate that these functions have impacts on the network accuracy and prediction outcomes.
(1) tanh: the tanh function is a hyperbolic tangent function, which maps variables to the values among the range [−1, 1]. However, the tanh function has the problem of gradient saturation; that is, the derivative of the function at both ends is almost zero. is easily causes the problem of gradient disappearance in the training process of the neural network back-propagation, which makes the training speed of the network model very slow or difficult to converge. e function's mathematical expression is given in the following equation: (2) Sigmoid: the sigmoid function image is similar to the tanh function, and the problem of gradient disappearance is also prone to occur. e function's mathematical expression is given in the following equation: ⋮ ⋮ Figure 4: Topological structure of BP neural network model.
(3) ReLU: the ReLU function is a linear rectification function and a nonsaturated activation function, which can solve the problem of the disappearance of the gradient caused by the derivative tending to zero. e ReLU function sets the negative value to 0 and performs truncation processing. e ReLU function is easier in the process of derivation calculation and can speed up the convergence speed of the network model [18]. e mathematical expression of the ReLU function is given by the following equation: (4) Softmax: the softmax function is generally used in the output layer of the neural network to complete the classification task. In the multiclassification process, the main task and function of the softmax function is to use the original output, calculate a new output, and map the value range to [0, 1]. In this way, the output of the neural network becomes the probability distribution of the target label. e function's mathematical expression is illustrated in the following equation:

Overview of Music Genres.
Since the emergence of human beings, music has developed with the evolution of human beings. Under the influence of different periods, regions, nationalities, and cultures, it has gradually formed some unique musical classic characteristics in musical thought, creative principles, artistic personality, and means of expression and techniques, and music types with different styles appeared. ese types can be called music schools. Popular music genres include classical, jazz, blues, hip-hop, rock, country, pop, and metal [19,20]. ere is no strict classification standard for the classification of music genres, which is subjective. Music works of the same genre have similar artistic styles.

Music Features.
e features and characteristics of the music genre can be divided into three different types: (i) time domain characteristics, (ii) frequency domain characteristics, and (iii) cepstrum domain characteristics. ese features can be extracted directly from the waveform of the original signal. e processing process is simple and requires less mathematical calculation. ey are widely used in the research of music classification tasks [20,21]. e two common time domain features are described in detail below: (1) Short-time energy: Short-time energy is the sum of energy in a small window, reflecting the change range of music signal over a period of time. It should be noted that it is generally used to judge the silence in a piece of music, carry out endpoint detection, and identify the beginning, transition, or end of music signal [22]. e calculation formula for the shorttime energy is given by the following equation: In formula (12), ω(n − k) represents "window function." e more popular window functions used to calculate short-time energy include "rectangular window" and an improved raised cosine window, "Hamming window" [23]. e calculation formula for window function is given by the following equation: In formula (13), N represents the length of the window. (2) Short-time zero crossing rate: If the adjacent voice signal samples carry the opposite algebraic symbols, it is considered that zero crossing will be produced. e level of zero crossing rate directly reflects the number of high-frequency components of music signal. Short-time zero crossing rate is commonly used to detect silent frames in voice time domain analysis. e calculation method of this feature is given by the following equation: In formula (14), x n (m) represents a discrete speech signal, and sgn[·] is a special function used to represent algebraic symbols. e definition of the function that denotes the algebraic symbols is given by the following equation: (1) Spectrum centroid (SC): e spectrum centroid is a commonly used measure. e size of this value represents the size of the frequency component of the music signal. e larger the value, the more highfrequency components and vice versa. e calculation formula is illustrated as follows:

Frequency Domain
(2) Spectrum energy (SE): e frequency domain feature is used to characterize the frequency domain energy of a frame signal of music. e calculation formula for the spectrum energy is as follows: (3) Spectrum traffic (SF): e spectrum traffic is a dynamic feature that represents the spectrum of the music signal. In fact, it is the sum of the squares of the signal differences of all adjacent frames in a discrete frequency domain music signal. e calculation formula is given as follows: In the three above formulas, F(ω) represents the Fourier transform of each frame of signal. Furthermore, l 0 and h 0 represent the maximum frequency and minimum frequency of a piece of music in the frequency domain signal, respectively.

Cepstrum Domain Characteristics.
e music signal is transformed into frequency domain through Fourier transform, and the frequency domain characteristics are obtained through mathematical calculation and analysis, as discussed in previous sections. en, take the logarithm of the music spectrum signal and perform the inverse Fourier transform. e audio signal in the frequency domain will be converted to the cepstrum domain, so as to obtain the cepstrum domain characteristics [24,25]. e most common cepstrum domain features and related formulas are listed below: (1) Mel frequency cepstral coefficient (MFCC): It is one of the most commonly used cepstral domain features, which can well represent the audio signals. e Mel frequency cepstrum coefficient can transform nonlinear relationship into linear relationship. e calculation step of the MFCC is through preemphasis, framing, windowing, fast Fourier transform, and taking the absolute value or the square value. rough the triangular band-pass Mel frequency filter bank, the logarithm of the output energy of the filter is taken and DCT inverse transformation is performed to obtain the characteristics of the dynamic Mel frequency cepstrum coefficient [26]. e relationship between the mel frequency represented by mel(f) and the linear frequency represented by f is given by the following equation: (2) Linear prediction and cepstrum: Combining the two principles of linear prediction and cepstrum, the all pole model function is defined as illustrated in the following equation [27]: In formula (20), a k and p represent prediction coefficient and prediction order, respectively. Assuming that h(n) represents the impulse response of the original music signal without preprocessing and H(z) represents the system function, the process of obtaining the cepstrum is to calculate the logarithm of H(z) first and then perform the inverse transformation. e calculation process is given by the following equation:

Classification of Music Genres
Grounded on the deep learning-based music genre classification method, in fact the music genre characteristics are extracted by preprocessing the musical signals. Furthermore, the music genre classification neural network model is planned according to the fully connected neural network structure. According to the characteristic sequence of the input music genre, the attention mechanism is researched, and the classification network of this article is designed using the attention mechanism to realize the classification of music genres.

Music Signal Preprocessing.
Preprocessing the music signal is a very important stage in the music genre classification method. e preprocessing can make the next extracted features more effective. Moreover, less useful signals and noise can be removed to increase the prediction outcomes and accuracy. e following steps were carried out to preprocess the music signals.
(1) Preemphasis: In order to improve the high-frequency resolution of the music signal [28] and in order to perform overall spectrum analysis on the entire frequency band, the preemphasis is introduced. e preemphasis is generally achieved 6 Mobile Information Systems through a first-order digital filter before the feature parameter extraction. e transfer function of the filter is expressed as given by the following equation: In formula (22), parameter a denotes the factor of preemphasis that is, in general, considered as a decimal digit nearby to 1. If we suppose that the worth of sample, related to the music genre signal, is x(n) at time n, then the outcome after the preemphasis phase is as given by the following equation: (2) Framing: In order to smoothly transition between the two frames of signals and to ensure that information is not lost, the framing phase needs to have an overlapping part of 1/3∼1/2 frame length between the two frames. is overlapping fragment is entitled the frame shift. en, the theoretical calculation formula for the number of frames of a music signal segment is computed as explained in the following equation: In formula (24), N 1 characterizes the entire span of the music signal, and N 2 symbolizes the length of the frame. Similarly, N signifies the total amount of frames, and N 0 exemplifies the frame shift. (3) Windowing: After framing all music genre segments, in order to increase the continuity between frames, it is suggested to reduce edge effects and also reduce spectrum leakage. Furthermore, it is also essential and crucial to accomplish the process of windowing on the framed music signal. e commonly used window functions in audio signal processing include (i) Hamming window, (ii) rectangular window, and (iii) Hanning window. e three window functions are defined as follows: ese three window functions all have low-pass characteristics, and the main performance is determined by the attenuation of the first side lobe and the width of the main lobe. Since the boundary of the window function of the Hamming window is smooth, the first side lobe attenuation is the most severe, which can meritoriously circumvent the phenomenon of leakage [29]. Consequently, this paper selects Hamming window as the window function.

Music Feature Extraction.
After preprocessing the signal of each music genre, the characteristic of the music genre, namely, MPCC, is extracted. e specific steps for extracting MPCC characteristic parameters of music genre signals are illustrated in the following steps: (1) Accomplish the FFT transformation on every frame of the music genre signal after preprocessing to acquire the spectrum of the frequency. (2) Proceed with the square of the modulus for the FFTtransformed spectrum, computed in previous step, in order to acquire the discrete power spectrum, denoted by |X(k)| 2 , of every music signal. (3) In the third step, pass the power spectrum |X(k)| 2 for filtering through a set of Mel filters using the following equation: (4) Finally, calculate the natural logarithm to acquire the MPCC parameters for each and every music genre signal using the following equation: Subsequently, the range of the frequency in the music signal changes from a little and few hertz to thousands or kilo of hertz, and the transformation is moderately very slow. erefore, the MPCC parameters extracted from each frame of the music genre signal in this paper are 12-dimensional.

Design of Network Model for Music Genre Classification.
e neural network learning process is listed in Figure 5(a). According to the neural network structure, the design and research of music classification model is shown in Figure 5(b) [15,16]. e input of the input layer processes the music signal through preemphasis, framing, and windowing to extract music genre features. e music genre feature sequence, extracted from the input layer, is their features learned. Similarly, the influence on the current time state is calculated from the future and the past, respectively. e feature representation H � H 1 , H 2 , . . . , H L is obtained and combined with the context semantic information, which is input into the attention mechanism network. e attention mechanism network learns the input feature representation H and obtains the corresponding attention probability distribution [14]. Subsequently, it multiplies each attention probability by its corresponding feature vector and finally obtains the music genre feature vector representation v. e attention process is given as follows: In the above formula, e t is the attention score of the feature vector H t at time t in the feature representation H. In the next phase, the activation function softmax is applied, as given by equation (28), to compute v as given by the following equation: e output layer of the network model is defined as follows by calculating the cross-entropy loss function: In the above formula, C is the loss, n is the number of samples, x is the input sample, and y and a are the output predicted value and target expected value, respectively, of input x of the network model. Note that the classification of music genres is calculated using the following equation: In the above formula, the classification of music genres is realized through the steps described above.

Experimental Environment and Datasets.
In order to verify the effectiveness of the music genre classification method based on deep learning, the MATLAB 2016a programming software was used to extract the features of music signals. We build a fully connected neural network based on eano library using the Python language. Similarly, we model training that uses the Adam optimization method as the gradient descent optimization algorithm. e learning rate is set to 0.001, and the training rounds are set to 200 rounds. All experiments are carried out and verified on the GTZAN dataset. ere are a total of 1000 audio files in the GTZAN dataset. ese 1000 files contain 10 genres of music, and each genre has a total of 100 samples. Note that the experiments were carried out several times and the reported results are averaged over multiple runs. In the experiments, the method of nonrepetitive random sampling is adopted, and 80% of each music genre dataset is selected. Furthermore, the distribution of the number of music genres in each  category of the training set and validation set is as shown in Table 1.

Classification Evaluation Index.
We performed the music genre classification experiments on five different music genre files of rock, metal, country, classical, and blues. In fact, this is a multiclassification task, and the categories are relatively balanced. e accuracy of the sample population accuracy is expressed follows: In the above formula, M(i, j) is the number of samples in the population.

Music Genre Classification Effect.
After the music genre classification network model is trained by the proposed method, the classification performance of the music genre classification network model is evaluated by using the verification set. e results and the forecast confusion matrix outcomes for 5 files are shown in Table 2.
Analyzing the results demonstrated in Table 2, we conclude that the metal music, classical music, and blues music all successfully fit into their appropriate classification categories, with accuracy rates of 94.94 percent, 92.50 percent, and 95.00 percent, respectively. Furthermore, the rock music and country music are sometimes mislabeled. Due to the fact that some country music can be used as an accompaniment to country dancing and that some rock music is mistakenly categorized as country music, country music is often confused with rock music. e distinction between rock music and metal music is somewhat erroneous. However, the possible reason is that they both pay more attention to rhythm and are similar. In general, the proposed method is used to effectively classify the music of the above five genres, and the proposed method has a better effect on the classification of music genres. e total number of neurons in the BP neural network has a significant impact over the training and test error. For example, as shown in Table 3, when the number of neurons increases, the training error continues to decrease, and we observed that there is a certain correlation between them. After the analysis, we concluded that 7 as the number of neurons is the most ideal measurement for our experimental setup.

Classification Accuracy of Music Genres.
e assessment outcomes and comparative study of classification precision of various music genre approaches are presented in Figure 6.
We can easily observe from Figure 6 that, under different validation sets, [4] is 73%, and [30] is 82%. e average music genre classification accuracy rate is 91%. Furthermore,    we can also observe that, associated with the method demonstrated in [4] and the approach presented in [30], the correctness and accuracy of the proposed music genre classification method are significantly higher.

Music Genre Classification Time.
e evaluation results, in terms of classification time, when the proposed approach is compared with other music genre classification techniques, are presented in Figure 7.
We can observe from Figure 7 that when the number of verification sets increases, the music type classification time of various techniques will also increase. e technique based on the deep learning algorithm, projected in this paper, has the benefits of refining the accurateness, precision, and effectiveness of the music classification.

Conclusions and Future Work
In this paper, a prediction method based on the deep learning algorithm was proposed, which has the advantages of refining the correctness, precision, and effectiveness of the music classification. e experimental outcomes demonstrated that the projected method has the ability to effectively improve the accuracy of the music classification and is helpful for music channel classification. Moreover, its music genre classification accuracy is high, which can effectively shorten the music genre classification time and has, therefore, a better music genre classification effect. However, because the research scope of this algorithm is not extended to the subject of finite element, the proposed method has some limitations. In the process of extracting music genre features, this paper ignores the accompaniment information of music. e main melody of the same piece of music, accompanied by different music, may present different genres and styles.
In subsequent research, we can consider combining the main melody and accompaniment of music to extract features to further improve the accuracy of classification. Moreover, advanced deep learning methods such as deep neural networks should be considered to improve the accuracy of the prediction outcome. In learning algorithms, the training is one of the activities that take significant time and can degrade the performance of the whole system. erefore, we will consider dividing the training and prediction phases over the edge-cloud architecture so that the training may happen at the remote cloud that has usually bulk of resources. e prediction part of the algorithm should run on edge which will essentially increase the processing and response time of the system.

Data Availability
e data used to support the findings of this study are available from the author upon request.

Conflicts of Interest
e author declares that he has no conflicts of interest.