Automatic Recommendation Algorithm for Video Background Music Based on Deep Learning

,


Introduction
As people's ability to collect video background music and store video background music becomes stronger and stronger, users' preferences for video background music gradually have their own style, which provides great difficulty for the preference recommendation function of music listening software [1,2]. Although some manual and simple mining methods have been used to improve the accuracy of the system's recommended songs for users, the analysis and processing capabilities of the structure of the video background music itself have not been correspondingly improved. Regarding the connotation of the video background music itself, perhaps only a video background musician or a professional music critic can truly understand the meaning of the video background music and can distinguish the type of the video background music when listening to the video background music [3][4][5]. However, for ordinary users, these important pieces of information can only roughly hear the mood and themes to be expressed, most of which are passive perceptions. Only artists can understand the isomorphic correspondence between the sound waves of the video background music and human emotions.
is makes humans have limitations in understanding video background music, and there are individual differences. With the development of computer technology, its application has spread to all fields of economic and social life. erefore, discovering the isomorphic relationship between video background music and human emotions through computers has become a means to solve such problems [6,7].
For the recommended video background music system, different video background music has different styles, divided into many video background music genres and video background musicians, and different video background music genres of the same video background music are understood by video background music. Among the many video background music listeners, each user has his own unique hobby [8]. is hobby is often reflected in the process of users listening to the video background music. erefore, it is necessary to use the video background music recommendation system service to recommend their favorite video background music [9]. e attribute information of the video background music refers to the singer, author, style, album, format, and song time of the video background music. Using this information, the video background music can be clustered, the similarity of the video background music can be calculated, and then a recommendation list can be generated. e essence is to turn the video background music into a vector, and each dimension is composed of its attributes and its proportions [10]. Various similarity measurement methods are used to calculate the similarity, and finally the results are recommended to users. However, this recommendation method has certain disadvantages, because the attribute information of the video background music does not fully reflect the video background music itself [11]. For example, the video background music in the same album may also have different styles, so the effect of this recommendation method is not very good. e idea of hybrid recommendation takes into account information other than the video background music itself. For example, external information such as the listening environment can be used, because when the user is listening to the video background music, the mood is related, and environmental factors may affect the mood of the user at that time [12]. After collecting this external information, we cluster the video background music to obtain different types of video background music (happy, sad, etc.) and then recommend them to users with different needs [13]. However, this method is a relatively blunt recommendation method, which is not suitable for personalized recommendation, and the introduction of external information will increase the calculation workload and will also affect the playback effect of the video background music, resulting in poor quality of the recommendation results [14]. Related scholars published a paper on a deep neural network coded by a restricted Boltzmann machine [15]. e paper proposed the use of restricted Boltzmann machines to pretrain hierarchically and then use real data sets for parameter adjustment. is paper brings Boltzmann machines and neural networks back to the sight of academia. Related scholars have proposed a personalized search ranking algorithm based on local knowledge bases [16]. e degree of preference extracts typical documents from categories, and typical documents from different categories constitute the user's personalized local document library. e final search results are sorted according to the degree of similarity with the local document library. Related scholars have proposed a recommendation system based on Deep Belief Network (DBN) [17]. e friendliness of deep learning to big data, strong representation ability, and excellent antinoise ability all make deep learning more attractive in the recommendation field. But deep learning is not a panacea. First of all, deep learningbased model training costs are extremely high, require a lot of data, and consume a lot of energy. Secondly, the abstract representational features extracted by the machine are not easy for people to understand. For resources with a small amount of data and easy feature extraction, traditional recommendation methods balance cost and output [18]. Based on the traditional model, researchers have proposed a new method that comprehensively considers the total number of tag citations and the total number of tags in the configuration file [19]. It is mainly used to calculate the weight of tags in the user profile and resource configuration file, and the calculation method matches the query and resource profile and user profile [20,21]. Early research on collaborative filtering methods in the field of video background music is based on explicit feedback based on the user's rating of songs or artists [22]. However, since the most common feedback collection method currently is to collect user listening records, the feedback form used in the field of video background music recommendation has changed from explicit to implicit. e use of implicit feedback has obvious disadvantages; that is, users no longer express their preferences for songs explicitly, but infer their preferences from the records of their listening to songs [23,24]. is paper introduces the proposed tightly coupled deep learning fusion model in modules. It introduces the principle and process of mining hidden features on the user side based on the extended stacked autoencoder and then introduces the mining of hidden features on the video background music side in detail, from the representation of text data to the use of deep learning models. Specifically, the main contributions of this paper can be summarized as follows: First, in terms of model design, this paper uses convolutional neural networks to mine the hidden features of video background music and improves this model by introducing an attention mechanism. e local key points of the text can better locate the local keywords and improve the accuracy of the hidden feature mining on the video background music side. e two content miningbased deep learning models are tightly coupled through the probabilistic matrix decomposition model, and the feasibility and operating mechanism of the model are explained theoretically through formula derivation. Second, according to the characteristics of the digital video background music service platform, this article uses social tags to describe the extra information of the video background music, designs a method that can infer the context and recommends, and implements a system on this basis. e system can provide services similar to the internet radio stations and complete the evaluation of the effectiveness of the recommendation algorithm through interaction with users.
ird, this paper verifies the effectiveness of the deep learning algorithm and the recommendation effect of the system based on this algorithm and shows from the 2 Complexity results that the recommendation algorithm designed by this paper can improve the recommendation quality.
e rest of this article is organized as follows. Section 2 discusses the related theories of automatic recommendation of video background music. Section 3 constructs a recommendation algorithm based on the deep learning fusion model. In Section 4, experimental testing and result evaluation were carried out. Section 5 summarizes the full text.

Video Background Music Automatic
Recommendation Related Theories

Video Background Music
Recommendation System. In this information overloaded Internet era, how to make users make full use of information on the internet has always been a hot research direction and research pain point. In this environment, the cost of internet information producers who want to make their information attract people's attention in a large number of information streams is getting higher and higher; it is also difficult for consumers to find what they really need in the vast ocean. Mature business solutions are mainly divided into search engines and recommendation systems. e recommendation system is based on the user's historical behavior to dig out potential points of interest. Although search engines and recommendation systems have different ways of conveying information, they also have similarities in their differences. e common purpose of both is to help users obtain more valuable information, and the difference lies in whether users actively obtain or passively recommend. e overall architecture of the video background music recommendation system is shown in Figure 1.
e recommendation system is a typical data-driven product. In order to realize an effective recommendation service provider, it is necessary to collect different information of users or video background music to characterize it. ere are many types of data sources used in the recommendation system, which can be roughly divided into two categories: one is content-related data describing users and video background music, and the other is data generated by interaction between users and video background music. According to the different use of data sources, the current general recommendation system algorithm classification method is generated. If the recommendation system uses content-related data of users or video background music, the recommendation system is considered to be a content-based recommendation system.

Collaborative Filtering Algorithm.
e recommendation model based on matrix decomposition assumes whether the user likes a piece of video background music. e reason behind it is controlled by a series of potential factors with different weights. e corresponding background music of each video also has a different set of potential factors. If the potential factor of the user and the potential factor of the video background music are innerly produced, a quantitative value of the user's preference for the video background music can be obtained. For each user, it need to calculate the quantitative value of the background music in all the videos that have not been rated and then sort them from largest to smallest, and then the user's prediction recommendation list can be obtained to realize the recommendation function. e entire calculation process is performed offline, so the online prediction recommendation process becomes very fast, and the energy efficiency of the model-based collaborative filtering algorithm is usually better than that of the neighbor-based collaborative filtering algorithm. e latent factor model decomposes the rating matrix R into a user hidden feature matrix U and a video background music hidden feature matrix V. Each row vector in the U matrix is the hidden feature vector of user i, and u ik is the weight of the user on the k-th latent factor. Each row vector in the V matrix is the hidden feature vector of the video background music j, and the same v jk is the weight of the video background music on the k-th latent factor. e inner product of the two vectors represents the "preference" of user i on the video background music j, and the public expression is as follows: (1) Among them, K represents the number of latent factors, that is, the dimensions of the u i vector and vj. r ui is the value predicted by the recommendation system. In order to obtain two characteristic matrices, LFM defines the following loss function to calculate U and V matrices: (2) In the latter half of the loss function, a regularization term is added to prevent the model from overfitting. λ is the parameter of the regularization term, and I ij is the indicator function.
At the same time, the difference in the rating interval of users is taken into consideration. For example, some users are accustomed to giving high scores regardless of whether the background music of the video is good or bad, while some users are very extreme, and the background music of the video they like is highly rated. erefore, when predicting the score, the preference information of the user and the video background music is introduced.
Among them, u is the average value of nonzero elements in the matrix, b i is the average rating of user i, and b j is the average rating of video background music j. e loss function becomes e optimization method of the loss function is usually a gradient descent method. e gradient descent algorithm updates the parameters by obtaining the partial derivatives of all unknown parameters in the loss function. e negative gradient direction of the parameters is the direction in which the loss function drops the fastest. Updates are continuously performed through iterations, and iterating stops until the set number of iterations or the reduction of the loss function is lower than the set threshold. Assuming that the learning rate in gradient descent is θ, the parameter update formula in each round of iteration is After the update, the model outputs the user hidden feature matrix U and the video background music hidden feature matrix V. In the prediction stage, the prediction score can be obtained by the vector inner product of the user and the song in the matrix.

Audio Content Characteristics.
e first step of MFCC calculation is to use a filter bank composed of triangular filters to convert the amplitude spectrum obtained by DFT into a mel scale. Each triangle filter defines the response of a frequency band and normalizes it so that the sum of the weights of each triangle is the same.
A filter bank F i (k) is composed of M triangular filters of equal height, and the filter of each filter is defined as Among them, i represents the i-th filter, and f bi is the boundary point of the filter, corresponding to the k-th coefficient of the N-point DFT. e position of the boundary point f bi depends on the sampling frequency Fs and the point N in the DFT. e number of filters is equal to the number of MFCC coefficients.
en through Discrete Cosine Transform (DCT) to obtain the MFCC coefficients, its physical meaning is the distribution of the energy of the signal spectrum in different frequency intervals. e function of each filter is to obtain the spectral energy of the corresponding frequency range. MFCC coefficient calculation formula is

Complexity
where M is the number of filters in the filter bank, j is the number of cepstral coefficients (j < M), and X i is the logarithmic energy output of the i-th filter; the formula is where X (k) is the amplitude spectrum of the Fourier transform. rough these two steps, the audio can be described by a series of cepstrum vectors; each vector is the MFCC coefficient of each frame. e spectral centroid is a metric used to characterize the frequency spectrum in digital signal processing. It shows where the "centroid" of the amplitude spectrum of the shorttime Fourier transform is and measures the average frequency of the amplitude-weighted spectrum. e brightness of the timbre in human perception is related to this feature. Its formula is Among them, X (n) is the amplitude of the Fourier transform at frequency n, and X is a DFT frame. N is the number of frequency points, for example, half of the number of samples in the DFT frame. e main purpose of the rhythm feature extraction is to extract the regular change features in the time sequence of the audio, such as rhythm, beat, and rhythm structure. Rhythm is the speed of a piece of music, and human intuitive feeling is the speed of the song. In a given prosodic structure, the rhythm is the beat rate in the prosodic structure, and the beat describes the time when an acoustic event occurs. e rhythm structure describes the basic law of the occurrence of video background music events. e calculation of rhythm features is based on the periodicity of the measured things. In video background music, the measured things refer to the starting point in the audio file and the extracted low-level features (such as energy, spectral features, etc.). Specifically, the starting point is used to estimate the beat position. For example, the starting point in a piano song is the moment when the key is pressed. When the starting point is combined with the spectral characteristics, the loudness at the current point can be estimated.

User Interest Mining Based on Extended Stack
Autoencoder. rough the mining of user-side content information, the user's hidden feature matrix can be obtained, which can be regarded as an implicit factor matrix that affects the user's preference for video background music. Usually, the user-side content information that can be obtained is mostly structured short-text information, such as user's gender, age, occupation, zip code, etc. is information does not involve word order issues. Based on the previous introduction and investigation and analysis, it is found that the SDAE model is used for structure. When mining user's interests, this paper expands the SDAE model (abbreviated as ASDAE), adds user's historical behavior feedback information when inputting the model, and restricts the reconstruction and training process of the model, so that the model is encoded. e obtained feature representations are more abundant, and the effect of the ASDAE model is found to be improved to a certain extent compared with the SDAE after testing. e basic idea of the model is to learn the hidden features of the input data from the noisy input data and obtain the dimensionality reduction feature that reexpresses the original data by minimizing the reconstruction error. e model includes an encoder and a decoder. Assuming that an SDAE contains L layers, the first L/2 layer is the encoder, and the last L/2 layer is the decoder. In this paper, the user's interest mining process based on the extended stack autoencoder is shown in Figure 2.
Tonal feature is an important part of video background music, so it is very important to extract the tonal feature content from the song to characterize the video background music content. In video background music, pitch (also known as tonality) is the general term for the main tone and mode category of the key. Tone can be thought of as a series of different musical tones around a tonic. In addition to the tonic, there are two important pitches in the tune: dominant and subordinate. Mode is to organize music tones together according to a certain interval relationship and become an organic system. According to the difference in the arrangement structure of the interval relationship, the modes can be divided into two categories: major and minor. For example, in the tonal feature of a song, the tonic is C and the interval relationship arrangement is a major; then the tonal feature of the song can be described as "C major." Generally speaking, the tonal characteristics of a song determine the most intuitive emotional expression that the song brings to the listener. It is generally believed that major songs can give listeners a broad and bright feeling, while minor songs can give listeners a feeling of lyric and melancholy. e specific user's interest mining process is as follows: (1) First, you obtain and encode user-side content information and convert the structured content information on the user-side (such as ID, age, gender, occupation, etc.) into vector form X mainly through one-hot method.
(2) You convert the user's historical behavior information (scoring matrix) into a user feedback table; that is, set the interactive position to 1, and set the noninteractive position to 0, and then obtain the user's feedback set for all video background music.
(3) e greedy method is used to pretrain each layer of the ASDAE model layer by layer to complete the parameter initialization of the ASDAE network. When performing layer-by-layer training, for each layer in the network, an output layer is added first, Complexity and then pretraining is performed by minimizing the reconstruction error, and finally the network parameters of this layer are obtained, and then the output layer is removed. e function of the coding layer is as follows: (4) You take the vectorized user content information X and the user video background music feedback table S as input, input them into the model ASDAE, and add an objective function to the pretrained ASDAE model, and supervise and train the ASDAE to finetune the network parameters. e objective function here should be determined according to the task of the model, here is the error between the user implicit matrix generated by the matrix decomposition part of the fusion model and the user interest matrix output by the model.

Video Background Music Feature Mining Based on Extended Convolutional Neural Network.
In the long-text information on the video background music side, the comment information not only contains some attributes of the video background music, but also contains the user's emotional tendency, which is rich in information, and relatively speaking, the extracted features may be more valuable. erefore, when this article introduces the content information on the video background music side, this article finally decided to add comment data. In the text mining problem, there are many attempts to apply the CNN model, which is popular because of its simplicity and good effect. e general idea of using the CNN model is to first vectorize the entire text and then use convolution, pooling, and fully connected layers to gradually extract text features. But for a long text, such as the comment information used in this article, a piece of information may contain the functions, characteristics, and user's attitudes of the video background music. e importance of the text in different areas in the entire text is not equivalent. In fact, the text is full of a lot of words that have nothing to do with the attributes of the background music of the video. e grasp of the key words in the local area is conducive to more accurately obtaining the attributes of the background music of the video. erefore, this article expands the CNN model and introduces the attention mechanism into it, which is recorded as ACNN. rough the following experimental comparison, it is found that the improved model is helpful to the improvement of the effect, so it also shows the correctness of the introduction of the attention mechanism. e structure of the improved CNN model is shown in Figure 3. e convolutional layer is used to extract text features. Due to the particularity of the contextual information contained in the text, it is different from computer signal processing in terms of processing. erefore, after obtaining the text sequence with attention effect, it is necessary to further extract the document features through the convolution structure. e working principle of the convolutional layer is to uniformly transform the input image or text into the form of a numerical matrix and use a fixed-size convolution kernel to roll the numerical matrix from left to right and top to bottom in a certain step. e convolution kernel  can be understood as a weight matrix, and the convolution operation is to take the sum of the product of the convolution kernel and the number at the corresponding position in the process of moving as the result. Let W c denote the weight matrix of the convolutional layer and b c denote its bias information, then the text feature C obtained after convolution is shown in the following formula. Among the nonlinear activation functions, there are sigmoid, Re LU, etc. After comparing the effects, this article uses the better Re LU activation function to avoid the problem of gradient disappearance, which is represented by * in the above formula.
e pooling layer further excavates the text features obtained after convolution; while ensuring that the target is not affected by relative position changes, it also reduces the dimensionality of the extracted text features, which can avoid overfitting to a certain extent. e working principle of the pooling layer is to set a fixed-size volume core for the feature map obtained in the upper layer and perform a pooling operation from left to right and top to bottom according to a certain step. ere are two ways of pooling operation, taking the maximum or average value in the corresponding range of the convolution kernel without going through the back-propagation process. e fully connected layer is used to synthesize the previously extracted features and map the synthesized result to a vector space of a specific dimension as the final output of this part of the model. Each node in the fully connected layer will be connected with all the outputs of the previous layer to achieve the purpose of feature mapping. rough full connection, you can finally get text features that meet the requirements of specific dimensions. e feature output obtained after the model passes through the fully connected layer is

Recommendation Algorithm Based on Deep Learning
Fusion Model. According to the principle of PMF, first, we assume that the two implicit characteristic matrices U and V obtained after decomposition all obey Gaussian distribution.
Based on the Bayesian formula and the knowledge of the maximum posterior probability, the problem of maximizing the log-posterior probability can be transformed into the problem of minimizing the loss function. During model optimization, the parameters are updated alternately. V is updated by fixing U and W, or U is fixed by fixing V and W. rough the gradient descent method, the update method of U and V can be obtained: Finally, after obtaining the hidden feature matrix U and V, the score prediction is made by the following formula: rough the gradient descent method, the update method of V can be obtained as rough the gradient descent method, the update method of U can be obtained as

Performance Test.
Since the algorithm designed in this paper involves the selection of similarity measurement methods in the first stage, it is necessary to evaluate the effect of collaborative filtering under different measurement methods on the data set. For the collaborative filtering stage using the deep learning fusion model and the Euclidean metric method, the comparison of the MAE values under different proportions of the training set is shown in Figure 4. It can be seen that the prediction scoring accuracy of the deep learning fusion model is generally higher than that of the Euclidean similarity. In the case of using a complete data set for actual recommendation, it should be more accurate to use a deep learning fusion model to measure the user's predicted preference models.

Complexity
After the second stage of refining, the MAE value predicted by the score changes with the α parameter as shown in Figure 5 (in 91% of the training set). is test shows that, in the case of considering certain contextual factors, case reasoning of deep learning fusion model will have a certain effect on improving the prediction score. However, as α increases, the prediction will deviate more and more from the original user's model and become a recommendation that only relies on contextual topics, and the impact of different similarity measurement methods in the collaborative filtering stage on the calculation of the prediction score will also become smaller.
When using Tanimoto's coefficient or log-likelihood ratio similarity, since the input model ignores the preference value, the predicted score and actual score of the test using the data set are both 1 (indicating interest), which means that the MAE is always 0. It is invalid to evaluate the recommendation without preference value in this way, so the accuracy rate and recall rate need to be used to evaluate the recommendation effect using Tanimoto's coefficient and similarity of log-likelihood ratio. When the training set is 91%, the recommendation accuracy rate of the collaborative filtering stage using 4 similarity measures when recommending different numbers of songs is shown in Figure 6.
It can be seen that the accuracy rate increases with the number of recommendations, and there is a period of increasing trend. After too much background music in the recommended video, it can be predicted that the proportion of users who like the video background music begins to decrease. Figure 7 shows the recommended recall rate in the collaborative filtering stage using four similarity measures. e recall rate basically shows an upward trend with the increase in the number of recommended video background music, but after the number of recommendations reaches a certain value, the increase in the recall rate slows down and even slightly callbacks. is shows that the algorithm can dig out the fact that the user's preference has reached the upper limit. On the whole, the log-likelihood of the data model with no preference value is more suitable for the collaborative filtering stage calculation on the data set than the similarity.
After the log-likelihood ratio similarity is used for the first stage of system filtering and screening, after the second stage of refining, the accuracy and recall of the final result vary with the α parameter as shown in Figures 8 and 9 (the recommended number is 150).

Complexity
It can be seen that, considering certain contextual factors, the recommendation quality has improved after the second-stage case reasoning is refined. However, when the α parameter takes a value of 0.6, the recall rate will drop again, which shows that the user's long-term interest model still accounts for a relatively high degree of importance when making recommendations.
Using the single-machine similarity algorithm on a single server, the average offline similarity calculation takes 23.2 minutes. However, using the distributed cooccurrence matrix method to allocate 4 servers with the same configuration for MapReduce calculation, it only takes 7.1 minutes to complete the similarity calculation on average, which illustrates the effectiveness of using the matrix block method to optimize the efficiency of similarity calculation.

Evaluation of Test
Results. After the Last.fm user data is crawled through the acquisition module, it is preprocessed and written into the scoring database. e information loading time recorded in the user table is shown in Figure 10.
After completing the data preparation, we determine the similarity measurement method and α coefficient according to the above algorithm test, complete the algorithm module configuration, and start the Servlet to wait for user's input information.
e user's input interface includes four information input boxes for user's name, song title, artist, and label, and execution buttons. In this experiment, since it is necessary to establish a rating preference model for the existing data set for testing, the user's name here is a necessary condition. e search condition can be song titleartist or tag keywords. e tracks played for the user are the first songs in the recommended list formed under the search conditions, and in the recommended list, apart from the tracks based on the contextual theme mentioned above, other tracks recommended based on the user's historical evaluation records are also adjusted due to the prediction score adjustment. rough functional testing, the usability of each module is basically verified. Finally, the system is deployed on the public network server, and each functional module can operate normally. In the Last.fm community, 45 member users from the Asian Music team were randomly invited to evaluate the experience of the system. Among them, 41 users had an average rating of more than 4 stars in the feedback of 10 songs in the recommended list, and they also had collection behavior.
In summary, the algorithm designed in this paper can improve the quality of recommendation and can be applied to the radio-type video background music service website to reflect its effectiveness and achieve the expected goal of this subject.

Conclusion
In terms of model design, this paper designs a model that combines deep learning model and collaborative filtering. Using the improved stacked denoising autoencoder for structured data processing, user's interest is obtained; using convolutional neural network for unstructured long text mining, the hidden features of video background music are obtained. By introducing the attention mechanism to capture the local key points of the text, the mining effect is Complexity improved, and the interpretability of the model is also strengthened. Based on the idea of probabilistic matrix factorization, the tightly coupled fusion of deep learning and matrix factorization is realized. e model uses a unified objective function to uniformly optimize the two parts of the deep model. e deep model part can provide corrections to the matrix decomposition, and the matrix decomposition part can provide guidance for the feature extraction of the deep learning part, so that the model has a better prediction effect. Aiming at the digital video background music service platform such as video background music network radio station, users need to consider the situation when recommending video background music and develop the method of acquiring context information, the establishment of the recommendation model that adds context information, the context awareness algorithm, and the tradition research work on the fusion method of recommendation algorithms. is breaks through the key technology of short-term user preference discovery, realizes a system that can be applied to the background music recommendation of Internet radio videos, and completes offline comparison tests and online simulation experiments. e results show that the algorithm in this paper achieves a higher recommendation accuracy than the control system: the average absolute error (MAE) of the prediction score can be more than 10% lower than that of the control system, and the recommendation accuracy and recall rate can be improved by more than 20%. Moreover, the system can improve user's satisfaction to a certain extent.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e author declares no conflicts of interest reported in this paper.