Tag-Aware Recommender System Based on Deep Reinforcement Learning

: Recently, the application of deep reinforcement learning in recommender system is flourishing and stands out by overcoming drawbacks of traditional methods and achieving high recommendation quality. The dynamics, long-term returns and sparse data issues in recommender system have been effectively solved. But the application of deep reinforcement learning brings problems of interpretability, overfitting, complex reward function design, and user cold start. This paper proposed a tag-aware recommender system based on deep reinforcement learning without complex function design, taking advantage of tags to make up for the interpretability problems existing in recommender system. Our experiment is carried out on MovieLens dataset. The result shows that, DRL based recommender system is superior than traditional algorithms in minimum error and the application of tags has little effect on accuracy when making up for interpretability. In addition, DRL based recommender system has excellent performance on user cold start problems.


Introduction
With the increasing amount of information and access to information getting more and more smooth, users' choice towards goods, movies, restaurants, etc. has significantly increased. On one hand, mass information brings more convenience. On the other, information overload brings the trouble of over-choice as well. Recommender system are information filtering tools that deal with such problem through providing users information with guiding significance in a highly personalized manner [1].
As a sub area of machine learning, deep reinforcement learning based recommender systems have gained significant attention by overcoming drawbacks of traditional methods and achieving high recommendation quality. Traditional algorithms regard recommendation as a statistic process, while DRL based recommender algorithms solve the dynamic changes of interest and distribution of users and items. Due to MDP modeling and cumulative rewards, long-term returns are considered to improve user viscosity. The application of deep neutral network effectively solved sparse data issues in recommender system. However, every coin has two sides. Deep neutral network lacks interpretability and tends to overfitting. Reinforcement learning need complex reward function design, and hardly handle with user cold start problem. Currently, most algorithms combined reinforcement learning with recommender system are based on DQN. Considering the Q-value based deep reinforcement learning algorithm is only suitable for low-dimensional, discrete motion spaces, as it is well known that DQN was first proposed in Atari games, which only have four actions. In the rating prediction problem studied in our paper, which has ten specific ratings, the Q-value based method is no longer proper. However, Actor Critic based deep reinforcement learning algorithm is not limited to discrete motion space, and even can handle continuous motion space, so algorithm in this paper is under the framework of Actor Critic. To be specific, we apply DDPG as our basic algorithm.
The research of recommender system includes Top-N recommendation and rating prediction, this paper focuses on rating prediction. In this paper, we propose a tag-aware recommender system based on deep reinforcement learning.
Firstly, modeling recommendation questions (rating predictions) based on deep reinforcement learning. The deep reinforcement learning approach models recommendation issues as a dynamic process. In spatial, it's scalable as the number of users and items increases. In time, it adapts to the dynamic changes in user interests, not only taking shortterm returns into account, but also long-term returns. Besides, complex data preprocessing is not necessary for the processing of data sets and the algorithm can automatically learn feature representations from scratch [6].
Secondly, tag information is used to make up for the interpretability problems existing in recommender system. By Wikipedia's definition, a label is a non-hierarchical keyword used to describe information, which can be used to describe the semantics of the item. Depending on who labels the item, there are generally two types of labeling applications: one asks the author or expert to label the item, the other allows the regular user to label the item, the latter also called user generated content. When a user tags an item, the label describes the user's interests on the one hand and the semantics of the item on the other. Users apply labels to describe their views on items, so labels are an important link between users and items. Labels also reflect the interests of users as an important data source, the effective use of labels to improve the quality of personalized recommendation results is of great help [7]. Douban does make good use of label data, increasing both the diversity and interpretability of recommendations.
Finally, user cold start problems. [8] Recommender system need to predict user's future behavior and interest according to the user's historical behavior and interests, so a large amount of user behavior data becomes an important part and prerequisite. Designing a personalized recommender system without a large amount of user data is a cold start issue. This paper focused on the problem of user cold start, which mainly solves the problem of how to make personalized recommendations for new users. Deep reinforcement learning can dig into the potential connection between user characteristics and item characteristics, hence, it has potential advantages in solving cold start problems.
Our contributions are listed as follows: 1. Apply deep reinforcement learning for rating prediction, adapting to the dynamics of users and items, and taking long-term returns into account. 2. Use tag apps to make up for the interpretability of recommender system. 3. Resolve the user cold start issue. The rest of this paper is organized as follows. Section 2 briefly reviews the work related to the combination of deep reinforcement learning and recommender systems. Section 3 first defines the recommendation problem according to the deep reinforcement learning, and then uses DDPG algorithm to predict ratings. Section 4 carries out experiment on the MovieLens 20M dataset, and the results show that our algorithm is superior than traditional algorithms in minimum error and has excellent performance on user cold start problems. Section 5 concludes the work in this paper and put forward directions of future work.

Traditional recommender algorithms and rating prediction
From GroupLens to Netflix, then to Yahoo! Music's KDD Cup, rating prediction has always been a hotspot in recommender system. The basic dataset for rating prediction is the user-rating dataset. The dataset consists of users' rating records, each record is a triple ( , , ), indicating that user gives item a rating . Because it is impossible for users to rate all items and new users come up every day, the key of rating prediction is to predict unknown user-rating records through history records. For example, there has two users AB, three movies CDE. User A's rating to movie C is 2, to movie D is 5 and movie E is not rated. User B is a new user without any rating records. When User A browses the web seeing Movie E, we want to help User A decide whether to see this movie by a predicted rating. When User B browses the web, we want to help the user decide which movie to watch by predicting user B's rating towards movie CDE.
Recommender systems predict users' preference on items and recommend items that users may be interested in automatically [9] [12]. Recommendation algorithms are usually classified into three categories [9][11]: collaborative filtering, content based and hybrid recommender system. Collaborative filtering makes recommendations according to users' or items' historical records, either explicit or implicit. Content-based recommendation is on the basis of items' and users' auxiliary information, like voice, images and videos. Hybrid recommendation integrates at least two different recommendation algorithms [10] [11].
Since the Netflix Prize competition, different researchers from different countries have come up with numerous rating prediction algorithms. Traditional algorithms include averaging prediction: predictions are made by calculating the average of the ratings. Domainbased approach: predictions are calculated by the similarity of users or items [2]. With the development of machine learning, latent Factor Model [3] and Matrix Factorization [4] are proposed, the essence of which is to study how to complete the rating matrix by the method of de-dimensionality. The representative algorithms include SVD, LFM, and SVD++ [5], which fuses time information on the basis of SVD.

DRL based recommender system
Reinforcement learning operates on a trial-and-error paradigm [13]. The basic model is composed of the following components: agents, environments, states, actions and rewards. The combination of deep neural networks and reinforcement learning formulate DRL which have achieved human-level performance across multiple domains such as games and selfdriving cars. Deep neural networks enable the agent to learn from scratch.
Recently, DRL has obtained good results [14][15] [16] in recommender system. Zhao et al. [17] explored the page-wise recommendation scenario with DRL, the proposed framework Deep Page is able to adaptively optimize a page of items based on user's real-time actions. On this basis, the List-wise method is proposed further [18], and these two articles mainly solve the sorting problem in recommender system, applying the DDPG framework. Zhao et al. [19]proposed a DRL framework, DEERS, for recommendation with both negative and positive feedback in a sequential interaction setting, and especially highlight the importance of negative feedback. This paper also demonstrate the effectiveness of proposed framework in real-world e-commerce setting. Zheng et al. [20] [20]proposed a news recommender system, DRN, with DRL to tackle the following three challenges: (1) dynamic changes of news content and user preference; (2)Single feedbacks; (3) diversity of recommendations. This paper not only consider click labels or rating into consideration but also take user viscosity into account. Chen et al. [21] proposed a robust deep Q-learning algorithm to address the unstable issue with two strategies: stratified sampling replay and approximate regretted reward. The former idea solve the problem from sample aspect while the latter from reward aspect. DQN based algorithm alleviates the problem of distribution shifting in dynamic environment, but need complex reward function. Chen et al. [22] get a more predictive user model and learn the reward function in a way consistent with user model. Learned rewards function benefits reinforcement learning in a more principled way, rather than relying on hand-designed rewards. User model makes it possible for model-based RL and online fit for new users, which address the user cold start problem. Although Complex reward functions no longer need to be built when using user models, but the design of reward functions is still required during the user model building phase. Choi et al. [14] proposed solving the cold-start problem with RL and bi-clustering. This paper using bi-clustering to improve cold start problem and provide interpretability for recommender system. Munemasa et al [15] proposed using DRL for stores recommendation.

Method
Considering the behavior of user rating movies is typical of sequential decision, which is in accord with the delayed feedback in reinforcement learning, we apply reinforcement learning to model recommendation problems. In this paper, the dataset of user rating records are viewed as the environment and agent needs to perceive the environment when predicting ratings. Reinforcement learning is usually modeled in the form of Markov decision process (MDP), which is a tuple< S, A, P, R, γ >, so our model is defined as the follows:

problem definition
State Space: The state should be able to represent explicit features of users and movies respectively and implicit features between users and movies. Based on MovieLens datasets, the dimension of each state is 28 and is sorted in chronological order by rating timestamp. Suppose Action Space: Our goal is to predict users' rating on movies, so we regard ratings as actions directly. The scale of ratings ranges from 0.5 to 5 in half-star, thus, there are 10 discrete ratings in total. Therefore, the action space has 10 actions.

Reward Function:
The key of rating prediction is to enhance the accuracy. The larger the difference between predicted rating and actual rating is, the smaller the reward is. On the country, the smaller the difference between predicted rating and actual rating is, the larger the reward is. The reward function in this article is a subtract of the difference between the prediction rating and the true rating. Since user ratings are the only feedback used, this article does not require complex reward function design and reward shaping. = − Discount factor : [0,1]when = 0, recommender system only takes immediate reward into consideration and when = 1, all future rewards are fully counted.

DDPG based rating prediction algorithm
The full name of DDPG [23] is Deep Deterministic Policy Gradient, a combination of Actor-Critic and DQN algorithms. Deep means using the experience pool and double network structure applied in DQN to promote effective neural network learning. Deterministic, that is, Actor no longer outputs the probability of each action, but rather a specific action, which helps us learn in the continuous motion space.

Figure 1 The network structure of DDPG
We call the two networks in Actor are action estimation network (rating prediction network) and action reality network. We call the two networks in Critic are state reality network and state estimation network.
DDPG applies a double network structure similar to DQN, both Actor and Critic have target-net and eval-net. It is important to emphasize here that we only train the parameters of the action estimation network (rating prediction network) and the state estimation network, while the parameters of the action reality network and the state reality network are copied by the first two networks at a certain time.
First of all, on Critic's side, the learning process on Critic's side is similar to that of DQN, and we all know that networks in DQN are learned based on the following loss functions, namely the real Q value and the estimated Q value square loss:  ( , ) is obtained from the state estimation network, and is the action passed over by the action estimation network (rating prediction network). + ( ′ , ) is the real Q value. Instead of using greedy strategy to select action , we directly get action through the action reality network. In general, the training of Critic's state estimation network is based on the square loss of real Q value and estimated Q value. The estimated Q value is obtained after inputting the current state and action , which is outputted by action estimation network (rating prediction network) to state estimation network. The real Q value is obtained after putting the reality reward , the next state ′ and the action ′ of the action reality network into the state reality network and then calculated the discount value.
Secondly, on the Actor's side, in this paper, we estimate the parameters of the action estimation network (rating prediction network) according to the following formula [23]: Let's set an example to explain this formula. Suppose to the same state, the action estimation network (Rating Prediction Network) predicts two different ratings a1 and a2, and gets two feedback Q values from the state estimation network: Q1 and Q2. Assuming Q1>Q2, that is, rating 1 is closer to the true value. Then according to the idea of Policy Gradient, increases the probability of action 1 and decreases the probability of action 2. Based on this, Actor wants to get as large a Q value as possible. Therefore, the loss of Actor can be simply understood as the greater the feedback Q value is the less the loss is, or the less the feedback Q value is the greater the loss is.
In addition, the traditional DQN uses a target-net network parameter update called 'hard' mode, that is, assigning network parameters in eval-net to target-net every certain steps. While DDPG applies a 'soft' mode of target-net network parameter updates, that is, each step updates the parameters in the target-net network a little bit. This method of parameter updating has been tested and showing that the stability of learning can be greatly improved.

Dataset
MovieLens datasets are widely used in recommendation research. Our experiment employs the MovieLens 20M dataset, which contains 138493 users' 20 million ratings and tag apps for 27278 movies. Only users with at least 20 ratings are included.
Different from former datasets, 20M datasets do not include any demographic information (age, gender, occupation, zip code), which is stopped being collected in the site, but include tag applications [24]. Besides, 20M includes a Before the experiment, we preprocessed the dataset: 1. Tags are words or short phrases applied by users to movies. This paper does not use word2vec or any other NLP methods, but directly selected 1127 labels most commonly used, according to tags' initials distributing ID number, and directly apply tagId as a feature.
2. This paper only selects users who have both ratings and tags for a movie. 3. All features are normalized. 4. Records include both tag and rating. Specifically, our dataset is sorted by rating timestamp in chronological order, the top 80% data are used for training, and the last 20% data are used for testing.
In order to test the user cold start problem, users in test set are divided into two parts. 502 users were old users (existing in the training set) and the remaining 960 users were new users (not in the training set). There are 21,227 records for old users and 21,902 records for new users. Name Users Movies Records Tags  Dataset  5510  7525  214129  1127  Train set  4550  6802  171000  1127  Test set  1462  4030  43129 1098 The preprocessing of experimental data in this paper refers to TRSDL [32], although the evaluation measures are all the same, due to the different data preprocessing methods, the experimental results are not comparable

MAE (mean absolute error)
MAE is used to measure the difference between the true rating and the estimated rating of recommendation algorithms.

RSME (root mean squared error)
RSME is the evaluation criterion used by Netflix Prize. The smaller the value of RSME, the more accurate the algorithm is.

compared method
This paper selects some classic algorithms in the recommender system for comparative analysis. Normal predictor: Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
The prediction is generated from a normal distribution ( , 2 ) where and 2 are estimated from the training data using Maximum Likelihood Estimation:

Co-clustering:
A collaborative filtering algorithm based on co-clustering [31]. Basically, users and items are assigned some clusters , , and some co-clusters . The prediction is set as: where ̅̅̅̅̅ is the average rating of co-cluster , ̅̅̅̅ is the average rating of u's cluster, and ̅̅̅ is the average rating of i's cluster. KNN basic: A basic collaborative filtering algorithm. The prediction is set as: KNN with means: A basic collaborative filtering algorithm, taking into account the mean ratings of each user. The prediction is set as:

KNN with baseline:
A basic collaborative filtering algorithm taking a baseline rating into account. The prediction is set as:

Slope one:
A simple yet accurate collaborative filtering algorithm [30]. The prediction is set as: where ( ) is the set of relevant items, i.e. the set of items rated by that also have at least one common user with .
( , )is defined as the average difference between the ratings of and those of : The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. When baselines are not used, this is equivalent to Probabilistic Matrix Factorization. [25]The prediction is set as: = + + + . If user is unknown, then the bias and the factors are assumed to be zero. The same applies for item with and [26] [27]. The SVD++ algorithm, an extension of SVD taking into account implicit ratings. The prediction is set as: Where the terms are a new set of item factors that capture implicit ratings. Here, an implicit rating describes the fact that a user u rated an item , regardless of the rating value. NMF: A collaborative filtering algorithm based on Non-negative Matrix Factorization. This algorithm is very similar to SVD. The prediction is set as: = where user and item factors are kept positive. Our implementation follows that suggested in NMF [28], which is equivalent to [29] in its non-regularized form. Both are direct applications of NMF for dense matrices.

Experimental results
Our experiment is carried out on the processed ml-20m dataset. First, we use the traditional algorithm as baselines to make comparisons. Then tag-aware DDPG algorithm proposed in this paper are employed to calculate the error, meanwhile, in order to verify whether tags have effect on error, this paper also calculates the error without tags to make comparative analysis. Finally, we select a test set which only contains new users to independently verify the user cold start issue.
The   Table 3 shows the minimum errors of various algorithms. SVD++ performs best among all traditional algorithms, whose MAE and RSME are 0.3816 and 0.5620. Our algorithm is slightly lower than that of SVD, where the MAE and RSME of tag-free reach 0.3720 and 0.5432, the MAE and RSME of tag-aware reach 0.3900 and 0.5577.
It's vividly that the best results of our algorithm are superior than traditional methods. In addition to the advantage in reducing error, DDPG based recommender system is more scalable when the number and characteristics of users and items enlarges, and can adapt to the dynamic changes of users and items as well. What's more, deep learning is good at digging potential connections between users and items, which provides a better idea for optimizing the long-term user experience.
With RSME as the evaluation indicator, DDPG's performance is shown in the  Judging from the performance of DDPG, although it exceeds the traditional algorithm in the minimum value, the robustness of which is poor, and the range of minimum value and maximum value is large. Besides, the use of tag apps has little effect on error, but adds interpretability to the recommender system. What's more, when all users in test set are new users, DDPG shows excellent performance. More detail will be analyzed as below. a) Tag-free and Tag-aware Name MAE RSME average best average best Tag-free 0.7165 0.3720 0.9448 0.5432 Tag-aware 0.7143 0.3900 0.9456 0.5577 Table 5 Comparison results of tag-free and tag-aware Since reinforcement learning learns from scratch, the initial predicted ratings are rather random, causing the training error very large at first. However, with the increase of training time, reinforcement learning gradually learned the correct strategy. The error decreased and stabilized.
The average RSME of tag-free is 0.9448, and the best RSME is 0.5274. The average RSME of tag-aware is 0.9456, and the best RSME is 0.5577. The results show that the application of tag has little effect on the accuracy, however, it makes up for the drawback of interpretability existing in recommender system.  Figure 5 The distribution of RSME of tag-free Figure 6 The distribution of RSME of tag-aware The error distribution follows the normal distribution, and the number of errors on the left side of the mean is greater than the right side, that is, most errors are concentrated in the interval with smaller errors. It can be seen that DDPG algorithms tend to have smaller errors, which means, more accurate. b) Cold start: Name MAE RSME average best average best Cold Start 0.7044 0.3600 0.9388 0.4939 Table 6 The results of Cold Start Figure 7 RSME of cold start Figure 8 The distribution of RSME of cold start When the test set only contains new users, the accuracy of DDPG is still very high. It can be speculated that deep reinforcement learning can be a good solution to the problem of user cold start. This is something the former algorithms cannot solve.
DDPG shows a lower error in dealing with the user cold start problem, which shows that the method adopted in our paper has a good effect on the problem of overfitting.
The error distribution follows the normal distribution and DDPG algorithms also tend to have smaller errors.

Conclusion
The combination of deep reinforcement learning and recommender systems has become a popular trend, and internet giant like Google and Alibaba both have done a lot in theoretical exploration and engineering practice in. In this paper, DDPG algorithm is applied to predict the  rating in the recommender system. Since the basic algorithm of DDPG is generally used to deal with large-scale continuous action, this paper first discretizes the continuous action, which is the rating of movies. Although the average error is higher than traditional algorithms, the minimum error is much smaller than the existing recommendation algorithm, and the results of this experiment tend to have smaller errors. Then, without increasing the error, tag is used to make up for the interpretability of the recommender system. Finally, on the issue of user cold start, the experiment proves that the recommendation algorithm used in this paper has smaller errors, and it also has a good effect on the overfit problem.
For future work, we have the following directions: 1) Scalability. This paper uses the MovieLens 20M dataset, we can continue to research on the 25M and latest datasets to explore the scalability problems. 2) Robustness. Although the error of DDPG algorithm converges to a great result, the error range is large, hence there is room for improvement of the robustness. 3) Parameters. The DDPG algorithm requires a lot of tuning, which is a common disease of machine learning. We want to propose more adaptable recommended algorithms.