Investigating the Temporal Effect of User Preferences with Application in Movie Recommendation

As the rapid development of mobile Internet and smart devices, more and more online content providers begin to collect the preferences of their customers through various apps on mobile devices. These preferences could be largely reflected by the ratings on the online items with explicit scores. Both of positive and negative ratings are helpful for recommender systems to provide relevant items to a target user. Based on the empirical analysis of three real-world movie-rating data sets, we observe that users’ rating criterions change over time, and past positive and negative ratings have different influences on users’ future preferences. Given this, we propose a recommendationmodel on a session-based temporal graph, considering the difference of longand shortterm preferences, and the different temporal effect of positive and negative ratings. The extensive experiment results validate the significant accuracy improvement of our proposed model compared with the state-of-the-art methods.


Introduction
Nowadays, a huge ecosystem of independent content providers (such as Facebook, Netflix, Google Maps, and Snapchat) and consumers (web users) is emerging on the mobile Internet.Confronted with the problem of finding a needle in the haystack, many web users usually resort to information filtering technology to find more relevant contents.Nowadays, recommender systems have been deployed on the websites of many industries [1], to make the web services more suitable and engaged to their users and promote the scale and profitability of such businesses [2].In the recent decades, recommender systems have received considerable research attention in the literature, and many effective recommendation approaches have been proposed, such as social network-based recommendation models [3], graphbased recommendation models [4,5], and context aware recommendation models [6,7]; a recent and up-to-date review can be found in the works of Lu et al. [8].
Many of these works are focused on movie recommendation or based on movie-rating data sets [9,10].Typically, in online video-watching websites with recommender systems, users are asked to rate movies with discrete scores to express their individual opinions, where a high score usually indicates user preference on this movie.Take https://www.netflix.comas an example, users are suggested to rate movies and TV shows (items in general) in a rating scale from 1 star to 5 stars, where one star means "Hate It," and five stars mean "Love It."This kind of explicit feedbacks can largely reflect user preferences.Even if a user dislikes a movie after watching it, he might be attracted by its title, cast, director, genres, or others; otherwise he would never watch it.Hence, the negative ratings indicate many useful information and thus should not be neglected or simply considered to be negative.Many works have shown that both of positive and negative opinions are effective to make effective recommendations.
First, given a rating scale where the highest score denotes the most positive opinion and the lowest score indicates the most negative opinion, users' rating scores do not distribute evenly along the whole rating scale [11].Second, different users may have different rating criterions, some goodtempered users are willing to give high scores whereas other critical people seldom give full marks to any items they have watched [12].Last but not least, the negative ratings indicate dislike and simultaneously relevance, and they may play an either negative or positive role depending on the sparsity of training set and the popularity of the corresponding items [13].
As the mobile platforms become more and more user friendly, computationally powerful, and readily available, online content providers have begun to develop mobile apps to offer more personalized contents.People can watch their favorite movies and TV shows wherever and whenever they have a break.This mobile feature poses a new challenge to recommendation systems.Most of previous works do not consider the temporal difference in the rating criterions of users.According to the memory effect of movie watching behavior [14] and the anchoring bias phenomena of movierating behavior [15], the current rating of one user will be influenced by his previous watching and rating history.Therefore, individual rating criterions may vary in different periods, depending on the previous items he has watched.Besides, a user's negative ratings may have a different temporal influence from his positive ratings on his future preference.
In this paper, we empirically analyze three typical data sets created by popular online video services (MovieLens, Netflix, and MovieTweetings) with focus on the temporal effects of the rating behavior of each individual user.We concentrate on the time-varying rating criterion and different temporal effects of positive and negative ratings on future behavior.We propose a session-based recommendation model taking into account these temporal characteristics of user ratings.Compared with five state-of-the-art methods on the aforementioned movie-rating date sets, our proposed model is validated to give more accurate prediction of user preference.

Empirical Analysis
In this section, we analyze empirically the temporal difference in users' rating criterions and the temporal effects of positive and negative ratings, with the hope of understanding the temporal characteristics of users' rating behaviors and verifying the following two assertions.
Assertion I.The rating criterion of a user varies over time.
Assertion II.The positive and negative ratings of a user have different temporal influences on his future preference.For the convenience of readers, we list all the notations used in this paper in "Notations."  1.

The Rating Criterion.
In this paper, we investigate users' rating criterions in two aspects: average rating score and rating scale.Specifically, we consider the monthly average rating score and rating scale of each user as two independent random variables and estimate their standard deviations across months.To obtain a reliable estimation, we consider only the users who are active in more than 2 months of the whole period.Figures 1(a), 1(b), and 1(c) show the distributions of standard deviation of average rating scores for MovieLens, Netflix, and MovieTweetings data sets, respectively.The mean values of the deviations for three data sets (0.36427, 0.56429, and 0.79849) are all significantly greater than 0 ( value ∼ 0, obtained by one-sided -test).Similarly, Figures 1(d), 1(e), and 1(f) show the distributions of standard deviation of users' rating scales.The mean values for MovieLens, Netflix, and MovieTweetings are 0.93737, 0.93495, and 1.05155, respectively ( value ∼ 0, obtained by one-sided -test).These observations indicate that every user has a significantly changing rating criterion over time, empirical evidence of Assertion I.

The Positive and Negative Ratings.
Note that the rating criterion on items varies from person to person; we take the median score   = ( max  +  min  )/2 of each individual user   , instead of the median score of systematic rating scale, to distinguish his own positive rating (rating score no less than   ) and negative rating (rating score less than   ).
We use session to represent a continuous period of user activity; thus the records of user   can be divided into several sequential sessions   = { , 1 ,  , 2 , . . .,  ,  }.In this paper, the sessions are divided by month; that is to say, two ratings of the same user are in the same session if and only if they occur in the same month.For a user   , the items rated with positive ratings by user   in session  ,  constitute his positive item set  ,  , and the negative item set  ,  is similarly defined.For a target user   , we take his latest positive item set  ,  as future preference, and all the previous positive and negative item sets  ,  and  ,  are treated as previous interests,  = 1, 2, . . .,  − 1.
The correlation Cor(, ) between two item sets  and  is defined as the averaged cosine similarity on all item pairs, where one item is from  and the other is from . Figure 2 plots the correlation Cor( ,  ,  ,  ) (the black line) and Cor( ,  ,  ,  ) (the red line) against the time gap |  −  |, of course averaged over all users, for MovieLens, Netflix, and MovieTweetings data sets.We can see that the future preference of a user is clearly more influenced by his positive ratings than the negative ratings in the past.From the temporal point of view, the bigger the time gap is, the less influenced the future preference is by the previous positive/negative ratings.However, the decay rates of influences of positive and negative opinions vary for different data sets.For MovieLens, the influence of positive ratings on future preference is more stable than negative ratings ( positive = 0.00584,  negative = 0.02325), while, for Netflix, the decay rates of influences of positive and negative ratings are very similar to each other ( positive = 0.00747,  negative = 0.00912).
Since the first and the last sessions contain data of only 1 day and 2 days for MovieTweetings data set, we ignore the last points of curves with time gap of 6 months.Different from above observations, we find that the influence of negative ratings is more stable than that of positive ratings in MovieTweetings ( positive = 0.01174,  negative = 0.00319).Therefore, users' positive and negative ratings have different temporal influences on his future preferences, empirical evidence of Assertion II.

Recommendation Model
Based on the session-based temporal graph (STG) introduced by Xiang et al. [4], we propose a session-based recommendation model with the temporal effect of user preferences (STeuP), which is an enhanced version of the Injected Preference Fusion (IPF) model associated with STG.Users and items are represented by user nodes  ∈  and movie nodes  ∈ , respectively.To represent users' ratings at different periods, we associate a session node  ,  to the movies rated by user   in this session.These three types of nodes are connected by weighted directed edges, namely,  , ,  , ,  , , and  , .The edges affiliated to session nodes reflect the short-term rating criterions of users, while the edges affiliated to user nodes reflect users' long-term preferences.Figure 3 gives an example of session-based temporal graph.
To eliminate the effect of different rating criterions of different individuals, the rating score of a user is normalized according to his own rating scale: which reflects users' long-term rating criterions.In this way, the rating scores of all users can be strictly regulated to [0, 1], where the maximum rating score of each user is set to 1 and the minimum rating score is fixed on 0. Since the short-term rating criterion of a user varies at different periods, we normalize his rating score in a particular session  ,  by Recall that our recommendation task is to recommend movies for a target user to watch in the future.Of course, the rating whose occurrence time is closer to the target time  is more useful to the recommendation task.Since the temporal influences of positive and negative ratings may be different, following previous works [19,20], we use two exponential functions to model the relevance of positive and negative ratings at time   with user's preference at the target time .Hence, edge weights in  , and  , are defined as ( Similarly in a specified session, the rating whose occurrence time is closer to the target time  is more important in this session.We use the same exponential functions to model the temporal influences of positive and negative ratings in a session, and the median rating value in this session is taken to distinguish positive and negative ratings.Thus, the edge weights of  , and  , are calculated by After setting the initial edge weights of STG, we normalize these edge weights as follows: A larger  indicates that users' long-term preferences play a more important role in preference propagation.Given a target user   , the basic idea of the preference propagation is to first inject initial preference on both the user node   and his latest session node and then propagate the preference to candidate movie nodes through various paths in the graph.As defined in [4], the preference propagated by each path  is the production of the initial preference (V 0 ) assigned to the target user node   (or the latest session node  , ) and the weights of all edges on the path: (V 0 ) depends on the node type: where  = 0 means no preference is injected into the user node, while  = 1 means no preference is injected into the session node.Similar to the previous work [4], we consider only the shortest paths (distance = 3) from source node to unknown movie nodes, which can be obtained effectively by Bread-First-Search.Consequently, we use   to represent the set of shortest paths from source nodes to an unknown movie node   for user   , and the estimated preference   of user   on movie   is then measured as where () is the weight of path  defined as (6).The topranked movies sorted by preference value are then recommended.

Evaluation Metrics.
In order to predict users' future preferences based on the past interaction records, all the records are listed in ascending order of rating time.We take the records occurred in the latest 30 days as the probe set   and the remaining records as the training set   for all data sets.The training set is treated as known information, while no information from the probe set is allowed to be used for recommendation.Moreover, we denote the latest time among the training set as the target time .In this paper, four typical metrics are employed to evaluate the accuracy, diversity, novelty, and coverage of recommendation results.

Accuracy.
Accuracy is one of the most important evaluation metrics of a recommendation system.Both Precision and Recall could be used to measure the accuracy of the recommendation.Precision  is the fraction of recommended items that are relevant, while Recall  is defined as the ratio of the number of relevant items in the recommendation list to the number of preferred items in the probe set.However, Precision and Recall seem to be two sides of the seesaw; that is to say, given a fixed length of recommendation list, when one end rises, the other end falls.The 1 measure is proposed to find a suitable trade-off between Precision and Recall, which is defined as where  = (1/||) ∑ || =1 (ℎ  /) and  = (1/||) ∑ || =1 (ℎ  /  ), where || is the number of users, ℎ  is the number of relevant items in the recommendation list of   , and   is the number of all preferred items in the probe set of user   .Generally speaking, for a given length  of recommendation list, the method with higher 1 value is the better one.

Diversity.
Diversity is used to measure the difference between recommendation lists of different users.An excellent algorithm should recommend as widely distributed items as possible, because people are glad to get personalized suggestions.We use the Hamming distance   to measure the diversity of recommendation lists, where   is the number of common items in the recommendation lists of user   and user   .  = 0, if   and   get identical recommendation list consisting of  items.Diversity is defined as the mean value of Hamming distance: 4.1.3.Novelty.Novelty quantifies the capacity of a method to generate novel and unexpected recommendations, which may be greatly contributed by less popular items (i.e., items of low degree) that are unlikely to be known previously.
It can be simply measured as the average degree of the recommended items.Specifically, for a target user   whose top- recommendation list is denoted by   , his novelty is defined as [21] Novelty Averaging over the novelty of all users, we obtain the novelty of the system.

Coverage.
Coverage measures the percentage of items that an algorithm is able to recommend to users.It can be calculated as the ratio of the number of distinct items in the users' lists to the total number of items in the system, which reads where || is the number of items in set ,   = 1 only if item   is recommended to at least one user (i.e.,   is in at least one user's list), and otherwise   = 0. Undoubtedly, recommending more popular items will result in lower coverage.

Parameter Adjustment.
Before comparing the proposed model with the baseline methods, we investigate the impacts of the parameters , , , and  on the performance of the STeuP model.As we see in Section 2.3, the temporal effect of positive and negative opinions may be different in different online websites.Thus, we first examine the effect of parameters  and , which govern the decay rates of temporal influence of positive and negative opinions on users' future preference.The bigger the parameters  and  are, the less affected the future behaviors are by users' past positive and negative opinions.Without loss of generality, we set  = 1 and  = 0.5 when tuning  and .
, and 4(c) plot the heat map of 1 against parameters  and  for MovieLens, Netflix, and MovieTweetings, respectively.The log- is along the -axis while the -axis is for log-.The different 1 values along the -axis are indicated by different colors.Firstly, we observe that the 1 is more sensitive to  than ; that is, given a fixed value of , the changing range of 1 is much bigger when traversing the parameter .Secondly, the results on all data sets show that there is an obvious "ridge" along the -axis, where we can get the optimal 1 value.Hence, we can firstly fix  to a small value (10 −2 ∼10 −3 ) and tune the parameter  to find the local optimal value, and then fix  and adjust  to find the globally optimal accuracy.
By setting the values of  and  to 0, we can get the recommendation results without temporal influence, which are presented in Table 2.We can see that consideration of temporal influence by weighting users' positive and negative opinions with different temporal decay rates leads to performance improvements.From the values of parameters  and  when we get the optimal 1, we can find that the decay rate of temporal effect of positive opinions is much smaller than that of negative opinions on MovieLens, but it is bigger than that of negative opinions for MovieTweetings.For Netflix, the decay rates of positive and negative opinions are almost the same.This validates again our inference on the different temporal influences of positive and negative opinions on three data sets in Section 2. 3.
In our STeuP model, parameter  controls the ratio of injected preferences into user nodes against session nodes.If  equals 0, no preference is injected into the user node; if  equals 1, no preference is injected into the session node.Thus, parameter  is used to balance the effect of long-term and short-term interests in the initial phase, where the larger  is, the stronger the influence of long-term preferences is.The results of how accuracy changes against  for three data sets are shown in Figure 5. Firstly, the results show that ignoring long-term preferences ( = 0) cannot generate good results.Secondly, in a sparser data set, the  value for the optimal 1 is bigger.Generally speaking, optimal results can be obtained by combing long-term and short-term interests together.In the next discussion, we fix  to 0.5, 0.9, and 1.0 for MovieLens, Netflix, and MovieTweetings data sets, respectively.
Parameter  is used to balance the influence of longterm and short-term preferences in the process of preference propagation. = 0 means item nodes are only connected to users nodes and item-item similarity depends only on users' long-term preference and vice versa.Figures 6(a), 6(b), and 6(c) plot the change of 1 against parameter  on three data sets.As the -axis is for the logarithmic value of , we can see that, for MovieLens and Netflix data sets, a parameter  close to 1 is corresponding to the optimal 1, while the optimal value of  for MovieTweetings is close to 10 2 .This observation verifies that both users' long-term and shortterm opinions are important to measure item similarity.Furthermore, users' long-term opinions are more important than short-term opinions on sparse data sets.

Comparison of Methods.
We compare our proposed model with other five models listed in Table 3.The mark √ in the table indicates whether a model distinguishes user ratings as positive and negative opinions and/or considers temporal influence.UCF is an important collaborative filtering method which calculates the similarity between users based on the rating information.NBI is a network-based    What is more, we also check the performance of a famous matrix factorization recommendation algorithm [22], which is very successful in rating prediction with the help of time information.However, the 1 value of this method on the aforementioned three data sets is all less than 0.001, perhaps because it does not apply to the situation of binary preference prediction.Hence, we do not present the results of the matrix factorization methods for comparison in this paper.
Given the recommendation length  = 10, the recommendation performance of these six methods for MovieLens, Netflix, and MovieTweetings is reported in Tables 4, 5

:
The highest and lowest scores in  ,   , : The edge sets from node set  to node set  ℎ  : Th e n u m b e r o f r e l e v a n t i t e m s (namely, the items collected by the user   in the probe set) in the recommendation list of     : The number of selected items in user   's probe set   : The time stamp when user   rates movie   , : The decay factors controlling the extent of temporal influences of positive and negative ratings  V ,  V ∩ ,  V ∩ : The neighbor set, neighbor session set, and neighbor user set of item V : The parameter used to adjust the preference propagation of an item node to its user neighbors or session neighbors V, : A node and a path on the session-based graph  V,V  : The weight of edge  V,V  in path  (V 0 ): The value of injected preferences on the source node V 0 : The parameter used to tune the ratio of injected preferences on the user node against the session node.

Figure 1 :
Figure 1: The distributions of standard deviation of average rating scores and users' rating scales for MovieLens, Netflix, and MovieTweetings data sets, respectively.

Figure 2 :Figure 3 :
Figure 2: The temporal influence of user's positive and negative opinions for MovieLens, Netflix, and MovieTweetings.The black one and red one represent user's positive and negative opinions, respectively.

Figure 4 :
Figure 4: The heat map of 1 against parameters  and  on MovieLens, Netflix, and MovieTweetings.

Table 1 :
Basic statistics of three data sets.The sparsity is defined as ||/( * ), where || is the number of ratings and  and  denote the numbers of users and items.

Table 2 :
Performance comparison of STeuP model distinguishing the temporal influence of positive and negative ratings on MovieLens, Netflix, and MovieTweetings.

Table 3 :
The six recommendation models for comparison in this paper.The mark √ under PN or T indicates whether the model distinguishes opinions as positive and negative opinions and/or considers temporal influence.SNBI is the enhanced version of NBI which assigns two different weights to the positive and negative opinions.UOS and SNBI distinguish users' ratings as positive and negative opinions, but do not utilize temporal influence.IPF is the recommendation model proposed together with session-based temporal graph, which is based on binary data and distinguishes users' long-term and shortterm preferences.STeuP is our proposed model based on the two assertions in Section 2, which takes into account both of the temporal differences of users' rating criterions and the different temporal effects of users' positive and negative ratings.
, and 6, respectively.The highest and lowest rating scores made by user     : The middle value of user   's ratings  ,  : Th e p o s i t i v e i t e m s e t i n  ,   ,  : The negative item set in  ,   max :