Normalizing Item-Based Collaborative Filter Using Context-Aware Scaled Baseline Predictor

Item-based collaborative filter algorithms play an important role in modern commercial recommendation systems (RSs). To improve the recommendation performance, normalization is always used as a basic component for the predictor models. Among a lot of normalizing methods, subtracting the baseline predictor (BLP) is the most popular one. However, the BLP uses a statistical constant without considering the context. We found that slightly scaling the different components of the BLP separately could dramatically improve the performance. This paper proposed some normalization methods based on the scaled baseline predictors according to different context information. The experimental results show that using context-aware scaled baseline predictor for normalization indeed gets better recommendation performance, including RMSE, MAE, precision, recall, and nDCG.


Introduction
The abundance of information available on the Internet makes the increasing difficulty in finding what the people want, especially for the Electronic Commerce domain.As a consequence, building personalized information selection models is becoming crucial.Among many different information selection technologies, the recommendation systems are greatly developed due to their application on most of the famous online shopping companies [1,2].
The algorithms of recommending items have been studied extensively, most of which belong to two main categories.Content-based recommendation systems try to recommend items according to the users' past preference [3][4][5], whereas the collaborative recommendation systems make the recommendation in terms of the similar neighborhood preference [6][7][8][9].Recommendation systems based purely on content generally easily suffer from the problems of limited content analysis and overspecialization.Defining the appropriate items' features is very difficult for many situations, and these features depend heavily on the users' history, which cannot find the latent profiles for recommendation.
Collaborative filter (CF) approaches overcome some of the limitations of content-based ones.Items for which the content is not available or difficult to obtain can still be recommended to users through the feedback of other users.CF ones can also recommend items with very different content, as long as other users have already shown interest for these different items.Among collaborative recommendation approaches, methods based on nearest neighbors still enjoy a huge amount of popularity, due to their simplicity, their efficiency, and their ability to produce accurate and personalized recommendations [10][11][12].CF models try to capture the interactions between users and items that produce the different rating values.However, many of the observed rating values are due to effects associated with either users or items, independently of their interaction.A principal example is that typical CF data exhibit large user and item biases, that is, systematic tendencies for some users to give higher ratings than others and for some items to receive higher ratings than others.
Item-based collaborative filter [13,14] has much more accuracy than user-based one [15,16], when the number of items is larger than the number of users.The electronic commercial business always has huge productions.The number of productions far exceeds the number of users.However, the average number of common ratings is very small, because most of the users only have interest in very few items.User-based collaborative filter systems easily suffer from overfitting problems in this situation.So the itembased collaborative filter algorithms play an important role in modern commercial recommendation systems (RSs).This paper intends to improve the recommendation performance using a novel rating normalization strategy.
When it comes to assigning a rating to an item, each user has its own personal scale.Even if an explicit definition of each of the possible rating is supplied, some users might be reluctant to give high/low scores to items they liked/disliked.There are some different rating normalization schemes which are designed for different reasons [17][18][19].Also, many of the observed rating values are due to effects associated with either users or items, independently of their interaction.We do not only convert individual ratings to a more universal scale but also consider the user and item biases.
The baseline predictor (BLP), which combines the overall averaging rating and user or item biases, involves these factors for normalization.But, for the item-based collaborative filter systems, the BLP is always a statistical constant which cannot be adaptively changed according to the context [20][21][22][23].We found that the recommendation performance can be improved if we slightly scale the different parts of the BLP in a limited range.In this paper, we provided some novel context-aware scaled baseline predictors (CASBLP) for itembased collaborative filter normalization, considering different context information.The experimental results show that CASBLP can significantly improve the prediction performance, such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), precision, recall, and Normalized Cumulative Discounted Gain (nDCG).
The rest of this paper is organized as follows.We present the details of CASBLP in Section 2 and show experimental results in Section 3. Finally, we conclude the paper in Section 4.

Baseline Predictor for Item-Based CF Normalization.
A general neighborhood-based collaborative filter recommendation using BLP normalization is defined as follows: Υ  (, ) is rating predictor based on the  nearest neighbors.  is the baseline predictor, which is always defined as Denote by  the average ratings.The parameters   and   indicate the observed deviations of item  and user , respectively, from the average.
For item-based CF, we do not use the user biases due to using the similar items as neighbors.So the BLP in itembased CFS is Υ  (, ) is replaced by the following formula: (, ) is the set of the most similar  items to the item , and () is the set of items the user  has rated.
There are many different similar weight functions.In this paper, we use two popular ones, Cosine and Pearson's Correlation, the details of which are defined, respectively, as (5)

Motivation of Scaling Baseline
Predictor.The baseline predictor can introduce some information which is independent of neighborhood influence, but it is always set as a constant.However, we found that slightly scaling the baseline predictor could get a better predicting accuracy.But using a single scaling factor for ,   , and   is not a good idea.Figure 1 shows an example where we can decrease the RMSE when scaling   (e.g.,   ) on a small MovieLens dataset.
From Figure 1, the best scaling factor is 0.6, at which we can get the lowest RMSE.However, from another perspective, such as Top measure, using the same scaling factor 0.6 is not a good choice.Figure 2 shows that scaling BP could not improve the precision and recall.
For the recommendation systems, Top measure is more important than RMSE.To improve both RMSE and Top measure, we should not use the same scaling factor for the parameters in   : Determining these three parameters is very difficult, but, unlike matrix factorization models, NBCFs can also not train the unknown parameters.In this paper, we provide several context-aware scaling factors.Before describing the details, we first change (6) to another representation.Actually, the baseline predictor can be also described as is the set of users rating item .The scaling version of baseline predictor can be considered as Here, we use the denominator  to control the scaling factors, and hence  = |  |/(|  |+).In fact,  is the Bayesian mean damping term [24].It biases means toward the global mean .Our task is to determine  and  according to the context information.
The recommendation system is a very special machine learning research.The user-item matrix is always too sparse.When data is sparse, we need other sources of knowledge to help the machine learning algorithm.Mining the context information is a way of adding knowledge to the recommendation system algorithms.

Context-Aware Scaled Baseline Predictors.
We consider several context situations to determine the scaling baseline predictors: ratings distribution, categories distribution, timestamp distribution, and links distribution.At first, we denote by  the set of all the items and by  the set of all the users.
The rating distribution aware (RDA) method scales the baseline predictors in terms of ratings distribution.The values of ratings are usually discrete.Denote by  = {V 1 , V 2 , . . ., V  } the set of possible rating values, where Denote by (V  ) the set of rating records of which the value is V  : is the user,  represents the item, and  means the rating of  rated by .Also, (V  ) denotes the set of users whose ratings contain V  , and (V  ) denotes the set of items which are rated using the value V  .Now we sort all (V  ), and let  dsc be the set of (V  ) order by descent according to the size of sets: Denote by  all the rating records.The scaling factors of RDA are evaluated as Here, we use the  largest (V  ).If the sizes of some sets are equal and the number of candidates is larger than , we randomly select the sets of the same size.Like RDA, the category distribution aware (CDA) method scales the baseline predictors in terms of category distribution.The items in recommendation system always have some labels, indicating some special attributes.In the MovieLens, the movies have some labels of genres.Each label corresponds to a category, and each item may belong to at least one category.
Suppose we have  categories, and denote by  the set of these  different categories, where  = { 1 ,  2 , . . .,   }.Denote by (  ) the set of items belonging to   . dsc is a descent ordered set according to the size of set: For CDA, the scaling factors are expressed as Note that to determine  we use ∑ |(  )  | as the numerator and | ⋃ (  )  | as the denominator.The difference is that the items always belong to multiple categories.
There is always a timestamp record for each rating.The timestamp distribution aware (TDA) method scales baseline predictor in terms of timestamp distribution.Suppose that the element of  is a 4-tuple, where   = ⟨, , , ⟩ ∈ .The meanings of , , and  are the same as in (V  ). is just the timestamp when  rated  by the score .The format of  is usually a Unix timestamp.We change  to "yy-mm-dd" format t.That means the base unit of time is the day, and now   = ⟨, , , t⟩.Let ( t ) be the set of rating records, of which the reduced timestamp is t.Like the previous two methods, we create a descent ordered set  dsc = {( t ) 1 , ( tℎ ) 2 , . . ., ( t )  } according to the size of ( t ), where |( t We select the first  elements of  dsc to compose another truncated set The links distribution aware (LDA) method scales baseline predictor in terms of links distribution.The links mean the relationship between users and items, which make up a rating network.Any pairs of users have no link, and any pairs of items also have no link.Equation (13) and Figure 3 show an example of rating network: Only when the rating between  and  is larger than or equal to   can we connect  and .The degree of the user  is expressed as   () and   () for the item .We create two descent ordered sets  dsc and  dsc according to the degrees.It is obvious that  dsc =  and  dsc = .But, for convenience, we use different symbols.That is,  dsc = {û 1 , û2 , . . ., û } and  dsc = { î1 , î2 , . . ., î }.There is a unique mapping from   ∈  to û ∈  dsc and from   ∈  to î ∈  dsc .For  dsc and  dsc , we have   (û 1 ) ≥   (û 2 ) ≥ ⋅ ⋅ ⋅ ≥   (û  ) and   ( î1 ) ≥   (û  ) ≥ ⋅ ⋅ ⋅ ≥   ( î ).We put the ordered degrees of users and items into two sets, respectively:   = {  (1),   (2), . . .,   ()} and   = {  (1),   (2), . . .,   ()}, where   (1) ≥   (2) ≥ ⋅ ⋅ ⋅ ≥   () and   (1) ≥   (2) ≥ ⋅ ⋅ ⋅ ≥   ().
Unlike the other methods, LDA controls  and  using different distributions.For , we use the top  and top  degrees, but for , we use the top  degrees and the average degree.

Experiments
3.1.Experimental Settings.We use a MovieLens latest dataset in our experiments, including 100,000 ratings and 6,100 tag applications applied to 10,000 movies by 700 users [25].There are four files for each dataset: links, movies, ratings, and tags.We use these files to get different context information.We compare several different methods in our experiments, the names and meanings of which are shown in Table 1.
The total methods compared are defined in Table 1.There are two similarity weight functions in our experiments: Cosine and Pearson's Correlation.The neighborhood sizes of item-based models are all set to 20, while they are 100 for user-based models.Values of  in (11)∼( 15) are the same, 6 in default.The values of  are also the same for these different methods, 20 in default.We randomly split the dataset into 5 parts and use cross-validation to train and test the models.
For top metric (e.g., precision and recall), we randomly select 100 items on the testing as the candidates, excluding the ones appearing in the training.Only the items rated above 3.5 (including 3.5) are recommended.The neighborhood collaborative filter models always incur a high memory cost.So we use a 16 GB RAM to run different NHCF algorithms.

Experimental Metrics.
Five metrics are used in our experiments: precision, recall, RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and nDCG (Normalized Cumulative Discounted Gain).
For a test dataset , denote by TP the set of recommend items which the users are really interested in, denote by FP the set of recommend items which the users are not interested in, denote by FN the set of not recommend items which the users are interested in, and denote by TN the set of not recommend items which the users are not interested in.The metrics of precision and recall are defined, respectively, as follows: The recommendation system generates predicted ratings r for a test set  of user-item pairs (, ) for which the true ratings   are known.The RMSE and MAE between the actual ratings are given by The recommendation systems always present to the user a list of recommendations, imposing a certain natural browsing order.In many cases, we are not interested in predicting an explicit rating or selecting a set of recommended items, as in the previous sections; rather we are interested in ordering items according to the user's preferences.nDCG is a measure from information retrieval, where positrons are discounted algorithmically.Assuming that each user  has a "gain"   from being recommended an item , the average Discounted Cumulative Gain (DCG) for a list of  items is defined as where the logarithm base is a free parameter, typically between 2 and 10.A logarithm with base 2 is commonly used to ensure that all positions are discounted.nDCG is just the normalized version of DCG: where DCG * is the ideal DCG, the value of which ranges from 0 to 1.The larger the value is, the better the performance is.

Experimental Results
. We change a little the format of the MovieLens dataset and import this dataset to a MySQL database.The coefficients of BLP can be conveniently calculated using some advanced SQL sentences.All of the coefficients of CASBLP methods are shown in Table 2.
The experimental comparison results are shown in Table 3 (using Cosine similarity) and Table 4 (using Pearson's Correlation).It seems that using Cosine is better than using Pearson's Correlation in our experiments.Maybe this is because even if each user has different personal rating scale, the rating matrix is too sparse to become the major issue.When data is sparse, Cosine is always a good choice.
From Table 3, we can see that when not using normalization scheme (NoBP) all of the metrics are much worse  show the impact of scaled factors on RMSE, precision, and recall, respectively.
For all these three metrics, the optimum of  is near 20, at which the RMSE is the lowest and the precision and recall are the highest.What is interesting is that any shrinking of  can improve precision and recall, even if we set  to zero.However, shrinking  would cause a slightly higher RMSE except at the value near 0.8.This means that  can control the accuracy of the rating prediction, but when  has shrunk,  plays a crucial role in items recommendation.What causes this phenomenon is that maybe the mean rating is computed in terms of all the users, which involves the global information, while the biases are computed in terms of only very few similar neighbors, which involves the local information.For the personalized recommendation systems, the local information is much more important, and an ordinary average prediction has little meaning.That is why even if we set  to 0 and only using the item biases we can also get a passable prediction performance.
The neighbor size is an important factor in the neighborhood-based recommendation systems, for itembased or user-based ones.We increase the neighbor size geometrically from 5 to 320.Figures 7, 8, and 9 show the change of recommendation performance including precision, recall, and RMSE, respectively.
What we can see from Figure 9 is consistent with what we have concluded from Tables 3 and 4. Whether using scaled BLP or unscaled BLP, we can get similar RMSE, which are all much lower than the NoBP scheme.With the growth of the neighbor size, all the RMSE are trending toward stability.
What surprised us is the results of precision and recall.Both metrics are increasing until reaching the stable values with the growth of neighbor size except the NoBP scheme, the precision and recall of which decrease to the stable values.This is due to the fact that, maybe without normalization, the prediction lacks personalization and causes too many more decoys to choose from.
Figures 7 and 8 also show the results which are consistent with Tables 3 and 4. Just slightly changing the coefficients of BLP, we can get higher precision and recall than unscaled BLP scheme and NoBP especially when using larger neighbor size.

Conclusions
Rating normalization is an important step when designing collaborative filter recommendation systems, especially for the item-based ones which play a key role in the domain of online commercial business.Using the baseline predictor for normalization considers both the global information and local information.Although we found that balancing them can improve the recommendation performance, there is no clear way of determining the weight of these two sources of information.In this paper, we proposed some context-aware scaled BLP schemes, which compute the weights of mean ratings and biases, respectively, in terms of different context information.What we concluded from the experiments not only verified the advantage of scaled BLP but also pointed out the different roles of each part of BLP.This paper only studied the BLP normalization of item-based collaborative filter system on a sole MovieLens dataset.The user-based and matrix factorization models actually are much different from item-based ones, the details of which we will explore in the future work using some different and larger recommendation dataset.

Figure 1 :
Figure 1: Impact of unified scaling factor on RMSE.

Figure 2 :
Figure 2: Impact of unified scaling factor on precision and recall.

Figure 3 :
Figure 3: Example of rating network.

Figure 8 :Figure 9 :
Figure 8: Impact of neighborhood size on recall.

Table 1 :
The meanings of different methods' names.

Table 2 :
Values of scaling factors.

Table 3 :
Experimental results using Cosine.

Table 4 :
Experimental results using Pearson's Correlation.An important problem is that the coefficients we used have optimal values.So we change  from 0 to 1 and  from 0 to 200 to see the changes of the performance.Figures4-6