Improving Top-N Recommendation Performance Using Missing Data

Recommender systems become increasingly significant in solving the information explosion problem. Data sparse is a main challenge in this area. Massive unrated items constitute missing data with only a few observed ratings. Most studies consider missing data as unknown information and only use observed data to learn models and generate recommendations. However, data are missing not at random. Part of missing data is due to the fact that users choose not to rate them. This part of missing data is negative examples of user preferences. Utilizing this information is expected to leverage the performance of recommendation algorithms. Unfortunately, negative examples are mixed with unlabeled positive examples in missing data, and they are hard to be distinguished. In this paper, we propose three schemes to utilize the negative examples in missing data. The schemes are then adapted with SVD++, which is a state-of-the-art matrix factorization recommendation approach, to generate recommendations. Experimental results on two real datasets show that our proposed approaches gain better top-N performance than the baseline ones on both accuracy and diversity.


Introduction
In the current age of information overload, it is becoming increasingly hard for people to find relevant content.Recommender systems have been introduced to help people in retrieving potentially useful information in a huge set of choices.Conventional recommendation methods are based on users' rating values.These rating values are considered as indications of users' preference level towards the rated items.Recommender systems estimate the ratings of items that have not been rated by the target user based on the rating history and recommend top- items with highest predicted ratings.This kind of rating prediction approaches has gain significant success.Recently, there is a growing interest in improving recommender systems in terms of ranking performance as it seems to better approximate the true task [1,2].As a result, some researchers consider the recommendation problem as a ranking prediction problem and directly optimize a ranking goal to learn their recommendation algorithms.
Most of these approaches, either rating prediction ones or ranking prediction ones, are trained and tested on observed ratings only.The effectiveness of these approaches is based on an implicit underlying assumption that the ratings in the available data are missing at random.If the assumption is not satisfied, the missing data mechanism cannot be ignored in general and has to be modeled precisely so as to obtain correct results.Indeed, some recent works find that the data are missing not at random [3][4][5].Marlin et al. [3] provide evidence that low ratings are much more likely to be missing from the observed data than high ratings in the Yahoo!LaunchCast data.This may be a consequence of the fact that users are free to choose which items to rate.Steck [4] works on training and testing recommender systems on data missing not at random and illustrates that accounting for missing ratings can improve the top- performance of simple matrix factorization model.Therefore, missing data, which have not been rated by the active users, carry useful information of user preferences.

Related Work
In this section, the review of literatures is divided into four parts.The first one is about conventional rating prediction recommendation algorithms.The second one includes some studies on ranking prediction recommendation approaches.
The third one is about some recent works on nonrandom missing data.The last one focuses on one-class collaborative filtering, the idea which is similar to our proposed schemes.

Rating Prediction Approaches.
Recommendation techniques have been studied for several years.Conventional recommendation approaches are based on rating prediction.They are used for providing personalized recommendations to help people in solving the information explosion problem.Collaborative filtering (CF) is a very popular technique, since it is not necessary to analyze the content of the candidate items using swarm intelligence instead.Furthermore, it can be easily adapted from one domain to another.CF algorithms can be divided into two classes: memory-based and modelbased [8,9].
Memory-based algorithms are heuristic methods that make rating predictions based on the entire collection of items previously rated by users [10,11].They are based on a basic assumption that people who agreed about their preferences to certain items in the past tend to agree again in the future [12].The level of agreement can be measured by similarity.Based on the similarity calculation, recommender systems predict ratings for unknown items using adjusted weighted sum of known ratings and recommend items with high predicted values [11].
Model-based CF is another kind of typical CF methods.Model-based algorithms use the collection of ratings to learn a model, typically using some statistical machine-learning methods, which are then used to make rating prediction.These approaches always design appropriate loss functions and optimization procedure to learn their model by minimizing the error between predicted ratings and actual ones.Examples of such techniques include Bayesian clustering [9], matrix factorization [7], and topic model [13].
SVD++ [7] is a model-based CF using matrix factorization technique.It considers implicit feedbacks as complement of explicit feedbacks and utilizes them together to build recommendation models by minimizing prediction errors.This approach is a state-of-the-art rating prediction approach, which is used as the foundation of our improvement.

Ranking Prediction Approaches.
Different from those rating prediction approaches, some researches directly consider the recommendation problem as a ranking problem.They propose models for ranking predictions by directly modeling user preferences with respect to a set of items rather than the rating scores on individual items.
Weimer et al. [14] present a method (CofiRank) which uses Maximum Margin Matrix Factorization and considers maximum NDCG as the optimizing target.The approach is adaptable to different scores.Since the optimizing target of CofiRank is a listwise one, the approach scales well on collaborative filtering tasks.
Liu and Yang [15] measure the similarity between users based on the correlation between their rankings of the items rather than the rating values.Based on the preferences of similar users, they propose collaborative filtering algorithms for ranking items with either a greedy strategy or a random walk model.Liu et al. [2] propose a probabilistic latent preference analysis (pLPA) model to make ranking predictions.From a user's observed ratings, they extract his/her preferences in the form of pairwise comparisons of items which are modeled by a mixture distribution based on Bradley-Terry model.An EM algorithm for fitting the corresponding latent class model as well as a method for predicting the optimal ranking is described.
Koren and Sill [16] propose a collaborative filtering recommendation framework (OrdRec), which is based on viewing user feedback on products as ordinal, rather than the more common numerical view.Their approach is based on a pointwise ordinal model, which allows it to linearly scale with data size.OrdRec is also an improvement of SVD++.It is used as a comparing approach in our experiments to verify the effectiveness of our proposed approaches in the top- recommendation task.

Nonrandom Missing Data.
Most of conventional collaborative filtering approaches use observed ratings only, and they expect that the model optimizing with observed ratings only is an unbiased estimating of using the entire data.These approaches are based on an implicit assumption that the ratings not in observed data are missing at random.However, this may not be satisfied.Some recent works have found that data are not missing at random [3][4][5].
Marlin et al. [3] find that low ratings are much more likely to be missing from observed data than high ratings in the Yahoo!LaunchCast data.This is an evidence of data missing not at random.Steck [4,5] works on training and testing recommender systems on data missing not at random.He assumes that the relevant rating values are missing at random, and the other ratings are missing with higher probability.Based on the assumption, he presents two performance measures that can be estimated, under mild assumptions, without bias from data even when ratings are missing not at random.In addition, he also propose an appropriate surrogate measure for training models which is captured as AllRank.In this measure, both observed and missing data are considered.It improves the top- performance of a simple matrix factorization model by accounting for missing ratings.
Cremonesi et al. [1] propose an improvement of matrix factorization by considering all missing values in the user rating matrix as 0, which is captured as PureSVD.This approach gets better top- performance even than more detailed and sophisticated latent factor models.The result demonstrates that considering missing data as 0 value is much more effective than just ignoring them, which is also an evidence of data missing not at random.[17] propose two frameworks to solve the OCCF problems.One is based on weighted low rank approximation; the other is based on negative example sampling.Li et al. [18] exploit the rich user information to improve recommendation accuracy in the OCCF problems.They propose two ways to incorporate such user information into the OCCF models: one is to linearly combine scores from different sources and the other is to embed user information into collaborative filtering.Rendle et al. [19] consider missing data as a mixture of real negative feedback and missing positive values and present a generic optimization criterion (BPR) for personalized ranking that is the maximum posterior estimator derived from a Bayesian analysis of OCCF problem.

One-Class Collaborative
The schemes, which we will propose in the next section to deal with missing data of recommender systems by weighting or sampling, are similar to the idea in OCCF.However, the context of recommendation is different between our schemes and the schemes in OCCF.OCCF focuses on the binary recommendation problems with implicit feedbacks while our proposed ones focus on the classical recommendation problems with explicit feedbacks.Furthermore, the neighborbased sampling scheme proposed in Section 3.3 can utilize the advance of NN methods while sampling negative examples.

Schemes to Deal with Missing Data
In the context of recommender systems, users are free to choose which items to rate.As a result, the observed rating data can indicate users' preferences.In the survey of Marlin et al. [3] using Yahoo!LanchCast data, there are 93.9%users report that they rate an item which they love very often, while only 36.5% users report that they rate an item for which they are "neutral" with the same frequency.The survey is a collection of ratings for songs, which is a little time-consuming context.If the context changes to a very time-consuming or cost-consuming, such as movie or ecommerce, the ratio of users choosing to rate an item for which they are "neutral" should be less.Therefore, there are two types of items for a certain user.One is the items that the certain user wants to rate.They are partitioned to a set  + .The other is the items that the user does not care and does not want to rate.They are partitioned to  − .The observed data contains the rated items.It is a part of  + .The rest part of  + combines with  − to be missing data.In this paper, we consider the items in  + as positive examples, and the items in  − as negative examples.
Based on the partition, Steck considers that the rating distribution is different between  + and  − [5].He tries to model the difference to improve a simple matrix factorization approach in top- recommendation task [4].In his opinion, the negative ratings with low value get high probabilities to be missing.Therefore, he imputes a small value (  ) for all missing data, and uses a weighting parameter (  ) to control the effect of missing data.In this way, the improved models using missing data can gain better top- performance than the original matrix factorization model using observed data only (in the work, AllRank-Regression with   = 0.05 and   = 2 gains the best top- performance.It is used as a comparing approach in our experiment).
The main idea of Steck [4,5] is that most of missing data are negative ratings.The difference between  + and  − is rating distribution.Different from them, in our opinion, most of missing data  [4], our   is used to represent negative examples, which are actually in a different item set from positive examples.Therefore, the value of   should out of the range of rating scale in order to distinguish negative examples with positive ones using rating value (The typical value of   is 0. The impact of different   is experimented in Section 6 even with the value in the range of rating scale.).
In the rest of this section, three schemes are introduced to deal with missing data.
where   is the weighting value for user  on item ,  is the observed data,  is a uniform confidence threshold for all missing data.If user  has rated item , it is a positive example, and the weighting value is set to 1. Otherwise, ⟨, ⟩ is considered as a negative example with a confidence level .
In this scheme, all missing data are imputed with   .It can be formalized as: where  *  is the data for learning recommendation models, and   is the rating value in observed data.
With WS, a recommendation approach aims at finding a prediction model to minimizing the objective of a weighted Frobenius loss function as: where  * is the re-construct matrix which contains both observed data and imputed ratings, while r(, ) is the rating predicted by recommender systems.
Broadly speaking, WS can be considered the same as AllRank-Regression in [4].The main difference is that their opinion about missing data is negative ratings (AllRank-Regression) or negative examples (WS).In addition, PureSVD, which is proposed in [1] is a special case of using WS in SVD approach with  = 1.

Random Sampling Scheme
. WS considers all missing data as negative examples.This assumption is roughly held in most cases.However, the main drawback is that the computational costs are very high especially when the target problem of recommender systems is information overload, which contains a massive set of missing data.Sampling scheme could solve this problem in a certain degree by considering some missing data as negative examples, which is much different from WS.
In this subsection, we propose a random sampling scheme (RSS) which samples some negative examples from missing data with a stochastic method.In RSS,  percentage of missing data is randomly selected as negative examples ().These negative examples are combined with rating matrix  to be the re-construct matrix  * for RSS.It can be formalized as: RSS uses  * to optimize the recommendation model and generate recommendations.Therefore, the size of  * is a major aspect of the computational cost for different recommendation approaches.As the size of  is a constant, the computational cost is mainly relevant to the size of .When  is 1,  is the entire set of missing data, the computational cost of RSS is similar to WS.When  is 0,  is an empty set, the computational cost of RSS is similar to the original recommendation approach which does not consider the effect of missing data.When  is between 0 and 1, the computational cost of RSS is reduced with a decrease in .The experimental results will show that RSS gains the best performance when  is 0.2.This indicates that the computational cost of RSS is much less than WS.
As RSS mainly focuses on utilizing missing data without improvement of training process of recommendation approaches.Therefore, RSS learns the prediction model by minimizing the objective of an unweighted Frobenius loss function as most recommendation approaches do.It can be written as: It is notable that the  should be re-built in each learning step, as the sampling scheme is a stochastic one, in order to reduce the randomness.

3.3.
Neighbor-Based Sampling Scheme.Sampling scheme can reduce the computational costs of weighting scheme.However, using a stochastic method leads that both the missing positive examples and the negative ones have the same Input: the rating matrix , the random ratio , the neighbor size  Output: the neighbor-based sampling matrix ℎ (1) for each user  ∈  do (2) Find (): the top- most similarity users of ; (3) Find (): the item set, in which items have not been rated by user ; (4) Find (): the candidate item set, a sub set of (), in which items have not been rated by all users in (); (5) Random select  percentage of items in () into ℎ; (6) end for Algorithm 1: The neighbor-based sampling scheme.chance to be selected as negative examples.In this subsection, we propose a neighbor-based sampling scheme (NSS) to increase the chance of negative examples to be selected and to decrease the selected chance of positive ones.Different from RSS sampling with a stochastic method, NSS samples some negative examples from missing data using swarm intelligence.
NSS is based on an assumption that similar users have similar tendency about negative examples.Like the idea of neighbor-based CF, in NSS, for a certain user, items that have rarely been rated by his/her neighbors are very likely to be negative examples.As a result, NSS searches the  most similar users as neighbors for individual users and then selects the items which have not been rated by users' neighbors.In this case, negative examples have bigger chance to be selected than positive ones.After that, NSS randomly samples some items as negative examples from the selected items.The sampled result is a negative example set (ℎ).The detail of NSS is described in Algorithm 1.
With the sampling result, the elements of the reconstruct matrix for NSS can be written as Recommendation approaches with NSS learn their models with the same loss function as RSS.Since the similar users in NSS are used to share the opinions about which items (not) to rate, the similarity functions which consider two users who often (do not) rate the same items as similar users are suitable.Jaccard index is a such kind of similarity function, and it is used for measuring users' similarities in this paper.

Recommendation Approaches
The proposed schemes in Section 3 can be adapted with many recommendation approaches to utilize missing data.In this section, we take a matrix factorization approach which is known as SVD++ [7] as the basic model and adapt the schemes with it in order to improve its top- recommendation performance by using missing data.
The SVD++ approach is demonstrated to yield superior accuracy by considering implicit feedbacks (including rating behaviors) as complement of explicit feedbacks and use them together to build recommendation models by minimizing prediction errors.The prediction model of SVD++ is where  is the average rating value of the known data.In addition,   and   indicate the observed deviations of user  and item , respectively, from the average.  and   are the factorized user and item factor, respectively.() represents the item set rated by user .  is an item factor which is according to the impact of implicit feedbacks.
The approach learns the values of involved parameters with a stochastic gradient descent technique by minimizing the regularized squared error function.
where  6 and  7 are the regularizing terms.The learning process runs in the rating matrix , which contains all the rating values proposed by users.The predicted ratings can be calculated by ( 7) using the learned parameters.SVD++ learns parameters from .It considers all missing data as unknown information and ignores them.
However, as mentioned above, missing data should not be ignored simply.Therefore, we adapt the proposed scheme with SVD++ to deal with missing data in order to improve the recommendation performance.The improvement approaches are introduced in the rest part of this section, respectively.

Improvement of SVD++ with Weighting Scheme.
In this subsection, WS is adapted with SVD++ to deal with missing data.This approach is captured as WSVD++.In WSVD++, we just change the learning process of SVD++ without any change of model structure.As a result, the prediction model of WSVD++ is consistent with SVD++, as shown in (7).On the other hand, the loss function should be converted to a weighted function as shown in (3) In WSVD++, the learning process runs through all the useritem pairs with different weighting values as defined in (1).
Let us denote the prediction error,  *  − r , by   .WSVD++ loops through all the user-item pairs.For a given case  *  , we modify the parameters by moving in the opposite direction of the gradient, yielding In this way, the learned parameters can converge to minimize the loss function.However,   and   are two parameters which indicate user and item rating bias, respectively.They should not been influenced by missing data, which are not explicit feedbacks from users for items.In addition,   is an indicator of implicit feedbacks, and the negative examples are negative evidences for implicit feedbacks.It should not been influenced by missing data, either.Therefore, we use a different weighting function for these three parameters, which can be written as Correspondingly, the learning process can be revised to There is an important parameter () in WSVD++.It determines the confidence of missing data to be negative examples.We will discuss the impact in Section 6.2.

Improvement of SVD++ with Sampling Scheme.
In this subsection, the sampling scheme is adapted with SVD++ to deal with missing data.The scheme can be either RSS or NSS.Depending on the choosing scheme, the approach is captured as RSSVD++ (with RSS) or NSSVD++ (with NSS), respectively.The same as WSVD++, we only change the learning process of SVD++ without any changing of model structure.The prediction models of RSSVD++ and NSSVD++ are consistent with SVD++, since they are all unweighted functions.Furthermore, the loss functions are still consistent with SVD++.
In addition, the negative examples are not relevance to the parameter   ,   , and   like WSVD++.Therefore, in order to guarantee the irrelevance, an indicator function is defined.It can be written as The learning process of RSSVD++ or NSSVD++ runs in the reconstruct matrix  * .It can be defined as Both RSSVD++ and NSSVD++ are sampling schemes.They share similar loss function and similar learning process.The main difference between them is the way in which they are sampling negative examples from missing data.There is an important parameter () which is the ratio of the negative examples randomly selected from missing data in RSSVD++.The impact of it will be discussed in Section 6.3.For NSSVD++, there are two important parameters, for example,  and .The former one is the ratio of a negative example randomly selected from candidate item set.The latter one determines the size of the neighbors for users.We will discuss the impact of them in Section 6.4.

Evaluation Metrics
In this paper, we focus on top- recommendation task, where a recommender system is trying to pick the best  items for people [1,6,8].The Normalized Discounted Cumulative Gain (NDCG) [20] metric is a popular metric for evaluating the relevance of top- results in information retrieval where the documents are assigned graded rather than binary relevance judgments.As the rating values can indicate the levels of users' preferences on items in recommender systems, the NDCG metric is suitable for evaluating recommendation quality in the top- recommendation task.In this paper, we use NDCG as our main evaluation metric.It is an accuracy metric, which can be written as where (, ) is the rating value of user  rating the item at the th position of the recommendation list, DCG@() represents the Discounted Cumulative Gain value at the th position for user , and IDCG is the maximum possible DCG which is used for normalizing the NDCG value.However, in recommender systems, the recommended items are always unknown ones (the target user has not rated the items yet).Therefore, some researchers consider the recommendation problem as a ranking problem and use NDCG to evaluate the algorithms while ranking the items in the test set [2].In this paper, we recognize it as NDCG+ metric and consider it to be a comparative evaluation metric.The main difference between these two metrics is that NDCG+ ranks the items contained in the test set of each user, whereas NDCG ranks all the possible items (all items except the ones that the current user has rated in the training set) for each user.
In addition, 1-call at top- recommendations is used as another accuracy metric.It reflects the ratio of users who have at least one relevant item in their top- recommendation lists [21].This paper is in the context of data missing not at random.Recall is a popular evaluation metric in this context, as it can be estimated without bias from observed data, whether or not the relevant ratings are missing at random [5].Recall is defined as where REL(, ) is the number of relevant items among the top- items for user  and REL() is the number of all relevant items for the user.As used in [4,5], the relevant items are the items rated by the current user with value 5. Furthermore, a number of recent studies find that beyond accuracy there are other quality factors, which are also important to users, for example, diversity and novelty [8,22].Diversity in recommender systems refers to how different the recommended items are from each other.It is an important complement of accuracy since a recommender system which recommends relevant items has little value to a user if the recommendation cannot expand his/her interest.Coverage is one of the most popular diversity metrics.It measures the percentage of items that an algorithm is able to recommend to users in the system.Denoting the total number of distinct items in top- places of all recommendation lists as   , the -dependent coverage is defined as Furthermore, just recommending popular items is not sufficient for users, which is considered as lack of novelty.Therefore, the coverage of recommendations in the long tail of the items is also a significant evaluation metric, which indicates the novelty of recommendations to a certain degree.It can be written as where   represents the intersection of   and the long tail item set.In this paper, we consider that the long tail item set contains all the items which are not in the top 20% popular item set.In summary, 6 evaluation metrics are used to evaluate our proposed approach.NDCG, 1-call, and Recall are used to evaluate the top- recommendation quality.COV is used to evaluate the diversity of recommendations, whereas CIL is mainly for evaluating novelty.NDCG+ is a metric to evaluate the ranking prediction quality.

Experiment
6.1.Experiment Setup.The proposed recommendation approaches are evaluated on MovieLens and EachMovie datasets, which are both widely used in the field of recommender systems.MovieLens dataset, denoted as ML, contains 100 thousand ratings which are assigned by 943 users on 1682 movies.Collected ratings are in 1-to-5 scale.EachMovie dataset, denoted as EM, contains 2.8 million ratings from 72916 users on 1628 movies.The original ratings from EM are in 0-to-1 scale.In order to be consistent with ML, it is converted to 0-to-5 scale, and then ratings with 0 value are excluded.In addition, EM dataset is very sparse, some users only rate a few items, and some items are only rated by a few users.These data may reduce the recommendation quality.Therefore, we exclude the users who rated no more than 20 items and the items which are rated no more than 10 times.
We use 5-fold cross validation for the evaluation.Starting from the initial data set, five distinct splits of training and test data are generated.For each data split, 80% of the original set is included in the training data and 20% of it is included in the test data.Users' rating history in the training set is used to generate recommendations according to different algorithms.The test set is then used to evaluate the recommendation results.We further split the test set randomly into two disjoint sets of equal size.One of them is used to determine the tuning parameters.The other is for final evaluation of the trained model.
In order to demonstrate the effectiveness of our proposed approaches, we compare them with the original SVD++ approach and other benchmark approaches.User-based CF (UserCF) and Slope-one are two classic rating prediction approaches.UserCF [11] is a classic neighbor-based CF approach, which is based on an assumption that users always like the items liked by similar users.Slope-one [23] is a memory-based CF approach based on average rating differential.The original SVD++ is also a rating prediction approach.OrdRec [16] is a ranking prediction approach based on a pointwise ordinal model.As an improvement of SVD++, it is used to compare the improvement level between using missing data and optimizing ranking performance.In addition, PureSVD [1] and AllRank-Regression [4] are two existing approaches dealing with missing data, which have been introduced in Section 2. They are both used for comparison.In order to minimize the impact of different original models, their idea of using missing data are adapted with SVD++, captured as Pure and AllRank, respectively.All of the above-mentioned approaches are evaluated by NDCG, Recall, 1-call, COV, and CIL, compared with NDCG+.
In addition, some approaches need user-specific parameters.The details of parameter assignments for different approaches are as follows: the size of nearest neighbors for UserCF is 50; SVD++, Pure, and AllRank have 50 features and 25 iteration steps with  6 =  7 = 0.05 and  1 =  2 = 0.002; the   and   for AllRank are 2 and 0.05, respectively, as suggested in [4]; OrdRec has 50 features and 60 iteration steps with  6 = 0.0005,  7 = 0.0001,  1 =  2 = 0.05, and  3 =  4 = 0.006.Our proposed approaches are improvements of SVD++.The parameters, which have been included in SVD++, are set the same value as in SVD++.Therefore, the effectiveness of the proposed approaches is irrelevance with the impact of these parameters.The impact of other new added parameters is analyzed in the next subsections.

Parameters of WSVD++.
WSVD++ is an improvement of SVD++ with a weighting scheme.There are two new added parameters, for example,  and   .One is , which determines the weight of the negative examples.The other is   , which is the imputed value of missing data.In order to analyze the impact of them, experiments are carried out to analyze the performances of WSVD++ with different values of  and   .When one of them is analyzed, the other one remains unchanged.Here, we only use NDCG value, which is our main evaluation matric, to analyze the performances.
Figure 1 shows the performances of WSVD++ as a function of the value of  with   = 0 on ML (the trends of performances on EM are similar.Therefore, we will not show them here.Furthermore, we use the same data in the experiments in this and the next two subsections while analyzing the performances of the proposed approaches with different parameters and only show the results on ML).The value of  is changed from 0 to 1, step by 0.1.
When  is 0, WSVD++ degenerates to SVD++.When  is from 0.1 to 0.2, WSVD++ gets better performance than SVD++.This indicates that improving SVD++ with a weighting scheme could increase the capability to recommend relevance items.When  is bigger than 0.2, the performance of WSVD++ is not better than SVD++.too much, the performance may decrease.When  is 1, WSVD++ can be considered as Pure.The performance of it is similar to original SVD++.When  is 0.2, WSVD++ gets the best performances.As a result, the best value of  is chosen as 0.2 in WSVD++.
Figure 2 shows the performance of WSVD++ as a function of the value of   with  as 0.2, which is chosen as the best value.The value of   is changed from −5 to 5, step by 1.It is obvious that when   is less than 1 (out of the range of rating scale), the performance of WSVD++ is better than it with   between 1 and 5.The result verifies the effectiveness of using an imputed value out of rating scale to model negative examples with a weighting scheme.In addition, it demonstrates that considering missing value as negative examples is more appropriate than considering it as negative ratings.The best performance is located at   = 0.When the value of   is smaller than 0, the performance declines.It may be because of the low confidence of the weighted examples.Too small   introduces bias of the trained model.

Parameters of RSSVD++.
RSSVD++ is an improvement of SVD++ with a random sampling scheme.There are two new added parameters, for example,  and   . is the ratio of the negative examples randomly selected from missing data, while   is the imputed rating for missing data.In order to analyze the impact of them, experiments are carried out to analyze the performances of RSSVD++ with different values of  and   , respectively.Similar to the above subsection, we only use NDCG value to analyze the performances.
Figure 3 shows the performances of RSSVD++ as a function of the value of  with   = 0.The value of  is changed from 0 to 1, step by 0.1.
When  is 0, RSSVD++ degenerates to SVD++.When  is from 0.1 to 0.2, RSSVD++ gets better performance than SVD++.This indicates that improving SVD++ with a random sampling scheme could increase the capability to recommend relevance items.When  is bigger than 0.2, the performance of RSSVD++ is not better than SVD++.This indicates that considering too much missing data as negative examples is not a good solution to improve recommendation quality, since some missing data are not negative examples.As a result, a small value of  can lead RSSVD++ getting good performances.When  is 0.2, RSSVD++ gets the best performances.Therefore, the value of  is chosen as 0.2 in RSSVD++.
Figure 4 shows the performance of RSSVD++ as a function of the value of   with  as 0.2.The value of   is changed from −5 to 5, step by 1.It can be found that when   is less than 1, the performance of RSSVD++ is better than it with   between 1 and 5.When   = 0, RSSVD++ gains When  is 0, NSSVD++ degenerates to SVD++.When  is bigger than 0, NSSVD++ gains better performance than SVD++ no matter what value  is.This is very different from WSVD++ and RSSVD++.It is because the candidate item set is heuristic selected by a neighbor-based algorithm, and most items in the set are likely to be negative examples.Furthermore, there is a similar phenomenon in these three figures.It is that the NDCG performances keep stable when  gets up to a certain value.For  = 20, this value is 0.2; for  = 50, the value is 0.5; and for  = 80, the value is 0.6.All in all, when  is 0.5 and  is 50, NSSVD++ gets the best recommendation quality.This pair of parameter values is chosen as the best one in NSSVD++.
Figure 8 shows the performance of NSSVD++ as a function of the value of   with  = 0.5 and  = 50.The value of   is changed from −5 to 5, step by 1. Similar to WSVD++ and RSSVD++, when   is less than 1, the performance of NSSVD++ is better than it with   between 1 and 5.However, when   is between −5 and 0, the performance is almost the same.does not introduce bias of the trained model.To be consistent with WSVD++ and RSSVD++, the chosen value of   for NSSVD++ is still 0.

Comparison with Baselines.
In this subsection, we present a performance comparison of both accuracy and diversity between our proposed approaches and the baseline ones on ML and EM datasets.For each approach, we report the NDCG and Recall values at the 1st, 3rd, and 5th positions in the recommendation list and 1-call, COV, and CIL at the 5th position, comparing with NDCG+ values at the 5th position.
Filtering.In some recommendation context, the training data usually consist simply of binary data reflecting a user's action.Researchers consider these problems as one-class collaborative filtering problems (OCCF).In these problems, users' action data are usually extremely sparse (a small fraction are positive examples); therefore ambiguity arises in the interpretation of the nonpositive examples.Negative examples and unlabeled positive examples are mixed together, and they always are unable to be distinguished.Pan et al.
are negative examples.Positive examples in  + and negative examples in  − are two different item sets.The goal of the recommender systems is to identify the unrated positive examples.As a result, it is necessary to distinguish an item belong to which item set.Unfortunately, only positive examples are explicit in recommendation context, negative examples are mixed with some positive examples in missing data.In order to solve the problem, we try to distinguish the negative examples, and use them together with the positive ones to learn recommendation model.Like the idea in [4], we use an imputed value (  ) for negative examples in order to model both positive and negative examples in one unique model.Different from Steck

3. 1 .
Weighting Scheme.Weighting Scheme (WS) considers that all missing data are negative examples with different confidence levels towards the positive ones.The weighting value indicates the confidence level, which determines how much missing data are considered as negative examples.The weighting function can be written as:

Figure 1 :
Figure 1: NDCG of WSVD++ as a function of the value of  with   = 0 on ML.

Figure 2 :Figure 3 :
Figure 2: NDCG of WSVD++ as a function of the value of   with  = 0.2 on ML.

Figure 4 : 6 . 4 .
Figure 4: NDCG of RSSVD++ as a function of the value of   with  = 0.2 on ML.

Figure 5 : 12 NDCGFigure 6 :
Figure 5: NDCG of NSSVD++ as a function of the value of  with  = 20 and   = 0 on ML.

Figure 7 :Figure 8 :
Figure 7: NDCG of NSSVD++ as a function of the value of  with  = 80 and   = 0 on ML.