Eliminating the effect of rating bias on reputation systems

The ongoing rapid development of the e-commercial and interest-base websites make it more pressing to evaluate objects' accurate quality before recommendation by employing an effective reputation system. The objects' quality are often calculated based on their historical information, such as selected records or rating scores, to help visitors to make decisions before watching, reading or buying. Usually high quality products obtain a higher average ratings than low quality products regardless of rating biases or errors. However many empirical cases demonstrate that consumers may be misled by rating scores added by unreliable users or deliberate tampering. In this case, users' reputation, i.e., the ability to rating trustily and precisely, make a big difference during the evaluating process. Thus, one of the main challenges in designing reputation systems is eliminating the effects of users' rating bias on the evaluation results. To give an objective evaluation of each user's reputation and uncover an object's intrinsic quality, we propose an iterative balance (IB) method to correct users' rating biases. Experiments on two online video-provided Web sites, namely MovieLens and Netflix datasets, show that the IB method is a highly self-consistent and robust algorithm and it can accurately quantify movies' actual quality and users' stability of rating. Compared with existing methods, the IB method has higher ability to find the"dark horses", i.e., not so popular yet good movies, in the Academy Awards.


Introduction
The fast development of the Internet and related infrastructures has created vast opportunities for people to date, read, shop, and enjoy entertainment online [1,2,3]. As people come to rely more and more on the Internet, they place themselves at additional risk. Disinformation and rumors mislead people into making wrong decisions. For example, some e-commercial Web sites sellers manipulate information in order to present low quality products in a good light. How to effectively disentangle truth from falsehood to protect individuals from malicious deception is a critical problem, especially for the companies who provide information services or products online [4,5,6,7]. Reputation systems arose as a result of the need for Internet users to gain trust in the individuals they transact with online [8,9]. Additionally, reputation systems enable users and customers to better understand the provided information, products, and services [10,11]. Reputation systems may help users to make decisions on whether or not to buy specific services or goods that they have no prior experience using or never purchased before [12,13,14].
Reputation system use a collection of historical ratings records and attributes of users' and items' to calculate their reputation/quality levels, which usually represented as the form of scores. Most e-commercial and interest-based websites employed some kinds of reputation systems to differentiate the qualities of services, products or entities before recommendation or information push. For example, Netflix, which provides DVD rental service allows users to vote on the movies and then computes the reputation score of each movies. Since the ratings have a large influence on users' online purchasing decisions and the online digital content distribution, various algorithms have been proposed to give objective evaluations. Laureti et al. [15] proposed an iterative refinement (IR) method where a user's reputation, i.e., rating stability is inversely proportional to the difference between the user's ratings and the corresponding objects' estimated quality. The estimated quality of each object and reputation of each user are iteratively updated until convergence is reached. Zhou et al. [16] proposed a robust ranking algorithm where a user's reputation is calculated by the Pearson correlation between user's ratings and the objects' estimated quality. Compared with the IR method, this method shows a higher robustness against spammer attacks. More recently, Liao et al. [17] developed a reputation redistribution process to the iterative ranking measurement, which can effectively increase the weight of votes cast by highly reputable users and reduce the weight of users with a low reputation, when estimating the quality of objects. There are also some other algorithms that are built on the basis of bayesian theory [18], belief theory [19], the flow model [20], or fuzzy logic concepts [21]. Most of the previous methods are based directly on ratings while neglect the fact that users may have a personal bias when they give a score to an object. We have empirically investigated four benchmark datasets that were obtained from two video-provided Web sites, MovieLens [23] and Netflix [22] and found that each user has a certain magnitude of rating error which decreases the prediction accuracy of ratings [24]. In order to eliminate the effects of this rating error on the evaluation results, we propose a new algorithm called the iterative balance (IB) method. Experiments on MovieLens and Netflix datasets show that the IB method is a highly self-consistent and robust algorithm, it can accurately quantify a user's reputation and a movie's quality. Compared with the state-of-the-arts, the IB method has a greater ability to find the "dark horses" for Oscar award.
This paper is organized as follows. In section 2, we introduce the representation of rating systems and the general framework of iterative ranking algorithms. Next, we describe our IB method and some well-known iterative algorithms which will be used for comparisons. In section 3, four benchmark datasets and several evaluation metrics are described. In section 4, we show the performance of the IB method in terms of accuracy and robustness. Conclusions and discussions are drawn in the last section where the potential relevance and applications of the IB method are discussed.

Bipartite network representation of rating systems
Bipartite networks are commonly used to represent the relationships between two groups of entities, such as the relationships between actors and movies, goods and customers, books and readers, publications and authors, etc. Only the relationships between the two groups of entities are allowed. Here, we use bipartite networks to represent the rating systems which include the set of users (denoted by U ), the set of objects (denoted by O) and the ratings between users and objects (denoted by R). A link in the bipartite network connecting user i and object α represents a historical rating r iα (∈ R). We give a simple example in Fig. 1 to show how to construct a bipartite network based on a set of rating data. The original data shown in Fig. 1(a) has seven rating records made by four users on four movies. The ratings are given on the integer scale from 1 star to 5 stars (i.e., worst to best). Fig. 1(b) shows the corresponding bipartite network where users are represented by circles, and objects are presented by squares. Users are connected with the movies that they have rated. All the users who have rated object α are represented by set U α , while all the objects which have been rated by user i are represented by set O i . For example, in Fig. 1 The object α's degree k α is the number of users in set U α , and the user i's degree k i is the number of objects in the set O i .

Iterative ranking framework
As a matter of fact, items have a set of qualities, based on a set of N traits. A user's aggregate rating is a reflection of the quality of those traits, plus the individual weighting that reflects the user's value system. A user's reputation is the accuracy of rating those traits, independent of his individual weighting of the traits. For convenience, This paper deals with the case where N = 1. Q α and R i denote the quality of object α and the reputation of user i, respectively. Note that, when users' biases and mistakes are absent, i.e., R i = 1 for every user, any two users would rate any object the same according to the instinct quality of the object. The most straightforward method to quantify one object's quality is to consider the historical ratings that the object received. Averaging over all ratings (abbreviated as AR) is the simplest method, which mathematically reads Obviously, in this form the ratings from different user contribute equally toQ α . However, the ratings of users with higher reputation are more reliable than the ratings from low reputation users. Therefore a weighted form to calculate the quality of an object Q α was proposed.
where R i is usually the normalized reputation score of user i. KR is "a very crude approach" of evaluating the reputation of users in system. The basic assumption is that the user with more experience, i.e., rating more items before, has a higher ability to rating trustily and precisely. The reputation of a user is directly proportional to the number of items he or she has rated in KR. However, due to the unreliable of this assumption, nothing more will be discussed in this paper.
There are also three iterative ways to calculate each user's reputation score R i . Laureti et al. [15] presented an iterative refinement method (abbreviated as IR), which considers users' reputation scores as inversely proportional to the mean squared error between users' rating records and the quality of objects, namely After normalization, we obtain Zhou et al. [16] proposed a correlation-based iterative method (abbreviated as CR), which assumes that a user's reputation is calculated by the Pearson correlation [25] between user rating records and the corresponding objects' quality.
The reputation scores are defined as Normalizing T R i , we obtain More recently, Liao et al. [17] proposed a reputation redistribution process (abbreviated as IARR) to improve the validity by enhancing the influence of highly reputed users. Then equation (7) can be rewritten as where θ is a tunable parameter to control the influence of reputation. Obviously, when θ = 0, IARR i is a constant value for all the users; when θ = 1, IARR reduces to the CR method. In this paper, we set θ = 3 which is suggested by the proposers [17]. In the same time, Liao et al. also presented another similar algorithm, called IARR2, by introducing a penalty factor to IARR. IARR2 algorithm thought that a user is more reliable if he rates more objects and his reputation is still high, and so does the objects. In IARR2, the equation (2) should be written as and the T R i in equation (8) was revised as In summary, under the framework of iterative models, there are four steps to achieve the final results through four different algorithms: (i) Initialize the reputation of users. Specifically, we set IR i (0) = 1/|O|, CR i (0) = k i /|O|, IARR i (0) = k i /|O| and IARR2 i (0) = k i /|O| for the IR, CR, IARR and IARR2 methods, respectively 1 .
(iv) Continue the iteration process according to (ii) and (iii) until the change of the quality estimates ) is less than a threshold ε, then terminate the iteration. In our experiments, we set ε = 10 −6 .

Iterative balance model
The above three methods neglect the fact that the ratings of different users may have bias due to personal interests and criteria. This bias can be measured by the standard deviation and the skewness of the user's rating records. Let's consider |U | users and |O| objects. Each user i has a certain magnitude of rating error δ i and each object α has an intrinsic quality Q α which is unknown for users. The magnitude of rating error δ indicates the inaccuracy degree of the rating score, which could play negative or positive effect on the rating. Then the rating of user i on object α, namely r iα can be written as Here, we assume that the distribution of the magnitude of rating error δ has zero mean. For an arbitrary user i, his magnitude δ i can be measured by the standard deviation (SD), which reads wherer α is the average score of all ratings on object α. Furthermore, we also give the skewness of the rating records, which refers to asymmetry in the real distribution of a user's rating records about its mean, where SK i could come in the form of 'negative skewness' or 'positive skewness', depending on whether the user's rating records are skewed to the left (negative skew) or to the right (positive skew) of the average rating records.  Table  1. Detailed introduction of the datasets can be found in section Materials and Methods.
We empirically analyze four benchmark user-movie datasets, three of them are samples from MovieLens, named M1, M2 and M3, and the other one is from Netflix, named NF (see Table 1 for basis statistics of the datasets). For each dataset, we investigate the distribution of SD and SK of users, respectively shown in Fig. 2(a) and Fig. 2(b). Both SD and SK follow normal distribution where the parameters are estimated via maximum likelihood approximation method. Due to the user's personal bias of rating, we proposed an iterative balance model to eliminate the bias in order to better quantify the user's reputation. The model considers the user magnitude to meet equation (9), and its process can be described as follows: (i) Initialize the quality of each object according to equation (1), we obtain Q α (0) =Q α .
(ii) Update the reputation of each user according to IBR i measures the rating bias of user i. Obviously, the lower the IBR i is, the higher reputation the user i has.
(iii) Update the quality of each object according to the equation where sgn(x) is the sign function, which returns 1 if x > 0; −1 if x < 0; and 0 for x = 0. It is noted that if Q α (t) < 0, then Q α (t) = 0.
(iv) Continue the iteration process of (ii) and (iii) until the change of the quality estimate j∈O (Q α (t) − Q α (t − 1)) is less than a threshold ε = 10 −6 , then terminate the iteration. The final stable values of Q α (t c ) and IBR i (t c ) are used to quantify the intrinsic quality of object α and the reputation of user i, respectively.

Datasets
To test the performance of our IB method, we consider four benchmark datasets, which are sampled from MovieLens [23] and Netflix [22]. MovieLens is an online movie recommendation Web site, who invites users to rate movies. Netflix Web site also has DVD rental service and the users can vote on the movies. The first three datasets are sampled from MovieLens with different sizes, which are named as M1, M2 and M3. The fourth dataset is a random sample of the whole records of user activities on Netflix.com. The rating scale for both MovieLens and Netflix is from one (i.e., worst) to five (i.e., best). Based on the users' historical records, we can construct a user-movie bipartite network. If user i selects movie α and rates it, a link between user i and movie α would be established. The statistical features of the four networks constructed based on four datasets are summarized in Table 1. In this paper, we consider only users and objects with degrees greater than 20.

Evaluation metrics
To evaluate the performance of IB method, we employ the mean-squared error (MSE) to measure the algorithm's accuracy on quantifying users' reputation, and the precision to evaluate the algorithm's accuracy on identifying good movies. Besides accuracy, we also investigate the robustness of our method, which is measured by the MSE and the Kendall's tau (τ ) coefficient [26]. A good method should give a higher reputation score to users with a lower error magnitude. M SE(i) represents the scoring stability of user i, which reads where Q α is the intrinsic quality of object α, i.e. the final quality value Q α (t c ). Usually, the comparisons focus on the top-rank users, therefore we here consider the average MSE value of the L highest reputation users.
Lower MSE value indicates higher accuracy. The accuracy of measuring object quality is evaluated by comparing with the movies nominated at Annual Academy Awards [27] and Golden Globe Awards [28]. These nominated movies are the benchmark good movies in the evaluation. A good algorithm will rank the benchmark movies higher than others, therefore we apply precision to evaluate the ability of an algorithm to find good movies. Instead of considering all movies, we focus on the top-L places. Then precision is defined as where m indicates the number of benchmark movies existing in the top-L places of the ranking list. Higher precision corresponds with better performance. The robustness is measured by Kendall's tau (τ ) coefficient [26]. For a dataset, each method gives a ranked list of objects. If movie A is better than movies B in dataset M1, then a robust algorithm will also rank movie A higher than movie B in dataset M2 (or M3). To measure the robustness, we consider the common objects in two datasets (i.e., M1 and M2, M1 and M3, M2 and M3), and extract the sub-ranking list of the common objects from each original ranked list. Assume there are N common objects between two lists where the quality score of object i are denoted by Q i and Q ′ i , respectively. The Kendall's tau rank correlation coefficient counts the difference between the number of concordant pairs and the number of discordant pairs, which reads where sgn(x) is the sign function, which returns 1 if x > 0; −1 if x < 0; and 0 for x = 0. Here > 0 means concordant, and negative value means discordant. The higher τ value is, the more robust the algorithm is. In the ideal case, τ = 1 indicates that the two ranking lists are exactly the same.  (17). We also present other representative algorithms for comparisons. However, the penalty factor in IARR2 amplifies the value of the users' reputation and objects' quality greatly, which makes the MSE value of IARR2 much bigger than other methods. If we plot the curve of IARR2 in Fig. 3, all the other curves will become nearly linear. So the the MSE result for IARR2 is not present here. We could observe that as L increases, the MSE value of the IB method is always the lowest, indicating that the IB method is a good measure of quantifying user reputation. Besides, we also investigate the correlation between the users' reputation scores and their personal MSE values. Table 2 shows the Kendall's tau correlation coefficient between the two ranking lists respectively generated by ranking users decreasingly according to their reputation scores (the higher the better) and by ranking users increasingly according to their MSE values (the lower the better). For all four datasets, the IB method yields the highest value, indicating that our IB method is highly self-consistent.

Accuracy for identifying good objects
Firstly, how do you define good objects? More specifically, how do you define good films? This is a well-known and highly controversial issue so that the opinion concerning this topic varies from person to person. According to a collection of answers in Quora.com, many people define a good films by how much it entertains and/or moves audience, how much it related to audience, or how strongly it makes audience emote. Just as the saying goes, "Each reader creates his own Hamlet". Here we want to adopt the movies that are most interesting, most appealing and most exciting as the benchmarks of the good films, and we believe that the selecting of movies that were nominated by either the Academy Awards [27] or the Golden Globe Awards [28] should be an authority choice. We adopt the precision to calculate the accuracy for identifying good movies.
In table 3, we summarize the number of nominated movies in three MovieLens datasets. Note that, users' behaviors in the movies rating website changes over time, particularly before and after a movie be awarded in famous film festival like Academy Awards or Golden Globe Awards. The Academy Awards was first presented in 1929 while Golden Globe Awards was first presented in 1943. However, the two data sets, Netflix and Movielens, we used in our manuscript are created in recent decades. This means that most of the rating scores are created after the movies were awarded in the film festival. The data sets we obtained limited us to explore the rating dynamics over time in this paper. We will try to study this problem in our future works. Fig. 4 shows the precision of five methods, including the IB, AR, CR, IR and IARR methods, on identifying good movies. For all methods, the precision decreases with the increase of L. Generally speaking, our IB method does averagely well. In some cases, IB performs good. For example, in M3 dataset the IB performs the best when evaluating with the Academy Awards, but is defeated by IR method when evaluating with the Golden Globe Awards. Each method will generate a ranked list where the top-ranked movies are predicted as the nominated movies. After comparing the nominated movies that predicted right by different methods, we find that our IB method is good at finding niches (i.e., unpopular yet good movies). This ability to find novel movies is important, since finding popular movies is much easier than digging niches. Usually the niches constitute the so-called "long tail" market which is considered to be promising and profitable. For instance, Netflix finds that in aggregate, "unpopular" movies are rented more than popular movies, and provides a large number of niches movies on their Web site. The novelty of a movie can be measured by its degree, namely how many users have rated it. An algorithm's novelty is defined as the average degree of the nominated movies in its ranking list, the lower the better. We compare the novelty scores of five methods. The results are shown in Fig. 5. We can see that in all presented cases, IB method always yields the lowest novelty score, indicating that IB method have higher ability to find "dark horses" (i.e., niches, not so popular yet good movies). Table 4 shows the movies nominated for an Academy Award as identified by our IB method in the top-100 places, but not in the lists of other six methods in M2 dataset. The average number of ratings of the 13 movies is 455, much lower than the average number of ratings of all nominated movies in M2 dataset (i.e., 717, see table 3). Besides, among the 13 movies, only three movies have been rated more than 717 times. We have also checked that the results of the other four methods highly overlapped while our IB method yields results which are considerably different from the rest. The results of other datasets are similar, so we will not present the detailed information. In the M1 dataset, there are also 27 nominated movies that are predicted right by IB method, but cannot be identified by the other four methods. The average number of ratings is 132, which is smaller than the average value of all nominated movies in the M1 dataset (i.e., 175, see table 3). In the M3 dataset, there are 23 nominated movies that cannot be identified by other four methods. The average number of ratings is 3245, which is smaller than the average value of all of the nominated movies in the M3 dataset (i.e., 3942, see table 3).

Robustness
Besides accuracy, robustness is another important aspect to consider when selecting algorithms. Robustness usually refers to an algorithm's ability to counteract malicious activities. Here we consider the algorithm's robustness against different datasets. The intrinsic quality of an object will not change in different sampled datasets. If an algorithm says object A is better than B based on sampled dataset 1, while says object B is better than A based on sampled dataset 2, then this algorithm is not robust because it generates inconsistent results on different sampled datasets. Therefore, instead of adding artificial ratings to investigate the algorithm's robustness, we apply MSE and the Kendall's tau (τ ) coefficient to measure the consistency of the results on different sampled datasets. M1, M2 and M2 are ready-made sampled datasets for experiment. Firstly, we  . We consider the same objects of the two datasets in each pair, and then calculate the difference between the two quality scores. Q i α and Q j α denote the quality scores of object α in the two datasets i and j (i = j), respectively, the M SE = i =j (Q i where N s is the number of same objects between datasets i and j. The results are shown in table 5. In all three cases, the IB method has the lowest MSE value. Moreover, we use Kendall's tau (τ ) coefficient to analyze the correlation between the two ranked lists of common objects in two datasets in each pair. Table 6 shows that the Kendall's tau (τ ) of the IB method is the highest among all five methods. In other words, the two ranked lists of the same objects given by the IB method in different datasets are more consistent than those given by the other four methods, indicating that IB is more robust.

Conclusions
Building online reputation systems is important to companies who provide services or products online (i.e., Taobao e-business platform for goods [29], Netflix for movies, Amazon for books/other products, Pandora for music [30]). Since the reputation scores generated by the system's algorithm are usually used to assist users who want to buy or select something that they have no prior experience using, finding a good ranking method is important. A good method should be both effective (i.e., reflect the intrinsic values) and efficient (i.e., simple to calculate). Additionally, it must be robust against tampering. Users' rating bias greatly ruins the algorithm's performance in terms of the above three criterions. Motivated to eliminate user bias for better evaluation, we proposed an iterative balance (IB) method to identify each user's reputation and each object's quality in online rating systems. Firstly, we empirically studied the standard deviation and the skewness of users' rating scores and found that each user has a certain magnitude of rating error. Then, we introduced an equation to correct this magnitude of rating error during the iterative process. We applied mean-squared error (MSE) to measure the algorithm's accuracy on quantifying each user's reputation, and the precision to evaluate the algorithm's accuracy on identifying good objects. The algorithm's robustness is measured using both MSE and Kendall's tau coefficient. Experiments on four benchmark datasets show that the IB method is a highly self-consistent and robust algorithm. Compared with other state-of-the-art methods, the IB method has a higher ability to identify niche items (i.e., unpopular yet good objects). For example, results using the MovieLens dataset show that the IB method is good at finding the "dark horses" for the Academy Awards. We believe our studies may find wider practical applications, such as helping online ebusiness platform to identify tampering, integrating the object's quality score into the recommender systems to improve the accuracy of recommendations and generally improving user experiences. Furthermore, this may also generate higher quality evaluation reports for seller reference.