A Smart Privacy-Preserving Learning Method by Fake Gradients to Protect Users Items in Recommender Systems

. In this paper, we study the problem of protecting privacy in recommender systems. We focus on protecting the items rated by users and propose a novel privacy-preserving matrix factorization algorithm. In our algorithm, the user will submit a fake gradient to make the central server not able to distinguish which items are selected by the user. We make the Kullback–Leibler distance between the real and fake gradient distributions to be small thus hard to be distinguished. Using theories and experiments, we show that our algorithm can be reduced to a time-delay SGD, which can be proved to have a good convergence so that the accuracy will not decline. Our algorithm achieves a good tradeoﬀ between the privacy and accuracy.


Introduction
Recommender systems, which help the electronic commerce websites to give more useful suggestions, are becoming more and more important. However, to provide users with appropriate options, the server will collect users' data, which includes lots of sensitive information.
Data in electronic commerce, economics, supply chain, financial system [1][2][3][4][5][6][7][8][9][10], etc., are generally very sensitive. In the electronic commerce case, it is shown in many studies, such as [11,12] that user data in recommender systems, shopping records, movies a user has watched, and ratings for the restaurants contain lots of very private information such as political attitudes, sexual orientation, etc. In this paper, we study the privacy-protecting problem in electronic commerce data. Privacy has been an important issue for a long time, not only in the recommender system but also in almost all algorithms in data mining and machine learning.
Differential privacy [13] is a popular method to protect privacy in machine learning algorithms. For recommender systems, there are many works applying differential privacy, such as [14][15][16]. Differential privacy matrix factorization algorithms are introduced in [17,18], etc. Traditional differential privacy method is centralized, in other words, relying on a trustworthy data collector. When we want the central server not to be able to get privacy information, local differential privacy (LDP) should be used. Every user will add noise to their private data in their own device before being submitted to the central server. Recommender systems with LDP are studied in [19][20][21]. LDP has been used in Google's Chrome browser [22] and Apple IOS 10 [23] to collect user data.
In local differential privacy, there are two important things to be protected. e first one is which items this user has rated and the second one is the ratings of the user. In some situations, which items have been rated is much more sensitive than the rating itself. For example, shopping record contains a lot of private information, but the ratings can only represent the quality of goods. e work in [19] can only protect the ratings but not both. Shin et al. [17] proposed a novel LDP matrix factorization algorithm to protect both kinds of privacy information based on the work in [24]. eir method is to let the user submit a noisy gradient, whose value is either B or − B. e algorithm is ε-LDP, and in each round of the training process, and since the output is binary, the adversary can not learn about which items are rated in a single iteration process.
However, if the adversary can get noisy gradients in multiple iterations since the noisy gradients obey the Bernoulli distribution with a mean 0, the items which have not been rated can be identified by a statistical test. e intensity of the privacy protection for the ratings and items after multiple iterations can be guaranteed by composition theorems for LDP [25,26]. If every iteration is ε-LDP, after k iterations, the final algorithm is at most kε-LDP. But these analyses are not a direct guarantee to protect the items rated by the users. We can turn to a new perspective on this question. After performing k iterations, given a sequence with length k denoted by y i , where y j i is the gradient submitted in iteration i, let P real (y j ) be the probability that y j is a real gradient sequence and let P fake (y j ) be the probability that y j is fake. Using these two probabilities, we can consider testing two hypotheses, the sequence is real and the sequence is fake. So now comes the question, how can we make it difficult to distinguish the two situations?
In order to improve the ability of protecting privacy, we want the probability error to be large. Note that the average negative log probability of error is well-known deduced from the Chernoff-Stein lemma.
Theorem 1 (Theorem 11.8.3 in [27]). X ∼ Q is a random variable; consider the hypothesis test between two alternatives, Q � P 1 and Q � P 2 , where D(P 1 , P 2 ), the K-L distance, is finite. en the average negative log probability of error of this hypothesis testing is D(P 1 , P 2 ).
Using this result, although we can not obtain the distribution of the real sequence, in Section 4, we will show that for the Gaussian noise based differential privacy algorithm, we can estimate the mean value of K-L distance and optimize the value of fake gradient to make the two distributions to be difficult to distinguish.
In this paper, we propose a novel algorithm such that if the item has not been rated by the user, the user will submit a fake gradient. Else, the user can submit the real one, but all the submitted data will eventually be noise added. e paper is organized as follows. In Section 2, we introduce differential privacy briefly as preliminaries. In Section 3, we introduce the framework of the general differential privacy matrix factorization algorithm. And in Section 4, we will show that our algorithm can reduce the average K-L distance between the fake and real gradient distributions, such that it can improve the intensity to protect the privacy items. Meanwhile, we can prove that our algorithm has the form of SGD with time delay, which can be proved that the accuracy of the model will not be reduced by our updating rules so that our algorithm achieves a tradeoff between accuracy and privacy. In Section 5, we use experiments to show the effectiveness of our algorithm. e related work is reviewed in Section 6. In the final section, we conclude.

Preliminaries
In this paper, the notations we used are listed in Table 1.

Differential Privacy.
Differential Privacy is first introduced by Dwork et al. [13], the aim of which is to make it difficult for an attacker to obtain privacy from the output data by adding noise. Definition 1. A randomized algorithm M: D ⟶ R with domain D and range R is (ε, δ)-Differential Privacy, if for two adjacent data d, d ′ ∈ D and for a subset S of range R, it holds that Note that this definition is to compare the two probability. If δ � 0, it can be expressed as If ε is small, such that it is hard to distinguish whether the output data is come from d or d ′ . As in [28], one can link differential privacy with mutual-information.
Another way to describe Differential Privacy is to use the distance between distributions. We say a randomized al- When α ⟶ 1, D 1 is the Kullback-Leibler distance, and when α � ∞, Renyi Differential Privacy is equal to (ε, 0) Differential Privacy. So we can see Differential Privacy is to make the output distributions with different inputs to be indistinguishable(the distributions have small distances).
One may ask how to achieve (ε, δ)-Differential Privacy in machine learning process. A basic paradigm to achieve "-differential privacy is to examine a query L 2 -sensitivity in [29].
Definition 2. f is a map from the data in the dataset D to a vector. e L 2 -sensitivity of f is Using this definition, we have the following theorem in [29].
is theorem provides a basic method to achieve Differential-Privacy-Machine-Learning.

The Framwork of Perturbed Matrix
Factorization Algorithm e program of Matrix Factorization algorithm with privacy protection has been studied by many authors, such as [17,19].
When minimizing the cost function We can use gradient descent e vector u i is the user profile vector for user i, and v j is the item profile vector for item j.
Note that we have where In this type of program, the user profile vectors u i are saved and updated on the users' own devices. As for the item profile vectors, all the users will send the gradient to the central server, and individual users should perturb their gradient g ij using a random mechanism M. en the central server sums all these gradients to update the item profile vectors v j . Using this random perturbation, ε-differential privacy can be achieved by adjusting the distribution of noise. e whole process is shown in Algorithm 1. Note that there are two types of private information. One is the ratings of the users and the other one is the items have been rated by the users.
In order to protect the items, one way is to use the random response mechanism introduced in Section 4.1 of [17]. In this method, we generate a y ij ′ such that y ij ′ � 1 with probability p, and if the original y ij � 0, we set a fake rating r ij � 0 so the fake gradient is u i (0 − u T i v j ) by (8), and Gaussian noise is added to the final gradient sent to the central server to protect the ratings of users.
However, it is shown in the discussion of Section 4.1 of [17] that the error caused by these fake ratings is not small, which will influence the final model accuracy.
e main reason is that there are many fake gradients, which lead to a great error in the expectation of the sum of gradients.
One way is to solve this problem is to set the fake gradient F ij to be zero. If y ij � 0, the user sends a random variable M(0) to the central server. is method is used in [17], where M(x) is a Bernoulli random variable with mean value x. However, the disadvantage of this method is that the distribution of gradients in the y ij � 0 case is very different from the distribution of the real gradient. For example, we can collect some data of g ij sent by the user i, and use a statistical test to test if this data obeys the certain distribution of mean 0, then we can know whether y ij � 0.
All in all, we need to strike a balance between privacy and accuracy. We need to provide a fake gradient to make sure the accuracy will not be greatly affected and let these two distributions, the fake one and the real one, to be statistically indistinguishable as far as possible.

The Main Results
In this paper, since we are concerned about the items of users, we will focus on considering the statistical distance of y ij � 0 and y ij � 1 distributions. We propose a novel algorithm to protect items of the users. In our algorithm, the user will submit a noise-added fake gradient in the y ij � 0 case. e K-L distance between the real and fake distributions will be small so that they are hard to be distinguished. On the other hand, we will study how will the fake gradients influence the model accuracy. We will show that in our algorithm, the updating rules can be reduced to a time-delay SGD, which will not influence the accuracy.
In our algorithm, the random mechanism M we choose is the Gaussian random mechanism, M(d) � N(d, σ 2 ). One of the advantages is that there is a very good composition theorem [26] which gives a much tighter estimate on the multi-iteration privacy loss for Gaussian mechanism-based differential privacy gradient descent algorithm.
Theorem 3 (Theorem 1 in [26]). Let C be the gradient bound in privacy gradient descent, there exist two constants c 1 and c 2 such that the after k iterations, the Gaussian noisy privacy gradient descent algorithm is (ε, δ)-differentially private for any δ > 0 if we choose Generally, C is chosen to be a prior bound of the gradient norm, so we do not write it in the algorithm description explicitly.

Complexity 3
In the case of the Gaussian random mechanism, it is easy to calculate the K-L distance between distributions. In the following section, we will show that we can find a good choice of the fake gradient.

Estimating the K-L Distance between Two Distributions.
Given a gradient sequence y j with length k, a probability of y j can be represented in the following form.
Using this form we can calculate K-L distance. Given two probability measures P 1 and P 2 in length k sequence space, we have In each iteration, the user will sent a perturbed gradient g ij ′ to the central server, which has the following forms: is is the K-L distance between two Gaussian distributions with the same σ. We can show that From equation (11), if we want to optimize the K-L distance, we need to consider Although we do not know the distribution of real gradients, this means value can be estimated by sampling. Let S be the set of user i such that y ij � 1.
And in our algorithm, for a given item j, all the users will use the same F-in other words, we F ij is independent of i. then the above equation is a function of the quadratic form.
Input: Random mechanism M, learning rate η, and redefined iteration number k Output: Item profile matrix V Randomly initialize u i (0), v j (0) for all i and j. for t � 1, 2, 3 . . . do Initialize G j � 0 for all j in central server. for i � 1, 2, 3, . . . , m do On user i: sample j uniformly from{1, 2, 3, . . . , n}. if y ij � 1 then Update u i on a local device by gradient descent. end end ALGORITHM 1: Perturbed Matrix Factorization algorithm. 4 Complexity In order to minimize this K-L distance, we should set . However, at time t, the user i can not get the current gradient ∇ v j L(v j (t)). However, in the following section, we will show that in our algorithm we can estimate it from the previous gradient i g ij (t − 1).

Algorithmic Description.
In Algorithm 1 with Gaussian random mechanism, we can see that the central server will receive the gradients submitted from the users, whose summation is as follows: Suppose F j � 0, G j is just a Langevin stochastic gradient [30] whose expected value is the total gradient. When F j ≠ 0, using G j to update the parameters will generally influence the accuracy of the model. One way to solve this problem is to subtract a value in the central server.
In order to determine the value of N j to make the F j part small, we can use the Random Response mechanism. e random Response mechanism [31] is a well-known method to obtain statistical information on sensitive issues, e.g., the proportion of AIDS. In our algorithm, we will use the Random Response mechanism to count the number of y ij � 0 items, which is used for the central server to correct the sum of the gradients. e procedure of the Random Response mechanism is that the responder will give the true answer with probability p > 0.5, and with probability 1-p, the answer will give an opposite answer.
Theorem 4 (Warner, 1965, in [31]). Suppose the number of the answer of y � 0 is n 1 , and the total number of the responders is n. (1/n), so if the total number of the users is large enough, with a high probability, θ ≈ θ.
e whole process is shown in Algorithm 2. It is easy to see that, in the central server, the update process has the following forms: where ∇ v j L(v j (t), z) is the sampling stochastic gradient and num j is the number of y ij � 0 terms in sampling.
As for ΔV j , we know that Note that since the regularization term bound the norm of matrix U and V, there exits a small constant β to make the loss function L(u, v) to be β− smooth, that is to say, Since So we have the folowing: One can easily prove that the variances of all these estimations are O(1/n).

e Influence of Model Accuracy.
We can see the form of updating rule (21) is a stochastic gradient descent with time delay. It can be shown that even if μ a not small, time delay SGD will still have good convergence. e convergence of SGD with time delay is proved in [32]. In this paper, Lian proved the convergence of asynchronous stochastic gradient descent which has the same form as equation (21).
Theorem 5 (T heorem 1 in [32]). Assume the loss function is β− smooth, η is the learning rate, B is the batch size, and T is the time delay. If after K iterations, we have with high probability, Where f(x * ) is the global minimum of f and σ is the standard deviation of stochastic gradients.
Proof of eorem 5. In this case, the stochastic gradients G m,t sent by the node m at time t can be written as G m,t � ∇f(x t− τ m,t ) + ζ t,m , where τ m,t is the time delay of the gradient and ζ t,m is the noise (including noise from the stochastic gradients and the Gaussian noise we added). In our case, ζ t,m is a sub-Gaussian random variable. To simplify the description, we assume ζ t,m is σ-sub-Gaussian.
In order to estimate T 2 � T 2,a + T 2,b , we can use lemmas in [33].
Let ζ k � (1/B) B m�1 ζ k,m . With probability 1 − e − ι , we have the following: is is from Lemma 30 in [33]. With high probability, And with high probability, We have the following: e theorem follows. is theorem has the same form as the convergence theorem of general and SGD, and in our case, we have T � 1. So we can show this time delay will not influence the convergence. At the start of our algorithm, we need to use the Random response mechanism to estimate the ratio of y ij , which will cause a privacy loss. However, we can show that since we need a large number of iterations in the machine learning algorithm, the initial privacy loss is insignificant.
It is easy to prove that the Random response mechanism is ln(p/1 − p)-Differential Privacy. We know from eorem 3 that ε ∼ O( ������� � k ln(1/δ) ) after k iterations. If n is large enough, we can choose a p near 0.5, and when k is large, ln(p/1 − p) will be much less then ε.
Noting that the K-L distance for a length k sequence is O(k), the discussion on the K-L distance is the same.

Experiments
We now show the performance of our algorithm. We evaluate three types of privacy gradient descent algorithms: (i) Algorithm 1, the noisy gradient descent with . e users will submit a gradient (ii) Algorithm 2, noisy gradient descent with F ij � ζ. (iii) Algorithm 3, our algorithm in this paper.
In the F ij � 0 case, the only noise in the total gradient is caused by Gaussian noise added to the users' device. is algorithm will be accurate but has no ability to protect the item's privacy. We will show that the performance of our algorithm is very close to the case F ij � 0 and much better than the algorithm using fake ratings. Input: Redefined iteration number k, learning rate η, probability p for Random Response and Standard deviation of Gaussian distribution σ. Output: Item profile matrix V For all items j, use the probability p Random Response method to estimate the ratio of the users with y ij � 0 as θ j . Randomly initialize u i (0), v j (0) for all i and j. for t � 1, 2, 3 . . . do Initialize G j � 0, n j � 0 for all j � 1, 2, . . ., n in central server.
for i � 1, 2, 3, . . . , m do On user i: sample B items S � S 1 , S 2 , . . . , S B uniformly from{1, 2, 3, . . . , n} for j in S do n j � n j + 1 if y ij � 1 then Update u i on the local device by gradient descent. end end ALGORITHM 2: Noisy matrix factorization with fake gradient.

Complexity 7
We test on MovieLens 100k dataset [34]. is version contains 100k ratings of 1682 movies submitted by 982 users.
is dataset is very sparse. In order to test the performance in different situations of sparsity, for every user, we choose a set F of items to be selected to provide fake gradients. We consider different cases that #F � #S (50% fake gradient density), #F � 3#S (75% fake gradient density) to test the algorithm. We set the profile vector dimension d � 15, regularization parameters λ � 0.001, learning rate η � 0.1, σ 2 � 1 and use AdaDelta to optimize. e test RMSE is shown in Figures 1 and 2.
After 400 iterations, the test RMSE is listed in Table 2. We see that when the density of fake rating increases, the test RSME of fake rating algorithm is growing rapidly, and the performance algorithm is very close to the zero mean fake gradient algorithm.

Related Work
Differential privacy introduced by Dwork [13] is a very strong guarantee to protect privacy. e original version of differential privacy consider a trusted server to provide data to queriers, and the aim is to prevent access to user privacy from queries.
Local differential privacy algorithm, such as RAPPORT [22], is to make sure the central server can not access the privacy of the users. e main technology is to add some noise before submitting the data to the server. In the Chrome browser, Google uses a randomized response mechanism to collect the data of the users' clicks. Also, there are many works to use local differential privacy to perform machine learning algorithms. For example, Google uses local differential privacy Federated Learning [35] to learn a language model in order to improve the performance of the inputting method.
One of the difficulties in differential privacy machine learning is that when training a model using many iterations, the privacy guarantees will decline rapidly. Differential privacy for multi-iterations is studied in [25,26] and a much tighter composition theorem is given.
Private recommender system is studied by many authors such as [17-20, 36, 37]. References [17,18] are based on a matrix factorization recommender system. e algorithm is to adding some noise in users' devices locally to protect  privacy. e algorithm in [17] can protect both the ratings and the items of the user. eir work is based on the work in [24], where they propose a new randomization mechanism and show that their mechanism is better when the dimension of data is large.

Conclusion
In this paper, we propose a novel privacy matrix factorization algorithm. In our algorithm, we use the Random Response method to estimate the selection ratios of the items, and then we use the average value of the gradients in the previous time as the fake gradient to be sent to the central server. Using our method, we can improve the indistinguishability of the real gradient and fake distributions so that improve the ability to protect user private items. Meanwhile, we show that our algorithms will not cut down the accuracy of the model since the updating rule can be reduced to SGD with time delay, which can be proved to convergence to gradient zero points.

Conflicts of Interest
e authors declare that they have no conflicts of interest.