Horizontally Partitioned Data Publication with Differential Privacy

In this paper, we study the privacy-preserving data publishing problem in a distributed environment. (e data contain sensitive information; hence, directly pooling and publishing the local data will lead to privacy leaks. To solve this problem, we propose a multiparty horizontally partitioned data publishing method under differential privacy (HPDP-DP). First, in order to make the noise level of the published data in the distributed scenario the same as in the centralized scenario, we use the infinite divisibility of the Laplace distribution to design a distributed noise addition scheme to perturb the locally shared data and use Paillier encryption to transmit the locally shared data to the semitrusted curator.(en, the semitrusted curator obtains the estimator of the covariance matrix of the aggregated data with Laplace noise and then obtains the principal components of the aggregated data and returns them to each data owner. Finally, the data owner utilizes the generative model of probabilistic principal component analysis to generate a synthetic data set for publication. We conducted experiments on different real data sets; the experimental results demonstrate that the synthetic data set released by the HPDP-DP method can maintain high utility.


Introduction
e ability of people to collect and analyze data is gradually improving with the development of the artificial intelligence. Sometimes the data are stored by different sites(data owners), and each site holds a smaller number of samples. For example, in Figure 1, there are three hospitals, the patients in each hospital are different from each other, but the data features of each patient are the same. In order to better mine the useful information behind the data, a large number of samples are needed. Pooling data in one central location enables efficient data analysis and mining, but data contain sensitive privacy; directly sharing or pooling the data will lead to privacy leakage [1,2], which prevents people from sharing data. at is to say, data are facing serious privacy leakage risks in the process of data sharing, network transmission, and storage [3]. It is important to protect the privacy of shared data and weigh the security and availability of data [4,5]. erefore, it is desirable to propose an efficient distributed algorithm, which can provide the utility close to the centralized case and protect the privacy of data. In recent years, there have been some researches on privacy-preserving data publishing and sharing, for example, the kanonymity [6] technology, the encryption techniques, such as lattice-based cryptography [7] and quantum cryptography [8,9]. e differential privacy [10] has been widely used for privacy-preserving data publishing; privacy-preserving data publishing based on differential privacy has become a research hot spot [11][12][13][14][15].
However, there are still some challenges when using the differential privacy technique to protect the privacy of the published data. One is that the data are stored by different data owners; directly pooling and publishing the data will lead to privacy leakage. When data are stored by multiple data owners, as the number of data owners increases, if differential privacy is used independently to add noise to the locally shared data, the utility of the published data will be reduced. In view of this, we propose a horizontally partitioned data publication approach with differential privacy. We make the following contributions: (1) We propose a method for horizontally partitioned data publication with differential privacy (HPDP-DP). In a distributed environment, data are owned by multiple parties. We use the weighted average of the noised covariance matrices of the local data to estimate the covariance matrix of the pooled data. e data owners and a semitrusted curator collaborate to get the principal components of the pooled data and generate a synthetic data set for publishing. (2) In the distributed scenario, in order to make the noise level of the aggregated data the same as in the centralized scenario, the HPDP-DP method utilizes the infinite divisibility of the Laplace distribution and Paillier homomorphic encryption to alleviate the effects of noise and can achieve the same noise level as the centralized scenario. (3) We evaluate the performance of HPDP-DP method through experiments on real data sets, and the experimental results show that HPDP-DP method can generate synthetic data with high efficiency.

Related Work
In this section, we introduce the research status of privacypreserving data release based on differential privacy in the centralized and distributed scenarios, respectively.

Privacy-Preserving Data Publishing in Centralized
Environment. In recent years, there are many researches on privacy-preserving data publishing based on differential privacy. Jiang et al. [16] proposed a method that adding Laplace noise to the covariance matrix and the projection matrix and then using the noisy projection matrix to restore and generate the synthetic data set for publishing. Zhang et al. proposed the PrivBayes method in [17]; they used the relationship between the features to build a Bayesian network. ey added Laplace noise to the low-dimensional marginal distribution to make the Bayesian network satisfy differential privacy, and then they used the Bayesian network to generated a synthetic data set for publishing. Chen et al. proposed the Jtree method in [18]. First, they proposed a sampling-based testing framework that is used to explore pairwise dependencies while satisfying differential privacy. en, they applied the connection tree algorithm to construct an inference mechanism to infer the joint data distribution. Finally, they efficiently generated a synthetic data set by using the noise margin table and inference model. Xu et al. [19] proposed DPPro scheme; they released high-dimensional data by using randomly projected. ey projected the original high-dimensional data into a randomly selected low-dimensional subspace and added noise to the low-dimensional projected data. ey theoretically demonstrated that the data published by the DPPro method have similar squared Euclidean distances to the original data. In order to solve the problem of dimensional disaster in high-dimensional data publishing, Zhang et al. [20] presented the PrivHD method with the junction tree. First, they used exponential mechanism to construct a Markov network; in order to reduce the candidate space, high-pass filtering technique is used in sampling. en, they used the maximum spanning tree method to build a better joint tree. At last, a high-dimensional synthetic data set is generated for publication. Zhang et al. [21] presented the PrivMN method. ey first constructed a Markov model to express the relationship of features. en, they used the Laplace mechanism to add noise to the marginal distribution to generate the noisy marginal distribution table. Finally, they used the noisy marginal distribution to generate a synthetic data set for publishing. Gu et al. [22] proposed the PPCA-DP method; they first used the principal component analysis to reduce the dimensionality of high-dimensional data and then added Laplace noise to the low-dimensional projection data; finally, they used the generative model of probabilistic principal component analysis to generate a synthetic data set for publishing.
e above are all studies on privacy-preserving data publishing in centralized scenarios.

Privacy-Preserving Data Publishing in Distributed
Environment. At present, most of the existing privacypreserving data publishing works focus on the centralized scenario; there are fewer studies on privacy-preserving data publishing in distributed scenario. e multiparty data release scenario studied in this paper is that each data owner owns a data set and uses the differential privacy technology to protect the privacy of the local data set rather than the scenario that multiple individuals keep their data locally. e latter typically utilize the local differential privacy [23] techniques to protect the privacy of individual data [24,25]. In the following, we will introduce the research status of privacy-preserving data release in multiparty data release, where each data owner owns a data set.
Alhadidi et al. [26] proposed the first noninteractive twoparty horizontally partitioned data publication method that satisfies differential privacy and secure multiparty computation. e data set published by this method is suitable for classification tasks. Hong et al. [27] constructed the framework (CELS protocol) that enables distributed parties to securely generate outputs while satisfying differential privacy. e security and differential privacy guarantees of the protocol are proved. Ge et al. [28] presented the DPS-PCA algorithm. Data owners collaborated to compute the principal components while protecting the privacy of data. e DPS-PCA algorithm can trade off the relationship between the accuracy of estimating principal components and the degree of privacy protection, but this method only outputs a low-dimensional subspace of high-dimensional sparse data. An efficient and scalable distributed PCA protocol is proposed by Wang et al. [29] for the computation of principal components of split horizon data in a distributed environment. First, the shared data are encrypted and sent to a semitrusted third party. Second, the shared data are aggregated by a semitrusted third party, and the aggregated result is sent to the data consumer. Finally, the data consumer performed a principal component analysis and obtained the principal components of the pooled data. Cheng et al. [30] presented the DP-SUBN 3 approach; the data owners built a Bayesian network with the assistance of a semitrusted curator, and then the Bayesian network is used to generate a synthetic data set. In DP-SUBN 3 approach, the four stages of correlation quantification, structure initialization, structure update, and parameter learning all need to access the local data set, and each stage satisfies differential privacy, which in turn makes the DP-SUBN 3 approach satisfy differential privacy. For the privacy protection of data publishing in arbitrary partitions between two parties, Wang et al. [31] presented the first distributed algorithm, which generates anonymous data from two parties. In order to prevent both parties from leaking private information, the anonymization process satisfies both differential privacy and secure two-party computation. Gu et al. [32] presented the PPCA-DP-MH approach. e data owners collaborate with a semitrusted curator to reduce the dimensionality of the data, and then the data owners used the probabilistic generative model of principal component analysis to generate a published data set. In the PPCA-DP-MH method, since multiple data owners add noise to the data locally and independently, the utility of publishing data gradually decreases as the number of data owners increases. In response to this challenge, we propose the HPDP-DP method in this paper. We design the generation and addition scheme of correlated noise, so that the utility of publishing data will not decrease with the increase of data owners, and even the utility of publishing data will gradually increase with the increase of data owners.

Probabilistic Principal Component Analysis (PPCA).
Principal component analysis is one of the commonly used dimensionality reduction methods. Principal component analysis is a statistical analysis method that converts multiple variables into a few hidden variables through dimensionality reduction techniques. ese fewer low-dimensional and not correlated hidden variables are also called principal components. e principal components can reflect most of the information of the original variables. Next, the main process of finding principal components is introduced. First, computing the covariance matrix Σ of the data. en perform eigenvalue decomposition on the covariance matrix Σ, Σ � UΛU T , where Λ is a diagonal matrix and the elements on the diagonal are the eigenvalues of the matrix Σ, e corresponding eigenvectors are as follows: u 1 , u 2 , . . . , u p which are called the principal components. U is an orthogonal matrix consisting of the eigenvectors. Usually, the top k principal components retained are determined by the cumulative contribution rate c � k i�1 λ i / p i�1 λ i . However, Michael et al. [33] proposed that the principal component analysis (PCA) is a nongenerative model, they presented that the principal component analysis (PCA) also has a generative model called probabilistic principal component analysis (PPCA). e most common model to associate low-dimensional latent variables with highdimensional observable variables is the factor analysis model, i.e. x � Ws + μ + ξ, where x is p -dimensional observation vector consisting of the p original variables, s is a k -dimensional vector consisting of k latent variables, ξ ∼ N(0, Ψ), the matrix W associates the vector x with the vector s. e vector μ allows the model to have a nonzero mean vector.
Theorem 1 [33]. From Figure 2 and the latent variable model x Ws + μ + ξ, when ξ ∼ N(0, σ 2 I), s ∼ N(0, I k ), then x|s ∼ N(Ws + μ, σ 2 I p ), σ > 0, W ∈ R p×k , where the maximum likelihood estimation of μ, σ 2 , and W are μ μ, where μ is the mean vector, the column vectors in U k is the eigenvectors corresponding to the top k eigenvalues of the covariance matrix.

Di erential Privacy.
Di erential privacy is a strong privacy protection model independent of background knowledge. If the output of a privacy-preserving algorithm is insensitive to small changes in the input, the algorithm satis es di erential privacy. e essence of di erential privacy is to randomly perturb the query results, so that people cannot infer the original input information based on the query results.
De nition 1 (Di erential Privacy) [10]. A random algorithm M satis es ϵ di erential privacy, if for any two neighboring data sets D, D (only one record di ers between the two data sets) and for any S(S ∈ Rang(M)) there is ε is a small positive real number, which is also called privacy budget.
In the De nition 1, ε is used for controlling the probability ratio of the random algorithm M to obtain the same output on the two neighboring data sets D and D; it re ects the level of privacy protection that the algorithm M can provide.
De nition 2 (Sensitivity). [10]. Let f be a function that maps a data set into a xed size vector of real numbers, f: D ⟶ R d , for any neighboring data sets D and D, the sensitivity of f is de ned as follows: where · 1 denotes the L 1 norm.

Paillier Encryption and Decryption.
In this paper, we use Paillier encryption scheme [36] to encrypt the local shared data before being aggregated. e Paillier encryption scheme is described as follows: (1) Key generation: n pq, where p and q are large primes, λ lcm(p − 1, q − 1). Euler function Φ(n) (p − 1)(q − 1), g ∈ Z * n 2 , the (n, g) is public key and λ is private key.
(2) Encryption: plaintext m < n, randomly select r < n, ciphertext c g m · r n mod n 2 .
Paillier encryption is additively homomorphic. We use [[m]] to represent the encrypted ciphertext of m. en,

The HPDP-DP Method
ere exist M(M ≥ 2) data owners, the m-th data owner P m holds a local data set denoted as e data sets X 1 , X 2 , . . . , X M can be viewed as horizontally split the integrated data set X ∪ M m 1 X m by M data owners. at is all the local data sets have the same attributes and do not intersect with each other. Our goal is to design an algorithm that can publish these horizontally partitioned data sets privately; speci cally, it is that with the assistance of a semitrusted curator, the M data owners and the curator collaborate to publish a synthetic data set X � ∪ M m�1 X m , which has the same scale and statistical properties as the data set X � ∪ M m�1 X m . Typically, we assume that the data owners and the curator are honest-but-curious, that is, they will follow the protocol but try to find out as much secret information as possible.
In view of the above scenario, we propose a horizontally partitioned data publishing method with differential privacy (HPDP-DP). e Algorithm 1 depicts the HPDP-DP algorithm. First, the data owner perturbs the local scatter matrix with random noise that obeys the Gamma distribution and sends it to the semitrusted curator. en the semitrusted curator aggregates all the local scatter matrices to get the noisy estimator of the covariance matrix of the pooled data. e semitrusted curator performs eigenvalue decomposition on the covariance matrix to get the principal components and then the top k principal components are sent to each data owner. At last, each data owner uses the top k principal components and the generative model of probabilistic principal component analysis to generate a synthetic data set.
In order to reduce the impact of noise on the availability of published data, the HPDP-DP algorithm employs a distributed Laplace mechanism to add noise to the local scatter matrix. According to eorem 2, the infinite additivity of Laplace distribution, we perturb the local scatter matrix with the noise follows a Gamma distribution, which makes the estimator of the covariance matrix of the pooled data contain the same level of noise as the centralized scene. Inspired by [37], since the step of perturbing the local scatter matrix with gamma-distributed noise does not satisfy differential privacy, we will use the Paillier encryption scheme to encrypt the perturbed scatter matrix to protect the privacy of local data. e HPDP-DP algorithm mainly consists of the following stages.

Security and Communication Networks
where ij and b m2 ij are sampled from Gamma((1/M), (p + p 2 )/Mε), Using the public key (n, g) and θ m to encrypt each element of L m to get the encrypted matrix ij · r n · modn 2 ) p×p which will be sent to the curator, m � 1, 2, . . . , M.
Aggregation and decryption phase: After receiving these encrypted matrices C 1 , C 2 , . . . , C M , the curator performs the Hadamard product on these encrypted matrices. We use the symbol°as the Hadamard product of matrices.
where M m�1 (b m1 ij − b m2 ij ) ∼ Lap((p + p 2 )/ε) holds due to eorem 2. e curator decrypts the above results to get the sum of local scatter matrices with Laplace noise L � M m�1 L m � M m�1 L m +(Lap((p + p 2 )/ε)) p×p , which is used as an estimation of the scatter matrix of the pooled data, and then the estimation of the covariance matrix of the pooled data is Σ � (1/N)L.
In this stage, our idea is to use the weighted average of the local covariance matrices to estimate the covariance matrix of the pooled data. Assuming that the covariance matrix of data owner P m is Σ m , the relationship with the scatter matrix is Σ m � (L m /N m ), and then the estimation of the covariance matrix of the pooled data is Principal component analysis phase: the curator performs eigenvalue decomposition on matrix Σ . e curator gets the eigenvectors (the top k principal components) u 1 , u 2 , . . . , u k and then sends them to each data owner.
Generate synthetic data set phase: Each data owner uses the returned top k principal components and the generative model of probabilistic principal component analysis in eorem 1 to generate a synthetic data set.

Security Analysis
Theorem 5. e data set owned by P m is X m and its corresponding scatter matrix is L m � (l m ij ) p×p , m � 1, 2, . . . , M. Defining the query function, the output result intended to be protected.
ij ) p×p are symmetric random matrices will be added to L m , b m1 then the algorithm M satisfies ε differential privacy.
Proof. According to eorem 2, it can be known each element of M t�1 (B m1 − B m2 ) obeys Lap(p(1 + p)/ε). So, next we will prove if algorithm M holds B � (b ij ) p×p is a symmetric random matrix and b ij is sampled from Lap(p(1 + p)/ε) , 1 ≤ i ≤ j ≤ p, then the algorithm M satisfies ε differential privacy. We denote the two neighboring data sets as X � ∪ M m�1 X m and X � ∪ M m�1 X m ; there is only one individual is different, without losing general assumption, suppose the different individuals are in X M and X M . We denote the only two different individuals as x M N M ∈ X M and x M N M ∈ X M . Assume that all individual data have been normalized to the [0,1] interval. e estimation of the scatter matrices of X and X are as follows: and Let B � (b ij ) p×p and B � (b ij ) p×p be two independent symmetric random matrices, where b ij and b ij are sampled from Lap(p(1 + p)/ε), 1 ≤ i ≤ j ≤ p.
Let S � L + B and S � L + B , then the log ratio of the probabilities of S and S at a point H is given by According to the definition of differential privacy (Definition 1), we need to prove that the following inequalities holds: e mean vectors of X M and X M are as follows: and . Hence, we have the following: erefore, the following formula holds: So the conclusion of eorem 5 holds. Security against external attacks: external attacker will eavesdrop on data sent by local data owners to the curator. According to the semantic security of Paillier encryption against plaintext attacks, external attacker unable to decrypt data (θ m · g l m ij +b m1 ij − b m2 ij · r n · modn 2 ) p×p without knowing private key λ and θ m , 1 ≤ m ≤ M. External attacker may also eavesdrop on the aggregated value of the data owners (g M m�1 l m ij +b m1 ij − b m2 ij · r Mn · mod n 2 ) p×p , external attacker unable to decrypt data without knowing private key λ. Even though the external attacker get the sum of scatter matrices with noise ( M m�1 l m ij + b m1 ij − b m2 ij ) p×p , because it contains Laplace noise, so the local data are still safe according to eorem 5. Security against internal attacks: internal adversaries are data owners and the curator. e data owner P m holds θ m secretly, the rest of the data owners and the curator cannot decrypt (θ m · g l m ij +b m1 ij − b m2 ij · r n · modn 2 ) p×p without private key λ and θ m unless the curator colluded with the M − 1 data owners. e curator can use private key λ and θ 0 to decrypt the aggregated value ( M m�1 (θ m · g l m ij +b m1 ij − b m2 ij · r n · modn 2 ) p×p , but the curator can only get the aggregated value with Laplace noise, so the local data are safe according to eorem 5. Communication cost analysis. ere exist three stages that incur communication costs. e first stage is the M data owners send the local scatter matrix to the curator, the size of the message sent by each data owner is p 2 , the total size of the message sent in this stage is Mp 2 . e second stage is the curator sends the top K eigenvalues and their corresponding eigenvectors to each data owner; the total size of the message sent in this stage is MpK 2 . e third stage is each data owner sends the synthetic data set to the curator; the size of the message sent by data owner P m is n m p, m � 1, 2, . . . , M; the total size of the message sent during this stage is np � (n 1 + n 2 + · · · + n M )p.

Experiment
In this section, we experimentally evaluate the performance of HPDP-DP algorithm by comparing with the DP-SUBN 3 algorithm [30]. We conduct experiments on different real data sets that are NLTCS [38] and Adult [39] data sets. NLTCS data set contains 21574 individuals, each individual has 16 attributes. Adult data set contains 45222 individuals, each individual has 15 attributes. We use the method in [30] to preprocess the Adult data set. After processing, the number of attributes in the Adult data set is 52. We use SVM classification accuracy to evaluate the performance of HPDP-DP algorithm. We train multiple classifiers on published synthetic data sets. For NLTCS data set, predicting whether a person is unable to go outside and whether a person is unable to manage money. For Adult data set, predicting whether a person holds a postsecondary degree and whether a person earns more than 50K. In each classification task, we use 20% of the individuals as the test set and 80% of the individuals as the training set. Each experiment is run five times, and the average results are reported. e number of retained principal components is determined by the cumulative contribution rate c. e cumulative contribution rate c is set to 0.8 for NLTCS data set and 0.95 for Adult data set. In order to measure the performance of the HPDP-DP algorithm more clearly, the same SVM classifier are trained on the original data set; we label the SVM classification accuracy on the original data set with "No Privacy."

5.1.
e Impact of the Number of Principal Components Retained on the SVM Classification Accuracy. In this section, we train multiple classifiers to study the influence of the number of principal components retained on the SVM classification accuracy. In this set of experiments, the number of data owners is set to 3; the privacy budget ε is set to 0.5.
For the Adult data set, Figures 3(a) and 3(c) show the cumulative contribution rate and individual contribution rate of the principal components. Because there are more attributes after preprocessing the Adult data set, so we only marked the corresponding SVM classification accuracy when the number of retained principal components k are 5, 10, 15, 20, 25, 30, 35, and 40 in Figures 3(b) and 3(d). For the NLTCS data set, it can be seen from Figures 3(e) and 3(g) that the contribution rate of only the first principal component has reached more than 30%. e cumulative contribution rate of the top seven principal components can reach 80%, and it can be seen from Figures 3(f ) and 3(h) that the corresponding SVM classification accuracy can reach more than 80%. e common conclusion is that when the cumulative contribution rate increases (the number of principal components retained increases), the SVM classification accuracy increases accordingly. is phenomenon is consistent with the principle of principal component analysis. e principal components are not correlated with each other and contain the information of the original data. e more principal components retained, the more information of the original data contained in the published data, and the better the performance of the published data set. 3 with Different Privacy Budgets. In this part of the experiments, we fixed the number of data owners to three while making the privacy budget ε take different values. Figure 4 shows the impact of privacy budgets on HPDP-DP and DP-SUBN 3 algorithms. Figures 4(a) and 4(b) show the SVM classification accuracy of the HPDP-DP and DP-SUBN 3 algorithms on Adult data set. Figures 4(c) and 4(d) show the SVM classification accuracy of the HPDP-DP and DP-SUBN 3 algorithms on NLTCS data set. From Figure 4, except for the salary classifier of the Adult data set, the performance of HPDP-DP algorithm is 8 Security and Communication Networks  signi cantly better than DP-SUBN 3 algorithm. Even for the salary classi er of the Adult data set, the SVM classi cation accuracy of HPDP-DP algorithm is still not lower than DP-SUBN 3 algorithm. From Figure 4, the experimental results show that the SVM classi cation accuracy of both synthetic data sets released by HPDP-DP and DP-SUBN 3 algorithms increases with the increase of the privacy budget. is is because, according to the de nition of di erential privacy, when the privacy budget ε increases, the degree of privacy protection decreases and the availability of the released data increases.

e Impact of the Number of Data Owners on the SVM
Classi cation Accuracy. In order to study the e ect of the number of data owners on the performance of the HPDP-DP algorithm, in this section, we set the number of data owners to 2, 4, 6, 8, and 10. We x the privacy budget ε to 0.2. e results in Figure 5 show that the performance of HPDP-DP algorithm is better than that of DP-SUBN 3 algorithm. We can observe that when the number of data owners increases, the SVM classi cation accuracy of the synthetic data sets released by HPDP-DP and DP-SUBN 3 algorithms increases accordingly. For DP-SUBN 3 algorithm, the reason is that when the number of data owners increases, the number of update iterations in DP-SUBN 3 algorithm increases, which helps to get better Bayesian network. For HPDP-DP algorithm, we use the weighted average of the local covariance matrices as an estimate of the covariance matrix of the pooled data, and the estimation e ect will get better as the number of data owners increases. At the same time, we use the distributed Laplace mechanism to add noise to the shared data, so even when the number of data owners increases, the aggregated result still contain only one share of random noise (the same level as the centralized scene). e scale of random noise is determined only by the privacy budget and the sensitivity. erefore, the SVM classi cation accuracy of the synthetic data set released by HPDP-DP algorithm increases as the number of data owners increases.

Conclusion
In this paper, in order to privately publish the horizontally partitioned data owned by multiple parties, we present a multiparty horizontally partitioned data publishing method with di erential privacy. We use the weighted average of the covariance matrices of the local data to estimate the covariance matrix of the pooled data and then obtain the principal components of the pooled data. In order to protect the privacy of the local data and improve the utility of the published data, we exploit the in nite divisibility of the Laplace distribution to add noise to the locally shared data to improve the utility of the published data. e experimental results show that the synthetic data set released by the HPDP-DP algorithm can maintain high utility. However, this paper also has limitations. (1) e principal component analysis is only suitable for linear dimensionality reduction and not for nonlinear dimensionality reduction. (2) e HPDP-DP algorithm is only suitable for horizontally partitioned data publishing, not for vertically partitioned data publishing. We will conduct research on these aspects in the future.
Data Availability e data used to support the ndings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no con icts of interest.