Multiparty Data Publishing via Blockchain and Differential Privacy

Data are distributed between diﬀerent parties. Collecting data from multiple parties for analysis and mining will serve people better. However, it also brings unprecedented privacy threats to the participants. Therefore, safe and reliable data publishing among multiple data owners is an urgent problem to be solved. We mainly study the problem of privacy protection in data publishing. For a centralized scenario, we propose the LDA-DP algorithm. First, the within-class mean vectors and the pooled within-class scatter matrix are perturbed by the Gaussian noise. Second, the optimal projection direction vector with diﬀerential privacy is obtained by the Fisher criterion. Finally, the low-dimensional projection data of the original data are obtained. For distributed scenarios, we propose the Mul-LDA-DP algorithm based on a blockchain and diﬀerential privacy technology. First, the within-class mean vectors and within-class scatter matrices of local data are perturbed by the Gaussian noise and uploaded to the blockchain network. Second, the projection direction vector is calculated in the blockchain network and returned to the data owner. Finally, the data owner uses the projection direction vector to generate low-dimensional projection data of the original data and upload it to the blockchain network for publishing. Furthermore, in a distributed scenario, we propose a correlated noise generation scheme that uses the additivity of the Gaussian distribution to mitigate the eﬀects of noise and can achieve the same noise level as the centralized scenario. We measure the utility of the published data by the SVM misclassiﬁcation rate. We conduct comparative experiments with similar algorithms on diﬀerent real data sets. The experimental results show that the data released by the two algorithms can maintain good utility in SVM classiﬁcation.


Introduction
With the development of science and technology, effective data collection and analysis can help people make better decisions in production. For example, analyzing the information of the patient can help doctors improve the accuracy of diagnosis and level of medical services, and analyzing the trajectory data can improve city traffic congestion. e data contain sensitive information and need to be processed for privacy protection before publishing [1,2]. ere have been some studies on privacy preserving data publishing. For example, the k -anonymity privacy protection technology [3], the encryption technology [4,5], the blockchain technology [6][7][8], and differential privacy technology [9][10][11]. Differential privacy has been widely used for privacy protection in recent years, the principle of differential privacy is to add random noise to data, which makes the attacker unable to distinguish the original input data. Differential privacy can quantitatively measure the degree of privacy protection and can resist attacks from attackers with background knowledge. Privacy preserving data publishing based on differential privacy has become a research hot spot [12][13][14][15].
However, in the distributed scenario, data are possessed by multiple data owners. Data from a single data owner may not be sufficient for statistical learning, and aggregating data by a single data owner may not be possible. For example [16], in Table 1, the data are possessed by three data owners. Each row in Table 1 represents the information of an individual, where records 1 to 4 are from data owner 1, records 5 to 8 are from data owner 2, and records 9 to 10 are from data owner 3. Simply integrating and publishing the data from each data owner will cause a serious privacy leakage. Sharing and exchange of data in a distributed environment requires security guarantees. In order to solve the proposed problem, we make the following contributions: (1) We propose two algorithms which are called LDA-DP and Mul-LDA-DP. e LDA-DP algorithm is used for privacy protection of data publishing in centralized scenario, and the Mul-LDA-DP algorithm is used for privacy protection of data publishing in distributed scenario. (2) In the distributed scenario, the data owners cooperate with each other to publish a projection data set which satisfies differential privacy. In order to improve the utility of the published data in the distributed scenario, we propose a correlated noise generation scheme that uses the additivity of the Gaussian distribution to mitigate the effects of noise and can achieve the same noise level as the centralized scenario. (3) We conduct experiments on different data sets. e experimental results show that the data released by LDA-DP and Mul-LDA-DP algorithms can maintain good utility in SVM classification.

Related Work
In this section, we introduce the research status of privacy preserving data publishing in centralized scenario and distributed scenario, respectively.

Privacy Preserving Data Publishing in Centralized
Scenario. Blum et al. [17] proposed the sublinear query (SULQ) input perturbation framework which adds noise to the covariance matrix, the framework can only be used for querying the projected subspace. Chaudhuri et al. [18] proposed the PPCA algorithm which is the improvement of SUQL algorithm. e PPCA algorithm randomly samples a k-dimensional subspace which ensures differential privacy and is biased toward high utility. Both SUQL and PPCA procedures are differentially private approximations to the top-k subspace. Zhang et al. [19] proposed the PrivBayes algorithm; first, they constructed a Bayesian network with differential privacy, and then they used the Bayesian network to generate a data set for publication. Chen et al. [20] presented the JTree algorithm. First, they explored the relationship between the attributes based on the sparse vector sampling technology, and then they constructed a Markov network that satisfies differential privacy and generated a synthetic data set for publication. Zhang et al. [21] proposed the PrivHD algorithm based on the JTree. ey used highpass filtering techniques to speed up the construction of Markov network and built a better joint tree for generating synthetic data set for publication. Xu et al. [22] proposed the DPPro algorithm; first, they randomly projected the original high-dimensional data into a low-dimensional space, and then they added noise to the projection vector and lowdimensional projection data; finally, they released the lowdimensional projection data. Zhang et al. [23] presented the PrivMN method. ey constructed a Markov model with differential privacy, and then used the Markov model to generate a synthetic data set for publication. e algorithms mentioned above are mainly used for privacy preserving data publishing in centralized scenarios.

Privacy Preserving Data Publishing in Distributed
Scenario. ere are fewer researches on privacy protection of horizontally partitioned data publication. Ge et al. [24] proposed a distributed principal component analysis (DPS-PCA) algorithm with differential privacy; first, data owners collaborated to analyze the principal components, while protecting the private information, and then they released low-dimensional subspaces of high-dimensional sparse data. Wang et al. [25] proposed an efficient and scalable protocol for computing principal components in a distributed environment. First, the data owner encrypted the shared data and sent them to the semitrusted third party, then the semitrusted third party performed a private aggregation algorithm on the encrypted data and sent the aggregated data to data user for calculating the principal components. Imtiaz et al. [26] presented a distributed principal component analysis (DPdisPCA) algorithm with differential privacy. Each data owner used Gaussian noise to perturbed the local covariance matrix, and with the assistance of a semitrusted third party to calculate the principal components while ensuring local data privacy. Alhadidi et al. [27] proposed a two-party data publishing algorithm with differential privacy. ey first presented a two-party protocol for the exponential mechanism which can be used as a subprotocol, the data released by this algorithm are suitable for classification tasks. Cheng et al. [28] proposed a differential privacy sequential update of the Bayesian network algorithm which is called DP-SUBN 3 , data owners collaboratively constructed the Bayesian network, data owners can treat the intermediate results as prior knowledge to construct the Bayesian network, and then they used the Bayesian network to generate a data set for publication. Wang et al. [29] proposed a distributed differential privacy anonymous algorithm and guaranteed that each step of the algorithm satisfies the definition of secure two-party computation.
is is the first research about differentially private data publishing for arbitrarily partitioned data. In our  [16], we proposed the PPCA-DP-MH algorithm. First, data owners and a semitrusted third party cooperated to reduce the dimension of high-dimensional data to obtain the top k principal components that satisfy differential privacy, and then each data owner used the generative model of probabilistic principal component analysis to generate a data set with the same scale as the original data for publication. Different from the prior work [16], this paper uses the linear discriminant analysis to publish the projection data with differential privacy. Linear discriminant analysis can retain the class information of the data while reducing the dimension, which is beneficial to maintain the utility of the published data in classification.

Linear Discriminant Analysis (LDA).
Linear discriminant analysis proposed by Fisher is one of the most widely used and extremely effective methods in the field of dimensionality reduction and pattern recognition. Its typical applications include face recognition, target tracking and detection, credit card fraud detection, and speech recognition. e idea of linear discriminant analysis for binary classification is to choose the projection direction so that the samples of different classes after projection are as far apart as possible and the samples within each class are as clustered as possible. We denote the data set as X � X (1) ⋃ X (2) , (2) . e within-class mean vector of samples in the original sample space is as follows: e between-class scatter matrix is as follows: e within-class scatter matrix is as follows: en, the pooled within-class scatter matrix is as follows: It can also be expressed as follows: e criterion of Fisher is as follows: Using the Lagrange multiplier method to find the optimal projection direction vector, we obtain the following: e result of linear discriminant analysis only gives the optimal projection direction, and does not give a clear classification result.

Differential Privacy.
Differential privacy provides a rigorous privacy protection for sensitive information, it can be quantified by mathematical formulas. e essence of differential privacy is to use noise to randomly perturb the output results, so that it is difficult to distinguish the original input data according to the output results. Definition 1. [30] A randomized algorithm M is ε -indistinguishable if for any two neighboring databases D and D differing in a single entry, and for all O⊆Range(M): where ε is a small positive real number.
When ε is small, , ε is used to control the probability ratio of algorithm M to obtain the same output on two neighboring databases, which reflects the level of privacy protection that M can provide.
Definition 2 [30]. A randomized algorithm M is (ε, δ) differential privacy, if for any two neighboring databases D and D differing in a single entry, and for any O(O⊆Range(M)) there is the following: where ε is a small positive real number called privacy budget and δ is a small positive real number. It is also called δ-approximate ε-indistinguishability.
Definition 3. is the relaxed version of differential privacy. When δ � 0, it becomes Definition 1, which is the strict version of differential privacy. Formula (9) means that it is allowed to break the limit of formula (8) with a small probability δ.

Security and Communication Networks
Theorem 4 (Post Processing) [31]. Let M: D ⟶ R be a randomized algorithm that is (ε, δ) differential privacy, let f: R ⟶ R ′ be an arbitrary mapping, then f°M: D ⟶ R ′ is (ε, δ) differential privacy.

Proposed Methods
In this section, we will propose two algorithms which are called LDA-DP and Mul-LDA-DP. e LDA-DP algorithm is used for privacy protection of data publishing in the centralized scenario, and the Mul-LDA-DP algorithm is used for privacy protection of data publishing in the distributed scenario. Without loss of generality, we assume that all individual data in this paper are normalized to p -dimensional unit vectors.

LDA-DP Algorithm.
In this section, we propose the LDA-DP algorithm for centralized data publishing.

Problem Statement and Algorithm Proposed.
e data set X contains two classes of data individuals denoted as Our goal is to protect the privacy information of the original data from being leaked while publishing the projection data of the original data.
In order to solve this problem, we propose the LDA-DP algorithm, which is mainly divided into two stages. First, we use the Gaussian mechanism of differential privacy to perturb the within-class mean vectors μ (k) (k � 1, 2). Second, we use the Gaussian mechanism to perturb the pooled within-class scatter matrix S w . Finally, we get the projection direction vector w that satisfies (ε, δ) differential privacy and publish the low-dimensional projected data of the original data. e specific details are in Algorithm 1.
Suppose the different individuals are in X (1) and We denote a � x∈X (1) x and a � x∈X (1) x, let c � a + g (1) and c � a + g (1) , each entry of g (1) and g (1) is sampled from N(0, σ 2 1 ). e log ratio of the probabilities c and c at a point h is |ln((P c � h|X { })/(P c � h|X ))|, the numerator in the ratio describes the probability of seeing h when the data set is X, the denominator corresponds the probability of seeing this same value when the data set is X.
We denote B � 2 k�1 x∈X (k) xx T , B � 2 k�1 x∈X (k) xx T , let C � B + G and C � B + G, G and G are two independent symmetric random matrices with the upper triangle (including the diagonal) entries are sampled from N(0, σ 2 2 ), and make the symmetrical position entries in the lower triangle matrix equal to the upper triangle. e log ratio of the probabilities C and C at a point H is |ln((P C � H|X { })/(P C � H|X ))|. By eorem 1, we need to find the value of σ 2 such that the inequality |ln((P C � H|X { })/(P C � H|X ))| ≤ ε 2 holds at least with probability 1 − δ 2 .

Security and Communication Networks
By using the Lagrange multiplier method and the inequality in [18], the following inequalities hold: e rest of the proof process is similar to eorem 5, then we can obtain the following: We have proven that the within-class mean vector μ (k) (k � 1, 2) satisfies (ε 1 , δ 1 ) differential privacy, the pooled within-class scatter matrix S w satisfies (ε 2 , δ 2 ) differential privacy, by the property of differential privacy sequential composition, the projection direction vector in the Algorithm 1 satisfies (ε, δ) differential privacy, where ε � ε 1 + ε 2 , δ � δ 1 + δ 2 . For the published projection data X � Xw, X ∈ R N×p , w ∈ R p×1 , p < N, we can regard X � Xw as a set of undetermined system of equation, the number of variables are more than equations, so the equation has infinitely many sets of solutions, that is, it is impossible to infer the information of the original data X from the published projection data X.

Mul-LDA-DP Algorithm.
In this section, we propose the Mul-LDA-DP algorithm for distributed data publishing. e mathematical notations used in this section are summarized in Table 2.

Problem Statement and Algorithm Proposed.
In the distributed scenario, data are stored by multiple data owners rather than a single owner, and the data owners do not trust each other. Data at a single site may not be sufficient for statistical learning. One solution is that each data owner uses the LDA-DP algorithm in Section 4.1 to publish the projection data independently. Another solution is the data owners cooperate with each other to publish the projection data of the integrated data. Comparing the two solutions, it is obvious that the latter solution can improve the utility of publishing data. Based on the idea of the second solution and [32], we propose the Mul-LDA-DP algorithm for distributed data publishing. e entity description of the model is as follows.
(1) Data owner. e data owner P m (m � 1, 2, . . . , M) has a data set X m . Each data owner can generate random vectors and matrices to perturb the within-class mean vectors and within-class scatter matrices locally. (2) Data publisher. e data publisher is a data publishing platform based on blockchain. e data publisher aggregates the local within-class mean vectors and within-class scatter matrices with noise. e data publisher can obtain the projection vector that satisfies differential privacy and publishes the projection data of the pooled data.
(3) A random number generator. It can generate random vectors and random matrices and send them to data owners and data publisher secretly. reat Model. In our setting, we assume that the data owners and data publisher are honest-but-curious, that is, they follow the protocol but may try to deduce information of other data owners from the received messages. Two types of adversaries are considered, which are external attackers and internal attackers. External attackers which can be called an external eavesdropper may gain access to information such as data sent by data owners to the data publisher. Internal adversaries can be the data owners and the data publisher. e goal of each data owner is to extract the information not owned by him, while the goal of the data publisher is to extract the information from each data owner. Distributed Within-Class Mean Vectors and Pooled Within-Class Scatter Matrix Computation. When the data are owned by M data owners, the withinclass mean vectors (1) can be decomposed into the following: where x. e pooled within-class scatter matrix (5) can be decomposed into the following: where T . e abovementioned result allows each data owner to compute and perturb a partial result simultaneously locally. erefore, we use the additivity of Gaussian distribution to propose a correlated noise generation scheme. We design the noise generation procedure such that (i) we can ensure that the data output from each data owner satisfy differential privacy and (ii) we can achieve the noise level of the same as the pooled data scenario. Scheme for Perturbing Shared Data by Correlated Noise. To prevent the data publisher and other data owners learning the privacy of local data, the data owner uses the noise generated by himself and the noise generated by the random number generator to perturb the local within-class mean vectors and within-class scatter matrices. rough our correlated noise design scheme, the data aggregated by the data publisher contain the same level of noise as the centralized scenario. e scheme is described as below: (1) Initialization stage. e random number generator generates p dimensional random vectors g (k) m , each entry is sampled from N(0, (M − 1)/Mσ 2 1 ), generates p × p random matrices G m , let G m be the symmetric matrix with the upper triangle (including the diagonal) entries are sampled from N(0, (M − 1)/Mσ 2 2 ), and makes the symmetrical position entries in the lower triangle matrix equal to the upper triangle, m � 0, 1, 2, . . . , M, k � 1, 2. Make these random vectors and matrices satisfy 1, 2) and G m are sent to data owner P m secretly, g (k) 0 (k � 1, 2) and G 0 are sent to the data publisher secretly.
(2) Data owner P m generates p dimensional random vectors g (k) m (k � 1, 2), each entry is sampled from , k � 1, 2, and sends them to the data publisher.
Due to the post-processing property of differential privacy, the within-class mean vector in Algorithm 2 satisfies (ε 1 , δ 1 ) differential privacy.
ere are three opportunities for attackers to steal the data transmitted between the data owner and the data publisher. e rst time is that the data owner sends the within-class mean vectors to the data publisher, the second time is that the data owner sends the within-class scatter matrices to data publisher. From eorems 7 and 8, we know that the withinclass mean vectors and the within-class scatter matrices satisfy di erential privacy. erefore, the attacker cannot infer the information of the original data from the eavesdropped data. e third time is that the data owner sends projection data to the data publisher, in Section 4.1.2, we have analyzed that it is impossible to infer the information of the original data from the published projection data.

Experiment
In order to measure the usability of the LDA-DP and Mul-LDA-DP algorithms proposed in this paper, we conduct experiments on real data sets which are Adult and NLTCS. Adult data set is extracted from the 1994 US Census, it contains 45222 individuals, each individual has 15 attributes. NLTCS data set is extracted from the National Long Term Care Survey, and recorded the daily activities of 21574 disabled persons at di erent time periods, each individual has 16 attributes. We use the SVM misclassi cation rate to measure the availability of the published data. For the Adult data set, it is necessary to predict whether a person (1) holds a post-secondary degree and (2) earns more than 50K. For the NLTCS data set, we need to predict whether a person (1) is unable to get outside, (2) is unable to manage money, (3) is unable to travel, and (4) is unable to bath. In our experiments, we set δ 0.001 to remain unchanged, and ε to take di erent values. We uniformly divide the privacy parameters into 2 portions (ε 1 ε 2 ε/2, δ 1 δ 2 δ/2). Each experiment was repeated 50 times, and the mean value was taken as the experimental result. We use "No Privacy" to represent the SVM misclassi cation rate on the original data set.

Comparing the Performance of LDA-DA, PrivBayes, and PRivHD Algorithms under Di erent Privacy
Budgets. e LDA-DA, PrivBayes, and PrivHD algorithms are all suitable for the centralized data publishing scenario, so in this set of experiments, we set the number of data owners to 1, and privacy budget ε takes di erent values. As can be seen from Figure 1, for both Adult and NLTCS data sets, the SVM classi cation utility of the data published by the LDA-DP algorithm outperforms the PrivBayes algorithm. e LDA-DP algorithm outperforms the PrivHD algorithm on the NLTCS dataset; however, the LDA-DP algorithm has slightly lower SVM classi cation utility on the Adult dataset than the PrivHD algorithm. We can also observe a commonality, for LDA-DA, PrivBayes, and PRivHD algorithms, the SVM misclassi cation rate decreases with the increase of the privacy budget ε. is phenomenon is consistent with the theory that as the privacy budget ε increases, privacy protection will weaken and the availability of data will increase.

Comparing the Performance of Mul-LDA-DA and DP-SUBN 3 Algorithms under Di erent Privacy Budgets.
e algorithm Mul-LDA-DP proposed in this paper is suitable for the distributed data publishing scenario, so in this set of experiments, we set the number of data owners to 3, and privacy budget ε takes di erent values. We train classi ers on published data set to compare the e cacy of Mul-LDA-DA and DP-SUBN 3 algorithms. From Figure 2, we can see that the SVM classi cation utility of the data published by the Mul-LDA-DP algorithm outperforms the DP-SUBN 3 algorithm. Both on money of NLTCS and education of Adult classi ers, the misclassi cation rate of Mul-LDA-DA algorithm is signi cantly lower than the DP-SUBN 3 algorithm especially.

Comparing the Performance of Mul-LDA-DA and DP-SUBN 3 Algorithms under Di erent Number of Data Owners.
In this section, the experiment studied the relationship between SVM misclassi cation rate and the number of data owners. e number of data owners is set to 2, 4, 6, 8, 10, and the privacy budget ε is set to 0.2, We trained two classi ers, education classi er, and salary classi er on Adult data set. e results in Figure 3 show that the SVM misclassi cation rate of the Mul-LDA-DP algorithm remains stable with the change of the number of data owners. e reason is that we perturb the local shared data by generating correlated noise based on the additivity of the Gaussian distribution. is scheme ensures that the level of Gaussian noise added to the data in the distributed scenario is similar to the noise level in the centralized scenario. erefore, as the number of data owners increases, the misclassi cation rate remains stable. e SVM misclassi cation rate of DP-SUBN 3 algorithm decreases as the number of data owners increases. is is because as the number of data owners increases, the number of update iterations increases when constructing the Bayesian network, and the Bayesian network constructed is closer to the distribution of the original data. However, from Figure 3, we can see that the performance of Mul-LDA-DA algorithm is still better than DP-SUBN 3 algorithm when the number of data owners is no more than 10.

Conclusion
In this paper, we propose two algorithms for privacy preserving data publishing, the LDA-DP algorithm for data publishing in the scenario, and the Mul-LDA-DP algorithm for multiparty horizontally split data publishing. We use the additivity of Gaussian distribution to alleviate the e ects of noise and can achieve the same noise level as the centralized scenario. e experimental results show that the projection data released by the two algorithms can maintain high utility in SVM classification. However, the research in this paper also has limitations. 1)We only research the privacy protection problem when the data are a binary classification, but they are often multiclassification data. 2) e data released by the two algorithms in this paper are low-dimensional projection data of the original data, which limit the analysis and mining of the released data in many aspects. In the future, we will continue to conduct research on the abovementioned issues.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.