Multi-Party Verifiable Privacy-Preserving Federated k-Means Clustering in Outsourced Environment

As a commonly used algorithm in data mining, clustering has been widely applied in many fields, such as machine learning, information retrieval, and pattern recognition. In reality, data to be analyzed are often distributed to multiple parties. Moreover, the rapidly increasing data volume puts heavy computing pressure on data owners.,us, data owners tend to outsource their own data to cloud servers and obtain data analysis results for the federated data. However, the existing privacy-preserving outsourced k-means schemes cannot verify whether participants share consistent data. Considering the scenarios with multiple data owners and sensitive information security in an outsourced environment, we propose a verifiable privacy-preserving federated k-means clustering scheme. In this article, cloud servers and participants perform k-means clustering algorithm over encrypted data without exposing private data and intermediate results in each iteration. In particular, our scheme can verify the shares from participants when updating the cluster centers based on secret sharing, hash function and blockchain, so that our scheme can resist inconsistent share attacks by malicious participants. Finally, the security and experimental analysis are carried out to show that our scheme can protect private data and get high-accuracy clustering results.


Introduction
Data mining technology can be used to analyze and extract potentially valuable information from large collections of data. Clustering algorithms are widely used in data mining and have an important role in research for medical, scientific, and commercial applications in practical life. In brief, clustering [1,2] algorithms can divide data items into groups according to their features and attributes, such that the data items are sufficiently similar in the same group. As a wellknown clustering algorithm, k-means clustering [3] algorithm has the advantages of simple process and good clustering results and it can assign data into k clusters based on the distances from cluster centers.
Data analysis can extract a lot of useful information, but in the process of data analysis, a large amount of personal privacy data will be collected and analyzed, such as living habits, criminal records, and medical records. Furthermore, privacy data breaches often result in financial losses and panic for society and companies. Data privacy has gained more attention than before. ere are some researches about privacy-preserving in [4][5][6]. Moreover, a lot of privacypreserving data mining schemes can be found in [7][8][9] in recent years. e traditional privacy-preserving k-means schemes are primarily achieved through the interaction of participants. Vaidya and Clifton [10] firstly proposed the multi-party privacy-preserving k-means clustering protocol on vertically partitioned data, where the secure distance computation and comparison are supported by the secure permutation scheme and homomorphic encryption. Jha et al. [11] introduced a two-party privacy-preserving k-means clustering protocol based on oblivious polynomial evaluation and homomorphic encryption. Bunn and Ostrovsky [12] presented an efficient two-party clustering protocol on arbitrarily partitioned data, where the intermediate values are not disclosed by using division protocol and random values protocol. Yi and Zhang [13] introduced an equally contributory privacy-preserving k-means clustering protocol based on ElGamal, plaintext equivalence test protocol, and mix networks. Xing et al. [14] proposed a mutual privacypreserving k-means scheme in the social scene where the parties are grouped with the help of a data analyst, and the scheme can resist collusion attacks. Zhang et al. [15] combined secure multi-party computing and differential privacy technology to train a privacy-preserving k-means clustering model. After that, privacy-preserving k-means schemes in the malicious model are proposed in [16,17].
Recently, explosive growth data poses a challenge for data owners in storing and computing, and a cloud server with high storage capacity and strong computing power is a good solution to the problem. Privacy protection and audit research in the cloud environment have also been studied in [18][19][20][21][22]. So there are some privacy-preserving k-means schemes in an outsourced environment. Liu et al. [23], following the framework in [24], presented a privacy-preserving outsourced k-means clustering protocol that one party outsourced the distance computation to a cloud server without revealing both the data and clustering results to any party and cloud server. Jiang et al. [25] introduced an efficient two-party privacy-preserving k-means clustering protocol, and this scheme can compute distance safely using subprotocols in [26] and update cluster centers using garbled circuit proposed in [27]. Zou et al. [28] proposed a highly secure privacy-preserving outsourced k−means clustering scheme using BCP homomorphic encryption and AES encryption under multiple keys. Sakellariou and Gounaris [29] introduced a privacy-preserving outsourced k-means scheme with low client-side load based on switch key and Paillier encryption. However, the existing privacy-preserving outsourced k-means schemes cannot verify whether participants share consistent data. In this article, we propose a multi-party verifiable privacy-preserving federated k-means scheme for horizontally partitioned data.
Our main contributions can be summarized as follows: (1) We propose a privacy-preserving k-means scheme based on Paillier cryptosystem, secret sharing, hash function, and blockchain. In the multi-party scenario, we outsource the main computing task to the cloud server and reduce the computing overhead of participants. (2) Our scheme can protect the participants' information and avoid leaking the clustering centers to participants in each iteration. Furthermore, the malicious participant can be detected using hash function and blockchain in the process of updating cluster centers. e rest of the article is organized as follows. Section 2 introduces the preliminaries about k-means clustering and cryptography knowledge. Section 3 presents the framework and specific entities in our scheme. Basic secure protocol and our scheme are detailed in Section 4. Security analysis is carried out in Section 5. Performance analysis is presented in Section 6. Moreover, we conclude this article in Section 7.

Preliminaries
For better elaboration, the notations used in this article and their semantic meanings are presented in Table 1.

k-Means Clustering Algorithm.
e k-means clustering algorithm is one of the most well-performed unsupervised clustering algorithms. e process of k-means clustering algorithm is described as follows. Assume that there is a set of samples and each sample a i is an ℓ-dimensional data. Suppose that the samples need to be grouped into k clusters C 1 , . . . , C k , where the cluster center of jth cluster is denoted by μ j . Initially, randomly select k samples μ 1 , . . . , μ k as the initial cluster centers.
ere are many iterations to measure the distances between each sample and k cluster centers. In this article, we adopt Euclidean distance as the criterion. Sample a i belongs to cluster C j if the cluster center μ j is the closest to a i . In each iteration, each sample is reassigned to the nearest cluster and recompute the cluster centers as in the following equation (1). e iteration terminates when there is no or little change in the cluster centers. e specific description of the k-means clustering algorithm is shown in Algorithm 1. (1)

Homomorphic Encryption.
Homomorphic encryption allows certain computation over encrypted data. Paillier [30] cryptosystem is a popular homomorphic encryption scheme based on the decisional composite residuosity class problem. Furthermore, the Paillier cryptosystem can provide fast encryption and decryption, and it is widely used in privacypreserving data mining. We adopt Paillier cryptosystem in our scheme. e Paillier cryptosystem is briefly introduced as follows: (i) Key generation: an entity selects two large primes p and q and compute n � pq and λ � lcm (p − 1, q − 1). en randomly choose an integer g ∈ Z * N and check whether gcd(L(g λ modn 2 ), n) � 1, where L(x) � (x − 1)/n. e public key is pk � (n, g) and the secret key sk � (λ).
(ii) Encryption: let m ∈ Z * N be a message and r ∈ Z * N be a random number. e ciphertext of m is computed by 2 Security and Communication Networks where E(·) denotes the encryption with the pk � (n, g). (iii) Decryption: decrypt the ciphertext of m by where D(·) denotes the decryption with the sk � (λ). (iv) Homomorphic: the Paillier cryptosystem is additive homomorphic, which satisfies the following equation:

Blockchain.
Blockchain is the underlying technology of Bitcoin [31], which is essentially a distributed database. Blockchain is a very new network form, which uses cryptography, hash function, and proof of work (Pow). ere are a lot of nice features of blockchain, such as decentralization, tamper resistance, and transparency.
(i) Decentralization: the data on the blockchain are maintained by all nodes in the peer-to-peer networks. Moreover, all nodes compete to generate a block of block without relying on a centralized third party to record transactions. (ii) Tamper resistance: each node in the peer-to-peer networks saves a copy of data on the blockchain, so it is impossible to tamper with the data once the data has been recorded on the blockchain. (iii) Transparency: the records on the blockchain are transparent to all nodes, and anyone can access the data on the blockchain.

Scheme Model
e scheme model is illustrated in Figure 1, where the multiparty verifiable privacy-preserving federated k-means model includes three entities. e first entity is participants; the data owners provide the original data for the k-means algorithm, such as hospitals, scientific research companies, and government agencies. e second entity is a cloud server with adequate storage and computing resources, where the cloud server is responsible for storing the encrypted samples of the participants and undertakes the main computational task in privacy-preserving k-means clustering. e last entity is blockchain, which is used to store hash values of secret shares. Because of the tamper-resistant nature of blockchain, once the hash values have been uploaded to the blockchain, it can be guaranteed that they will not change.
In the multi-party verifiable privacy-preserving federated k-means scheme, the specific descriptions of entities are shown below.

Notations
Meanings Hash value S.sum e sum of samples S.no e number of samples Randomly select k cluster centers μ 1 , . . . , μ k from the dataset Repeat (1) Calculate the distances between each sample and k cluster centers μ 1 , . . . , μ k (2) Assign each sample to the closest cluster (3) Replace each cluster centers μ i with the mean of the i th cluster until Cluster centers do not change ALGORITHM 1: k-means clustering.

Security and Communication Networks
(i) Participants: the samples data to be analyzed are horizontally distributed on the participants, where the n participants are denoted by P 1 , . . . , P n . Moreover, each participant P i holds d p i samples, where the samples are ℓ-dimensional. Participants in red represent malicious participants. Malicious participants may not follow the protocol and share inconsistent data information. In our scheme, participants generate their public key and secret key and upload the encrypted samples to the cloud server. Besides, participants need to share data with a secret sharing scheme when updating the cluster centers. In order to verify the secret shares, participants need to compute the hash values of secret shares and upload them to the blockchain in advance. (ii) Cloud Server: all encrypted samples are stored on the cloud server. e cloud server is responsible for interacting with participants and calculating and comparing the distances between encrypted samples and cluster centers. en, the cloud server assigns samples to different clusters. (iii) Blockchain: blockchain is responsible for storing hash values generated by the participants. e aim of our scheme is to build a multi-party verifiable federated k-means scheme. In our scheme, participants only get the final k-means clustering results, and the cluster centers in each iteration are only known to the cloud server. Furthermore, participants and cloud servers cannot know or infer private information about the other side. In the process of updating the cluster centers, participants send secret shares to other participants and verify the secret shares received from other participants with hash values. Once there exists a secret share that fails to verify, it means that malicious participants have sent inconsistent secret shares so that malicious participants can be detected in time and avoid losses due to incorrect k-means results.

Our Construction
We construct a multi-party verifiable privacy-preserving federate k-means scheme. In our scheme, we use Paillier encryption to protect the information of participants. Furthermore, secret sharing, hash function, and blockchain technology are used to verify and update new cluster centers.

Basic Secure Protocol.
We present a set of subprotocols that will be used in constructing the multi-party verifiable privacy-preserving federate k-means scheme.
(i) Secure Multiplication (SM) Protocol. In this protocol, participants P have input (E pk (x), E pk (y)) and output (E pk (x * y)) to cloud server C, where neither P nor C knows x and y. Furthermore, information concerning x and y is not leaked to P or C.
Step 1.1 upload hash values Cloud Server Step 1.3 choose cluster centers Step 2.2 compute new cluster centers Step 3 stop iteration P 1 P 2 P 3 P n

Participants
Step . , E(y ℓ ) > denote the encrypted components sets of X and Y. e cloud server C with the input (E(X), E(Y)) and participants P with sk securely compute the encrypted value of the squared Euclidean distance between vectors X and Y.
(iii) Secure Bit-Decomposition (SBD) Protocol. SBD [32] protocol considers cloud server C with input E pk (z) and participants P securely compute encrypted values of the individual bits of z, where 0 ≤ z ≤ 2 α − 1 and z is not known to C and P. No information regarding output [z] � 〈E pk (z 1 ), . . . , E pk (z α )〉 is revealed to P. Here, (z 1 , z α ) are the most and least significant bits of z.  Participant P i has a dataset, In this protocol, P i securely share samples using a secret sharing scheme. Firstly, is not known to other participants. P i sends F(x i ) to cloud server C, and C solves the set of F(x) using Lagrange's interpolation to recovery the sum of the secret values.

e Proposed Scheme.
is section presents the detailed steps of the multi-party verifiable privacy-preserving federated k-means scheme.

Step 1: Assign Samples to Clusters.
is step aims to assign each sample to its nearest cluster. Where the cluster centers are initialized by cloud server C. (i) Cloud server encrypts k cluster centers using the pk of each participant to get C i Step 1.4: cloud server and participants compute distances. Cloud server C has P i 's encrypted samples CT i and cluster centers C i μ,c , and P i has the secret key sk. C and P i run SSED protocol to compute encrypted distances between each sample and k cluster centers. e distances are represented as j denotes the encrypted distances between P i 's sample S i,j and k cluster centers in encrypted form.
(v) Step 1.5: cloud server assigns samples to k clusters. To compare the encrypted distances, C and P i run SMINk protocol to get the minimum distance e sample S i,j will be assigned to the cth cluster if the minimum distance min i,j is equal to dis i j,c . Finally, the cloud server gets the clustering results of each participant's samples C � C i |1 ≤ i ≤ n}, where C i � C i j |1 ≤ j ≤ k and C i j denotes the set of P i 's samples in the jth cluster. en, cloud server sends samples clustering results to the corresponding participants. In other words, participants only know the clustering results of their own samples, and they do not know the clustering results of other participants' samples.

Step 3: Stop Iteration.
is step aims to determine whether to terminate the privacy-preserving federated k-means algorithm. After Step 1 and Step 2, the cloud server should compare the new cluster centers μ ′ with previous cluster centers μ. If they are close enough (e.g., the difference is no more than the threshold set earlier), the k-means ends and the clustering results have been sent to participants in Step 1.5. Otherwise, the cloud server encrypts the k new cluster centers using each participants' public key. en, go to Step 1.4 and iterate.

Security Analysis
In this section, we discuss the privacy protection capabilities offered by our scheme. Firstly, we define the goals in privacy protection for different entities.
For each participant in the federated k-means scheme, it should not access to the following information: (1) Samples' information from other participants (2) Clustering results for other participants' samples (3) Clustering centers in each iteration Require: C has E pk (x), E pk (y); P has sk (1) C: (a) Pick two random numbers r x , r y ∈ Z N (b) x′ ⟵ E(x) * E(r x ), y′ ⟵ E(y) * E(r y ) (c) Send x′, y′ to P (2) P: Require: C has E pk (X), E pk (Y); P has sk (1) C: ALGORITHM 3: SSED (E pk (X), E pk (Y)) ⟶ E pk (|X − Y| 2 ).
For the cloud server, it can only get the cluster centers in each iteration and cannot obtain the samples' information of participants.
e blockchain is only used to store hash values, which is publicly available and does not interact with the participants or the cloud server, so it does not access any private information about participants or the cloud server.

Privacy-Preserving Analysis for Assigning Samples.
In the stage of assigning samples, each participant encrypts samples with the Paillier cryptosystem. Because Paillier's cryptosystem is semantically secure, other participants and cloud servers cannot decrypt and deduce additional information from the encrypted data.
According to our scheme, we should compute the distances between samples and cluster centers with the SM and SSED protocols. In SM protocol, the cloud server has the encrypted values E(x) and E(y), and the participants hold the secret key sk. e aim of the SM protocol is to return encrypted values of x * y (e.g., E(x * y)) to cloud server with the interaction between cloud server and participants.
e idea of SM protocol is based on the following property: e detailed SM protocol is shown in Algorithm 2. Firstly, the cloud server chooses two random numbers r x , r y , which are only known to the cloud server. en, the cloud server computes x ′ � E(x + r x ) � E(x) * E(r x ), y ′ � E(y + r y ) � E(y) * E(r y ) and sends x ′ , y ′ to the participant. After receiving x ′ and y ′ , participant decrypts h x � D(x ′ ), h y � D(y ′ ) and computes h � h x * h y � (x + r x ) * (y + r y ).
en, the participant sends h ′ � E(h) to the cloud server. Finally, the cloud server computes E(x * y) � h ′ * E(x) N− r y * E(y) N− r x * E(r x * r y ) N− 1 � E ((x + r x ) * (y + r y )) * E(x) − r y * E(y) − r x * E(r x * r y ) − 1 . Note that N − x is equivalent to −x for any x ∈ Z N . During the SM protocol, no information about x and y except for E(x * y) is revealed to the cloud server.
In the SSED protocol shown in Algorithm 3, the cloud server has the encrypted vectors E(X) . , E(y ℓ ) > and the participants hold the secret key sk. e goal of the SSED protocol is to compute the distance between X and Y. Cloud server computes E((x i − y i ) 2 ) using the SM protocol and then computes E(|X − Y| 2 ) � ℓ i�1 E((x i − y i ) 2 ) because the Paillier cryptosystem has additive homomorphism. Since the SM protocol does not reveal private information to the cloud server, SSED does not leak information either.
After computing the distances between samples and each cluster center, each sample is assigned to the closest cluster with SBD, SMIN2, and SMINk protocols. In SBD protocol, for random z(0 ≤ z ≤ 2 ℓ ), the cloud server only can get the individual bits of the binary representation of z. During the process, the cloud server has E(z) and participants have sk, and z is not known to both of them. e detailed SBD protocol is described in [32]. e SMIN2 protocol is described in Algorithm 4, and this protocol can help cloud server get the encrypted results of the individual bits of min(u, v). e participant can not only directly obtain u or v but also send an identifier α to the cloud server. en, cloud server can access min(u, v) without decrypting or knowing u, v. Similarly, the SMINk protocol shown in Algorithm 5 is private and secure.

Privacy-Preserving Analysis for Updating Cluster Centers.
In the step of updating the cluster centers, we mainly used Shamir's secret sharing. e SSVS protocol is described in Algorithm 6. In the SSVS protocol, each participant generates a polynomial for each sample f(x) � a n−1 * x n− 1 + · · · + a 1 * x + a 0 . Participants compute shares and hash values of shares. en, participants upload hash values to the blockchain. e VRS protocol is described in Algorithm 1. In the VRS protocol, participants send and receive shares with each other. e secret values of a participant cannot be revealed even if all remaining participants exchange their shares. Because each participant executes Shamir's secret sharing algorithm with a random polynomial of degree n − 1. In order to compute the coefficients of the corresponding polynomial, at least n values of the polynomial are needed. In theVRS protocol, each participant actually computes n values of polynomial but only sends n − 1 polynomial values to the other participants, keeping one polynomial value for itself. us, as long as the participants do not reveal the polynomial value they hold, the secret value cannot be deduced even if the remaining participants combine their shares. So the VRS protocol can protect the privacy of the participants.

Resisting Attacks from Malicious Participants.
In our multi-party verifiable privacy-preserving federated k-means scheme, we make no assumption that all participants are semihonest. us, there will be malicious participants in the privacy-preserving k-means scheme. In the following, we will show that our scheme can resist inconsistent sharing attacks from malicious participants. We assume that the cloud server is semihonest and will follow the proposed protocol.
Because a malicious participant may send inconsistent shares (e.g., shares of samples that do not belong to him) to other participants in the process of updating cluster centers. When participants receive shares from others, they should compute the hash values of shares and verify whether the hash values are on the blockchain. Furthermore, once the information has been agreed and added to the blockchain, it is recorded by all nodes together and is cryptographically guaranteed to be interlinked backward and forward, making tampering very difficult and costly. erefore, if participants receive a share that cannot be verified, it indicates that other participants have sent inconsistent data share. And we can conclude that the participant does not follow the proposed protocol, and it is considered to be a malicious participant. So incorrect k-Means results due to inconsistent shares by malicious participants can be avoided.

Performance Analysis
6.1. Experimental Analysis. We use two datasets of different sizes for the experiments. e first one is a 2-dimensional synthetic dataset S1 [33], which is a clustering benchmark. S1 contains 2000 samples and 7 cluster centers. e second dataset is the HCV [34], which contains laboratory values of blood donors and hepatitis C patients. HCV consists of 588 10-dimensional samples and 4 cluster centers.
To show that our scheme is suitable for multi-party scenarios, we assume that samples are randomly distributed among three participants, and we choose SHA256 as the required hash function. In the process of calculating metadata, the blockchain needs to run a consensus algorithm.
e consensus time depends on the specific blockchain system selected.
erefore, this section only conducts experimental analysis on the proposed scheme and does not consider the execution cost of the blockchain itself.
Running time of the experiment on datasets S1 and HCV is shown in Tables 2 and 3, respectively, where n, k, C, and P represent the data size, number of clusters, cloud server, and participants. e unit of time is seconds. We record the running time under different data size in one iteration, and it is obvious that the running time increases proportionally with the number of data size increasing.

Security and Communication Networks
Require: C has a random number generate; P has sk (1) C: (a) Choose n publicly known random numbers x i , . . . , x n (b) Send x i to P i (2) P: ) and uploads hv to blockchain B ALGORITHM 6: SSVS (S ⟶ f(x), hv).

Require: P has the clustering results
In c th cluster, P 1 , P 2 , . . . , P n have the set of samples C 1 c , C 2 c , . . . , C n c . For ease of expression, we assume here that C i c � S i,1 . In fact, C i c may contains a lot of samples.
while hv is not on B do: (2) C: Solves the set of F(x) using Lagrange's interpolation to find sum of samples S.sum, where S.sum � S 1,1 + S 2,1 + · · · + S n,1 .   We calculate the respective time proportions of cloud server and participants to judge the performance of our scheme. In the process of our scheme, the cloud server undertakes the main computing tasks, and running time accounts for 77%.

Accuracy Analysis.
We calculate the percentage of correctly classified samples as the standard for the accuracy of our scheme. For dataset S1, we compare the result of our privacy-preserving k-means clustering with plaintext k-means clustering to measure the accuracy of our scheme. e result of privacy-preserving k-means clustering on S1 is shown in Figure 2 and the accuracy of the S1 dataset is 99.25%. In the same way, we calculated the accuracy of the HCV dataset to be 99.14%. So, we can conclude that our scheme can get high-accuracy clustering results.

Functional Analysis.
We analyze our scheme's function and compare it with schemes proposed in [13,14,25] from different properties, including parties, participants' information protection, cluster centers protection, and verifiability. e result is shown in Table 4. In [25], Jiang et al. combined homomorphic encryption and garbled circuit to design an outsourced two-party privacy-preserving k-means clustering scheme. Compared with the scheme in [25], our scheme extends to multi-party. Furthermore, our scheme can verify the shares from other participants with secret sharing, hash function, and blockchain technology. In the step of updating cluster centers, the shares are verified by hash values calculated in advance, and the blockchain is used to ensure that once the hash value is verified, the participants can be guaranteed to share the consistent data. From Table 4, the significant advantages of our scheme compared to the other three schemes are verifiable.

Conclusion
In this article, we propose a multi-party verifiable privacypreserving federated k-means scheme that provides a validation mechanism in an outsourced environment. By computing hash values of secret shares in advance, we can detect malicious participants that send inconsistent shares and avoid incorrect k-means clustering results. e security and experimental analysis show that our scheme can protect privacy and get high clustering results.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this manuscript.  Figure 2: Result of privacy-preserving k-means clustering on S1 dataset.