Privacy-Preserving k-Means Clustering under Multiowner Setting in Distributed Cloud Environments

With the advent of big data era, clients who lack computational and storage resources tend to outsource data mining tasks to cloud service providers in order to improve efficiency and reduce costs. It is also increasingly common for clients to perform collaborative mining to maximize profits. However, due to the rise of privacy leakage issues, the data contributed by clients should be encrypted using their own keys.This paper focuses on privacy-preserving k-means clustering over the joint datasets encrypted undermultiple keys. Unfortunately, existing outsourcing k-means protocols are impractical because not only are they restricted to a single key setting, but also they are inefficient and nonscalable for distributed cloud computing. To address these issues, we propose a set of privacy-preserving building blocks and outsourced k-means clustering protocol under Spark framework.Theoretical analysis shows that our scheme protects the confidentiality of the joint database and mining results, as well as access patterns under the standard semihonest model with relatively small computational overhead. Experimental evaluations on real datasets also demonstrate its efficiency improvements compared with existing approaches.


Introduction
With tremendous amount of data collected each day, it is increasingly difficult for resource-constrained clients to perform computationally intensive tasks locally.There are numerous cases regarding this, such as mobile phones or sensors in the IoT system with limited battery and small companies that lack hardware and software infrastructures.Thus, it is a viable option to outsource data mining tasks to Cloud Service Provider (CSP) which provides massive storage and computation power in a cost-efficient way [1].By leveraging the cloud platforms, a great many giant IT companies have offered machine learning services to help clients to train and deploy their own models, for example, Amazon Machine Learning [2], Google Cloud Machine Learning Engine [3], and IBM Watson [4].Despite the advantages, privacy issues are the critical concerns for cloud users to employ these services.For many data records may contain sensitive information, such as health condition, financial records, and location information, outsourcing them in plain form inevitably reveals personal privacy to CSP which may be untrustworthy or even malicious.For instance, it is favorable to improve the diagnosis accuracy by utilizing data mining techniques over medical records gathered from multiple patients [5], while releasing such information to public directly is prohibited by laws in many countries, for example, HIPAA [6].Thereby, appropriate mechanisms should be designed to guarantee that the outsourced mining tasks are executed in a privacy-preserving manner.
In this paper, we focus on privacy protection techniques on outsourced -means clustering, which is a widely used data mining algorithm in the fields of image analysis, information retrieval, pattern recognition, and so on.The outsourcing datasets are contributed by multiple data owners who are willing to collaborate in outsourced clustering in order to obtain more accurate results.Normally, the data records are encrypted via cryptographic tools to prevent them from being disclosed to other parties.The goal of our solution is to let cloud servers perform clustering over the encrypted data.
Traditional privacy-preserving clustering schemes cannot be directly adopted to address the privacy issues for outsourcing.They aim at computing clusters through interactions among different data holders without revealing respective data to others [7][8][9], whereas, in our case, the data are stored and processed by the cloud rather than clients themselves.
Most existing works on outsourced privacy-preserving clustering require cloud clients to utilize the same key for data encryption [10][11][12][13][14][15].In practice, sharing the same encryption key has some disadvantages: (1) for symmetric encryption scheme, compromised data owners can easily decrypt other owners' encrypted data if they launch eavesdropping attacks, which suites the case in [10][11][12][13][14]; (2) for asymmetric encryption, if the datasets are encrypted under cloud's public key, data owners cannot decrypt their uploaded data due to not knowing the private key [15].One way to solve this is allowing data owners to encrypt their data by their own keys, but this calls for computation over encrypted data under multiple keys (denoted by multikey).The only work [16] concerning multikey clustering was constructed on the multiplicative transformation method proposed in [17], whereas their proposed solutions expose all owners' keys to the query user, which causes security risks if the user is compromised.
Another issue of current research is that their threat models do not suffice higher level attacks.The essence of underlying schemes in [10,14,16] is to apply random matrix to encrypt data.These are secure against known-sample attack [17] (namely, attacker only knows some instances).But if attacker also knows the corresponding encrypted values of some data (i.e., chosen plaintext attack), the remaining instances may be recovered by setting up enough equations.In addition, the fully homomorphic encryption (FHE) used in [12,13] is not secure according to [18].Furthermore, access patterns (assignment of data objects, etc.) are disclosed to cloud servers [14,16].They may be used to derive sensitive information regardless of data encryption, as indicated by [19,20].Last but not least, few works consider combining their privacy-preserving techniques with large-scale data processing frameworks for boosting efficiency.
To address these challenges, we propose a novel solution for Privacy-Preserving -means Clustering Outsourcing under Multikeys (PPCOM), which enables distributed cloud servers to perform clustering collaboratively over the aggregated datasets with no privacy leakage.Specifically, the major contributions of this paper are fourfold: (i) Firstly, our solution allows the cloud to perform arithmetic computations over encrypted data under multiple keys.It is achieved by transforming ciphertexts under different keys into ones under the unified key through the double decryption cryptosystem.Since the encryption scheme is only partially homomorphic, we propose a secure addition subprotocol under the noncolluding two-server model, which does not reveal anything about input or output, including the input ratio.Based on these, the cloud can compute Euclidean distances between records and cluster centers.
(ii) Secondly, we propose two secure primitives to evaluate equality test and compare ciphertexts, addressing the problem that encrypted data are incomparable because of probabilistically random distribution.These two are further utilized in other privacy-preserving building blocks, including minimum Euclidean distance computing and encrypted bitmap converting, as well as cluster center updating.
(iii) Thirdly, on the basis of those privacy-preserving building blocks, we design PPCOM protocol by integrating Spark framework to accelerate the outsourcing process, which works in distributed cloud environments.Moreover, PPCOM requires no clients' participation after they upload their encrypted datasets under their own private keys.
(iv) Fourthly, theoretical analysis demonstrates that the proposed protocol not only protects the privacy of aggregated data records and clustering centers, but also hides access patterns under the semihonest model.Experimental results on real dataset shows that PPCOM is much more efficient than existing methods in terms of computational overhead.
The rest of the paper is organized as follows.Section 2 reviews the related works.In Section 3, we describe the preliminaries required in the understanding of our proposed PPCOM.In Section 4, we formalize the system model, threat model, and design objectives.The design details of the proposed privacy-preserving building blocks as well as outsourcing protocol are presented in Section 5. We provide security analysis in Section 6. Section 7 shows the theoretical and experimental evaluation results.Finally, we conclude the paper and outline future work in Section 8.

Related Work
There have been a lot of works on privacy-preserving distributed -means clustering [7][8][9].These works have different security requirements and design goals compared with our work.In the distributed setting where data are partitioned among multiple parties, the clustering task is undertaken by data holders instead of centralized servers.Generally, their schemes exploit secure multiparty computation (SMC) techniques so as to preclude one's data from being disclosed to the others except the final results.Whereas, in terms of clustering outsourcing, data owners intend to transfer the major workloads to cloud servers for the sake of reducing costs and improving efficiency.During the entire outsourcing process, all the inputs and outputs, as well as intermediate results, are supposed to be encrypted to ensure confidentiality.
As for outsourced -means clustering, Liu et al. [12] first leveraged FHE technique to perform outsourced clustering.To compare encrypted distances, their approach requires data owner to provide trapdoor information during each iteration, which entails heavy overhead on clients.To reduce the amount of data owner participation, Almutairi et al. [13] presented an efficient mechanism by using the concept of an Updatable Distance Matrix (UDM).Nevertheless, both works reveal partial privacy to cloud servers, such as the size of each cluster and the distance between data object and centroid.Moreover, the encryption scheme adopted in [12,13] is not secure according to [18].
Another line of research for outsourced clustering is to use distance-preserving data perturbation or transformation techniques to encrypt dataset [21].Keeping the distance after encryption enables the cloud to update clusters independently without data owner's involvement, which achieves approximate efficiency compared with unencrypted data.However, as [17,22] pointed out, these solutions are weak in security.Specifically, if attackers obtain some original instances (i.e., known-sample attack (KSA) [17]), the rest may be recovered by identifying corresponding encrypted ones and setting up enough equations.The work by Lin [11] focused on kernel -means instead of standard -means to avoid the preserving-distance vulnerability of random transformation.For outsourced collaborative data mining, Huang et al. [16] utilized asymmetric scalar-product-preserving encryption (ASPE) proposed in [17] that is resilient to KSA to compare distances.However, Yao et al. [23] demonstrated that ASPE is vulnerable to the linear analysis attack (LAA) [23].To accelerate clustering efficiency, much research [24,25] has been done to integrate MapReduce into -means algorithm and optimize its performance, while few of them take privacy protection into account.Most recently, a secure scheme based on MapReduce to support large-scale dataset was proposed by Yuan and Tian [14].By preserving the sign of encrypted distance difference like ASPE, their approach enables the cloud to assign data object to its closest cluster.It is resistant against both KSA and LAA.Unfortunately, none of the previously mentioned encryption schemes were formally proved to be secure against chosen plaintext attack (CPA); meanwhile some sensitive information, such as assignment of data objects and size of clusters, is directly disclosed to cloud servers.
To further reenforce security, Rao et al. [15] proposed a semantic secure scheme for outsourced distributed clustering over the aggregated encrypted data from multiple users.Based on Paillier cryptosystem, their solution protects not only confidentiality of data contents, but also access patterns from cloud servers and other users.In addition, participation of users is no longer required during outsourcing.Their design objective is similar to ours, except that their protocol does not support computation under multikeys and Spark framework.Besides, the cost of secure comparison is too heavy since each input has to be decomposed into encrypted bits by calling SBD subroutine [26].This will be demonstrated in the experimental evaluations in Section 7. In regard to computation under multikey setting, López et al. [27] proposed a new FHE; however its efficiency suffers from complex key-switching and heavy interactions among users.There are other works that utilize double decryption mechanism [28] or proxy reencryption technique [29] to convert ciphertext keys, allowing two servers to conduct addition and multiplication operations under multikeys.Nevertheless, these basic operations still cannot fulfill the need to perform more sophisticated data mining computations, for example, similarity measurement.

Preliminaries
In this section, we briefly introduce the typical -means clustering algorithm and public key cryptosystem with double decryption mechanism, serving as the basis of our solution.
Initial  records are selected randomly as cluster centers  1 , . . .,   .Then the algorithm executes in an iterative fashion.For   , the algorithm computes Euclidean distance between   and every centroid   for 1 ≤  ≤  and updates  according to arg min  ‖  −  ‖ 2 , that is, assigns   to the closest cluster   .Later, the centroid   is derived by computing the mean values of attributes of records belonging to   .With the updated  1 , . . .,   , the clustering algorithm begins the next iteration.Finally, the algorithm terminates if the matrix  does not vary any more or if a predefined maximum count of iterations is reached [10].

Public Key Cryptosystem with Double Decryption.
Public key cryptosystem with double decryption mechanism (denoted by PKC-DD) allows an authority to decrypt any ciphertext by using the master secret key without consent of corresponding owner.In this paper, we use the scheme proposed by Youn et al. [30] as our secure primitive, which is more efficient than the scheme in [31] in that Youn et al. 's approach applies smaller modulus in cryptographic operations.The major steps are shown in the following.
(i) Key generation (KeyGen() → , , msk, pk, sk): given a security parameter , the master authority chooses two primes  and  (|| = || = ) and defines  =  2 .Then it chooses a random number  in Z *  such that the order of   =  −1 mod  2 is .The master secret key msk = (, ) is known only to the authority.The public parameters are , .A cloud user picks a random integer sk ∈ {0, 1, . . ., 2 −1 − 1} as secret key and computes pk =  sk mod  as public key.
(iii) Decryption with user key (uDec(sk, ) → ): the decryption algorithm takes ciphertext  and sk as inputs, and outputs the message  by computing  ← / sk mod .
By applying the general conversion method in [32], the scheme was claimed to be IND-CCA2 secure under the hardness of solving the -DH Problem [30].However, Galindo and Herranz [33] have constructed an attack by generating invalid public keys and querying for the master decryption, which may lead to factorization of .To solve this, we adopt a slight modification of the scheme by checking the validity of the secret key, as proposed in [33].If sk ≥ 2 −1 , the master entity outputs a rejection message; otherwise, the decryption proceeds as usual.

Problem Statement
In this section, we formally describe our system model, threat model, and design objectives.
4.1.System Model.In our system model depicted in Figure 1 (3) Executing worker (EW): EW server is a cluster node within Storage and Computation Provider, which is responsible for storing DO's dataset and performing computation over them.There are EW 1 , . . ., EW  in the system.They together constitute a parallel Spark cluster, working on the same distributed file system like HDFS and providing cloud users with massive storage and computing power.
(4) Key authority (KA): KA belongs to Cryptographic Service Provider, which is assigned with distribution and management of public parameters and public/private keys, as well as the master key of the cryptosystem.
(5) Assistant worker (AW): AW is also part of Cryptographic Service Provider.AW server holds the public/ private keys generated by KA, with which AW is able to assist EW to execute a series of privacy-preserving building blocks.We assume that there are  AWs, that is, AW 1 , . . ., AW  .All AWs and KA constitute the cluster of Cryptographic Service Provider.Note that they offer sufficient computing power for temporal tasks, while they do not store the combined database.
Previous study has shown that it is impossible to implement a noninteractive protocol in the single server setting under the partially homomorphic encryption scheme [34].So at least two servers are required to complete the outsourced computation [35].In design of the system model, we take into account the situation that there are usually a large number of servers in one CSP.Moreover, it is feasible to propose secure outsourcing protocols through the cooperation between cloud servers from different CSPs.∀ ∈ [1, ], DO  generates its own key pair pk  /sk  using the parameters produced by KA and encrypts   with pk  before uploading it to EW.With the joint datasets as inputs, the distributed cloud servers are scheduled to perform -means clustering algorithm in a privacy-preserving manner.

Threat Model.
In our threat model, all cloud servers and clients are assumed to be semihonest, which means that they strictly follow the prescribed protocol but try to infer private information using the messages they receive during the protocol execution.This assumption is consistent with existing works [11][12][13][14][15][16] on privacy-preserving clustering in cloud environment.Furthermore, the cloud servers have some prior knowledge regarding distribution of owners' datasets that may be used to launch inference attacks by analyzing access patterns [19].DOs, QCs, EWs, AWs, and KA are also interested in learning plain data belonging to other parities.Therefore, a CPA adversary A is introduced in the threat model.The target of A is to decrypt the ciphertexts from the challenge DO and challenge QC with the following abilities: (i) A may compromise all the EWs to guess the plaintexts of received ciphertexts from DOs and AWs during the execution of the protocol.
(ii) A may compromise all the AWs and KA to guess the plaintext values of ciphertexts sent from EWs during the protocol interactions.
(iii) A may compromise one or more DOs and QCs except the challenge DO and the challenge QC to decrypt the ciphertexts belonging to the challenge party.
Nevertheless, we assume that the adversary A cannot compromise EWs and AWs and KA simultaneously; otherwise, A is able to decrypt any ciphertext stored on Storage and Computation Provider with the keys from Cryptographic Service Provider.In other words, there is no collusion between these two cloud providers, whereas servers within the provider itself may collude with each other.We remark that such assumptions are typical in adversary models used in cryptographic protocols (e.g., [15,28,36]), in that cloud providers are mostly competitors and economically driven by different business models.Key authority  (i) Correctness: if the cloud users and servers both follow the protocol, the final decrypted result should be the same as in the standard -means algorithm.(ii) Data confidentiality: nothing regarding the contents of datasets  1 , . . .,   and cluster centers  1 , . . .,   , as well as the size of each cluster, should be revealed to the semihonest cloud servers.(iii) Access pattern hidden: access patterns of clustering process, such as which records are assigned to which clusters, should not be revealed to the cloud to prevent any inference attacks [37].(iv) Efficiency: most computation should be processed by cloud in an efficient way, while DOs and QCs are not required to be involved in the outsourced clustering.

The PPCOM Solution
In this section, we first discuss a set of privacy-preserving building blocks.Then the complete protocol of PPCOM is presented.
Recall that, in Section 3.1, the semihonest but noncolluding cloud servers need to cooperate to perform computation over encrypted data under PKC-DD scheme.At first, KA takes a security parameter  as input and generates public parameter (, ) and master secret key msk by executing KeyGen().Also, KA generates a key pair pk  /sk  used to unify ciphertext encryption key.After that, pk  /sk  and msk are sent to AWs, while (, ) is distributed to DOs and QC, which are used to produce their own key pair pk  /sk  for  ∈ [1, 𝑛].Their generated public keys are sent back to KA for management.Hereafter, let Enc pk (⋅) denote the underlying encryption and uDec sk (⋅) and mDec sk (⋅) denote userside decryption and master-side decryption, respectively.|| represents the bit length of .

Privacy-Preserving Building Blocks.
We present eight privacy-preserving building blocks under multikeys.Five of them aim at solving basic operations over ciphertexts, including ciphertext transformation, multiplication, addition, equality test, and minimum, while the rest are especially designed for outsourced clustering.
It can be apparently observed that the underlying encryption scheme is multiplicatively homomorphic due to the following equation: where Enc pk (  ) = (   , pk   ⋅   ), for  = 1,2.This property is so critical that multiplication over ciphertexts can be evaluated by one EW server independently, as long as the encryptions are under the same public key.Hereafter, "×" denotes multiplication operation in the encrypted domain while "⋅" represents multiplication in the plaintext domain.To preclude AW from obtaining  1 and  2 , a straightforward solution for EW is to blind the inputs with a random value  by multiplying Enc pk  (), where  ∈  Z  and GCD(, ) = 1.Then the encrypted randomized data are sent to AW.Since AW holds the secret key, it is able to get  ⋅  1 , ⋅ 2 through decryption.The randomized addition (denoted by ) is computed by  ⋅  1 +  ⋅  2 mod .After that, AW encrypts  and sends it back to EW who can get the desired output by running Enc pk  () × Enc pk  ( −1 ).Nevertheless, it is very possible that partial privacy is revealed to AW.This is because the ratio of inputs can be calculated via  1 / 2 ←  1 ⋅ / 2 ⋅ , which may be utilized to distinguish inputs.As our threat model assumes the semihonest servers have some background knowledge of dataset distribution, it is effortless for AW to find correlations between encrypted records and known samples.Therefore, to achieve the privacy-preserving guarantees, disclosing input ratio should be prohibited during SA execution.
We propose an enhanced SA protocol still under twoserver model, which protects confidentiality of inputs and outputs as well as intermediate results.There are five steps in SA, the details of which are presented in Algorithm 2.
The desired output is calculated by Enc pk  ( 1 +  2 ) ←  3 −1 .The correctness of SA protocol can be proven by the following equation: Executing Algorithm 2 requires two rounds of interactions between EW and AW, which incurs more computational and communication overhead than the simple solution.However, it reveals no privacy to both cloud servers.The formal security analysis of SA is given in Section 6.

Secure Equality Test (SET) Protocol.
Given that EW holds two encrypted values Enc pk  ( 1 ) and Enc pk  ( 2 ) while AW holds the secret key sk  , the goal of SET is to test whether  1 and  2 are equal without revealing them to cloud servers.The detailed steps are presented in Algorithm 3.
At the first step, EW computes the fractions of two input ciphertexts.Supposing that ( 1 ,  1 ) = (  1 ,  1 ⋅ pk  1 ) and ), where ,  1 ,  2 ∈  Z  .During Step (2), AW decrypts   using sk  as follows: Since  is randomly selected in Z  ,  is a random value if and only if  1 ̸ =  2 .For  1 =  2 , it is obvious to infer that  = 1.Thus, if AW obtains  = 1, the returned value  is set to be true (denoted by ); otherwise,  is false (denoted by ).It is worth noting that neither  1 ,  2 nor the intermediate result  1 / 2 is revealed to the cloud during execution of SET.

Secure Squared Euclidean Distance (SSED) Protocol.
For -means algorithm, we use squared Euclidean distance to measure the distance between the data record and cluster centroid, denoted by ‖  −   ‖ 2 , supposing that EW holds the ciphertext of th data record   , and the ciphertext of th cluster centroid   , while AW holds the secret key sk  , for 1 ≤  ≤  and 1 ≤  ≤ .
Note that   is a vector composed of  attributes which may be rational numbers.However, the ring Z  does not support rational division operation, so a new form of expression is required to represent the cluster center.Let ⟨  ,   ⟩ denote the new form of cluster center, where   and   represent the where Apart from these, EW also has encrypted secrets associated with the distances, that is,    ,    .The output of SSDC is the encrypted minimum squared distance and its corresponding secret.Since our encryption scheme is probabilistic and does not preserve the order of messages, EW and AW should jointly compute the minimum without revealing ⟨Enc pk  (Ω , ), Enc pk  (  )⟩, ⟨Enc pk  (Ω , ), Enc pk  (  )⟩, and   , as well as   , to both servers.
Our key idea is to compute the fraction value between the two squared Euclidean distances, based on which AW is able to judge its relationship and returns EW as an encrypted identifier that indicates the minimum value.The fraction between the two squared Euclidean distances can be calculated as follows: Since Ω , ⋅   2 and Ω , ⋅   2 are integers within Z  , the ratio  may be a rational value in Q, according to (6).It can be observed that if ⌊⌋ < 1 (⌊⋅⌋ truncates the decimal fraction while keeping the integer part), we deduce that ‖  −   ‖ 2 < ‖  −   ‖ 2 ; otherwise, ‖  −   ‖ 2 ≥ ‖  −   ‖ 2 .The overall steps of SSDC are given in Algorithm 4.
One may prefer to use Enc pk  (1) and Enc pk  (0) to represent the indicator, which is more straightforward, whereas it is not secure to utilize encryption of "0", because  2 = 0 ⋅ pk  mod  = 0, in which Enc pk  (0) = ( 1 ,  2 ).In that case, EW can obviously infer that the encrypted message is zero by observing the ciphertext part.
During Step (3), EW takes the received  and encrypted squared distances as well as secrets as inputs and computes the target minimum values.It invokes a secure subroutine called ComputeMin, as shown in Algorithm 5.The correctness of SSDC protocol is proven as follows.Let us take Enc pk  (Ω min ) as an example; if  mod 2 = 1, EW and AW jointly execute ComputeMin and get the following: Require: EW has two encrypted squared distances along with their corresponding encrypted secrets, i.e., ( Apparently, it can be observed that if  = 2, then we have Enc pk  (Ω min ) = Enc pk  (Ω , ); otherwise, Enc pk  (Ω min ) = Enc pk  (Ω , ).Likewise, Enc pk  ( min ) and Enc pk  ( min ) are calculated in a similar way.

Secure Minimum among 𝑘 Squared Distances (SMkSD)
Protocol.We assume that EW holds a set of encrypted squared Euclidean distances [(, 1)], . . ., [(, )], where and AW holds the secret key sk  .Besides, EW also has the encryptions of secrets corresponding to their distances, that is,   1 , . . .,    .The goal of SMkSD is to compute the encryption of the shortest squared distance along with its encrypted secret, denoted by [ min ],   min , respectively.To execute SMkSD, we compute the minimum by utilizing SSDC with two inputs each time in a sequential fashion.The computation complexity of this algorithm is ().Also, it can be executed in a binary tree hierarchy like SMINn in [15], which takes at most ⌈log 2 ⌉ iterations.

Secure Index to Bitmap Conversion (SIBC) Protocol.
Given that EW has the encrypted index of the closest cluster denoted by Enc pk  (]) (] ∈ [1, ]), and AW holds the private key sk  , the output of SIBC is a bitmap vector Λ composed of  encrypted elements.During execution of SIBC, neither the index nor the bitmap should be revealed to both servers.(1).This indicates that the position of Enc pk  (2) in Λ is the index of the nearest cluster.The typical form of Λ is as follows: The complete steps are presented in Algorithm 6.
During Step (2), AW decrypts the permutated set Γ  and computes the fraction for each part.It is easy to infer the following equation: where ,  ∈ [1, ] and ,  may not be the same.Regardless of , if  = ], then we have   = 1; otherwise,   is a random number in Q.Note that AW cannot know the relationship between   and   for  ̸ = , since they are randomized by blinding factors.Besides, the ratio   /  = (  /  ) ⋅ (/)  is also a random number.Thus, as long as  is kept confidential, the index of closest cluster is not revealed to AW.In the end, EW recovers the true sequence of Λ by running the inverse permutation  −1 ().Furthermore, it should be emphasized that the method based on (10) can also be used in other scenarios where equality test is required.

Secure New Cluster Computation (SNCC) Protocol.
Given that EW holds the assignment membership matrix , the encrypted dataset   , and target cluster   , the goal of SNCC is to calculate the new cluster centroid denoted by [  ], where [  ] = ⟨Enc pk  (  ), Enc pk  (  )⟩.During execution of SNCC, nothing regarding the data record, sum of attributes, and cluster size should be disclosed to cloud servers.The complete steps are shown in Algorithm 7.
During Step (2), EW and AW jointly compute the encryption of cluster size Enc pk  (  ) by invoking SA subprotocol.It can be verified that Obviously, (12) sums up those whose V , equals 2 and discards those with V , = 1, through which the final result is the size of the cluster   .

The Complete PPCOM Protocol.
In this subsection, we present our proposed PPCOM protocol for the standard -means algorithm which works in the distributed cloud environments.
The primary goal of PPCOM is to schedule a group of cloud servers to perform clustering task over the joint datasets encrypted under multiple keys, meanwhile no privacy of data records, intermediate results, and final clusters should be revealed to the semihonest servers.In order to improve the overall performance, we leverage a large-scale data processing engine called Spark [38].It develops a data structure called the resilient distributed dataset (RDD) for data parallelism and fault-tolerance, which facilitates iterative algorithms in machine learning.Though it provides a scalable machine learning library-MLlib-which includes -means algorithm [39], it does not take privacy protection into account and cannot process encrypted data either.Therefore, it is essential to combine our proposed building blocks in Section 5.1 and Spark computing framework to design PPCOM.
PPCOM is composed of four stages, that is, Data Uploading, Ciphertext Transformation, Clustering Computation, and Result Retrieval, the details of which are described in the following.

Data Uploading Stage.
We assume that all data records should have been preprocessed by data owners.One essential preprocessing step is to normalize data values, because different attributes have different value domains, which possibly leads to the case that those attribute values whose domain is large have greater impacts on accuracy of distancebased clustering.Normalization enables records to fall into the common range by endowing all attributes with equal weights.In this paper, we adopt Mix-Max Normalization [40].Suppose that attribute  owns  observed values V 1 , V 2 , . . ., V  and [min  , max  ] is the range of , while [new min  , new max  ] is the target range.Mix-Max Normalization maps V  into V   in [new min  , new max  ] by calculating the following equation: After the preprocessing step, DO  encrypts its dataset   with pk  by calculating Enc(pk denotes the encryption of   .After all DOs complete uploading their datasets to EWs, the cloud aggregates the distributed datasets into a joint database   = ⋃  =1    .Under this circumstance, DO  is still able to retrieve its data and decrypt it with its private key sk  , whereas DO  cannot decrypt   without corresponding sk  for  ̸ = .

Ciphertext Transformation Stage.
Upon receiving clustering request from QC, EWs initiate ciphertext transformation procedure which aims at converting ciphertexts under pk  into encryptions under the unified key pk  , for  ∈ [1, 𝑛].EW first replicates   into    to ensure DOs' accessibility to their original dataset.Then SCT subprotocol is executed to output converted dataset (denoted by    ).This stage is important for two reasons: (1) EW is able to perform homomorphic operation merely under the same public key; (2) user-decryption is much more efficient than master decryption.

Clustering Computation
Stage.With all the converted records Enc pk  ( , ), for  ∈ [1, ],  ∈ [1, ], the objective of this stage is to compute the cluster centroids [ 1 ], . . ., [  ] and the membership matrix  × without compromising privacy.The outsourcing process is not only protected by the proposed secure building blocks, but also accelerated by Spark framework.The stage includes four steps, namely, Job Assignment, Map Execution, Reduce Execution, and Update Judgement, as shown in Figure 2.
Step 1 (Job Assignment).Firstly, CSPs select  minimum computing units (denoted by MCUs).Each MCU is composed of one server from {EW 1 , . . ., EW  } and one from {AW 1 , . . ., AW  }; that is, MCU = {EW, AW}.Obviously, MCU is able to perform cryptographic building blocks on its own.Since the workload of AW is relatively light compared to EW, it is preferable to share one AW within several MCUs to maximize resource usage.We assume that every cloud node possesses sufficient storage and computational power for its assigned job.The set {MCU 1 , . . ., MCU  } is divided into two disjoint sets, that is, Map set and Reduce set.Without loss of generality, Map consists of MCU 1 , . . ., MCU  , while Reduce comprises MCU +1 , . . ., MCU ++1 .Then    is divided into  uniformly distributed partitions  1 , . . .,   , which are transferred to their corresponding MCU nodes in Map set.In this paper, we assume that the  initial cluster centers are chosen by DOs in advance and they are also encrypted.∀ ∈ [1, ], Map[] is aware of the initial cluster centroid set  = {[ 1 ], . . ., [  ]}, where [  ] = ⟨Enc pk  (  ), Enc pk  (  )⟩, for 1 ≤  ≤ .
Step 2 (Map Execution).Given   and  as inputs, Map[] ( ∈ [1, ]) outputs a key-value table   , in which the key is the record and the value is the encryption of bitmap that indicates the closest cluster.Suppose that   includes  data records As presented in Algorithm 8, each MCU in Map executes the following steps in parallel: (1) computes the encryption of squared Euclidean distance between [  ] and cluster center [ ℎ ], for 1 ≤  ≤  and 1 ≤ ℎ ≤  (Steps (1)-( 5)); (2) computes the encrypted index of the minimum among  distances for each record by calling SMkSD (Step (6)); (3) converts the index into an encrypted bitmap via SIBC scheme (Step (7)).The final output for Map[] is table   with  entries.Each entry consists of a data record as the key and its corresponding assignment bitmap as value.
Step 3 (Reduce Execution).Taking { 1 , . . .,   } from Map MCUs and the aggregated dataset    as inputs, Reduce[] computes the cluster center for   based on the assignment membership matrix .The major steps are presented in Algorithm 9.It can be observed that each MCU in Reduce concurrently executes the following: (1) converges the assignment vectors in { 1 , . . .,   } into the complete membership matrix  (Step (4)); (2) computes the new cluster center for the target cluster by invoking SNCC (Step (7)).The final output for Reduce[] is the encrypted centroid [  ] for   and the matrix .
Step 4 (Update Judgement).This step is to determine whether the predefined termination condition of -means algorithm is satisfied.In this paper, we consider that the membership matrix is not changing as the termination condition.Given that EW holds the previous matrix  and the current matrix   and AW holds the key sk  , our strategy is to find out whether the elements in  and   are equal one by one by utilizing SET subprotocol.Once a mismatch is detected, EW replaces  with   and goes to Step 2 of Clustering Computation Stage.Otherwise, these servers continue comparing till the end of matrix.If the outcome is  =   which means that the assignment of clusters does not vary any more, EW then terminates the iteration and activates Result Retrieval Stage.

Result Retrieval
Stage.Since the cluster center set  and assignment matrix  are encrypted under pk  , QC cannot decrypt them without sk  .Firstly, SCT scheme is invoked to transform the encryption key of  and  from pk  to pk  .After that, QC is able to download them and decrypt the final result with sk  .The final cluster center of   is recovered by   ←   /  , for  ∈ [1, ].

Security Analysis
We first analyze the security of the privacy-preserving building blocks.Since all parties are semihonest, security in this model can be proven under "Real-versus-Ideal" framework [41]: there is an ideal model where all computations are performed by a trusted third party.The protocol is secure if all adversarial behaviors in the real world can be simulated in the ideal world.Considering that the proofs of the proposed building blocks are basically the same, we just take the formal proof of SA subprotocol as an example.
Theorem 1.The SA protocol described in Section 4.1 securely computes the addition over ciphertexts by using the PKC-DD cryptosystem in presence of two semihonest but noncolluding cloud servers.
Proof.Since this algorithm is collaboratively completed by EW and AW, we need to prove that SA is not only secure against adversary A EW corrupting EW, but also against A AW corrupting AW, respectively. ( During outsourcing process, cloud servers invoke the proposed building blocks as subroutines and all transmitted data are encrypted for each step.Note that the data records are held by cloud parties without decryption key.The assistant parties with the key can decrypt the received data, but the real data are randomized.Since PKC-DD is semantically secure and blinding factors are randomly selected, nothing regarding the data contents and computed clusters is revealed to the servers.Moreover, the access patterns of which encrypted input denotes the minimum Euclidean distance (as shown in SSDC, Algorithm 4) and of which record is assigned to which cluster are protected from the cloud due to encryption of  (as shown in SNCC, Algorithm 7).By the Composition Theorem [41], the sequential composition of four stages in PPCOM is secure under the semihonest model.
Discussions.Note that the order of computing addition via SA cannot be altered in ComputeMin called SSDC even though the final outcome is the same.Suppose that the clouds choose to compute Enc pk  (V ⋅  − V) or Enc pk  (2 ⋅  −  ⋅ ); if the result is Enc pk  (0), it will inevitably reveal  = 1 or  = 2 to both servers.Similarly, the order of computation steps in Algorithm 7 in SNCC ought to be kept unchanged.

Performance Analysis
In this section, we analyze the performance of PPCOM protocol from both theoretical and experimental perspectives.
7.1.Theoretical Analysis.Let Exp, Mul denote the modular exponentiation and multiplication operations, respectively.Let || represent the key size of the double decryption scheme.The encryption of the underlying cryptosystem incurs 2Exp + 1Mul.The cost of normal decryption is 1Exp + 1Mul, while that of authority decryption is 2Exp+2Mul.Recall that  is the dimension size of a record, and  is size of the joint dataset.The computational and communication overheads for the major building blocks and clustering algorithms in one iteration are given in Table 1.It can be observed that the addition and comparing operations incur a lot of Exp operations at the cost of hiding access patterns.For Data Uploading Stage, it takes each DO 2Exp + Mul computational cost and 2|| bits.The computational overhead of Ciphertext Transformation Stage is 8Exp + 9Mul while its communication cost is 4||.We stress that Stage 1 and Stage 2 of PPCOM protocol are executed only once.These overheads are amortized through a number of iterations.Furthermore, the number of MCU in Map set is closely related to the costs, while  affects the performance of Reduce more significantly (usually  ≫ ).It is easy to find that the larger the computing cluster is, the less the tasks () are distributed to each unit.This is because the -means jobs can be parallelized under Spark.As for the Update Judgement step, the worst case is to compare every element of the matrix, the computational and communication costs of which are related to  and .

Experimental Analysis.
The experiments are conducted on our local cluster, in which each server running CentOS6.5 has Intel Xeon E5-2620 @ 2.10 GHz with 12 GB memory.We implemented all the outsourcing protocols using the Crypto++ 5.6.3library and Spark framework.In this paper,  [15] and BCP encryption in [28]) is 1024 bits, which is commonly acceptable.To achieve the same security level, || of PKC-DD should be 500∼600 bits more than RSA modulus [42].During all tests, we choose the security parameter  = 512, so the key size || = 1536.
To facilitate comparisons, we use KEGG Metabolic Reaction Network dataset [43].The dataset includes 65554 instances and 29 attributes.Before clustering, all records are first normalized into integers in [0, 1000] to prevent impacts of large unit values as mentioned in Data Uploading Stage of Section 5.2.Note that the first attribute is excluded from tests, since it is just the identifier of pathway.All of the testing records are randomly selected from KEGG dataset.

Privacy-Preserving
Building Blocks' Performance.We first measure the execution time of each privacy-preserving building block on a single server through 1000 times, the average costs of which are shown in Table 2.It can be seen that costs of the compound protocol (made up of basic primitives, e.g., SSED, SMkSD, and SNCC) are relatively high, since they are made up of several SA operations, which involve many encryptions and decryptions, as well as rounds of interactions to preserve data privacy.This is consistent with theoretical analysis.
We then evaluate the performance of SCT scheme.Table 3 shows the ciphertext transformation time for varying dataset size () in SCT and KeyProd in [28].It can be seen that the cloud running time grows with increasing value of .Our scheme executes about 4 times faster than KeyProd in that PKC-DD works more efficiently than theirs.Besides, we remark that schemes in [28] are designed for basic arithmetic operations under multikey rather than complex mining tasks like -means algorithm.
We also compare the cloud running time for SSED and SMkSD with counterpart methods in PPODC [15], respectively.As shown in Figures 3(a) and 3(b), computation time of both grows with increase of dataset size and our schemes outperform PPODC's.Let  denote the bit length of plaintext message.In Figure 3(b), it is easy to find that, with growth of , the computation time of PPODC grows more rapidly.It is because ciphertexts have to be decomposed into encryption of bits before comparison in [15].First, we evaluate the overhead on cloud servers with varying  and  when  = 2000 and  = 8, in comparison with the optimized PPODC with 8 parallelized server pairs.The results are given in Figures 4(a) and 4(b).It can be seen that the computation costs of both protocols grow almost linearly with  and PPCOM protocol outperforms PPODC; for example, when  = 10,  = 12, the cloud computation time of PPODC is 381.798min, that is, 4.33 times that of our scheme.The efficiency is gained not only by improved secure primitives, but also by Spark framework.Nevertheless, the communication overhead of PPCOM is relatively high, which is mainly caused by frequent interactions during SA process.Furthermore, the growth of dimension size also increases the computational and communication overhead of both protocols.

Factors Affecting PPCOM's
Next, we evaluate the overhead on cloud servers with varying  when  = 4,  = 20.As shown in Figure 5(a), the computation time decreases with the growth of .It can be derived that (1) the scaling of parallelized servers can accelerate the outsourced clustering task; (2) it takes PPCOM less computational cost than PPODC to accomplish the same amount of work.Figure 5(b) shows that the communication cost of both schemes remains unchanged regardless of , because the total amount of clustering task is fixed.However, PPCOM incurs heavier communication overhead to protect privacy and access patterns.
Moreover, we evaluate the impact of  on cloud servers' performance with  = 4,  = 10.From Figure 6(a), we observe that the running time of both protocols increases with , as more data need to be clustered.It is obvious that the cost of PPODC grows more sharply than ours.Figure 6(b) shows the computation overhead of Map stage and Reduce stage during execution of PPCOM, respectively.Map stage takes larger proportion of total cost than Reduce, while they scale linearly with .In addition, the doubled parallelized MCUs save almost 40% of Map execution time.4, we summarize qualitative comparisons with existing outsourced means protocols.All protocols are claimed to protect input data privacy.The encryption schemes of first three are constructed on random transformation, such as randomized kernel matrix [11] and random invertible matrices [14,16].They were proven to be secure against KSA (known-sample attack) that the attacker knows a set of plain data objects in the dataset, but not the corresponding encrypted values.Yuan and Tian's work [14] can also defeat LAA (linear analysis attack) introduced by [23].However, these schemes are weak considering that attacker gets both some data objects and their encryptions.The latter four outsourcing protocols are proposed based on homomorphic encryption techniques, which can resist CPA (chosen plaintext attack).References [12,13] adopt Liu's FHE as the underlying encryption scheme, which yet may not be secure enough as illustrated by [18].Rao et al. 's [15] and our encryption schemes achieve semantic   security, relying on hardness of Decisional Composite Residuosity [44] and Diffie-Hellman Problem [30], respectively.

Comparative Summary. As shown in Table
From this table, it can be seen that only [16] and ours support computation over encrypted data under multikeys, whereas one drawback in [16] is that they either reveal data owners' keys to query user or reveal query user's key to owners.Rao et al. 's [15] and ours hide access patterns by executing clustering in an oblivious way, preventing cloud servers from knowing the assignment membership of encrypted records which may be used to launch inference attack.
Several schemes require data owners to participate in the mining process so as to update cluster centroids or to assist similarity comparison, except those from [11,15] and ours.Almost all works allow the cloud to perform comparison operation between encrypted distances, while approach in [13] adopts a plain updatable matrix to compare.What is more, only Yuan and Tian's [14] and our researches consider how to integrate big data processing framework into privacy-preserving protocols.As a consequence, our solution achieves the most comprehensive security requirements and  feasibility for clustering outsourcing compared with current works.

Conclusion
In this paper, we proposed an efficient privacy-preserving protocol for outsourced -means clustering over joint datasets encrypted under multiple data owners' keys.By utilizing double decryption cryptosystem, we proposed a series of privacy-preserving building blocks to transform ciphertexts and evaluate addition, multiplication, equality, and comparison, and so on, over encrypted data.Our protocol not only protects privacy of the aggregated database, but also hides access patterns under the semihonest model.Another improvement is that the outsourced clustering works under big data processing framework, which can be scaled to process big data.Experiments on real dataset show that our scheme is more efficient than existing approaches.However, the computation and communication costs of PPCOM are still heavy for large datasets.Our future work will focus
Performance.There are three major factors affecting the outsourced performance: (1) the number of clusters (); (2) the number of parallelized MCUs (); (3) the size of aggregated dataset ().

Figure 3 :
Figure 3: Performance of SSED and SMkSD with varying size of datasets ().

Figure 4 :
Figure 4: Experiment analysis with varying number of clusters () in the real dataset.

Figure 5 :
Figure 5: Experiment analysis with varying number of parallelized MCUs () in the real dataset.

Figure 6 :
Figure 6: Experiment analysis with varying size of joint datasets () from the real dataset.
5.1.1.Secure Ciphertext Transformation (SCT) Protocol.Given that EW holds Enc pk  (), and AW holds (msk, pk  ), the goal of the SCT protocol is to transform encrypted message  under public key pk  into another ciphertext under public key pk  .During execution of SCT, the plaintext  should not be revealed to EW or AW; meanwhile the output 5.1.2.Secure Addition (SA) Protocol.It takes Enc pk  ( 1 ) and Enc pk  ( 2 ) held by EW and sk  held by AW as inputs.The output is the encrypted addition of  1 and  2 , that is, Enc pk  ( 1 +  2 ), which is only known to EW.As the encryption scheme is not additively homomorphic, it requires interactions between EW and AW.