Securely Outsourcing ID 3 Decision Tree in Cloud Computing

With the wide application of Internet of Things (IoT), a huge number of data are collected from IoT networks and are required to be processed, such as data mining. Although it is popular to outsource storage and computation to cloud, it may invade privacy of participants’ information. Cryptography-based privacy-preserving datamining has been proposed to protect the privacy of participating parties’ data for this process. However, it is still an open problem to handle with multiparticipant’s ciphertext computation and analysis. And these algorithms rely on the semihonest security model which requires all parties to follow the protocol rules. In this paper, we address the challenge of outsourcing ID3 decision tree algorithm in the malicious model. Particularly, to securely store and compute private data, the two-participant symmetric homomorphic encryption supporting addition and multiplication is proposed. To keep from malicious behaviors of cloud computing server, the secure garbled circuits are adopted to propose the privacy-preserving weight average protocol. Security and performance are analyzed.


Introduction
In the modern Internet of Things (IoT), huge data are collected from sensor-networks and need to be provided for analysis by high-effective techniques, such as data mining.This process requires enormous computation and storage to support; cloud computing technology can provide the corresponding support.However, this process may leak the privacy of participants' information.The privacy-preserving data mining (PPDM) based on encryption method has emerged as a solution to this problem.
Privacy-Preserving Data Mining Framework.Considering different frameworks and theories, PPDM was originated by Lindell et al. [1] and Agrawal et al. [2] in 2002, respectively.Lindell's framework is essentially a secure cryptographybased two-participant computation protocol without outsourcing.In other words, two parties can interactively compute ( 1 +  2 )ln( 1 +  2 ) on their private input  1 and  2 .Agrawal's framework is essentially a single-participant disturbance-based data storage and computation outsourcing algorithm.In particular, one party can upload disturbed data to server for private computation.With the development of cloud computation and IoT, a multiparty storage and computation outsourcing framework is preferred.
Cryptography-based privacy-preserving data mining supporting one-party outsourcing has been studied [3,4], with homomorphic encryption.However, multiple-key homomorphic encryption is an open problem when multiple parties are involved in the outsourcing framework.For example, how to execute ciphertext addition and multiplication on ciphertexts encrypted by different public keys?Security Models.We usually consider two different security models, including the semihonest and malicious security model.The definition in the semihonest model requires that all the users need to follow the rules of protocol.But we allow the dishonest users to obtain internal states of the other users.In the malicious model, different from the first security model, the corrupted users are allowed to deviate from the specified protocol.The success of the adversary means that the adversary can get the results of these protocols.

Wireless Communications and Mobile Computing
Data Distribution.Three types of distributed datasets are defined in related works, including the horizontally distributed datasets, vertically distributed datasets, and arbitrarily distributed datasets.The users in the horizontally distributed dataset can keep divided parts for the same attributes.However, in the vertical datasets, users are allowed to keep different attributes.In the last one, the datasets can be arbitrarily divided and stored by the users.
Due to the existence of malicious participants in the real environment, malicious participants may not follow the protocol.For example, they can intentionally tamper with the data, suspend the protocol anytime during the execution of the protocol, and so on.To solve this problem, this paper combines the noncontact commitment and confusion circuit mechanism, studies the average computing protocol based on confusion circuit, and then proposes the framework of a secure cryptography-based two-participant protocol with data storage and computation outsourcing.The framework consists of two data owners and two cloud servers (cloud storage server (CSS) and cloud computing server (CCS)).Each data owner has a horizontally distributed private database that is encrypted before being outsourced to the cloud for storage and computation.
In counting, we propose the Secure Equivalent Testing (SET) protocol to calculate the number of items for each attribute value based on the encrypted data.
To perform the sum and multiplication operations over ciphertext, we adopt the Paillier encryption system and implement the Secure Multiplication (SM) protocol.
To execute comparison over ciphertext, we adopt the Secure Minimum out of 2 Numbers (SMIN2) protocol.

Related Work
Distributed PPDM without Outsourcing.Distributed PPDM without outsourcing is mainly for data stored and calculated locally by the participant, based on distributed data based on various data mining methods, which can be decomposed to different operations, such as average calculation, calculation, and calculation of logarithmic vector inner product.Then the cryptography-based technology is used to design various privacy-preserving computing protocols.In 2002, Lindell and Pinkas [1] proposed a secure ID3 decision tree algorithm over horizontally partitioned data.They decompose the distributed ID3 algorithm to multilogarithmic calculation, polynomial evaluation calculation, and data comparison, and then designed the security log protocol, polynomial evaluation protocol, and secure comparison protocol, so as to achieve privacy-preserving in distributed ID3 algorithm.In 2007, Emekci et al. [5] implemented a secure addition computational protocol based on the secret sharing algorithm and extended the secure logarithmic computing protocol from two parties to multiple parties; thus realizing the multiparty participation of the privacy protection ID3 method.However, the complexity of the algorithm increases exponentially when the participant data are more numerous.In 2012, Lory et al. [6] used Chebyshev polynomial expansion to replace Taylor expansion in [1], thus further improving the computational efficiency of secure logarithmic computing protocols.However, their agreements still have limited efficiency in the implementation of privacy protection protocols.
Different from above, in 2003, Vaidya et al. [7,8] designed a multiparty privacy-preserving ID3 algorithm of vertically distributed data sets.They vectorized all attribute value information by constructing constraint sets and then computed it by using the method of secure intersection protocol, thus designing privacy-preserving ID3 for vertically distributed data sets.
In 2007, Han and Ng [9] proposed a multiparty distributed privacy-preserving ID3 method based on arbitrary distributed data sets.Firstly, each participant's data set is vectorized, and then the attribute value information is computed by using security intersection protocol and so on.Then, the entropy value of each attribute is computed by using security logarithm computation protocol and so on.Thus, the ID3 decision tree classification method of privacy protection based on arbitrary distributed data set is obtained.However, with the increase of the number of participants, the computing volume of the client increases exponentially.
Li et al. [10] and Gao et al. [11] addressed the Naive Bayes Learning for aggregated arbitrary distributed databases.
PPDM with Computation Outsourcing.Cryptography-based privacy-preserving data mining has a lot of encryption and decryption operations in the computation process.Therefore, it is difficult for large-scale data processing.As a measure for solving resource-restricted problems, the outsourcing technique has been widely used in cloud computing applications, such as data sharing [12,13], data storing [14,15], data updating [16,17], and social network analysis [18,19].In this context, we need to rely on security outsourcing technology to outsource computing or storage tasks of all participants to the cloud to process, thus greatly reducing the computing/storage load of each participant.In 2014, Liu et al. [3] adopted a new encryption scheme that supports both addition and multiplication over cipher texts.In this scheme, most of the computations are performed on the cloud, which reduces the computation workload of the data owner.However, the scheme is limited to a single party's data mining operation.Chen et al. [20] designed new algorithms for secure outsourcing of modular exponentiations.In 2015, Bost et al. [21] proposed the privacy-preserving hyperplane decision, Naive Bayesian, and decision tree classification algorithms, and through the semihonest model, secure twoparty computation model to prove that the above scheme can satisfy the semantic security (Semantic Security); and the related protocol makes it possible to design an adaptive enhancement algorithm (Adaptive Boosting) combine to further enhance the accuracy of the algorithm; building a classifier can be used to construct the privacy protection of the library, the further development of the classification algorithms for privacy-preserving technology in the future lays a solid foundation.

PPDM with Multiparticipant Data Storage and Computation
Outsourcing.In 2013, Peter et al. [22] proposed a new solution for the outsourcing of multiparty computation.Such a technique can be used in our setting.But as the security analysis in the previous works, they can only achieve security in the semihonest model.In [23], a new protocol was proposed to achieve data mining for two parties.In [24], association rule mining was addressed in the malicious model.In [25], the privacy-preserving KNN classification was addressed.In [26], the deep learning task was addressed.Besides the above related work, several fundamental secure algorithms, such as dynamic homomorphic encryption [27,28], authentication [29,30], and light-weight multiparty computation [31], which have also been considered in the malicious model, have been proposed.However, to the best of our knowledge, no existing study has considered a method for outsourcing computation in the malicious model.
In this study, the secure outsourcing of ID3 data mining is considered in the malicious model for the cloud environment.We show how to solve the outsourcing problem for ID3 protocol over horizontally partitioned data.

Preliminaries
In this section, we present a brief overview of the preliminaries used in this paper, including the ID3 decision tree algorithm, Paillier's homomorphic encryption scheme, and the other related protocols.

Distribute ID3 Decision Tree
Algorithm.The ID3 algorithm description is given as follows.It builds a decision tree in a top-down manner with the information of samples.Starting at the root, the best object classification will be obtained.The best prediction is computed with the information gain.The information gain of an attribute   is defined as  Then each party can calculate (,   ) value at its own side.

Paillier's Homomorphic Encryption Scheme.
Homomorphic encryption is a special type of encryption in which the result of applying a special algebraic operation to plain texts can be obtained by applying another algebraic operation (which may be different or the same) to the corresponding ciphertexts.Thus, even when the user does not know the plain texts, he/she can still obtain the results of applying that algebraic operation to the plain texts.
The Paillier encryption scheme [32] is described as follows:

Li's Symmetric Homomorphic Encryption Scheme.
The description of symmetric homomorphic encryption scheme proposed by Li et al. [33] is as follows.
(i) KeyGen(): () is used to generate key for users as  = (, ). and  are primes with the condition that  ≫ . is chosen from Z *  .(ii) Enc sk (): is a small positive integer, which is denoted as ciphertext degree in this paper.(iii) Dec sk (): Homomorphic Addition Readers may refer to [18] for details on the scheme.

Garbling Scheme.
A garbling scheme [34] consists of four algorithms, which is denoted by  = (, , , V). can be transformed by Gb into (, , ).Note that  is the garbled circuit.The encoding and decoding information algorithms are denoted by , .The output of garbled  can be encrypted and get the result  = (, ).

Noninteractive Commitment.
A noninteractive commitment scheme [35] is also required in our paper, denoted by (Com  , Chk  ).The distribution of Com  (; ) is determined by the value of  as Com  ().

Basic Cryptographic Subprotocols.
In this section, we present a set of cryptographic subprotocols that will be used as subroutines when constructing the proposed protocol.

Outsourcing Secure Comparison Protocol (OSCP).
The value of  is kept secure from the cloud and users.The value of  = ( mod ) is computed. is kept by the data owner (Algorithm 1).

Secure Equivalent Testing Protocol (SET).
With two ciphertexts c 1 = Enc sk (m 1 ) and c 2 = Enc sk (m 2 ), SET is to compute f and decide if the plaintexts are identical (m 1 = m 2 ) (Algorithm 2).

Secure Multiplication Protocol (SMP).
The algorithm is described as in Algorithm 3.

Secure Minimum out of 2 Numbers Protocol (SMIN2).
The algorithm is described as in Algorithm 4.

Secure Circuit Protocol (SCP).
We denote the three parties of the protocol by CSS 1 , CSS 2 , and CCS and their respective inputs by x 1 , x 2 , or x * 3 .Their goal is to securely compute the function y [34] (Algorithm 5).For simplicity, we assume that |x i | = |y| = m.All communication between parties is via private point-to-point channels.Next, we assume that CSS 1 and CSS 2 can learn the same output y, while CCS can get the garbled values for the portion of the output wires corresponding to its own outputs only.CCS cannot get the output y with these garbled values.This protocol uses a garbling scheme, a four-tuple algorithm  = (Gb, En, De, Ev), as the underlying algorithm.Gb is a randomized garbling algorithm that transforms a function of a triple.En and De are encoding and decoding algorithms, respectively.Ev is an algorithm that produces a garbled output Two ciphertext are computed by the cloud as  1 =   ( 1 ) and  2 =   ( 2 ). 1.The cloud computes  00 =   ( 00 ) =  1 −  2 and  01 =   ( 01 ) =  2 −  1 .2. Check if  00 ≥ 0 or  01 ≥ 0 and computes if  00 ≥ 0 ∧  01 < 0,  0 = 1,  1 = 0. else if  00 < 0 ∧  01 ≥ 0,  0 = 0,  1 = 1.

The value of 𝑓 is computed as follows
Algorithm 2: Secure equivalent testing protocol (SET).

Outsourcing Privacy-Preserving ID3 Decision Tree Algorithm in Malicious Model
In this section, we present our secure outsourcing ID3 decision tree in cloud computing using the homomorphic encryption scheme and subprotocols proposed in Section 2 as building blocks.

Main Concept.
The aim is to privately compute ID3 over encrypted databases, and the key is to find privately the attribute A for which Gain is maximum.From the above description, the key value which needs to be calculated with other parties is Entropy(S a i ).Since all the data was encrypted and sent to the cloud, the cloud server can count the number of |S(c k )| t , |S| it using the SET protocol described in Section 2. Now, (3) can be executed as (x 1 + x 2 )/(a 1 + a 2 )log 2 (x 1 + x 2 )/(a 1 + a 2 ), and the calculation of the logarithmic operation can be performed in CSS.The value to be calculated is the value of c 1 = (x 1 + x 2 )/(a 1 + a 2 ), which can be easily determined using our SCP protocol as explained in Section 2.Then, all the parties can calculate the value of Entropy(S) independently.

System Model.
The system model is shown in Figure 1, which includes two data owners and cloud servers (cloud storage server {CSS 1 , CSS 2 }, and cloud computing server CCS).Each data owner owns a private data set that is encrypted and outsourced to cloud server storage.Data owners can request cloud server to process ID3 data on encrypted data.At the same time, CSS and CCS servers participate in supporting the outsourcing privacy protection ID3 data mining algorithm steps; after the implementation of the algorithm, the final results are sent to the data owner.Assuming that the data owner and the CSS server are semihonest participants, CCS is a malicious participant.

Details of the Proposed Algorithm.
Our securely outsourcing ID3 decision tree (SOID3) algorithm is detailed as follows: (1) P 1 and P 2 run KeyGen() to generate the secret key SK i , i = 1, 2 and a public parameter p of Li's homomorphic encryption scheme.Further, each party shares p with the other party and the cloud but shares SK i only with itself.
(2) Each party uses its key SK i to encrypt every attribute value of its database, and then outsources the encrypted database to the CSS (CSS 1 and CSS 2 ).
(3) The CSS 1 and CSS 2 use the SET protocol to calculate the value of |S a j | i and |S a j (c k )| i for each attribute with each party P i .
(4) Each party generates its Paillier public and private keys (pk i , sk i ), i = 1, 2, and sends the public keys to the CSS 1 and CSS 2 .
(6) Each party decrypts the received information, calculates it with the logarithmic operation of ( Cloud Storage Servers a 2 )log 2 (x 1 + x 2 /a 1 + a 2 ), and then encrypts it with its public key.Then, it sends it back to the cloud.(7) After getting the result, CSS 1 and CSS 2 use the SMIN2 protocol to select the ciphertext data with the minimum value and then further select the attribute label with the maximum information gain and return it to each participant.

Cloud Computing Servers
(8) The participants divide the data sets and build tree nodes.Then, go to Step (3) until termination.

Security Analysis
In this section, we prove that the secure outsourcing ID3 decision tree (SOID3) algorithm can offer protection against the malicious cloud server.
Theorem 1.The SOID3 algorithm can achieve privacy for each party and the semihonest cloud storage server.
Proof.We mainly consider the security model under the noncollusive semihonest model and the semihonest cloud server.Suppose there are two parties, P 1 and P 2 , and cloud storage server CSS.
Let P = (P 1 , P 2 , CSS) be the participants of all protocols.Consider three types of attackers ( A P 1 , A P 2 , and A CSS ) that can invade P 1 , P 2 , and CSS.In the real model, P 1 and P 2 have data sets D x and D y , respectively, and CSS has encrypted data sets Enc(  ) and Enc(  ).Make H ⊂ P a collection of honest participants.For all P i H, out P i indicates the output of P i .If P i is invaded, out P i represents all views of participant P i in running protocol Π.
For each P * ∈ P, the attacker A = (A P 1 , A P 2 , A CSS ) view in the runtime protocol Π can be defined as In the ideal model, there exists an ideal model F for function f, and all participants can interact with the model F. That is, Challenger DP a and participant P i can send data x and y to F. If D x or D y is ⊥, F returns ⊥.Finally, F can return f(D x , D y ) to challenger DP a .As mentioned earlier, H ⊂ P is a collection of honest participants.For each participant P i H in the collection, return the out P i as F output to P i .If P i is intruded on by a semihonest attacker, out P i is still consistent with the output of P i in previous realistic models.
For all P * ∈ P, in the ideal model, in the presence of independent simulators Sim = (Sim P 1 , Sim P 2 , Sim CSS ), the P * view is Therefore, it is considered that the protocol Π is secure in the presence of noncolluded semitruthful attackers.Definition 2. Let f be a deterministic functionality among parties in P. Let H ⊂ P be the subset of honest parties in P. We say that Π securely realizes f if there exists a set Sim = {Sim P 1 , Sim P 2 , Sim S } of PPT transformations (where Sim D a = Sim P 1 (A P 1 ) and so on) such that for all semihonest PPT adversaries A = {A P 1 , A P 2 , A S }, for all inputs D x , D y and auxiliary inputs z, and for all parties P ∈ P the following holds: where ≡ denotes computational indistinguishability.
Theorem 2. The SOID3 algorithm is secure with the semihonest cloud storage server and the malicious cloud computing server.
Proof.First consider the case where CSS 1 or CSS 2 is corrupted.It is necessary to prove that, in the SCP protocol, the ideal model and the realistic model are not distinguishable.That is, in the following interactions, it is impossible to distinguish between the various types of interaction information and outputs of the participants in the ideal model and the real model.
(1) In the real model, assume that there is an emulator that can simulate various behaviors of a semihonest participant CSS 1 (or CSS 2 ), and receive inputs (x 1 , a 2 ) and (x 2 , a 2 ) from the execution environment of the protocol.At the same time, the simulator can simulate the function F f , which sends all inputs (x 1 , a 1 ) and (x 2 , a 2 ) to the simulated F f .Since the simulator does not do anything computed by F f , there is no difference between the real F f and the simulated F f from the execution environment point of view.
(2) Because in Step 2, CSS 1 and CSS 2 uniformly select the seed r of Pseudo-Random Function (PRF), the PRF security shows that the real model in Step 2 is indistinguishable from the ideal model.
(3) In Step 3, we modify the simulator, which knows in advance what promises will be opened when the simulator generates commitment C. First, the simulator selects the random numbers o 1 , o 2 that can be marked which promises to be opened and calculates the values of b At this point, the simulator has obtained the values of x 1 , x 2 , and a 1 , a 2 .Then, the simulator can submit the markings that promise not to be opened.In this process, due to the concealment of commitment, the realistic model and the ideal model are equally indistinguishable.
(4) In Step 6a, the simulated CSS 1 and CSS 2 stop executing when De(D, Ỹ) = ⊥.Change the emulator to make Ỹ ̸ = Ev(F, X).By obfuscating the authenticity of the circuit, CCS has only negligible probability to obtain Ỹ ̸ = Ev(F, X) in De(d, Ỹ) = ⊥.Therefore, in this step, the realistic model and the ideal model are equally indistinguishable.
(5) In Step 6b, the correctness of the obfuscation circuit guarantees that both CSS 1 and CSS 2 of the analog can be output.Therefore, if there is no pause in 6a, we can modify the simulator to an analog obfuscation circuit that generates (F, X, d).We can simulate the output of CSS 1 and CSS 2 by simulating the instructions of F f .According to the security of the confusing circuit, the real model is also indistinguishable from the ideal model in this step.
Therefore, in this protocol, the execution environment can not distinguish between the realistic model and the ideal model.And the protocol is secure when CCS is a malicious participant.

Performance Analysis
In this paper, we consider that CSS has a strong calculation ability and we ignore its computation time.Each data owner does not need to store the ciphertext but can just use the public key to encrypt the message and the private key to decrypt the ciphertext.
In each iteration, first, each data owner will execute the SBD protocol and SMIN2 protocol with the cloud.There are two interactions in the SBD protocol and 2k interactions in the SMIN2 protocol.Then, CSS 1 , CSS 2 , and CCS will execute 6 interactions in the SCP protocol.Finally, each data owner will execute 1 interaction when it goes to the new iteration.We assume that t is the iteration time, so the communication traffic is at most O(k * t).
In this paper, a secure average computing protocol based on SCP is implemented.The server selected in the  experimental environment is CPU: Intel (R) Xeon (R) CPU E5-2620 v3@2.40GHz* 2, memory: 32G, operating system Ubuntu 16.04.4LTS version.In the experiment, AES-128 is chosen as the basic encryption method of the confusion circuit, and the open source code of JustGarble is changed, and the commitment protocol is implemented based on SHA-256.Finally, the average values obtained from experiments are as follows.
In our secure outsourcing ID3 decision tree (SOID3) algorithm experiment, two participants were tested with different numbers of records.The experimental results are shown in Figure 2.
From Figure 2, since the client is only responsible for encrypting uploaded data, the time consumption is very low.In the cloud, CCS and CSS servers need to run SCP protocol, resulting in a lot of time consumption (Table 1).The main reason is that a large number of bit commitment processing is needed in the obfuscation circuit, and the performance improvement will be focused on this issue in the follow-up work.

Conclusion
In this paper, we proposed a secure outsourcing ID3 decision tree algorithm for two parties of the malicious model.Our algorithm can preserve the privacy of the users' data as well as that of the data mining scheme for the cloud servers.The parties can get only the result trees and have no knowledge about the data mining scheme.Moreover, the cloud servers cannot get any private information about the parties.In summary, our protocol offers protection against malicious cloud servers.
In the future, we intend to extend our algorithm to vertical and arbitrary partitioning in the malicious model.In addition, we plan to extend our algorithm to a general multiparty privacy-preserving framework suitable for other useful schemes, such as random decision tree, Bayes, SVM, Wireless Communications and Mobile Computing 9 and other data mining methods, and can be extended for use in the wireless sensor-networks [36,37].

Figure 2 :
Figure 2: Performance measurements for our SOID3 with 2 participants.
.., }.Without considering privacy, each party   shares his own |     |  , |   |  and |   |  to all other parties.As a result, any party can calculate () and (   ).  is the subset of  with tuples that have value   for class attribute .|   |  equals the set of transactions with class attribute  set to   in database   .Then the value of (   ) can be calculated as  (   ) =     is the subset of  with tuples that have values   for attribute   and   for class attribute .Therefore, (3) can be easily computed by party   and parties   ( ̸ = ) all of the values |   |  and |     |  from its database.Each party   then sums these together with the values |   |  and |     |  from its database and completes the computation.

Table 1 :
Time cost of the SCP protocol.