Balancing Access Control and Privacy for Data Deduplication via Functional Encryption

Data deduplication serves as an eﬀective way to optimize the storage occupation and the bandwidth consumption over clouds. As for the security of deduplication mechanism, users’ privacy and accessibility are of utmost concern since data are outsourced. However, the functionality of redundancy removal and the indistinguishability of deduplication labels are naturally incompatible, which bring about a lot of threats on data security. Besides, the access control of sharing copies may lead to infringement on users’ attributes and cumbersome query overheads. To balance the usability with the conﬁdentiality of deduplication labels and securely realize an elaborate access structure, a novel data deduplication scheme is proposed in this paper. Brieﬂy speaking, we drew support from learning with errors (LWE) to make sure that the deduplication labels are only diﬀerentiable during the duplication check process. Instead of authority matching, the proof of ownership (PoW) is then implemented under the paradigm of inner production. Since the deduplication label is light-weighted and the inner production is easy to carry out, our scheme is more eﬃcient in terms of computation and storage. Security analysis also indicated that the deduplication labels are distinguishable only for duplication check, and the probability of falsifying a valid ownership is negligible.


Introduction
As a flourishing service mode, cloud computing adopts load balancing, distributed computing, and other technologies to conveniently provide computation and storage functions for remote follow-up users, thus saving local resources and promoting work efficiency. However, if the users immoderately outsource their data to the cloud, a serious problem may occur due to massive duplicated data. As reported in [1], almost half of the cloud storage is wasted because of data redundancy. Consequently, the budget for managing duplicate data raises up to eight times than that of source data maintenance [2,3]. With the explosive growth of data nowadays, the tremendous storage requirements or the exorbitant administrative expenses have put enormous pressure on cloud service providers. erefore, how to store and manage data economically and efficiently has become a serious challenge for these enterprises.
To cut down the costs caused by redundant data, deduplication technology has been widely used by cloud service providers [4]. In such a technology, duplication check and proof of ownership are two key problems. Till now, the problem of how to balance the conflict between comparability and confidentiality for secure duplication check remains unsolved [5]. Meanwhile, the problems of how to efficiently validate the access authority and how to achieve complex access structures are also urgent to address, considering that the mechanism of query matching is cumbersome and the downloading certificates may be abused to launch various attacks.
As a research hotspot, lots of attentions are put on the efficiency and security of data deduplication. In the published literature, Li et al. [6] suggested carrying out deduplication by comparing the fingerprint of the outsourced file with the uploaded ones in a direct way. However, this method is deficient since the communication and comparison of those fingerprints are inefficient and the contents of data are exposed. To reduce the traffic of deduplication labels and conceal the data, Puzio et al. [7] used the hash function to code the same plaintexts into identical values, which serve as the labels for duplication check. Although this method achieved the goals of transmission efficiency and storage saving, it is vulnerable to dictionary attacks since the hash values are overt.
In order to ensure the confidentiality of deduplication labels, Chen et al. [8] utilized the message lock encryption (MLE) to encrypt those hash values of data. However, the traditional MLE scheme is not semantic secure and vulnerable against quantum attacks [9].
Fortunately, cryptographers have been devoted to design secure, efficient, and effective crypto systems to resist quantum attacks in recent years. In 2005, Regev et al. [10] proposed a novel paradigm as an underpinning of cryptography, namely, learning with errors. ey proved that the difficulty of solving it is equivalent to the hardness of shortest vector problem (SVP) over lattice, and thus, it can resist the attacks based on quantum computing. Besides, it is provided with the capacity of homomorphic and linear computation. erefore, we consider exploiting it in our scheme to ensure the functionality, efficiency, and security of deduplication labels.
As for the proof of ownership, the best solutions till now are all based on Merkle hash tree (MHT) [11,12]. In detail, the cloud and the user independently hold an MHT computed from the outsourced data. us, the user can upload the same MHT to the cloud for comparison. e disadvantages of such scheme are not only high storage and communication overheads but also low computation efficiency. erefore, Chen et al. [13] improved it by randomly asking the cloud to select some leaf nodes of the MHT to challenge the user. e user must trace the path from the root to these leaves as a reply to prove that he possesses the same tree. Although this method does not require the transmission of the whole MHT for comparison, it demanded that the user and the cloud should construct and store a complete MHT for each file. Moreover, the challengeresponse mode implies a long delay.
In order to promote the performance of PoW, the advantages of inner product predicate gradually entered the researchers' sight [14][15][16]. Roughly speaking, only if the inner product results 0, the user can be granted a permission to access the corresponding file. e most significant merit of this method is using computation instead of comparison to efficiently perform ownership proof. erefore, we adopted it in our scheme to balance the conflict between the variety of access structures and the security of users' privacy.
Aiming at checking replication over semantic secure deduplication labels and achieving fine-grained access control, this paper proposed a novel cloud data deduplication scheme by exploiting LWE (learning with errors) together with inner product predicate. Our contributions are abbreviated as follows: (i) ough designed for the purpose of deduplication, the deduplication labels are indistinguishable to any process except for duplication check. is property is achieved in virtue of semantic secure and homomorphic LWE, which is also resistant to quantum attack. (ii) e proof of ownership is carried out by inner product, which is computationally efficient. Besides, we impose the accessibility of users on their attributes, implying the functionality of the elaborate access structure and ownership transfer. (iii) For each file, only one light-weighted downloading certificate should be stored by the cloud, while the clients should only carry out and upload its corresponding proof on demand. at is to say that both the storage and bandwidth are economic for cross-user access. e rest of this paper is organized as follows. In Section 2, some formal definitions related to LWE and inner product predicate are given. Section 3 depicts our deduplication scheme, including the detailed way for duplication check and ownership proof. e correctness of our scheme is formally validated in Section 4, followed by security and performance analysis in Sections 5-7 that concludes the paper.

Preliminaries
For better understanding of our scheme, the concepts related to learning with errors and inner product predicate [2,17] will be introduced in advance.
Definition 1 (Integer lattice). An integer lattice Λ is the integer linear combination of vectors a 1 , a 2 , . . . , a k a 1 , a 2 , . . . , a k over Z m , expressed as Definition 2 (LWE hardness assumption). On parameters n, m, q, α and a discrete Gaussian distribution χ, where for x ∈ Z q , we select a noise e from χ m and uniformly sample a vector s ∈ Z n q together with a matrix P ∈ Z n×m q . Based on the value of two versions of LWE hardness can be defined as follows: (a) LWE-Search hardness: Given multiple pairs of (P, b) on constant P and s, searching for the value of s is difficult. (b) LWE-Determination hardness: For uniformly sampled b ′ ∈ Z n q , the tuples of (P, b) and (P, b ′ ) are statistically indistinguishable. It means that it is difficult to tell if the second term of those tuples are randomly chosen or computed from formula (3).
In fact, the LWE-search hardness is equivalent to the problem of finding a short enough vector in lattice (GapSVP), and the LWE-determination hardness can be reduced to the problem of solving linearly independent shortest vectors (SIVP) of a lattice in the worst case. erefore, the LWE assumption can be used to guarantee the one-way property for encryption with semantic security.
Definition 3 (Inner product predicate). e inner product predicate P n,q is defined on the Cartesian product K × I that From the perspective of functional encryption (FE), I can be deemed as the space of ciphertexts and K is composed of secret keys. Once a correct key v → is known, we are able to learn the output of function P n,q ( v → , t w → ). To construct an attribute-based access control policy, the access structure is coded as a vector w → , thus the access authority can be verified with respect to the consistency of authorization certificate v → .To avoid obfuscation, the symbols used in this paper is listed in advance, as in Table 1.

Duplication Check Based on LWE
To prevent dictionary attacks caused by the exposure of deduplication labels, we intended to make them indistinguishable except for the process of duplication check.
erefore, LWE is adopted to randomize the hash value of file to ensure the indistinguishability of deduplication labels and resist the attacks of quantum computation. In addition, we exploit inner product predicate to control the accessibility of clients, which is flexible for functions such as crossuser sharing and ownership transfer. e logical idea of our scheme is illustrated below, which is shown in Figure 1

File Upload
A user denoted as A, who possesses a file M A and expects to upload it, is not aware of its existence over cloud at the very beginning. To avoid unnecessary storage and bandwidth, he is supposed to check if there is a copy already held by the server.
Drawing support from any strong-collision resistant hash function the user figures out the hash value of file M A as and codes it as a vector of ℓ elements. On fixed public matrix P ∈ Z n×m q and a pseudorandom sequence generator (PSRG), he produces a vector and exploits LWE to obtain Herein, b A � [PSRG(h A )P + e A ] q stands for the last row of (P, b A ), where PSRG(h A ) is considered as a n dimensional vector and r A is randomly chosen from − 1, 0, 1 { } m .To this point, the user is ready to take the n + 1 dimensional vector v A �→ as a deduplication label and upload it to the cloud. Since the subsequent actions he should take depend directly on the result of duplication check, we will discuss the situations for original uploader and repeated uploader, respectively, who are denoted as A and B for clarity.

e Process of Original
en, the cloud preserves the uploaded ciphertext C A for storage and the deduplication certificate s A → for duplication check. To further retrieve the file, user A ought to upload a downloading certificate as well, like the following.
Assuming that the attributes of user A correspond to a secret vector μ A �→ � (μ A,0 , μ A,1 , . . . , μ A,n− 1 ) ∈ Z n q , which can also be regarded as a polynomial It is worth mentioning that the user is aware of the elements of μ A �→ only if he corresponds to those attributes. To actualize a functional encryption which reflects the access structure in covert manner, he uniformly samples two vectors which is equivalent to a cyclic matrix with respect to the homorganic between polynomials and cyclic matrices.
In order to construct the correct downloading certificate, After that, the user uploads w → � (w 0 , w 1 , . . . , w n− 1 ) as the downloading certificate and submits to the cloud for further expansions on access structure. At the end, the user preserves the hash value sk A , the essential elements It can be seen that, if the two files are identical, only 〈− e B , r B 〉 will remain in formula (14). erefore, when the result satisfies the cloud can ensure the duplication of file M B with negligible false positive.
To validate his accessibility, user B should also figure out the downloading right of the corresponding file. However, it is more reasonable to use existing download rights w → held by the cloud server for the purpose of storage saving. Based on this, user B can use the following subprotocol to obtain the download right of the duplicate file, and the cloud will simply send the link back to him for further retrieval.

e Subprotocol for Access Expansion. Denoting the secret corresponding to the attributes of repeated uploader
To bind the access structure with his own attributes, he should also figure out a cyclic matrix U B which can be used to compute his proof of ownership which is as follows: ough the downloading certificate w → cannot be exposed to prevent unauthorized access, the cloud can provide user B with the values of (w n− 3 , w n− 2 , w n− 1 ) and y to help him calculate the correct cyclic matrix U B . us, the downloading right can be carried out by user B in Algorithm 1.

Proof of Ownership
Once any legal user obtained his downloading right, he should be authorized to retrieve the corresponding file from the cloud. To improve the efficiency of ownership proof, access authorization is executed in a computational way.
After uploading, the legal user A will be provided with the last row u A �→ � (u A,n− 1 , . . . , u A,1 , u A,0 ) of the cyclic matrix.
erefore, he only needs to form the cyclic matrix U A and combines it with his attribute vector u A �→ to figure out the downloading right. Based on the resulted vector, the cloud can easily verify his accessibility by functional encryption. e process of PoW is completely given in Algorithm 2.
After obtaining the ciphertext C A , user A can decrypt the file by computing Dec sk A (C A ) � M A because he is aware of the secret key sk A � H(M A ).
In fact, the ownership proof process for user B is similar to that of user A. e reason why user B can also decrypt the file C A is due to the equality of plaintexts M A and M B . Since

Downloading Right Transfer
On noting that, without the secret vectors corresponding to the attributes of legal users, other users are incapable of computing the downloading right even if the last row of cyclic matrix is known. Since the access controls subprotocol, any legal user can directly transfer the resultant downloading right to other users to avoid redundant operations such as peer to peer transmission. However, it may lead to the abuse of downloading right and violate the confidentiality of user's attributes. Practically, legal users are prone to transfer the downloading right of their file to others who share party of common attributes with him. erefore, we designed a protocol that any legal user can update the downloading right and transfer it to a group of users with the same set of attributes. In this way, the owner does not have to download the file from the cloud and only needs to transfer the downloading right to other users to complete file sharing, which effectively reduces the consumption of communication bandwidth.
en, the common attributes vector μ team ����→ � (μ team,0 , . . . , μ team,n− 1 ) can be defined as a partial ordering relation that μ team,i � μ A,i if μ A,i ∈ μ tall,j |j � 0, . . . , Q(n) and μ team,i � 0, otherwise. Specifically, the process that the user A constructs the common attribute vector μ team ��� �→ � (μ team,0 , μ team,1 , . . . , μ team,n− 1 ) is detailed in Figures 2 and 3. As shown in Figures 2 and 3, the user A mainly retains the secret attributes shared by the same group and sets the attributes which are distinct in the user group as 0. Finally, he outputs a common attribute vector μ team ����→ .

Proof of Ownership.
e user A performs the following steps to realize the PoW and retrieves (w n− 3 , w n− 2 , w n− 1 ) and y.If the downloading right is valid, the inner product will result in 0, meaning that the user A is authorized to retrieve the file. erefore, the cloud server returns C A (w n− 3 , w n− 2 , w n− 1 ) and y back to him. Similarly, the values of (w n− 3 , w n− 2 , w n− 1 ) and y can be used to update the downloading right for a group of users. Specifically, the process of PoW is shown in Algorithm 2, which is the same for any valid user even if the updated downloading right is used.

Update the Downloading Right.
To share the file to a group, the downloading right update process can be carried out by the user A as the following. In a clear form, the process that the user A calculates the downloading right for a group of users is shown in Algorithm 3. , . . . , u team,1 ′ , u team,0 ′ ) and the secret key sk A to all users who are within the same attributes set. In these ways, a group of users are provided with the downloading right, which can be valid if the common attributes vector μ team ����→ is known.

Correctness Proof
e previous section is mainly composed of three parts, namely, the file uploading, the proof of ownership, and the downloading right transfer. To verify the correctness of our design, this section intends to prove that file duplication can be effectively eliminated and only authenticated users can access the file.
Firstly, the correctness for the deduplication label is given by eorem 1.

Mathematical Problems in Engineering
Since PSRG(·) is a deterministic algorithm, when where μ A �→ � (μ A,0 , μ A,1 , . . . , μ A,n− 1 ) are the attributes of the user A. After which the user A sends the download right X A � �→ to the cloud. Finally, the cloud calculates the inner product of Based on the last element of download certificate w → · · · · · · · · · · · · Figure 2: Common attributes. en, he also gets the result of download right X B � �→ and sends it to the cloud. Moreover, the inner product of 〈 w → , tX B ′ �→ 〉 calculates the process as follows: (20) In a word, all legal users who hold the download right corresponding to file M A can pass the PoW.

Security Analysis
is part will prove that the deduplication label is indistinguishable except for duplication check process, and the downloading right is resistant to forgery. To begin with, the security about deduplication label is given in eorem 3.

Theorem 3 (Security of deduplication label).
For legitimate users, whether uploading the same or different files to perform deduplication, the deduplication labels are only distinguishable to the duplication check process.
Proof. e following analysis will be divided into two cases, with respect to the deduplication labels corresponding to same files and different files. (21) According to the deterministic algorithm PSRG(·), we can see PSRG(h A ) � PSRG(h B ). Moreover, for the common matrix P, it is obvious that PSRG(h A ) · P � PSRG(h B ) · P. However, e A , e B and r A , r B are randomly sampled from χ m q and − 1, 0, 1 { } m , respectively. e probability that the deduplication labels are identical is (1/(3q) m ) < (1/Q(m)), which is negligible. erefore, we claim that the results v (1) Similarly, since PSRG(h A ) ≠ PSRG(h B ), the probability that deduplication labels are the same is (1/(n + 1)(3q) m ) < (1/Q(m)), which is indistinguishable from the distribution of Case 1. erefore, we can conclude that, since the deduplication labels of the same file are different, Case1 is of the same distribution indistinguishable from Case2, and the deduplication labels are semantic secure. In summary, the deduplication tags corresponding to the same file and different files are indistinguishable. Proof. According to inner product predicate, the user A's downloading right X A � �→ can make the inner products 〈X A � �→ , t w → 〉 output 0. However, the download certificate w → is calculated by the user A who samples w i (i � 0, . . . , n − 2) and sets the last element w n− 1 of the download certificate w → to be 1 ))mod q. en, when the user A uploads for the first time, the cloud obtains the completed download certificate w → corresponding to A's secret attributes. For now, if there is an illegal user who tries to falsify the download certificate w → ← $ Z n q to cheat the PoW system, his advantage is which is negligible. ). In detail, Since the value of w n− 3 is known, the result of x B,n− 3 ′ can be calculated. However, because the rank of formula (25) which is negligible. erefore, our scheme will not expose the remaining elements of the download certificate w → .
In terms of Lemmas 1 and 2, it can be seemed that no user can forge a valid downloading right since the complete download certificate and the attributes vector μ A �→ will not be exposed.

Performance Analysis
en, the performance of our schemes will be analysed comparing with other main technologies. e notation of symbols can be found in Table 1, as for functions, such as the necessity of third-party, deduplication level, participants, and the necessity of key fusion, and the comparison can be found in Table 2.
Compared with the schemes from [2,3,9], our scheme does not require any third-party, which effectively avoided extra trusting relationships and can save numerous computation/communication resources. Moreover, our scheme executes file deduplication amongst multiple users, implying that it is more flexible and more adaptive to various cloud environment. From the perspective of key fusion, when compared with the literature from [2,3,8,9], any key fusion process is unnecessary in our scheme, so that it can be applied even if the user resources are limited. en, we compare the computation overheads for deduplication taken by the client, third-party, and cloud in the above schemes. e details are given in Table 3.
Compared with the cost on client side in scheme [3], that of our scheme is O(f)Hash + O(1)PSRG, where a pseudorandom number sequence is generated instead of N convergence keys. In fact, it means that our scheme is more efficient since PSRG can be iterated generated via small numbers, not saying that our scheme if free of any thirdparty. Moreover, the hash value of file can be secretly used as the encryption key in this paper. erefore, there is no need for multiple users to reconstruct the convergence key, which further outperformed the scheme of [3] by avoiding the consumption of key distribution and fusion.
Compared with the schemes in [8,9], our method does not need to construct Bloom filter or attribute binary tree on client side, so the computational cost is slightly advantageous. In addition, since our scheme does not involve any third-party, the computational cost of TTP can be neglected. As for the overhead on cloud side, our scheme does not have to initialize any ownership data structure compared with that of schemes [8,9]. erefore, the calculation is deduced to O(g) since it is not related to the file size but only to the length hash value. Now, we compare the computational overhead for PoW, respectively on client, third-party and cloud side. e results are shown in Table 4.
It can be seen from Table 4 that users have to preserve and search the Bloom filter or attribute binary tree to accomplish PoW in [2,3,8,9]. So, there is an additional cost O(kL) or O(N log N) on the client side. However, our scheme does not require this process, so the calculation cost is only O(f)Hash + O(n)Add, where the second term is just n times of add operation. Comparing the cost on cloud side, our scheme dose also outperformed that of [2,3,8,9], which  (1) Samples u team ����→ � (u team,0 , u team,1 , . . . , u team,n− 1 ), where u team,0 is irreversible with cofficients belong to Z q (3) Computes y′←w n− 2 · x team,n− 2 + w n− 1 · x team,n− 1 (4) Computes x team,n− 3 ′ ←x team,n− 3 − ((y − y′)/(w n− 3 )) (5) Samples x team,n− 2 ′ ← $ Z n q , x team,n− 2 ←x team,n− 2 ′ , and Computes x team,n− 1 ←(y − w n− 2 · x team,n− 2 ′ )/w n− 1 (6) For all k ∈ 1, . . . , n − 1 { } u team,i · μ team,0 + u team,[i+(n− 1)]mod n · μ team,1 + · · · + u team,(i+1)mod n · μ team,n− 1 ←x team,i Output: u team ����→ � (u team,0 , u team,n− 1 , . . . , u team,1 ) ALGORITHM 3: Calculation process chart of common downloading right.  is O(n)Add. e reason is similar that the calculation cost on cloud side has nothing to do with the file size but only the number of attributes. Finally, taking the file of 256 bits as an example, we compare the communication overhead for deduplication and PoW amongst the same set of schemes. e details are shown in Figure 4.
According to Figure 4, our scheme has obvious advantage on communication overheads compared with other schemes. Our solution can effectively reduce the usage of bandwidth as well as time delay. Moreover, since all deduplication check and ownership proof processes are independent, our scheme is capable of parallel processing, which is more fit for batch implementation.

Conclusions
is paper proposed a novel deduplication scheme based on LWE and FE to balance the conflict between the accessibility and the indistinguishability of data. Focusing on the purpose of deduplication check, LWE is exploited to construct deduplication labels which are distinguishable only if their deduplication certificates are known. To realize more efficient and flexible access control, inner product predicate is used that data can be retrieved only if both users downloading right and attributes vector are possessed. anks to the separation of downloading right and user's attributes, the downloading right can be recalculated for repeated uploading and authorization transfer without changing the corresponding deduplication label or download certificate over cloud. Correctness and security analyses proved that deduplication can be accomplished only by the duplication check process with negligible false positive, and it is almost impossible for any adversaries to fabricate a legal downloading right. Compared with other main technologies, our scheme is more applicable to multiuser environment and freed from trusted third-party. Since both duplication check and ownership proof are realized by inner product, the performances of computation and communication are more advantageous in our method, not mentioning its capacity of batch processing due to parallelism.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.