Security Analysis and Improvements on a Remote Integrity Checking Scheme for Regenerating-Coding-Based Distributed Storage

Enabling remote data integrity checking with failure recovery becomes exceedingly critical in distributed cloud systems. With the properties of a lower repair bandwidth while preserving fault tolerance, regenerating coding and network coding (NC) have received much attention in the coding-based storage field. Recently, an outstanding outsourced auditing scheme named NC-Audit was proposed for regenerating-coding-based distributed storage. (e scheme claimed that it can effectively achieve lightweight privacy-preserving data verification remotely for these networked distributed systems. However, our algebraic analysis shows that NC-Audit can be easily broken due to a potential defect existing in its schematic design. (at is, an adversarial cloud server can forge some illegal blocks to cheat the auditor with a high probability when the coding field is large. From the perspective of algebraic security, we propose a remote data integrity checking scheme RNCAudit by resorting to hiding partial critical information to the server without compromising system performance. Our evaluation shows that the proposed scheme has significantly lower overhead compared to the state-of-the-art schemes for distributed remote data auditing.


Introduction
Distributed cloud storage provides an on-demand data outsourcing service and has become a popular research trend due to its elasticity and low maintenance cost. However, the shortcoming of this system is that some storage nodes could be untrustworthy, which makes data integrity or reliability become increasingly troublesome tasks for data owners. Regenerating coding, an alternative technology designed for these networked storage systems, shows a potential advantage to reduce data repair bandwidth if the outsourced data integrity is the precondition of such systems. erefore, enabling remote data integrity verification becomes fundamental and crucial for regeneratingcoding-based cloud storage systems [1][2][3].
Many solutions realizing outsourced integrity checking without local copy have been presented under several system and security models, such as provable data possession (PDP) [2][3][4][5][6] and proof of retrievability (PoR) [7][8][9][10]. PDP approaches always resort to some homomorphic authentication schemes for ensuring possession of files on untrusted storages, while PoR approaches combine spot-checking and error correcting codes to ensure both possession and retrievability of files on archive service systems. To enable integrity verification, both solutions require a cloud server to return the proof response of some particular hosted data blocks specified by the auditor (or the user itself ). If the returned proof cannot pass the auditor's verification checking, the auditor concludes that data hosted in that server are corrupted. Both PDP and PoR are the earliest solutions of data integrity auditing in a single cloud environment, in which only one copy of user data is stored in the cloud. Considering that files are usually striped and redundantly stored across some distributed systems (i.e., multiple servers or multiple clouds), the works in [7,[9][10][11][12] explore integrity verification suitable for such distributed setting with different redundancy schemes, such as replication, erasure codes, and regenerating codes. In this paper, we focus on the remote data integrity verification problem in regenerating-code-based distributed storage systems.
Most traditional PoR approaches cannot support popular cloud storage applications because of their random permutations for auditing. Although some PDP schemes combined error correction codes suggesting to preprocess the outsourced data, this is inefficient by decoupling error correction.
erefore, the state-of-the-art approaches can only partially solve cloud data integrity checking, and they still face usability challenges in practical scenarios due to their low efficiency for regenerating-coding-based distributed cloud storage [13][14][15]. As we all know, usability requires that a cloud data integrity scheme is practically secure, supports real-time applications, and also runs fast. Actually, most of the existing solutions either involve large-scale encoding computation of the stored data or need to generate much-processed parity data authenticated. When these issues meet regeneration-coding-based systems [5,8,9], the distributed storage performance will degrade dramatically, far from reaching the usability.
ere have been only a few number of works that have been devoted to usable regeneration-coding-based applications. For achieving lightweight implementation cost, Le et al. [11] proposed a symmetric-key-based privacy-preserving auditing scheme called NC-Audit, which presents relatively efficient performance. e authors claimed that it can realize remote data privacy-preserving auditing along with failure repair to cloud storage nodes. However, the scheme has some fatal security weakness to break the integrity checking protocol as illustrated in this paper. Recently, Lakshmi and Deepthi [10] proposed a homomorphic encryption scheme based on channel coding for regenerating-coding-based storage system, which realizes verifiable computation and error correction with a very small amount of bandwidth. However, large-scale matrix multiplication operations are involved during the process of audit and error correction, which brings in heavy online computation overhead.
In parallel, another kind of interesting and possible solution to maintain remote system security is presented in [16][17][18], which targets malware detection or tamper resistance in storage networks. However, these works only focus on the software or hardware security, orthogonal to the data security in this paper. Different from these active attack detection mechanisms, we only focus on remote data integrity verification solutions in distributed cloud storage. e contributions of this paper are threefold as listed below.
(1) We point out that the scheme NC-Audit for regenerating-coding-based storage is actually insecure for that it cannot satisfy the fundamental auditing security, which makes an adversarial storage node to successfully forge an illegal response to pass the auditing verification even if the storage node has deleted the user's whole file. (2) We further propose an improved algebraic securitybased remote data auditing scheme named RNC-Audit (Revised NC-Audit) to fill the security gap between the coding reliability and the usability of integrity checking. e methodology behind this work is to scramble partial key parameters to strengthen the security.
(3) e proposed scheme is practically secure and does not rely on any computation-heavy arithmetic calculations. It supports existing distributed cloud storage applications and works over computationefficient finite fields. e rest of the paper is organized as follows. In Section 2, we formulate the system model and the threat model. In Section 3, we describe the scheme NC-Audit execution between the user and a single storage node. In Section 4, we elaborate security analysis on NC-Audit. In Section 5, we put forward an improved algebraic security-based scheme RNC-Audit. In Section 6, we explain the correctness of RNC-Audit. In Section 7, we evaluate the communication and computational efficiency. In Section 8, we conclude the paper.

System
Model. Similar to [11], we consider a cloud storage service involving a user and a third-party auditor (TPA) and some regenerating-coding-based storage nodes which make up the cloud storage provider (CSP). e user uploads his data to the storage nodes and resorts to the TPA to check for the integrity of the outsourced data at each node. Particularly, the user does not want the TPA to learn about his data privacy. e auditing system model is shown in Figure 1.
Before data uploading, the user encodes the file using a regenerating code and uploads the encoded data to N storage nodes N 1 , N 2 , . . . , N N in a distributed way. e detailed procedure is as follows.
(1) Divide the file into a sequence of initial message vectors, . . , m, and e i is an m-dimensional unit vector whose i-th element is 1.
for j � 1, 2, . . . , M, where α sij is randomly chosen in F q and g y sj is termed as y sj 's encoding coefficient and is composed of the last m elements of y sj .
When a (N, k, d, M, β) regenerating code is adopted in a distributed storage system with N storage nodes (each node stores M blocks), the data file distributed this system can be restored by accessing M data blocks from any of the k healthy nodes. When a server node fails, the data stored in the failed server can be reconstructed by retrieving β(β ≤ α) data blocks from any d(k ≤ d ≤ n − 1) healthy servers, and therefore, the repair bandwidth is c � dβ. In the example given in Figure 2, N � 4, k � 2, d � 3, M � 2, β � 1, and c � 3. e symbolic representations described above will continue to be used in the following text. We refer the interested reader to the literature about regenerating code construction .

reat Model.
We consider semitrusted storage nodes that are faithful and do not deviate from the auditing protocol. However, they may deliberately delete rarely accessed user's data to reduce storage costs; they may also try to hide data corruptions caused by either internal or external factors to maintain reputation. For clarity, we concentrate on our discussion between a single storage node and the TPA.
We assume that the TPA, who is in charge of remote data integrity checking (i.e., data auditing), is independent and reliable. e TPA has no willingness to collude with existing storage nodes but has a strong desire to extract or leak user's secret keys. is is a general assumption when relying on a TPA for data auditing to reduce the user's burden [5,6,8].
As the practical requirements of cryptographic protocols, both the TPA and the storage node are fully aware of the protocols used.

Auditing Model.
Generally, a remote data auditing scheme for cloud storage always includes a three-stage process, i.e., initialization, outsourcing upload, and data auditing.
e initialization stage mainly generates system parameters and protocol security keys; in outsourcing upload, a user needs to generate authentication tags for all the outsourced blocks and outsources them to storage nodes; in the stage of data auditing, cloud server computes a response (i.e., a linear combination of some given blocks and its authentication tag) challenged by the TPA, and the TPA verifies whether the tag is a valid tag of the combined block.

Description of NC-Audit
For clarity, this section focuses on the discussion of NC-Audit execution between the user and a single CSP storage node.
In NC-Audit, every initial message block v i consists of n characters in F q , while the last two characters are padded randomly.
e scheme introduces three pseudo-random functions (PRFs): where Z + indicates the positive integer set, K denotes the PRF key set, and ID is the file identifier set. NC-Audit consists of three phases as follows.
(1) Initialization: for i � 1, 2, . . . , n − 2, and then sends p 1 , p 2 , . . . , p n−1 to N s . Here the j-th element of the vector p i is determined by (2) Outsourcing upload: For j � 1, 2, . . . , M, the user works as follows. (a) Generate y sj as in Section 2.1 and compute as the authentication tag of the vector y sj . (b) Retain the secret keys and the encoding coefficient vector g y sj , send g y sj to TPA, and upload the vector z sj � (y sj , t sj ) to N s . In fact, the user can delete the whole data of the file denoted by id.
If so, N s passes the auditing in that time and outputs 1; otherwise, it outputs 0.

Security Analysis on NC-Audit
e authors in [11] have presented the scheme NC-Audit with ciphertext indistinguishability under chosen plaintext attacks. However, it is not enough to guarantee the security of the auditing mechanism. As claimed, the vector r (or r)  can only be shared between the user and the TPA, which means that once r is leaked to an adversarial storage node, the adversarial storage node is able to forge at least one illegal vector to pass the auditing verification when q is large. e following analysis shows that how an adversary can deduce the private vector r easily. According to equation (3), we can construct a system of linear equations about vector r (with n − 2 unknowns) as follows: Since p i ∈ F n−2 q , we can see that A is clearly a (n − 2) × (n − 1) dimensional matrix in F q . Meanwhile, every element of A can be considered random in F q because it is generated by a PRF, so the probability that the rank of the matrix A is n − 2 can be high up to 1 when q is large enough. is means that the adversary has the ability to solve the vector r with a high probability.
According to equation (8), we then have As stated above, the adversary can solve the vector r with a high probability when q is large enough. Moreover, he can indeed know the public vectors c. erefore, the adversary can easily deduce at most q n− 3 − 1 vectors c ′ ( ≠ c) satisfying r · c ′ � r · c. (12) Since the vector e is fixed during each encryption in NC-Audit, it holds that r · e + m ′ � r · (e + m). (13) According to equations (10) and (11) which means still holds regarding the forged vector m ′ . at is, the adversary can elaborately disguise an illegal plaintext vector m ′ ( ≠ m) with the specific vector e and the tag t, such that the forged response message resp ′ � (〈e + m ′ , (r, p)〉, e n−1 , e n , t) can succeed in passing the auditing verification.

An Improved Remote Data Integrity Checking Scheme for Distributed Storage
According to the algebraic analysis in Section 4, we can conclude that the security of NC-Audit depends on the security of the secret vector r, and vice versa. Inspired by the notion of algebraic security in [20], an alternative and feasible method of realizing secure integrity checking is to prevent an adversary to deduce the vector r. For example, if the adversary can get less than n − 3 values out of p i (i � 1, 2, . . . , n − 1), there is no way for him to launch such attack as in Section 4. Subsequently, we present an algebraic security-based auditing scheme called RNC-Audit, where a special randomization is performed in order to protect partial critical security parameters, which makes the auditing protocol satisfy the algebraic security criterion, i.e., the adversary has no ability to solve the linear system of equation constructed from his known information. To attain this end, without loss of generality, we assume that the values p n−2 and p n−1 in equation (9) have been randomized to prevent the adversary's security analysis in the following section.
Besides the PRFs F 1 and F 2 used in NC-Audit, RNC-Audit also introduces another PRF: where WID is the set of identifiers of auditing tasks. Seeing that the system parameters of the protocol are distinct from that of NC-Audit, RNC-Audit will perform different auditing computation. e detailed execution of RNC-Audit between the user and a single CSP storage node is highlighted as follows.
(1) Initialization: (a) Setting security parameters λ and PRFs, then the user shares a unique key k e and k v with the storage node and the TPA, respectively, where k e is used for encryption at the storage node and k v is for verification at the TPA. (b) Both the user and the TPA compute the vectors r ∈ F n−2 q and r ∈ F n+m q as in NC-Audit. (c) e user generates the vector p i ∈ F n−2 q as in NC-Audit, then computes and sends the parameters p 1 , p 2 , . . . , p n−3 , α 1 , α 2 to N s (rather than the former parameters p 1 , p 2 , . . . , p n−1 used in NC-Audit), and simultaneously sends θ 1 , θ 2 to TPA, where in which α 1 and α 2 are both selected randomly in F * q . (2) Outsourcing upload: For j � 1, 2, . . . , M, the user works as follows. (a) Generate y sj as in Section 2.1 and compute as the authentication tag of the vector y sj . (b) Retain his secret key and g y sj , send the vector g y sj to TPA, and upload the vector z sj � (y sj , t sj ) to N s . Usually, the user can delete the whole data of the file denoted by id.

Security and Communication Networks
(3) Data auditing: (a) e TPA generates and sends the challenge message chal � |〈i, ε i 〉|i ∈ Δ . (b) N s computes the aggregate vector e, i.e., e � i∈Δ ε i z si � e, e n−1 , e n , g e , t , (19) and then performs the following operations: Step 1: generating β i � F 4 (k e , wid, i), i � 1, 2, . . . , n − 1 and computing a mask vector where wid ∈ WID is the identifier to label the current audit task.
Step 2: computing Step 3: sending the response message resp � (〈c, ]〉, e n−1 , e n , t, c 1 , c 2 ) to TPA, where (c) TPA computes g e according to chal, extracts c � (e, e n−1 , e n , g e ) from resp, and computes and then verifies if the following equation holds.
If so, N s passes the auditing in that time and outputs 1; otherwise, it outputs 0.

Correctness of RNC-Audit
e correctness of RNC-Audit is guaranteed if the file is corrected by the following derivation process.

Security Guarantee.
Similar to the analysis in eorems 2 and 4 in [11], it can be easily proven that RNC-Audit can provide data possession proof and privacy-preserving guarantee.
Especially, RNC-Audit can effectively conquer the security weakness existing in NC-Audit in Section 4. e user has the ability to protect the values p n−2 , p n−1 against both the honest-but-curious server and the TPA, which makes the adversary to only at most obtain a linear system with n − 3 equations with respect to n − 2 unknowns as follows: It is easy to see that when q is large properly, the adversary has no ability to solve the vector r except the brute force guess, which can effectively resist the adversary analysis as in Section 4 and thus guarantee the auditing security of RNC-Audit. Note that we can easily conclude that the proposed scheme realizes the algebraic security as defined in [20].

Computation and Communication.
Under the same security level, the scheme NC-Audit and the scheme in [10] were once considered to be efficient in computation performance among the current data auditing schemes for regenerating-coding-based storage. erefore, the following text will compare the performance of RNC-Audit with that of the schemes in [10] and NC-Audit. e systematic performance features of the three schemes are shown in Table 1. e scheme in [10] is the one specifically constructed for online user auditing, which differs in system model from the existing schemes because the entity TPA is never needed.
is design simplifies the protocol process and eliminates the security risks from the TPA, but the incidental side effect is that the user itself must be always online and it incurs high processing burden. Furthermore, the scheme involves largescale error correction encoding and channel decoding operations, which makes the computation overhead more expensive compared to RNC-Audit and NC-Audit.
Although NC-Audit is also excellent in computation cost, it cannot guarantee user's data privacy. In contrast, RNC-Audit can ensure data privacy and achieve similar implementation performance as NC-Audit. Actually, RNC-Audit only operates two more multiplications in F q compared to NC-Audit during each audit. In addition, it does not increase the storage overhead of every entity in the system during the protocol running.
For each audit round, the main communication overhead is the response transmission from the storage node to the TPA (or the user), which is dominated by the size of the (encrypted) data block. NC-Audit and RNC-Audit also keep the equivalent performance in communication efficiency. Comparatively, the scheme in [10] has lower communication overhead because of its simplified auditing method.

Implementation.
We implement in C to compare the online computation performance of RNC-Audit with two typical schemes, i.e., NC-Audit and the scheme in [10]. e experimental result reported for these three schemes is the average of 1000 runs on a computer with Intel(R) Core (TM) i7-8650U 1.9 GHz and 16G RAM.
For a fair comparison, we set λ � 80, q � 2 8 , n � 2 12 (i.e., 4 kB block), m � 200, and M � 300. e experiment ignores the addition operation and focuses on the online processing time of multiplications over F q with the help of lookup table. Table 2 shows the average computation performance of the TPA and one CSP server (both have the same configuration) during one auditing work, when |△| � 200 and 300. t i,j represents the average online computing time of entity i when j � |△|.
e result shows that RNC-Audit outperforms the other two schemes in execution efficiency. Since the scheme in [10] needs to additionally perform complex decoding operations, the user-side computation overhead is relatively large, but the server-side overhead is small. Among the three schemes under comparison, the computational time always increases with |△|.
Taking the above discussions into account, the proposed scheme RNC-Audit is excellently customized to achieve providing proof of retrievability and privacy-preserving auditing without any data security compromise while potentially being efficient in real-time applications and thus much usable for the regenerating-coding-based distributed storage.

Conclusion
Privacy-preserving data auditing is one of the key issues in distributed cloud storage applications. is paper first points out that there exists a fatal security flaw in the scheme NC-Audit [11] according to our algebraic analysis. Inspired by the algebraic security, an improved PoR-based scheme called RNC-Audit is presented, which can not only effectively prevent the algebraic analysis but also maintain the competitive implementation efficiency in coding-based storage systems. Our analysis and evaluation results demonstrate that RNC-Audit is more efficient and usable than the stateof-the-art schemes in practical resource-constrained scenarios.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Security and Communication Networks 7