To protect data in cloud storage, fault tolerance and efficient recovery become very important. Recent studies have developed numerous solutions based on erasure code techniques to solve this problem using functional repairs. However, there are two limitations to address. The first one is consistency since the Encoding Matrix (EM) is different among clouds. The other one is repairing bandwidth, which is a concern for most of us. We addressed these two problems from both theoretical and practical perspectives. We developed BMCloud, a new low repair bandwidth, low maintenance cost cloud storage system, which aims to reduce repair bandwidth and maintenance cost. The system employs both functional repair and exact repair while it inherits advantages from the both. We propose the JUDGE_STYLE algorithm, which can judge whether the system should adopt exact repair or functional repair. We implemented a networked storage system prototype and demonstrated our findings. Compared with existing solutions, BMCloud can be used in engineering to save repair bandwidth and degrade maintenance significantly.
With the rapid growth of data production in companies, the requirement of storage space grows very largely as well. This growth leads to the emergence of cloud storage. Cloud storage is a concept which is an extension and development from cloud computing. This system collects application software in order to work together and provide systems of data storage and business access features through grids or distributed file systems [
According to forecasts from the International Data Corporation (IDC), the size of the global cloud computing and cloud storage market has increased from $16 billion in 2008 to $42 billion in 2012 [
In order to improve the reliability of cloud storage, some companies have developed their own solutions, such as HDFS [
We hereby focus on the recovery problem for a family of network coding. In the current situation, cloud coding is a popular direction to prevent large-scale cloud node failure. While bandwidth consumption is an important performance signal in cloud storage system, we always want to repair data using the minimum bandwidth and the fastest repair speed. The model for a cloud file system using erasure codes is inspired by NCCloud [
In this paper, we have proposed Bandwidth and Maintenance minimum Cloud system (BMCloud) which has low bandwidth consumption and low overhead in terms of maintenance. The system has both functional repair and exact repair when data loss or data error are experienced. When a part of data in cloud is lost, the system can recover with exact repair to degrade the overhead of maintenance for the future. More importantly, it has the ability to consume less bandwidth when recovering. While almost all the data in the cloud breaks, the system can recover it with a functional repair. We have developed a JUDGE_STYLE rule to judge whether the system should use exact repair to recover or not.
The system includes a proxy which can calculate and transfer data through the clouds and can maintain the system consistency. In the exact repair function, the proxy itself does not require arithmetic processing of the data or the data cache, and it does not need to provide calculation and storage capabilities. Since most of the calculation work is loaded on the cloud nodes in this function, which will be described in Section
The contributions of this paper are described as follows. We have developed E Code algorithm to recover data in exact repair. It can improve recovery bandwidth performance and ensure data integrity. We have developed the JUDGE_STYLE algorithm, which can judge whether the system should use exact repair or functional repair. We have implemented the system BMCloud. When the number of the fail strips in the cloud is less than or equal to 4, or between 4 and 10 but we can recover data in exact repair, BMCloud could significantly improve repair bandwidth by 53.9% and 41.8% compared to current solutions (RDP, NCCloud, etc.). In addition, it can degrade maintenance cost.
The rest of this paper is organized as follows. Section
Since Patterson et al. [
Maximum Distance Separable (MDS) can tolerate maximum failures with a given amount of redundancy. It is widely used in RAID-6 code, so that most of the RAID-6 codes are MDS codes. The most important feature of MDS is the ability to verify that the length of the chain is equal to the length of all the disks, which means that all disks are involved in checking, so you can reach the maximum disk utilization. But there is another coding which is different to MDS. In these cases, the length of the chain is less than the length of all the disks, such as in M-code [
With the development of the network, especially the development of the cloud in recent years, erasure code is also transferred from the disk array level to the cloud level. In network coding, there are three versions of repair which include exact repair, functional repair, and exact repair of systematic parts.
Functional repair has been researched to the greatest extent of the three [
After Rashmi et al. [
Exact repair is also proven theoretically. With the idea of interference alignment [
Different from the theory analysis above, NCCloud has proposed the implementable design for the functional repair. The system has divided each cloud into two parts and uses its algorithm, so that time for data reads can be reduced by 25% compared to time taken to reconstruct the whole file. There are cloud storage systems that provide a scalable platform for storing massive data over multiple storage nodes via erasure coding [
In this paper, we want to develop a cloud system applied as a deep archive and make some progress on the base of an existing system. We specifically design BMCloud under a thin cloud assumption—that the remote data center storing the backups does not provide any special backup services, as Cumulus [
Due to the importance of bandwidth consumption, we put it first on this list. Network code has made some advancements on this issue, but we believe that there is still plenty of room for improvement. That is why the bandwidth is one of the most important factors to consider. We wanted to develop an extra layer on functional repair to create a hybrid system, and so finally we chose E Code, which is a RAID-6 code and has excellent I/O properties. In BMCloud, we elaborately apply E Code on the cloud platform and reserve its excellent properties in I/O (see details in Section
In BMCloud, we want to improve the functional repair model in some aspects, such as bandwidth and computing cost.
Existing systems such as NCCloud, which use functional repair models, need to regenerate EM to repair failed nodes and that may cause some problems that we should not ignore. Since each fault will lead to whole node data regeneration and EM updates, the maintenance overhead of EM may rise significantly, because it will take many computing resources to generate new EM. After a certain number of faults, the maintenance overhead of the whole system may become unacceptable.
In BMCloud, E Code can deal with faults with size of one or two F-MSR blocks and repair them exactly. Every fault repaired by E Code will not need to regenerate and update EM. This property guarantees that the EM will stay stable for a relatively long time. Furthermore, repairing computing cost of E Code is much less than F-MSR. So the system will not incur cost as high as that for computing and bandwidth resources to maintain stability. Since faults in small scale constitute to the majority of regular faults, E Code can be very helpful.
From that, we know that even though we added an extra layer in our system, there is still room for improvement in terms of better computing and bandwidth performance.
Though faults on a large scale seldom occur, cloud systems must have a mechanism to deal with these lethal problems. F-MSR code is an excellent approach to a solution and has an acceptable bandwidth cost. We preserve the fault tolerance requirement and repair traffic with F-MSR (with up to a small constant overhead) as compared to the conventional repair method in erasure codes.
In BMCloud, we improve the performance of bandwidth, stability, and other aspects on the premise of the preservation of the properties of F-MSR code.
In BMCloud, we elaborately designed the repair model which made it capable to provide the most economical and stable solutions for faults on different scales. Furthermore, BMCloud has excellent expandability. The number of nodes in E Code can be any number larger than 4, and this property helps the E Code layer connect to F-MSR seamlessly. If a better system using the functional repair model came up, it would be very convenient to deploy E Code on it, so the mechanism of BMCloud can be applied in many situations.
For an (
In E Code, we define any two cloud nodes adjacent to each other as an E Code area. An exception occurs for the first cloud node, which is grouped with the last node in an E Code area. To a specific file, an E Code area contains two F-MSR chunks.
A stripe is a concept borrowed from traditional RAID codes, and they are an independent coding group of strips. Any stripe in E Code belongs to specific E Code area.
After E Code coding, an additional strip would be appended to a stripe and more strips in each F-MSR chunk. We define the code chunk after extension as an extended F-MSR chunk. Table
Notations of BMCloud.
Notations | Description |
---|---|
F-MSR chunk | A data unit in F-MSR code |
Stripe | A coding group containing a collection of strips in E Code |
Strip | A data unit of a stripe in E Code |
E Code area | Two neighboring cloud nodes in one stripe |
Encode matrix (EM) | Coefficient matrix in the F-MSR code |
Extended F-MSR chunk | An code chunk after E Code encoding and extension |
In order to solve the concerns mentioned in Section
Software architecture of BMCloud.
BMCloud consists of three modules as coding, storage, and protocol, and defines four workflows as download, update (upload), delete, and repair. Basic functions in file systems and data structure definitions, as well as consistency control and other utility functions, are also included in the system.
BMCloud is mounted on Linux with FUSE (filesystem in userspace). Basic functions in a filesystem are supported by user-defined codes. Our system implements reading and writing, rename, link, creating a new folder, changing file attributes, and other basic functions only because a perfectly functioned filesystem is not BMCloud’s point. BMCloud applies Hadoop zookeeper distributed applications to provide coordination services and ensure the system’s strict consistency. In the part of workflows, the system mainly defines the encoding, decoding, update (upload), download, repair, and delete operation workflows. This part is the implementation section about encoding and decoding operation and repair strategy in FMSR and E Code. The underlying part has three modules. Coding module provides a variety of basic encoding methods, including FMSR, RAID0, RAID1, and RS coding, in which we focused on the use of FMSR, a functional repair coding method. Storage module is designed to adapt to different network environments and cloud vendors to provide an interface for basic I/O. Protocol module is a specialized extension module for E Code, defining generation and transmission of the parity blocks in E Code.
Figure
An example of an E Code encoded process.
Define the row sequence and the column sequence in a strip in the left cloud node as
Therefore, we know that
We use an algorithm to improve recovery bandwidth performance and ensure data integrity. The details are given in Algorithm
1: Requirements: 2: Native chunks; 3: Encoding request; 4: Main parameters ( 5: 6: 7: (a): Generate random encoding coefficient vectors; 8: (b): If 9: 10: 11: (c): Compute the product of encoding matrix and native 12: 13: 14: (a): Divide each F-MSR chunk into stripes and strips; 15: (b): For stripe 16: 17: 18: (c): Consolidate the strips into extended F-MSR chunks
F-MSR code is a code using the functional repair method, which is an important foundation of BMCloud. In F-MSR code, the system utilizes EM to record the mathematical relation between the original data and encoded data. From the EM, the system can retrieve the original data. When faults occur, the system can repair them by calculating a linear transformation on EM and regenerating new encoded data on the cloud nodes.
The encoding process preserves the F-MSR code block. It just adds a parity block of E Code to the data, so that in daily use, there is no need for the system to decode the block twice. The added parity block will only be used in the repair process.
The reason why we choose E Code is that it introduces abundance by adding parity blocks. This method maintains the contents of the data.
The E Code layer can improve the stability of the system while not delaying the response time of the system.
The deep gray squars are the parity strips of E Code. The other 12 strips are data strips which are partitions of F-MSR chunks. In Figure
The dividing method provides system repair abilities in smaller granularity. That enables BMCloud to avoid repairing whole nodes in most situations. In other words, BMCloud avoids the cost of calculating the data of a whole node and updating bandwidth costs of EM.
The prototype system of BMCloud is constructed of a proxy and several cloud servers (cloud nodes) in heterogeneous environments. Figure
The communication ring of cloud servers.
As a controller, a proxy functions to coordinate data transmission between several cloud nodes and to maintain the system’s consistency. In the idealized mode, the proxy has no need to compute or store any data, which strengthens the system’s scalability. In a large-scale archive environment, proxy process requests remain limited and affordable to current servers or cloud servers, Therefore proxy will not become the bottleneck in this prototype.
Data packages could be directly transmitted between cloud nodes, cutting the bandwidth cost from multiple transmissions. Cloud nodes may need computing capacity to encode and decode the data, undertake the recoveries, and respond to instructions from proxies or requests from other cloud nodes.
We employ a distributed structure in which all the cloud nodes and proxy functions have ability to calculate independently, so that it is necessary to arrange a unified protocol to control and maintain the system.
As with single-fault recovery, there are two equivalent strategies, recovery with the last node or recovery with the next node. In the former case, there will be two kinds of orders from the proxy, the ones to the previous node and the ones to the recovering node. In order to prompt this, we will send an order formatted as follows for prenode (
Encode the original data with F-MSR code. In this process, the encoded data is divided and distributed to several cloud nodes.
Every F-MSR chunk is divided into (
Calculate the data of the parity strips and insert it into the reserved space in
Furthermore, since BMCloud is developed on the foundation of F-MSR code, the system has the cloud level fault tolerance ability. When one cloud node fails, the system can repair the data from the surviving node.
Every neighboring cloud node pair is defined as an E Code area. Any stripe belongs to one and only one E Code Area. In an E Code area, the cloud node on the left owns (
In BMCloud, most strips are located on two data-links. In order to improve the recovery ability of BMCloud, we added a parity strip
E Code in our system has excellent recovery abilities. When the number of the failed strips is less than or equal to 4, BMCloud will recover the data with an exact recovery policy. All faults in this scale can be repaired by the E Code layer exactly. In extended-F-MSR chunks, we divide the strips into two families: data strips and parity strips. In these cases, we use vectors (
S(1,3), S(1,6), S(1,10), S(1,13): they can be easily repaired because they are all in a data link, which has only one failed strip. S(1,3), S(1,6), S(1,7), S(1,10): S(1,3) and S(1,10) can be repaired first then, the recovery trace of S(1,6) and S(1,7) is obvious. S(1,2), S(1,3), S(1,4), S(1,7): S(1,7) first, and from S1 we can repair S(1,4), then S(1,2) can be recovered by a data-link-square, and S(1,3) can by a data-link-upside-down triangle. S(1,3), S(1,4), S(1,6), S(1,7): from four involved data-links, we get the equitation set as follows:
When the numbers of the failed strips are between 4 and 10. The E Code layer can handle these faults except for in some special situations. So first, BMCloud will scan all the failed strips and check their relationships. Then, the system will try to use a greedy algorithm to repair the fault. The details are in Algorithm
1: Requirements: 2: Native chunks; 3: Repair request; 4: Main parameters ( 5: 6: 7: 8: (a): Scan failed strip 9: (b): If 10: then restore 11: else 12: (c): List the equation set of the rest of the failed strips 13: from the related data-links 14: (d): If the equation set is soluble 15: then restore all the failed strips 16: else got 17: 18: (a): Repair the whole node with F-MSR code; 19: (b): Recalculate the related parity strip and update all 20: the EMs on every cloud server;
When the number of failed strips is larger than 10, the scale of the fault overcomes the upper bound of repair ability of the E Code layer. So BMCloud will repair the fault by the F-MSR layer. The data of the whole node will be regenerated by F-MSR code from the data on surviving cloud servers. After regenerating the new F-MSR chunk, BMCloud will recalculate the parity strips and add them into the F-MSR chunk to restore and extend the F-MSR chunk onto a new cloud sever. In the meanwhile, the related parity strip on the related cloud server will be updated.
Table
Monthly price plans (in US dollars) for Amazon S3 (US Standard) and Windows Azure Storage, as of January, 2013.
S3 | Azure | |
---|---|---|
Storage (per GB) | $0.064 | $0.062 |
Data transfer in (per GB) | free | free |
Data transfer out (per GB) | $0.120 | $0.119 |
PUT, POST (per 10 K requests) | $0.100 | $0.010 |
Get (per 10 K requests) | $0.010 | $0.010 |
However, in the analysis, we have ignored three practical considerations: the computing cost, the size of metadata, and the number of requests issued during repair, because we considered these values negligible in real-life applications.
Computing cost: in real-life applications, the MTTF (99.999999999%) of the Business Cloud is very long, so it will cost few computing resources to guarantee high availability.
Metadata size: in BMCloud, regardless of the size of the data files, the F-MSR metadata size is always within 160 B. In evaluation, we used a 512 MB file to test the response time of BMCloud, and compared to the size of the test file, the metadata size will usually negligible. In real-life applications, the size of the data file usually overcomes 1 GB, so the effects caused by metadata on system are too small and do not need evaluation.
In this part, we deploy our BMCloud prototype in a real local cloud environment to evaluate the system performance of the response time. This cloud storage environment is chosen to carry out this analysis in order to evaluate the performance without the effects of network fluctuations. We may continue the analysis on commercial clouds as a future work. All results are averaged over 50 runs.
The experiment is implemented on a storage platform based on OpenStack swift. The proxy of the system is installed on a laptop with Intel Core i5-580 and 8 GB RAM. This machine is connected to an OpenStack swift platform attached to a number of storage servers with Xeon E5606 CPU and 8 GB DDR3 RAM. We create 6 containers on this platform, and 4 containers act as 4 cloud nodes and other 2 containers act as spare nodes, to constitute the testing environment of F-MSR (
We test the response time of the three most important operations in the workflow of BMCloud: file upload, file download, and file recovery. For each operation workflow, we keep a record of the detailed time costs in every type of operation. We use random files sized from 1 MB to 512 MB as the test data set. RAID-6 Reed-Solomon code is chosen as a control group. There are two types of recovery situations, considering whether the failed node is native or parity.
Figures
File upload response times of BMCloud on a local cloud.
File download response times of BMCloud on a local cloud.
Recovery response times of BMCloud on a local cloud.
Figures
512 MB file response time detailed analysis (Functional).
512 MB file response time detailed analysis (exact).
In the recovery process, BMCloud shows a shorter response time than RAID-6. BMCloud needs to download less data during repairs than RAID-6 and NCCloud. There are two kinds of recovery methods in BMCloud, exact repair, and functional repair. Exact repair sharply curtails repair bandwidth and hence repair response time, showing our main advantage. In repairing a 512 MB file, NCCloud spends 9.593 s in download; the native-chunk repair while RAID-6 spends 12.124 s and BMCloud spends only 5.583 s for all the data. The response time of BMCloud is 41.8% and 53.9% better than that of NCCloud and Raid-6, respectively. On the other hand, BMCloud spends 11.467 s in functional repair, which is a little less than NCCloud and RAID-6.
We have not tested our system on a commercial cloud since our test environment is limited. But from NCCloud [
In this paper, we developed a low repair bandwidth, low maintenance cost cloud storage system named BMCloud. It has the exact repair algorithm E Code to degrade repair bandwidth and it also provides functional repair to recover data in any condition. The JUDGE_STYLE algorithm can help BMCloud decide in which cases exact repair will be used and in which cases functional repair will be used.
We implemented the system and conducted experiments which prove that BMCloud is effective in degrading repair bandwidth and maintenance costs. The result shows that the response time of BMCloud is 41.8% and 53.9% better than those of NCCloud and Raid-6 when the system operates in the recovery methods of exact repair.
There are still much more work to be done in the future which mainly take two directions. First, we will focus on the recovery of a single-disk failure and do some experiments to verify the performance of BMCloud. Second, we will take a twin-code model into account and make the system more suitable for cloud storage.
In conclusion, we believe that BMCloud is an attractive cloud storage system: one that offers low repair bandwidth, while achieving low maintenance cost.
This work is supported in part by the National Basic Research Program of China under Grant no. 2011CB302303, the NSF of China under Grant no. 60933002, and the National High Technology Research and Development Program of China under Grant no. 2013AA013203.