In order to achieve energy saving and reduce the total cost of ownership, green storage has become the first priority for data center. Detecting and deleting the redundant data are the key factors to the reduction of the energy consumption of CPU, while high performance stable chunking strategy provides the groundwork for detecting redundant data. The existing chunking algorithm greatly reduces the system performance when confronted with big data and it wastes a lot of energy. Factors affecting the chunking performance are analyzed and discussed in the paper and a new fingerprint signature calculation is implemented. Furthermore, a Bit String Content Aware Chunking Strategy (BCCS) is put forward. This strategy reduces the cost of signature computation in chunking process to improve the system performance and cuts down the energy consumption of the cloud storage data center. On the basis of relevant test scenarios and test data of this paper, the advantages of the chunking strategy are verified.
Along with the development of the next generation of network computing technology, such as the application of Internet and cloud computing, the scale of the data center is showing the explosive growth in the past 10 years. The total amount of global information is double every 2 years. It was 1.8 ZB in 2011 and will reach 8 ZB in 2015. In the next 5 years (2020), the data will be 50 times higher than that of today [
According to the 2005 annual report of the international well-known consulting company Gartner [
How was electrical energy wasted in the data center? References [
Green storage technology refers to reduction of the data storage power consumption, electronic carbon compounds, the construction and operation cost, and improvement of the storage equipment performance in terms of the environment protection and energy saving. The study shows that there is a large amount of redundant data in the growing data. And the proportion of redundancy will be higher as time goes on, leading to a sharp decline in the utilization of storage space. The low utilization of storage space leads to a waste of both storage resources and lots of energy.
In order to improve the utilization of storage space, a large number of intelligent data management technologies, such as virtualization, thin provisioning, data deduplication, and hierarchical storage, have been widely applied in the cloud data center. Deduplication, now a hot research topic, meets the ends of energy conservation, efficiency of customers’ investment and reduction of emission, and operation costs by saving lots of storage space, improving the performance of data read and write, and lessening the bandwidth consumption effectively.
Research shows that a large amount of data is duplication in the growing data. Therefore, the key for reducing the data in the system is to find and delete the redundant data. The basis of detecting it is the high-performance and stable chunking strategy.
It takes advantage of signature calculation to partition the data objects in the most existing literatures [
It is CPU intensive for detecting the chunk boundaries through signature calculation. For a 100 MB file, if the expected average length of chunk is 8 KB, the minimum length of chunk is 2 KB, and the file is expected to be divided into 12800 chunks. For each chunk boundary, 6144 times signature calculations and comparisons are required.
The key problem of improving chunking performance lies in cutting down the consumption of CPU resources by reducing the number of invalid signature calculations and comparisons.
In this paper, we build the BCCS to a prototype of storage system. The primary contribution of our work is as follows. Firstly, as the deficiencies of total signature computations of Rabin fingerprinting algorithm are very huge, it is very CPU-demanded to divide the chunks that decrease the system performance. A new strategy, BCCS, is presented, reducing the overhead of generating fingerprints, as well as converting the question of files data partitioning stably to matching two binary bit strings. It samples a special bit from a text byte to constitute its data fingerprint, which converts the signature computation to binary string comparison. Only 1/8 of the text reads originally, and the bit operation is taking place of the traditional comparison. Secondly, by replacing the comparison operation, BCCS uses bitwise operation to optimize each matching process and exclude the unmatching positions as much as possible, getting the maximum jumping distance to quicken the matching process of binary string. It also reduces calculation and comparison costs by taking advantage of the bit feature information brought by failure matching every time. This measure reduces the cost of signature computation in chunking process and brings down CPU resource consumption to improve the system performance.
BCCS also divides the chunking process into two steps: the preprocessing step and the partitioning step. Then, BCCS adopts the parallel processing mode dealing with the preprocessing at a preprocessing node that synchronizes with partitioning at the front-end node. It saves the time of the preprocessing when the system is working. All of the measures minimize the total time consumption for data partitioning.
The rest of this paper provides the following. Section
In order to solve the stability problem of chunking, Muthitacharoen et al. [
Eshghi and Tang [
Bobbarjung et al. [
Research works on how to avoid long chunks, how to choose the optimal chunk size to achieve the optimal deduplication effect, and so on have been done a lot; nevertheless it is seldom mentioned in the literature how to reduce the overhead of fingerprint calculation for the data in the sliding window in the chunking process. When the amount of data becomes large, the calculation of the CPU overhead will be greatly increased; thus chunking algorithm will face enormous pressure on the performance.
Most existing chunking strategies [
Two key problems should be solved in order to reduce the signature calculation and save CPU resources. The first is to reduce the calculations of fingerprints as far as possible; the second is to minimize the number of signature computations.
According to binary string, BCCS borrows the incremental calculation method of fingerprint from the Rabin fingerprint algorithm to convert the problem of chunking file stably to the matching process of two binary strings, reducing the overhead of chunk fingerprints generation. BCCS also learns the leap-forward match from the BM [
The Rabin algorithm is based on the principle that a pattern string whose length is
The Rabin algorithm was more efficient, but it consumed a large amount of CPU resource for modulo operation. Therefore, the key to reduce CPU overhead is improving algorithm to minimize modulo operation.
The basic comparison unit is byte in the Rabin algorithm. The new algorithm can be considered to select on binary bit to represent the basic comparison unit to accelerate the algorithm speed. The process can be completed by bitwise operation instruction without modulo operation and greatly reduces the CPU overhead of generated fingerprints.
Therefore, the new algorithm selects the low order bit of each basic comparison unit to form its fingerprint. If the fingerprint
Let us assume that SHR AL, 1; SHL BX, 1.
Then, the data stored in the BX is the fingerprint of
Assume that the length of the original pattern string is
It considers the following two ways for pattern matching in this paper.
Matching the substring of text from pattern.
As bit string comparison, BCCS-P is slightly different from BM, each time comparing a substring through a bitwise operation. Two cases are considered separately in this paper.
As shown in Figure
The perfect matching of text bit string.
The good suffix match of text bit string should be working if there is a failure of matching.
As shown in Figure
Success of perfect matching of text bit string.
As shown in Figure
The good suffix matching of text bit string.
The jumping distance to right for the pattern is
Matching the substring of pattern from text.
The following two cases are attached importance in this paper.
As shown in Figure
The perfect matching of pattern bit string.
The good suffix match of pattern bit string is working once the matching fails.
It suggests that the perfect matching of pattern bit string is successful if the string matches with substring of pattern in the text successfully
Similar to Figure
As shown in Figure
The good suffix matching of pattern bit string.
The jumping distance for pattern to right is still
Through the above analysis we know that, in order to reduce the comparison cost, BCCS obtains the maximum jumping distance
The following items are on the range of
BBCS-P selects a bit substring whose length is
Compared with BCCS-P, BCCS-T selects the bit substring whose length is
The detailed discussion of the values range of the jumping distance
Assuming that
The
From [
Here,
Equation (
Compared with the traditional pattern matching algorithm, the speedup of
Through the above analysis, the bigger
This paper establishes a prototype system for big data storage based on deduplication technology. It does the performance evaluation of the Bit String Content Aware Chunking Strategy, and the effects of the different length of the target bit string on chunking speed, chunk size, and chunk compression ratio are analyzed.
The prototype system consists of two server nodes. The CPU of each server node is 2-way quad-core Xeon E5420 with 8 GB DDR RAM, whose frequency is 2 GHz. And the cache of each core is 6114 KB. The chunks are stored in RAID 0 with two disks (Seagate Barracuda 7200 RPM, 2 TB capacity of each one). While each node is equipped with an Intel 80003ES2LAN Gigabit card connecting to the Gigabit Ethernet, one of the nodes is the main server and another is the mirror server.
Four different chunking algorithms, Rabin algorithm, FSP algorithm, BCCS-P, and BCCS-T, are compared in the experiment.
Two different test data sets are set up to test the influence of chunking speed of different file types by different chunking algorithms. As shown in Table
Chunking test data set.
Data set | File type | Set size (MB) | File number | Average length of files (B) | Minimum length of files (B) | Maximum length of files (B) |
---|---|---|---|---|---|---|
Modified | Linux source codes 2.6.32, 2.6.34, 2.5.36 | 999 | 76432 | 13,083 | 6 | 145,793 |
Unmodified | rmvb | 883 | 2 | 464,507,195 | 439,263,182 | 489,751,207 |
An adaptive control scheme on chunking for the two different test data sets is proposed. The prototype system automatically chooses the optimal algorithm according to different test set types.
The prototype system determines the type of the file by the suffix of file name. The multimedia files and compressed files are classified as unmodified file type, and system establishes a static text file, containing suffix name of unmodified file. Other types of documents are classified as modified file. When the files belong to the unmodified the system analyses the suffix of a file name at first and then chooses FSP algorithm or BCCS algorithm to chunk according to the different conditions whether the given parameters need faster duplicate speed or higher duplicate rate. Otherwise the system uses BCCS algorithm directly when the files belong to the modified data set.
The chunking is divided into two steps by BCCS: the first is preprocessing and the second step is a binary string matching, namely, chunking.
The preprocessing is that the system processes the input text data, extracting one bit of each text byte to generate fingerprints, providing support for the following chunking before calling the chunking module.
The experiment first compares the time of the preprocessing time and chunking time by BCCS-T and BCCS-P for the modified data set. The time overhead ratio is as shown in Figures
Ratio of chunking time and preprocessing time for modified data set of BCCS-T.
Ratio of chunking time and preprocessing time for modified data set of BCCS-P.
The minimum chunk size is proposed as 2 KB by BCCS to avoid too many short chunks, which reduces the overhead of metadata and decreases the unnecessary computation.
As shown in Figure
The comparison of chunking throughput of modified data set.
The unmodified test data sets are chunked by FSP and BCCS-T algorithms, respectively. It is shown in Table
Comparison of BCCS-T and FSP algorithms for unmodified data set.
Chunking algorithm | Total time cost (ms) | Chunking time cost (ms) | Saving time (ms) | Deduplication rate |
---|---|---|---|---|
BCCS-T-11 | 16453 | 5218 | 10642 | 1.2061 |
BCCS-T-12 | 18303 | 5503 | 11588 | 1.2050 |
BCCS-T-13 | 19336 | 6485 | 11851 | 1.0952 |
BCCS-T-14 | 19627 | 5743 | 13942 | 1.0920 |
BCCS-T-15 | 20045 | 7956 | 13920 | 1.0920 |
FSP | 16167 | 686 | 15456 | 0.9771 |
It is an essential prerequisite for finding redundant data and improving deduplication rate to rapidly and stably divide the data objects into suitable size chunks. Most existing chunking algorithms obtain the fingerprints in the sliding window by Rabin algorithm to determine the chunk boundary. It consumes a large amount of CPU computing resources on fingerprint computing. The CPU cost of fingerprints computing is greatly increased when the processed data becomes huge, and the chunking algorithm faces plenty of pressure on the performance.
This paper proposes a novel chunking strategy, namely, the Bit String Content Aware Chunking Strategy (BCCS), through researching and analyzing the advantages of Rabin algorithm and BM algorithm.
The primary contributions of our work are as follows: first, according to characteristics of the binary bit string, it simplifies the fingerprint calculation requiring a lot of CPU computing resources into the simple shift operation with reference to calculation methods of Rabin incremental fingerprint algorithm; secondly, it converts the problem of data objects chunking stability into the process of two binary bits’ string matching.
According to the different matching subjects, this paper proposes two kinds of chunking strategies based on Bit String Content Aware Chunking Strategy, namely, BCCS-P and BCCS-T, with the bad character matching rule and the good suffix matching rule in BM. BCCS-P is pattern matching centered when the jumping distance for unmatching is limited every time. The maximum jumping distance does not exceed the length of pattern; thus the benefits are limited. BCCS-T is text-matching centered that rules out the unmatching positions and gets the maximum jumping distance to reduce intermediate calculation and comparison costs by making best use of the bit feature information brought by failure matching every time. This measure reduces the cost of signature computation in chunking process to improve the system performance.
To a certain extent in the large data center using the deduplication technology to store the data, it reduces the storage system reliability and also has some limitations in storage overhead and system performance with multiple files sharing data objects or chunks. It is also one of the problems that needs to be further studied in the future regarding how to make up for this defect, ensure the safety and reliability of big data storage, and provide the QoS.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research was supported by Natural Science Foundation of Hubei Province (no. 2013CFB447).