Replication technology is commonly used to improve data availability and reduce data access latency in the cloud storage system by providing users with different replicas of the same service. Most current approaches largely focus on system performance improvement, neglecting management cost in deciding replicas number and their store places, which cause great financial burden for cloud users because the cost for replicas storage and consistency maintenance may lead to high overhead with the number of new replicas increased in a payasyougo paradigm. In this paper, towards achieving the approximate minimum data sets management cost benchmark in a practical manner, we propose a replicas placements strategy from costeffective view with the premise that system performance meets requirements. Firstly, we design data sets management cost models, including storage cost and transfer cost. Secondly, we use the access frequency and the average response time to decide which data set should be replicated. Then, the method of calculating replicas’ number and their store places with minimum management cost is proposed based on location problem graph. Both the theoretical analysis and simulations have shown that the proposed strategy offers the benefits of lower management cost with fewer replicas.
Today, several cloud providers offer storage as a service, such as Amazon S3 [
Compared with the definitions of conventional computing paradigms such as cluster [
It is obvious that the client access latency can be reduced with the number of replicas increased. And every client demands to access its data set from a replica that is as close as possible in order to minimize its delay. However, there are at least two challenges while replicating all data sets to all data centers can ensure lowlatency access [
In practice, CSPs supply a pool of resources, such as hardware (storage, network), development platforms, and service at the expense of cost. And data set storage and transfer costs are the two most important components in data management, which are caused by storage resource and bandwidth consumption, respectively. Also, with the number of new replicas increased, the transfer cost will be declined because the data set can transfer more effectively; but the storage cost is getting bigger because of new replicas’ existence. That is to say, too many replicas in the cloud may lead to high storage cost, and the increased storage cost for replicas may be greater than the reduced transfer cost, if there is no suitable replicas placements strategy. Therefore, it is urgent to find a balance selectively to replicate the popular data or not.
Based on the analysis above, there are at least three important issues that must be solved in order to achieve the minimumcost data set replicas placements scheme: (1) whether or not to create a replica in cloud computing environment; (2) how many data set replicas should be created in the cloud; (3) where the new replicas should be placed to meet the system task successful execution rate and bandwidth consumption requirements.
Therefore, in this paper, towards achieving the minimumcost replicas distribution benchmark in a practical manner, we propose a replicas placements strategy model, including the way to identify the necessity of creating replica, and design an algorithm for replicas placements that can easily reduce the total cost in the cloud.
The main contributions of this paper include (1) proposing data sets management cost models, involving storage cost and transfer cost; (2) presenting a novel global data set replicas placements strategy from costeffective view named MCRP, which is an approximate minimumcost solution; (3) evaluating replicas placements algorithms using analysis and simulations.
The remainder of this paper is organized as follows: Section
We will present in this section the background related to data sets replicas placements and management cost models in the cloud.
Data sets replication is considered to be an important technique used in cloud computing environment to speed up data access, reduce bandwidth consumption and user’s waiting time, and increase data availability [
Conclusions as a result, the abovementioned replication technologies have not involved data set storage and transfer cost, which are the most important elements for the clients in deciding whether or not to use cloud storage system, especially for small business. Therefore, we will consider the data sets management cost as a basis for replicas placements in order to minimize the storage and transfer costs on the premise that system performance satisfies data set availability requirements in this paper.
In a payasyougo paradigm, all the resources in the cloud carry certain costs, so the more the replicas the more we have to pay for the corresponding resources used. Some of them may often be reused while the others may not be. So, once we decide to create a replica, we need to evaluate its access frequency as well as management cost, especially when large data sets—or “big data”—are usually common in the cloud. In [
Base on the analysis, it is very necessary to design the data sets replicas placements strategy from costeffective view. And this research is very significant for business, especially for small businesses, which usually use big data on cloud computing platforms.
In this section, we will first present some concepts in cloud environment; then we propose storage cost model and single transfer cost model, respectively. At last, we present data set management cost model in the cloud.
In cloud storage system, there are some distributed data centers in cloud environment for data sets storage. And each data center has some properties, such as storage capacity, CPU speed, and network bandwidth, and read/write speed. Similarly, different configurations of data center lead to different quality of service (QoS).
Cloud computing environment (CCE) can be regarded as a set of distributed data centers, written as
Figure
Architecture of cloud environment.
In CCE, each data center
Data set
In the following section, we assume that the architecture of CCE and data set
In a commercial cloud computing environment, service providers have their cost models to charge users. For example, Amazon cloud service’s prices are as follows: $0.15 per gigabyte per month for the storage resource [
Data set
That is to say, the total storage cost is the CSP’s storage cost ratio function multiplied by the size of the data set and its storage time, for example, using Amazon S3 for storage pricing and considering that 0.5 T (512 G) data set has been stored for 6 months. The storage cost is
In the cloud, the data sets transfers are absolutely necessary once a request arrives, in which process the transfer cost will be generated inevitably for the reason for network consumption. In this model, the input data sets transfers are free, whereas output transfer cost varies with respect to data set volume and the CSP’s atomic transfer cost ratio function.
Data set
It is noted that the transfer time
In this paper, we facilitate a data set
Data set
Let us introduce a simple example: a 500 G data set is stored in the cloud for a month, and the storage cost is
In this section, we will present the replicas scarce resource model in order to determine whether or not to create a new replica in the cloud.
There are many data sets stored on the data center in the cloud environment. And it is not necessary to replicate the data set on all the data centers. It is intelligent to replicate the popular data sets with high user frequencies for reducing data set transfer delay. In this way, we will define replica scarce degree as a criterion of adding replicas.
For a data set
If a data set
Response time rt is the time that elapses when a service requests a data set until the user receives the complete data set.
It is obvious that average response time can be calculated starting from the initial time when the request is submitted till the final response if received with the image from the target node.
Average response time art is the ratio of total response time and the requested times per unit of time and can be represented as follows: art =
Average response time (art) is a basic parameter to determine the replicas numbers and stored places for the reason that the awt can be reduced by placing replicas on data centers. However, awt is not the only valid parameter to create replicas. The reason is the average response time depends on a number of factors, such as bandwidth and data set size al. And the bigger the data set size, the longer the average response time. In this way, we will present replica scarce resource by introducing data set size.
A data set
To sum up, it is necessary to create replicas for replica scarce resource. And there are two important factors to be considered before creating new replica: (1) longer average response time and (2) higher requests frequency.
For those data sets with low requested frequency, and those with high requested frequency but short response time, there is no need to place replicas from costeffective view. Algorithm
(01) count the data set
(02) set
(03) sum the data set
(04) set
(05) set
(06) if ((
(07) return true;
(08) else
(09) return false;
(10) End.
Algorithm
Once the decision to create replicas has been made, the most urgent problem needed to solve is where to place it. In this subsection, we will present a replica placement algorithm from costeffective view.
It is obvious that the replica’s candidate store places are not unique, but a set of data centers. Then the most economic way is to select the data centers with lower storage cost and transfer cost to other data centers on the basis of shorter average response time. And Algorithm
primitive data set
(01) set
(02) set
(03) for each data center
(04) begin
(05) //Assuming the replica stores on
(06) set
(07) set
(08) for data center
(09) begin
(10) set
(11) set
(12) end
(13) set
(14) if (
(15) continue;
(16) else
(17) begin
(18) calculate the storage cost
(19) set
(20) set
(21) for data center
(22) begin
(23) calculate transfer cost
(24) calculate transfer cost
(25) set
(26) set
(27) end
(28) if
(29)
(30) end
(31) end
(32) return
In Algorithm
Here, we will analyze the time complexity. Suppose there are
In the previous sections, we have tentatively placed one replica on a data center and obtained high system performance with lowest data sets management cost. However, replicas number and their storage places are still urgent problems to be solved from costeffective view in practice. In this section, we will present an approximate minimumcost replicas placements algorithm based on location problem (LP).
In order to formulate the minimumcost replicas placements problem, we make the following assumptions: (1) The cloud computing environment is a customertoserver system, in which the data sets themselves travel to the facilities to be served. On the other hand, the user requests the data set for further analysis, not a result by computing or querying from the data set. (2) Each data center represents a candidate replica location as well as a data set demand point, for the reason that client requests data via data center. (3) Only one replica may be located per data center. (4) The replicas service is uncapacitated; that is, they may serve an unlimited amount of data sets requests.
In its simplest form, the minimumcost replicas placements problem is as follows: given a set of data centers, which represent demand points as well as candidate replicas placements, and a set of connections between each pair of data centers. Each connection has a transport cost per unit data set and each data center is associated with a charge for data set storage. Also, all demands must be routed over the connection to the nearest replica. The problem is to find the set of replicas placements that minimize the total management cost: the sum of data sets storage and transport cost.
Ideally, the optimal minimumcost replicas placement is shown in Figure
Data set access domain with minimum management cost.
Next, we model the minimumcost replicas placements problem. Conventionally, a cloud environment is represented by a graph where two nodes have an edge if and only if two corresponding nodes can communicate with each other. In order to describe such a circumstance, we will transform them into a graph
Each data center
Each connection line between two data centers should be classified according to the type of network and used for implementation: (i) lines in one domain from a node with a replica to others can be transformed into edges with weight 0, respectively; (ii) lines in one domain between nodes without replica can be transformed into edges, respectively, and their weight is minimum transfer cost between corresponding data centers; (iii) lines crossdomain can be transformed into edges between corresponding nodes, and its weight is the minimum transfer cost.
The storage cost of each data center should be mapped to the first property of corresponding node.
The product of access frequency and time of period
In order to describe the nodes, we define a 3tuple (
In this way, we wish to find optimal locations at which we place replicas to serve a given set of
The minimumcost replicas placements problem is defined as follows.
Given a connected undirected and weighted graph
Several aspects of this formulation are worth noting. First, we observe that if we set the transfer cost to the same, then the result is simple, which is an alternate formulation of the uncapacitated facility location problem (UFLP) in which link additions are disallowed. Thus, the UFLP is a special case of the UFLNDP in which link additions are disallowed. Since the UFLP is NPhard (in the parlance of computational complexity) [
MCRP is NPhard.
In the original problem, we need to decide
Then, it can be regarded as a UFLP using mapping rules shown in Table
Mapping rules from minimumcost replica placements strategy to UFLP.
Index  Element in minimumcost replica placements strategy  Mapping  Element in UFLP 

(1)  The transfer cost ratio between data centers 


(2)  The storage cost of each data center 


(3)  The user requested access times 


(4) 



Note that the storage cost for the node and cost of the edges should be expressed in comparable units; for example, the storage cost for each node can be expressed in dollars for each replica, while the cost of each edge can be represented in dollars per request.
An optimal solution to the MCRP consists of
This property quantifies our intuition about the tradeoff between constructing facilities and links; that is, as we build more facilities, fewer links are needed. The property also has implications in the identification of polynomial solvable cases, as has been discussed in [
In this way, the minimumcost replicas placements problem is NPhard. Therefore, no polynomial time algorithms of solving the problem are likely to exist for minimumcost replica placements. Hence, it is of practical importance to obtain approximation methods whose costs are close to optimal.
In this subsection, we introduce an approximate algorithm for MCRP. The idea is first decomposing the transfer ratio to edge weight and then finding the candidate replicas placements data centers using graph
First, we will construct a graph
(01) Initialize an edgeweighed graph
(02) For each (
(03) Assign the weight of this edge as
(04)
(05) EndFor
(06) Initialize an edgeweighed graph
(07) For each
(08) Assign the weight of edge from
(09)
(10) Output
In Algorithm
Next, we will propose approximate minimum management cost replicas placement algorithms based on the graph
(01) Generate a minimumcost spanning tree from
(02)
(03) While
(04) Begin
(05)
(06) Print
(07) For
(08) If
(09) Begin
(10) Delete
(11) If
(12) EndIF
(13) Delete
(14)
(15) EndWhile
(16) End.
In Algorithm
Here, we will analyze the time complexity of Algorithm
The data center that deployed the original data set is responsible for the data set and its replicas’ management, including when to create replicas and where to place them. With the data set requests increased (e.g., the number of requests amounts to 5000 in a month), the data set replicas placements algorithm with minimum management cost data center will start up, and the data set replicas will be transferred to other data centers according to the computation results. Also, a replicas distribution table including some important information such as original data set and its replicas’ location will be placed, and its size, most recent update time, and so forth al, will send to each data center, in order to let the others know where the data sets are placed. In this model, a user that connected the data center accesses a data set as follows: First he/she tries to locate the data set replica locally. If the object replica is not present, he/she goes to check the data placement directory residing on each data center, which stored a data set replicas distribution structure. After that, the user’s request goes to the nearest data center and then will transfer the data set to user via near data center.
In this section, we first present the experimental setups and then discuss the tradeoff between storage cost and transfer cost. Next, we describe the whole procedures of MCRP approximate algorithm using an example step by step. At last, we compare the result among different circumstances to demonstrate how our replicas placements strategy works.
The experiments were conducted on a cloud computing simulation environment built on the computing facilitates at Network & Information Security Lab, Shandong University of Finance and Economics (SDUFE), China, which is constructed based on SwinDeW [
Figure
Architecture of cloud environment.
And in the analysis, we observe and study the running conditions for a period of one month. The usage frequency is according to Poisson distribution. Table
Data sets access configurations.
Data centers 









Users number 
1  0.6  0.8  1.1  0.6  0.7  0.9  1 
Access frequency 
2  3  3  5  4  6  5  4 
We define the cost of a solution in MCRP as the sum of storage cost and transfer cost. In order to compare the total costs with different replicas numbers, we have computed the storage and transfer costs, respectively. The result is shown in Figure
Tradeoff between storage and transfer costs.
From Figure
Similarly, Figure
Reduced cost comparisons of data centers.
In this section, we will analyze the MCRP algorithms proposed in Section
First, we need to generate a minimumcost spanning tree from Figure
Minimumcost spanning tree.
Then, we will select the node with maximum linked edges, for example,
Result after first deletion.
It is obvious that there still exist nodes with degree greater than two, for example,
Result after second deletion.
In this case, the maximum degree of all nodes is only one, and then the node with small storage cost value is the suitable place to store replica, for the reason that they have the same transfer cost despite where the replica store place is. Figure
Result of replicas store places
The random simulations are conducted on randomly generated data sets of different sizes, generation times, and usage frequencies. In the simulations, we use a number of 50 data sets, each with a random size from 100 GB to 1 TB. And the usage frequency is also random from 1 to 10 times.
Elapsed time in measuring the replica possibility.
Figure
Comparisons of replicas numbers with different usage frequency.
Figure
Comparisons of total cost among different strategies.
From the above experimental and simulation results, the following conclusions can be drawn: (1) the proposed data sets replicas placements strategy effectively reduces the cost of application data set; (2) the proposed data replica strategy reduces the number of replicas; also (3) the proposed data replica strategy can effectively achieve system load balance by placing the popular data files according to the cost and user access history.
In this paper we have investigated a model that simultaneously optimizes replicas placements from costeffective view in the cloud. This model has a number of important applications in replication technology. Also, our current research, including experiments and simulations, is based on Amazon cloud’s cost model, which can be reused by replacing the corresponding cost ratio. The data sets replicas placements strategy proposed in this paper is generic and dynamic, which can be used in any data intensive applications with different price models of cloud service. As presented in Section
The strategy proposed in this paper mainly focused on only one data set
Considering the dynamic nature, which is the main metric of the cloud computing system, such as cost of transfer and storage and the user access frequency, the minimumcost replica strategy is still available and effective, since we focus on a period of time
The replica deletion and maintenance strategies are also easy to obtain from the minimumcost replica strategy. The basic idea is that when comparing the total cost with replica and the cost of no replica, then delete replicas once the cost with replicas is greater than no replica. On the other hand, we can update the replicas stored places according to the minimumreplicas strategy at scheduled time intervals.
Furthermore, experimental results and analysis show that the proposed strategy in cloud environment is feasible, effective, and universal. Hence, we deem that it is highly practical as a replica strategy. However, this paper presents the first attempt to apply the technique to solve the problem as how to place data sets replicas in the most appropriate data centers in the cloud from the minimumcost view. It must be kept in mind that these findings are the results of a preliminary study. To be more useful in practice, future works can be conducted from the following aspects:
The current work in this paper has an assumption that the data set’s usage frequencies are obtained from the system log files. Models of forecasting data set usage frequency can be further studied, with which our benchmarking approaches and replicas strategies can be adapted more widely to different types of applications.
The replicas placements strategy should incorporate the data set generation, and deduplication technology, especially data content based deduplication technology, which is a strong and growing demand for business to be able to more costeffectively manage big data while using cloud computing platforms.
The author declares having no competing interests.
Some experiments in this paper were done on the cloud platform of SwinDeWC in Swinburne University of Technology during Xiuguo WU’s visiting period, which is sponsored by Shandong Provincial Education Department, China. Moreover, we want to show thanks to Doctor Dong Yuan and Professor Yun Yang, Swinburne University of Technology, Australia, for their valuable feedback on earlier drafts of this paper. In addition, this work presented in this paper is partly supported by Project of Shandong Province Higher Educational Science and Technology Program (no. J12LN33), China; the Doctor Foundation of Shandong University of Finance and Economics under Grant no. 2010034; and the Project of Jinan HighTech Independent and Innovation (no. 201303015), China.