Data Placement for Privacy-Aware Applications over Big Data in Hybrid Clouds

1School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing, China 2Jiangsu Engineering Centre of Network Monitoring, Nanjing University of Information Science and Technology, Nanjing, China 3State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China 4Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA 5School of Information and Control, Nanjing University of Information Science and Technology, Nanjing, China


Introduction
The rapid development of science and technology makes the network information increase exponentially, and the continuous accumulation of network data brings opportunities and challenges for big data.Big data gives plenty of benefits to humanity in many fields including network, health care, transportation, finance, military, and politics.Recommendation service, prediction service, and computing service can be realized through big data storage and analysis [1][2][3].As huge amount of data could lead to system crash with the traditional data storage techniques, it is essential to realize the big data storage [4,5].Distributed file systems and databases are beneficial to big data storage.The emergence and development of cloud computing contributes to the big data storage, access, and processing, as cloud computing provides ubiquitous and various resources, to respond the explosive growth of data accumulation.
Cloud computing is a powerful technology that can provide humorous cloud services for the customers everywhere through the Internet, which aggregates geodistributed resources, to accomplish higher throughput and computing ability [6].The customers could benefit from the public cloud services, as they are not necessary to build the infrastructure and manage the data center [7].Currently, data privacy issues have received a lot of attention due to the increasing concern of the privacy and the data value protections, since the individuals often suffer heavy blows from privacy leaks [8,9].How to protect data and improve the security level of private data has become a hot topic of cloud computing [10].Generally, it is an effective way to place these datasets in the private cloud; thus hybrid cloud for big data storage should be taken into consideration for privacy-aware applications [11,12].The data access time for V   (1 ≤   ≤ ) extracting the The tasks need to be performed,  = { 1 ,  2 , . . .,   }   nth (1 ≤  ≤ ) task in T

𝑛
The binary variable to judge whether   is place on V   , The time cost for the task   to extract the dataset The binary variable to judge whether V  is on public cloud The total access time of public datasets    The access time for obtaining the datasets in private cloud PE The baseline power consumed by the active PMs VE The power consumed by the active VMs is calculated by IE The power consumed by the VMs in the idle mode TE The power consumed by the switches due to data access The total power consumption to perform the tasks Nowadays, an increasing number of applications, especially for the scientific workflows, for example, weather forecasting flows, are deployed in the hybrid cloud.Data placement has a direct impact on the data access efficiency and the cost for data storage, as the locations of big data could affect the overhead for the service renting and the access time for data extraction.Therefore, reasonable and efficient data placement methods are essential to the performance of big data processing [13][14][15].For the data placement in hybrid clouds, the payment of public cloud services and the energy consumption generated in the private cloud are key factors to determine the locations of the datasets for the execution of the privacy-aware applications.
With the above observations, it is still a challenge to realize data placement for privacy-aware applications over big data in the hybrid cloud, considering the cost saving in the public cloud and the energy saving in the private cloud.In view of this challenge, we design an efficient data placement method to deal with the above challenge.Our main contributions are threefold.Firstly, we undergo cost, access time, and energy analysis over big data in hybrid cloud.Secondly, a corresponding cost and energy aware data placement method, named CEDP, is designed to address the resource provisioning problem for the privacy-aware applications over big data in the hybrid cloud.Finally, a sequence of experimental analysis is conducted to validate the efficiency and the effectiveness of our proposed method.
The rest of this paper is organized as follows.In Section 2, formalized concepts are presented for cost, access time, and energy analysis over big data in hybrid cloud.Section 3 specifies our proposed method.The comparison analysis and performance evaluation are described in Section 4. Section 5 presents the related work, and Section 6 concludes the paper and gives outlook for the future work.

Cost, Access Time, and Energy Analysis over Big Data in Hybrid Cloud
In this section, cost and access time for data placement in the public cloud are analyzed.Besides, the access time and the energy consumption analysis for data placement in the private cloud are also presented.Table 1 specifies the key terms and description for cost, access time, and energy analysis over big data in the hybrid cloud.

Cost and Access Time Analysis in Public
Cloud.In the cloud environment, the datasets and the tasks both need to be hosted in the form of VMs.Suppose there are  VM instances that are available for hosting tasks and datasets across the public clouds and the private cloud data centers, denoted as  = {V 1 , V 2 , . . ., V  }.Suppose there are  datasets that need to be stored in the hybrid cloud platforms, denoted as  = { 1 ,  2 , . . .,   }.
Let    be a binary variable to judge whether The reserved datasets need to be extracted from one VM to another.When the VM V   (1 ≤   ≤ ) needs to extract   , the data access time is denoted as    , , which is calculated by Security and Communication Networks 3 where  ,  is the number of links between V  and V   , |  | represents the data size of   , and   (0 ≤  ≤  ,  ) is the bandwidth of the th link.Suppose there are  tasks that need to be performed in the hybrid cloud environment, denoted as  = { 1 ,  2 , . . .,   }.The tasks are executed on the VMs whether in the public cloud or in the private cloud.Thus, these tasks have placement relationships with the VM instances.Let    be the binary variable to judge whether   (1 ≤  ≤ ) is placed on V  , which is measured by Thus, the time cost for the th (1 ≤  ≤ ) task   to extract the dataset   , denoted as  , , is calculated by As big data is now expanding explosively in both academia and industry, the execution of one task may need several datasets for supporting.For the data extraction in the public cloud, we mainly focus on the bandwidth cost for the data transferring.
Let   be the binary variable to judge whether V  is placed on the public cloud, which is measured by The datasets could be extracted by the tasks in both public cloud and private cloud.The access time by the tasks in the public cloud is calculated by where    is a binary variable to judge whether   is necessary for the execution of   .
The access time for the datasets in the public cloud by the tasks in the private cloud is calculated by Then the total access time for extracting the datasets from the public cloud is calculated by The bandwidth cost for the data transferring in the public cloud is calculated by where   is the expenditure of V  for data transferring per unit time.

Access Time and Energy Consumption Analysis in Private
Cloud.For a private cloud data center, the cloud providers need to take into account the time cost and the power consumption while allocating the datasets.The access time for obtaining the datasets in private cloud is calculated by Suppose there are  PMs denoted as  = { 1 ,  2 , . . .,   } that are available to host the private datasets.And the tasks in the private clouds are also deployed on the PMs in P.
The energy consumption in the private cloud for the execution of the privacy-aware applications mainly refers to the energy consumed by the PM base power, active VMs, and the unused VMs, and the energy consumption due to data transferring.The PMs in the sleep mode also consume a certain amount of power, but it is far less than the energy consumed by the active PMs in the order of magnitude, that could be neglected [16,17].
The baseline power consumed by the active PMs is calculated by where   (1 ≤  ≤ ) and   are the baseline power consumption rate and the total running time for   .The power consumed by the active VMs is calculated by  where   ,   are the power consumption rate and the running time for V  , respectively.The power consumed by the VMs in the idle mode is calculated by where    and    are the power rate and the idle time of V  , respectively.
The power consumed by the switches due to data access is calculated by where   is the number of switches between   and   , and  is the active power rate for each switch.
Then the total power consumption to perform the tasks with data extraction processes is calculated by Then the objectives for data placement over big data in the hybrid cloud are min  and min .

Cost and Energy Aware Data Placement Method for Privacy-Aware Applications
In this section, a cost and energy aware data placement method is proposed for privacy-aware applications over big data in the hybrid environment.In this method, we aim to reduce the cost for renting cloud services and achieve energy savings in the private cloud.

Method Overview.
In this paper, a cost and energy aware data placement method is proposed to address the challenges of data placement problem for the privacy-aware applications in the hybrid cloud environment.
Figure 1 shows the specification of our proposed method.The input of our method is the privacy-aware applications with task distribution in the hybrid cloud, and the datasets that need to be placed in the hybrid cloud.Our method consists of four main steps, i.e., VM identification for data placement, VM selection for data placement in public cloud, cost aware data placement in public cloud, and energy aware data placement in private cloud.
For each dataset, the VMs that need to access it are identified in Step 1. Then in the public cloud, we choose available VMs to host the datasets, which need to be placed in the public cloud, through Step 2. For the VMs obtained by Step 2, we conduct cost aware data placement through Step 3, so that the optimal data placement strategies with minimum cost are designed for the datasets that need to be placed in the public cloud.For the datasets with privacy preservation requirements, they are necessary to place in the private cloud.Energy aware data placement are designed in Step 4 to achieve energy savings while allocating VMs to store these datasets.The ultimate output of our method is the data placement strategies.

VM Identification for Data Placement.
In the hybrid cloud, both the datasets and the tasks combined in the privacy-aware applications physical resources from cloud platforms for hosting, which could be responded by the VMs.Generally, the resource capacity of PMs and the resource requirements from tasks and datasets are specified by the amount of the resource units, that is, the VM instances [16].For many public cloud vendors, such as Amazon, they provide many types of VM instances, including CPUintensive instances and I/O optimized instances.
Definition 1 (resource requirement of   ).The resource requirement of   mainly refers to the VM instance type and the number of VM instances, which is denoted as   = {V  ,   }, where V  and   are the VM instance type and the required total amount of VM instances of   , respectively.
To satisfy the requirements of a dataset that needs to be stored, one or more VM instances with the same specification are requested, and these instances could be regarded as a special VM.
Definition 2 (special VM).For the VM instances that deployed to perform the same task or store the same dataset, it could be treated as a special VM.
Currently, in the big data era, large-scale datasets could be shared for multiple tasks, and one task may need several different datasets for execution.To place the dataset efficiently, the special VMs that rented for hosting the tasks, which require the datasets for execution, should be identified.For the dataset   in , the special VM set is denoted as V  ; then  has a corresponding special VM set  = {V 1 , V 2 , . . ., V  }. Figure 2 shows an example of special VM identification.In this example, there are three datasets (i.e.,  1 ,  2 , and  3 ) that need to be stored in the hybrid cloud. 1 needs to be accessed by tasks  1 and  3 ,  2 needs to be accessed by  2 , and  3 needs to be accessed by  4 . 1 requires V 1 and V 2 for execution,  2 requires V 3 for execution,  4 requires V 5 and V 6 for execution.In this example, the two VM instances V 1 and V 2 occupied by  1 could treated as a special VM V 1 , V 3 is treated as V 2 , V 4 is treated as special VM V 3 , and V 5 and V 6 are treated as special VM V 4 .
The VMs identified according to the task distribution and the dataset access requirements should be specified as the special VMs.For the dataset   , the corresponding special VMs are put in the VM set V  .Then all special VM sets for all datasets are recorded as  = {V 1 , V 2 , . . ., V  }.
Algorithm 1 specifies the key idea of special VM identification for data access.The input is the dataset .This algorithm should traverse all the datasets (Line (1)) and all the tasks (Line (2)).For each dataset, we find the VMs of the tasks that need to access the dataset (Lines (3) to ( 12)).Finally, the output is the special VM set SV.

VM Selection for Data Placement in Public
Cloud.The datasets should be placed on the VMs, thus the available VMs on the cloud should be identified to store the datasets.
For the PMs and VMs in the private cloud, the resource scheduler could be aware of the map relationship between PMs and VMs.However, in the public cloud, the resource scheduler can only select the available VMs that cloud vendors provided.
In the public cloud platforms, when renting VMs for storing the datasets.We would like to choose the VM instances with lowest cost.Generally, the more the renting time of bandwidth is, the more cost the users need to pay.Thus, the access time for tasks extracting the datasets should be taken into consideration.FatTree is a typical network topology for cloud datacenters.For most VMs connected to different switches, the data access time is almost the same.In this section, we conduct VM selection process, to select the VMs that could store the datasets in the public cloud.And these VMs should be sorted by the distances between the selected VMs and the VM for holding tasks, identified in Algorithm 1. Figure 3 shows an example of VM distribution in a FatTree-based data center network.There are seven switches (i.e.,  1 ∼ 7 ), distributed as a tree network.In these switches,  1 is the core switch,  2 and  3 are the aggregation switches, and the switches  4 ∼ 7 are edge switches.There are three PMs (i.e., PM 1 ∼PM 3 ) connected to the edge switches.In this example, there are five running VMs distributed on these PMs, where V 1 and V 2 are placed on PM 1 , V 3 is placed on PM 2 , V 4 is placed on PM 3 , and V 5 is placed on PM 4 .
The data access time is a key objective that users take into consideration, which is closely relevant to the distances between the VMs where the task hosts and the datasets locates.The distance calculation relies on the locations on the FatTree network.The distance between two VMs on the same PM is 0. For example, as shown in Figure 3, V 1 and V 2 are placed on the same PM; the distance between V 1 and V 2 is 0. The distance between two VMs on different PMs depends on the number of links between these two VMs.For example, the distance between V 1 and V 3 is 2. Furthermore, if the PMs are connected to two different edge switches, but they have the same aggregation switch, the distance is double than the distance between two VMs connected to the same edge switch.For example, the distance between V 4 and V 5 is 4.
Besides, if the VMs are connected to the different aggression switch, the distance of them is triple than the distance of VMs connected to the same edge switch.For example, the distance between V 1 and V 4 is 6, and the distance between V 3 and V 4 is also 6.
Based on the process of distance calculation, the identified VMs, which are available for hosting the datasets, could be sorted by the increasing order of the distance values.The datasets placed in the public cloud also be accessed by the tasks deployed in the private cloud.In the private cloud datacenter, the network is also built based on FatTree; thus the data access between VMs in these two kinds of cloud platforms needs to access the core switch, the aggregation switch, and the core switch in both public and private cloud, which is an edge-to-edge communication across clouds and platforms.So, in this section, we mainly focus on the access time within the public cloud platform.
For the dataset   , which is arranged to store in the public cloud, there are several tasks in the public cloud should access   ; the special VMs in V  should be updated by removing the special VMs in private clouds.The corresponding VMs are selected to hold   , which are put in the VM set V  .For all the datasets, the VM set list is denoted as  = {V 1 , V 2 , . . ., V  }.
Algorithm 2 specifies the key process of VM selection for data placement.The input is the VM node set SV.This algorithm traverses all the VM set in SV (Line (1)), and, for each VM set, the VMs in the private cloud removed (Line (2)).For each VM in the VM set, we select the VMs in the public cloud and calculate the distances between the selected VM and the VM in the VM set (Lines (3) to ( 12)).Then the selected VMs are put in the VM set CV, and it is sorted in the increasing order of distance (Line ( 13)).The final output is the identified VM set CV.

Cost Aware Data Placement in Public
Cloud.After the processing of VM selection in Algorithm 2, the VMs that could be allocated to store the datasets in public cloud are obtained.As, in the public cloud, the cost and the access time are closely relevant, especially, in the FatTree network, in this section, we mainly focus on the cost for the public cloud services.

Input:
The VM set VS Output: The identified VM set CV (1) for  = 1 to || do (2) Remove the VMs in private cloud from V  (3) for  = 1 to V  do (4) for  = 1 to  do (5) if V  is in the public cloud then (6) if V  is not in V  then (7) A d d V  to V  (8) Calculate the distance between V  to V , (9) end if (10) end if (11) end for (12) end for (13) SorttheVMsinV  in the increasing order of distance ( 14) end for (15) Return CV Algorithm 2: VM selection for data placement.
The cost mainly depends on the service time and the unit payment fee for VM renting.As we know there are different VM instances provided by the cloud vendors, and the cost for these VM instances are various; thus, to achieve cost efficiency, we should select the optimal data placement strategy with minimum cost for the datasets that need to be placed in the public cloud.Definition 3 (data placement strategy of   ).The dataset placement strategy of   consists of the VM instances that need to rent for storing   , denoted as   .
For all the datasets in D, the relevant data placement strategy set is denoted as  = { 1 ,  2 , . . .,   }.After the processing by Algorithm 1, we get the special VMs for each dataset access, which are used to hold the tasks that need to access the dataset.Although the datasets in the public cloud could be accessed by the tasks both running in the public cloud and the private cloud, the datasets only can use the public cloud services for storing, due to the resource limit in the private cloud.The VMs that could be employed to respond the resource requirements could be achieved by Algorithm 2.
Then for each dataset in the public cloud, we try to select the suitable data placement policy, to save the cost expenditure for cloud service renting.As there are multiple data placement policies for each dataset, the placement policy with the minimum cost, calculated by formula (10), is selected as the final data placement strategy.
Algorithm 3 specifies the key idea of cost aware data placement.The input for this algorithm is the dataset  that need to be placed in the hybrid cloud.The special VMs for each dataset are identified by Algorithm 1 (Line (1)).Then we traverse all the datasets (Line (2)) and select the datasets that need to be placed in the public cloud (Line (3)).The VM instances are selected to respond to the resource requirements of each dataset in public cloud (Line ( 5)).Then multiple iterations are undergoing to find the data placement policy with lowest cost (Lines ( 7) to ( 16)).The output of this algorithm is the data placement strategy S.

Energy Aware Data Placement in Private
Cloud.After the data placement in Section 3.4, the data placement strategies for the datasets that need to be placed in the public cloud are all designed.For the privacy-aware applications of users, some tasks contained in them are deployed in their own datacenter, that is, the private cloud constructed by themselves.In this scenario, the resource scheduler could know the specification of the task distribution and the map relationship between VMs and PMs.Similar to the network topology of the public cloud, the private cloud data center also constructed based on the FatTree network.Thus, when allocating VMs to store the datasets in the private cloud, the access time is not a key issue to take care of.In the private cloud, we mainly focus on reducing the energy consumption due to data access and task execution.
In Section 2, the energy consumption is specified as the energy consumed by the running PMs, the active VMs, the idle VMs, and the switches due to data transferring.The energy consumed due to data access could be specified as the following three scenarios: (1) The datasets could be placed on the PMs that the tasks located which need to access the datasets.In this case, the energy consumption of switches due to data transferrin could be neglected.For example, the VMs V 1 and V 2 in Figure 4 share the data storage of PM 1 ; thus there are no data transferring through any switch.
(2) The datasets also could be placed on the VMs which are connected to the same edge switches with the tasks which need to access the datasets.Then the data access only across one switch, and the energy for data transferring only occurs in this switch.For example, the VMs V 1 and V 3 are placed on PM Classify CV as special VMs, denoted as V  (7)  = MAX,  = 1 (8) while  ≤ |V  | do (9) if V , can hold   then (10) Calculate the total cost TC by (10) (11) if TC <  then (12)  = TC (13) else  =  + 1 (14) end if (15) end if (16) end while (17) U p d a t e  with cost C (18) end if (19) end for (20) Return  Algorithm 3: Cost aware data placement.PM 2 separately, and the data transferring between these two VMs only employs the switch  4 .
(3) The datasets also cloud be placed on the PMs with the different switches to the PMs that hosted the tasks need to access the datasets.In this situation, whether the datasets placed on which PM, the energy consumed due to data transferring is same, as the data access use five switches, that is, two edge switches, two aggregation switches, and one core switch.For example, in Figure 4, the energy consumption due to data access between V 1 and V 4 , will use the edge switch  4 , the aggregation switch  2 , the core switch  1 , the aggregation switch  3 , and the edge switch  6 .
From the above analysis, the occupation of the VMs which are near to the VMs the task hosts, which need to access the dataset, will cause fewer energy consumption.Besides, for the energy consumption for PM running, the main idea to save the energy consumption is to make full use of the running PMs and try best to reduce the number of running PMs.If the VMs are placed on the PMs with the tasks, it can achieve energy saving from the perspective of both data access and PM running.Thus, the PMs are sorted in the decreasing order of the distances between the VM to host the tasks and the VM identified for hosting the dataset.Then, we select the PM through multiple iterations; at last we select the PM to host the dataset with the minimum energy consumption, calculated by formula (16).
The data placement strategy for the datasets in the private cloud could be improved as   = {  ,   }, where   and   are the amount of VM instances and the VM location of   , respectively.
Algorithm 4 shows the key idea of energy aware data placement in private cloud.In this algorithm; the input is the dataset .The special VMs for hosting the tasks in the private cloud are identified by Algorithm 1 (Line (1)).Then all the datasets are traversed to check whether the dataset needs to place in the private cloud (Line (2) and Line (3)).For each dataset, we traverse the PM list, to select the PMs that can hold it (Lines (4) to ( 8)).The PM list is sorted in the decreasing order of VM distance between the VMs selected by Algorithm 1 and the PMs (9).We find the PM with energy consumption for data placement through multiple iterations (Lines (10) to (20)).The final output of this algorithm is the updated data placement strategy set S.

Experimental Evaluation
In this section, we use the cloud simulator CloudSim to simulate the hybrid cloud environment and the data placement method CEDP.

Experimental Context.
In this paper, 4 different scales datasets with VM distributions and task distributions are generated to validate our proposed method.Besides, the bid datasets are also provided with 4 different scales.The above datasets are stored in the Google disk (https://drive.google.com/open?id=0B0T819XffFKrQVFoOHM2TU1zZHM).Our method is validated on the physical node, equipped with the processor (Intel Core i5-5300U CPU @2.30GHz) and 8.00 GB memory.
The parameters used in our simulation are specified in Table 2.We use 4 types of PMs (150 PMs for each type) to construct our private cloud platform.And the energy consumption rate settings are similar to our previous work in [16][17][18].for  = 1 to  do (5) if   can hold   then (6) A d d   to   (7) end if (8) end for (9) S o r t  in the decreasing order of VM distance between the VM in svi and the VM in   (10) Calculate the energy consumption ec 1 after allocating   to  ,1 by ( 16) ( 11) while num ≤ |  | do (13) Calculate the energy consumption ec num after allocating   to  ,num by Eq. ( 16) (14) if ec num <  then (15)  = ec num (16) end if (17) n u m= num + 1 (18) end while (19) U p d a t e  according to MC and the relevant PM (20) end if (21) end for (22) Return  Algorithm 4: Energy aware data placement in private cloud.
We use 4 datasets with different scale of datasets that need to be placed in the hybrid cloud.And 20% of them are privacyaware data, which should be placed in the private cloud.For the public cloud, there are 4 types of VMs that are presented for data placement.

Performance Evaluation.
The performance evaluation is conducted from two aspects, that is, the public cloud and the private cloud.For the private cloud, we mainly focus on the energy consumption and the access time.However, for the public cloud, we mainly validate the method performance through the comparison analysis on cost for VMs renting.As our work is the first to privatize a data placement policy for privacy-aware applications over big data in hybrid cloud, two benchmark methods are employed for comparison analysis.On is a resource utilization aware data placement method, named DP RU, which aims to optimize the resource utilization for the cloud datacenters.The other is a distanceaware data placement method, named DP DIS, which aims to place the datasets near the tasks which needs to access them.
(1) Evaluation on Energy Consumption in Private Cloud.The energy consumption is closely relevant to the number of the employed PMs. Figure 4 shows the comparison of the employed PMs by CEDP, DP RU, and DP DIS with 4 different scales of datasets for data placement in private cloud.In Figure 4, it is intuitive that our proposed method employs the same number of PMs with DP DIS.It is because that our method considers the data access time, which also depends on the distance between the tasks and the datasets.From Figure 4, we can find that, in most cases, DP RU applies fewer PMs than CEDP and DP DIS, because DP RU is a greedy algorithm to achieve high resource usage, regardless of the data access time.
Although we employ more PMs than DP RU, it does not mean DP DIS is more energy efficient than CEDP, as the data access processes also consume a certain amount of energy.
Figure 5 shows the comparison of the total energy consumption with different scale of datasets by using CEDP, DP RU, and DP DIS.As shown in Figure 5, CEDP and DP DIS achieve the same energy consumption after data placement in the private cloud.And these two methods achieve better energy efficiency than DP RU, although DP RU employs fewer PMs than CEDP.We can detect that there is more energy consumed by the switches due to data access.
(2) Evaluation on Access Time in Private Cloud.As the applications need big data for processing, the tasks need to access the placed data frequently.The access time is a key attribute to measure the quality of cloud service.better show the comparison analysis, we sort the experimental results in the decreasing order of access time, achieved by DR RU.From Figure 6, we can find that our method CEDP obtains optimal access time than DP RU, in most cases.For example, in Figure 6(c), when the total number of datasets is 1500, there are 300 privacy-aware datasets that should be placed in the private cloud, where there are 296 datasets that could obtain better access time by CEDP among 300 datasets than DP RU.Obsoletely, there are still some accidental cases that DP RU achieves better access time than CEDP.As there are multiple datasets that could be provided for the same task, a dataset has been placed in advance, and there are no spare PMs that connected to the same edge switch or the aggression switch.Our proposed method CEDP is a global optimization method that can achieve better time efficiency than DP RU from a global perspective.Therefore, for some of the datasets, it is reasonable that there are some accidental cases.Overall, CEDP could obtain time efficiency than DP RU.
Figure 7 shows the comparison of total overall access time with different scale of datasets by using CEDP, DP RU, and DP DIS.It is intuitive from Figure 7 that CEDP could get the same access time as DP DIS, and both of them are superior compared to DP RU.For example, when the number of datasets is 2000 and CEDP and DP DIS get the overall time near 8 × 10 5 seconds, whereas DP RU achieves near 9.5 × 10 5 seconds overall access time.CEDP and DP DIS are both distance-aware data methods; thus they are time-sensitive.
(3) Evaluation on Cost in Public Cloud.For the performance evaluation in the public cloud, the cost for VMs renting is one of the most key metrics.The renting fee is closely relevant to the VM instance type.Thus, we analyze the number of employed VMs for data placement in public cloud.Four figures in Figure 8 show the comparison analysis of the number of employed VMs by CEDP, DP-RU, and DP-DIS with different scale of datasets placed in public cloud.From Figure 8, we can find that CEDP employs cheaper VM instances (i.e., type 1 and type 2) than DP RU and DP DIS.Besides, CEDP employs fewer expensive VM instances (i.e., type 3 and type 4) than DP RU and DP DIS.For example, in Figure 8(a), CEDP employs over 100 VM instances with type 1 and type 2, whereas DP RU and DP DIS both employ less than 50.But CEDP employs fewer VMs with respect to type 3 and type 4 VMs.
Then we conduct the statistics of total cost for these 3 methods.Figure 9 shows the comparison of total cost with different scale of datasets by using CEDP, DP-RU, and DP-DIS.In Figure 9, we can find that our method could achieve cost savings compared to DP RU and DP DIS, as we present a cost-sensitive method for data placement in public cloud.

Related Work
Big data needs a huge mass of computing resources and storage resources, to promote the development of cloud technology [19][20][21].Data placement in cloud environment has been widely concerned to improve the quality of cloud services.
Data Placement over Cloud.Due to the necessity and importance of data placement, there exist multiple methods to place users' data over multiple clouds [13][14][15][22][23][24].Fan et al. [13] constructed a tripartite graph in GBDP (genetic based data placement) scheme and demonstrated validation of the scheme.Jiao et al. [14] proposed an optimization approach leveraging graph cuts to optimize multiobjective data placement in multicloud for socially aware services.Yu and Pan [15] showed a location-aware associated data placement scheme to improve the associated data location and the localized data serving and at the same time ensure the balance between nodes.Agarwal et al. [22] presented a system named Volley which can analyze the logs of data center requests and output migration recommendations to address data placement problem.Yu and Pan [23] proposed the sketch-based data placement (SDP) to lower the overhead and keep the benefits of the data placement.Su et al. [24] proposes that better features can be provided by multicloud  storage and presented a systematic model Triones to formulate data placement in multiple clouds storage by using erasure coding.
Although cloud computing can provide rich resources of computing and storage and data placement can also maximize the efficiency of cloud usage and expenditure reduction, the data privacy problems become the greatest concern for more and more people [25].Hybrid cloud that combine public cloud and private cloud can protect user privacy by placing private data on private cloud [26].
Hybrid Cloud.The study of hybrid cloud is also increasing.Mixed cloud on the task scheduling, virtual machine scheduling, privacy, and other related work have made some progress.Some were discussed about separating private data from public data and placing them in trusted private cloud and untrusted public cloud, respectively [27][28][29][30].Zhou et al. [27] presented a set of techniques for privacy-aware data retrieval by splitting data and storing on hybrid cloud.Huang and Du [28] proposed a scheme to achieve image data privacy over hybrid cloud efficiently and proposed a oneto-one mapping function for image encryption.Wang and Jia [29] described several methods about protecting data security in hybrid cloud and discussed an authentication intercloud model.Abrishami et al. [30] presented a scheduling algorithm to protect data privacy while minimizing the cost and satisfying the users' limitation.Tasks, datasets, and virtual machines scheduling in hybrid cloud to maximize the benefits and minimize the cost were studied in [31][32][33][34].Zhou et al. [31] produced a three-stage framework to explore the benefits of uploading applications to hybrid cloud.Qiu et al. [32] described a model for heterogeneous workloads scheduling and an online algorithm for tasks preemptive scheduling.Zinnen and Engel [33] used HGP to estimate task execution times and proved that the former result is the same as optimization with unknown generating distributions.Bakshi [34] introduced a secure hybrid cloud approach and the virtual switching technologies.Although these papers take improving the overall efficiency of the hybrid cloud by scheduling into account, it does not consider the effect of the storage location of the data on overall efficiency when the task has been properly allocated in public and private cloud.Our work considers the impact of data placement in a hybrid cloud environment, paying attention to the energy loss on the private cloud and the rental price on the public cloud.
To the best of our knowledge, there are few works which focus on the data placement problem in the hybrid cloud for privacy-aware applications over big data, considering both the cost in public cloud and the energy consumption in private cloud.

Conclusion and Future Work
In the big data era, data placement becomes increasingly important for data accessing and analysis, as the datasets are often too large to host with the computing task.Cloud platforms are proved to be powerful to host the data-intensive tasks.Besides, the data privacy is also a key concern for both academia and industry; thus it is necessary to undergo data placement in the hybrid cloud.In this paper, we propose an energy and cost aware data placement method driven by the requirements of the privacy-aware applications in the hybrid cloud.Our method aims to reduce the energy consumption in the private cloud and save cost for renting the VMs in the public cloud.
For future work, we will try to realize our method for the real-world workflow applications, such as weather forecasting, where the raw data should be stored in the private cloud, and the intermediate data could be stored in the public cloud.

2 .
VM selection for data placement in public cloud

Figure 1 :
Figure 1: Specification of our proposed method.

Figure 3 :
Figure 3: An example of VM distribution in a FatTree-based data center network with core switch  1 , aggregation switches  2 and  3 , and edge switches  4 ∼ 7 .

Figure 4 :
Figure 4: Comparison of the number of employed PMs by CEDP, DP-RU, and DP-DIS with 4 different scales of datasets placed in private cloud.

Figure 6 (
including 4 subfigures) shows the comparison of the access time by CEDP, DP RU, and DP DIS with different scale of datasets placed in private cloud.For these 4 datasets, there are 100, 200, 300, and 400 privacy-aware datasets, separately.To

Figure 5 :Figure 6 :
Figure 5: Comparison of total energy consumption with different scale of datasets by using CEDP, DP-RU, and DP-DIS.

Figure 7 :
Figure 7: Comparison of total access time with different scale of datasets by using CEDP, DP-RU, and DP-DIS.

Figure 8 :
Figure 8: Comparison of the number of employed VMs by CEDP, DP-RU, and DP-DIS with different scale of datasets placed in public cloud.

Figure 9 :
Figure 9: Comparison of total cost with different scale of datasets by using CEDP, DP-RU, and DP-DIS.

Table 1 :
Key terms and descriptions for cost, access time, and energy analysis in hybrid cloud.Notation Description  All the available VM instances  = {V 1 , V 2 , . . ., V  } V  mth (1 ≤  ≤ ) VM in   The dataset need to be placed  = { 1 ,  2 , . . .,   }   th (1 ≤  ≤ ) dataset in D    The binary variable to judge whether   is placed on V     , Figure 2: An example of special VM identification with tasks ( 1 ∼ 4 ) and datasets ( 1 ∼ 3 ) deployed on VMs (V 1 ∼V 6 ) in the hybrid cloud.