^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

In hybrid cloud environments, reasonable data placement strategies are critical to the efficient execution of scientific workflows. Due to various loads, bandwidth fluctuations, and network congestions between different data centers as well as the dynamics of hybrid cloud environments, the data transmission time is uncertain. Thus, it poses huge challenges to the efficient data placement for scientific workflows. However, most of the traditional solutions for data placement focus on deterministic cloud environments, which lead to the excessive data transmission time of scientific workflows. To address this problem, we propose an adaptive discrete particle swarm optimization algorithm based on the fuzzy theory and genetic algorithm operators (DPSO-FGA) to minimize the fuzzy data transmission time of scientific workflows. The DPSO-FGA can rationally place the scientific workflow data while meeting the requirements of data privacy and the capacity limitations of data centers. Simulation results show that the DPSO-FGA can effectively reduce the fuzzy data transmission time of scientific workflows in hybrid cloud environments.

With the widespread applications of Big Data technologies, the amount of data generated by modern network environments is greatly increasing. Therefore, traditional distributed computing modes such as grid computing may not meet the requirements of massive data processing. In recent years, cloud computing has emerged as a research hotspot [

Due to the complexity of the work process and increasing data volumes, scientific research studies with strict work steps cannot be managed manually. To address this problem, the workflow technology was proposed [

Some researchers have contributed to addressing the problem of data placement for scientific workflows. Yuan et al. [

Moreover, most of traditional data placement strategies are based on deterministic environments. However, uncertainty is an essential feature of network environments, which may have a significant impact on data transmission [

To address the above problems, we proposed an effective data placement strategy for scientific workflows in hybrid cloud environments. The main contributions of this paper are summarized as follows:

We define and model the data placement problem for scientific workflows in hybrid cloud environments. Specifically, we fuzzify the data transmission time into triangular fuzzy numbers and regard it as the optimization objective of the proposed model.

Based on the problem definitions and modeling, the DPSO-FGA is proposed as the second contribution for reducing the fuzzy data transmission time while considering the uncertainty of data transmission time, the different numbers and capacities of private data centers, and network bandwidth limitations, which can well adapt to real-world network environments.

We validate the effectiveness of the proposed DPSO-FGA method by using various scientific workflows in hybrid cloud environments, which can outperform the classic CFRA and CFGA methods in terms of fuzzy data transmission time.

The rest of this paper is organized as follows. Section

Hybrid cloud environment

A hybrid cloud environment consists of public and private data centers, where each private data center has a certain capacity, while each public data center has no capacity limitation. Thus, a hybrid cloud environment is defined as_{i} represents the _{i} indicates the maximum capacity of a data center. Specifically, the capacity of a public data center is unlimited, while a private data center may reserve some storage space with an upper limit _{i}, and _{i}. If _{i} can be used to store public data. If _{i} can be used to store both public and private data. For any two data centers _{i} and _{j}, _{ij} represents the network bandwidth between them, which is assumed to be known and fluctuate within a certain range.

Scientific workflow

The scientific workflow is a data-intensive application consisting of tasks and datasets, where a task may be related to multiple datasets and a dataset may also be related to multiple tasks. There is a data dependency relationship between the tasks, where the output datasets of a task may be the input datasets of other tasks. Meanwhile, there is also a sequential relationship between the tasks, where a task may only be executed after all its predecessor tasks have been executed. After all the tasks are completed, the scientific workflow ends. In particular, the task without a predecessor task is the beginning task and the task without a successor task is the ending task. Moreover, datasets can be divided into initial and generated datasets, where the original input datasets of a scientific workflow are the initial datasets and the datasets generated during the running process are the generated datasets. Also, datasets can be divided into private and public datasets, where private datasets can only be stored in private data centers and the tasks using them as the input datasets must also be scheduled to the same data centers. By contrast, public datasets have no restriction on storage locations. Therefore, a scientific workflow is defined as a directed acyclic graph (DAG), denoted by _{c} represents the _{ij} indicates the data dependency between tasks _{i} and _{j}, where _{ij} = 1 indicates that _{i} is the direct predecessor task of _{j}. Moreover, _{l} is the _{i} is the input dataset of _{i}, _{i} is the output dataset of _{i}, and _{i}) is the data center for executing _{i}. Furthermore, _{i}, _{i} is the task number of generating _{i}, in which _{i} of the initial dataset is 0, and _{i} is the serial number of the data center storing _{i}.

It should be noted that the settings of privacy datasets in scientific workflows need to satisfy three logical rules. Specifically, for the task _{i} in the hybrid cloud environment _{i} (denoted by {Ii, Oi}) contains the privacy dataset _{i}, the following holds:

Rule 1.

Rule 2.

Rule 3. For each privacy dataset

According to Definition

Fuzzy data transmission time

When optimizing uncertainty problems, there are commonly three types of theories, including the probability theory, gray theory, and fuzzy theory. Specifically, the probability theory can be applied in sampling problems with massive samples, the gray theory is suitable for the problems with fewer samples, and the fuzzy theory can be used to solve the problems with unclear extensions of concepts [

In the past research, the data transmission time was usually defined as the ratio of the dataset size to the bandwidth between data centers, without considering the other essential factors such as bandwidth fluctuations. However, the data transmission time is uncertain in real-world network environments. In response to this uncertainty, by utilizing the fuzzy theory, triangular fuzzy numbers are introduced to represent the data transmission time. For each independent data transmission process, the mapping _{k} is transmitted from the data center _{i} to _{j}. Therefore, the fuzzy data transmission time is defined as

Calculation of fuzzy number

The model involves addition and comparison operations between fuzzy numbers. For the triangular fuzzy numbers

Addition operation (calculating the fuzzy data transmission time):

Comparison operation (comparing the fuzzy completion time and choosing suitable values).

For

According to the literature [

The model involves the addition, subtraction, multiplication, division, fuzzification, and defuzzification operations between fuzzy and real numbers. For a triangular fuzzy number

Addition and subtraction operations:

Multiplication and division operations:

Fuzzification and defuzzification operations.

On the one hand, the fuzzification operation, according to the literature [

On the other hand, the defuzzification operation is commonly used to quantitatively compare fuzzy numbers and analyze results. Li [

Data placement strategy

The purpose of effective data placement is to reduce the data transmission time while meeting the order of task execution, the proportion of dataset privacy, and the capacity constraints of data centers. Only when the datasets required by a task are transmitted to the same data center, the task can be executed. Moreover, the time of scheduling a task to a data center is much shorter than the data transmission time [_{i}, _{k}, _{j}} indicates that the dataset _{k} is transmitted from the data center _{i} to _{j}. _{i}, _{k}, _{j}}. _{ijk} = {0,1} indicates whether there is {_{i}, _{k}, _{j}} during this time (_{ijk} = 1 for yes and _{ijk} = 0 for no).

According to the above definitions, the data placement problem for scientific workflows is modeled based on the fuzzy theory with the objective of minimizing the fuzzy data transmission time while considering the capacity constraints of data centers. The problem model is defined as_{ij} = {0,1} indicates whether the dataset _{j} is stored in the data center _{i} (_{ij} = 1 for yes and _{ij} = 0 for no).

In light of the advantages of the particle swarm optimization (PSO), genetic algorithm (GA), and fuzzy theory, we propose an adaptive discrete particle swarm optimization algorithm based on the fuzzy theory and genetic algorithm operators (DPSO-FGA) to implement the effective data placement for scientific workflows, with the goal of minimizing the fuzzy data transmission time.

The PSO algorithm was derived from the literature [

Speed update:

Location update:

where the detailed definitions of the symbols can be found in [

A fitness function needs to be defined for particles to track the optimal solution during the update process. As the optimization goal, the fuzzy data transmission time is used to define the fitness function as_{i}.

If the total size of datasets placed in a data center does not exceed its maximum capacity, the particle can be a feasible solution. Otherwise, it is infeasible. When making the selection between feasible and infeasible solutions, the feasible one will be directly selected. When making the selection between feasible solutions, the particle with the smaller fitness function will be selected. When making the selection between infeasible solutions, the particle with the smaller fitness function will be also selected because it is more likely to become a feasible solution in subsequent operations.

The particle encoding needs to meet three principles, including completeness, nonredundancy, and soundness [

The following is an example of the particle encoding, where the particle number is 3, the current iteration number is 10, the number of datasets is 10, and the number of data centers is 4. Moreover, the datasets with underlines indicate that they are privacy datasets, where the data centers used for storing them cannot be changed during the subsequent update process:

The traditional update of the PSO algorithm is shown in equations (

For the inertial part of the traditional update, the mutation operation is defined as_{u}() is to randomly change a quantile in the encoded particle within a range of values. It should be noted that the quantiles of privacy datasets cannot be mutated. Moreover, an infeasible particle should select a quantile, which makes the particle infeasible, to be mutated. Thus, the quantile to be selected should be the location of an overloaded data center. For instance, the following mutation operation is taken based on equation (

For the individual and population cognitions in the traditional update, the cross operation is introduced as_{p}(_{i}(_{i}(_{i}(_{i}(_{i}(

In summary, the process of the particle update is defined as

Algorithm

_{cur(i)} to 0 and the fuzzy data transmission time

_{i}_{ini}//

_{cur(X[i])} + = _{i}_{X[i]}

_{cur(X[i])} > _{X[i]}

_{j} in the data center _{j} with the least fuzzy data transmission time

_{cur(j)} + _{j}) + _{j}) >

_{j} of _{j} into the corresponding data center

_{i} in the placement of task _{j}’s input dataset _{j}

_{j} to _{j}

The execution steps of Algorithm

_{cur(i)} to 0 and the fuzzy data transmission time

_{cur(i)}. If _{cur(i)} exceeds the maximum capacity of data centers, the solution corresponding to the particle is infeasible and the current operation will be stopped and returned.

_{j} with the smallest fuzzy data transmission time will always be selected to place the task _{j}_{cur(j)}, _{j}), and _{j}) exceeds the maximum capacity of data centers), the current operation will be stopped and returned. Otherwise, the output dataset _{j} of the task _{j} will be placed into a corresponding data center and the storage capacity of data centers will be updated.

The inertial weight _{i}(_{i}(_{i}(_{i}(

Moreover, the individual and population cognition factors (i.e., _{1} and _{2}) are defined by using the gradient descent method [

The scientific workflow model comes from five different scientific fields [

Environment and parameter settings.

Parameter | Value |
---|---|

Population size | 100 |

Maximum number of iterations | 1000 |

0.9, 0.4 | |

0.9, 0.2 | |

0.4, 0.9 | |

Hybrid cloud environment | _{pub}, _{pri} | _{pub} = {_{1}}, _{pri} = {_{2}, _{3}, _{4}}} |

Software environment | Windows 10 64-bit, Python 3.7 |

Hardware environment | Intel® Core™ i7-6700HQ CPU @2.60GHZ, RAM 8.00 GB |

Moreover, some extrasettings are shown as follows:

Maximum capacity: the datum capacity is set to

Bandwidth (M/s) between data centers: the bandwidth between _{1} and {_{2}, _{3}, _{4}} is set to {10, 20, 30}, the bandwidth between _{2} and {_{3}, _{4}} is set to {150, 150}, and the bandwidth between _{3} and _{4} is set to 100.

Proportion of privacy datasets: due to the difference of datasets among various workflows, the proportions of privacy datasets of in the scientific workflows, including CyberShake, Epigenomics, Inspiral, Montage, and Sipht, are set to {0.25, 0.2, 0.2, 0.2, 0.02}.

Fuzzy parameter: based on the fuzzy theory, the data transmission time

The proposed DPSO-FGA is compared with the constraint fuzzy randomized algorithm (CFRA) and constraint fuzzy greedy algorithm (CFGA), which can improve the performance of the randomized algorithm (RA) and greedy algorithm (GA) in data placement. The CFRA and CFGA rely on the fuzzy theory while considering some essential conditions, including the application scenarios of scientific workflows, privacy settings, and capacity constraints. The conditions refer to meet the maximum capacity requirements of data centers and the proportion of private datasets during the data placement process.

The steps of CFRA

The steps of CFGA

To avoid the randomness of results, 10 independent experiments are carried out on five scientific workflows under different environment settings. Table

Average fuzzy data transmission time of different algorithms under various scientific workflows.

Workflow | Algorithm | Average fuzzy data transmission time (s) |
---|---|---|

CyberShake | DPSO-FGA | (375683.37, 415977.97, 448166.12) |

CFRA | (613685.49, 658202.55, 728115.64) | |

CFGA | (760212.74, 819721.35, 890377.31) | |

Epigenomics | DPSO-FGA | (297570.10, 300909.35, 326971.76) |

CFRA | (561274.26, 606573.37, 654489.23) | |

CFGA | (1192066.03, 1242136.17, 1290256.56) | |

Inspiral | DPSO-FGA | (2356942.36, 2521230.04, 2734877.94) |

CFRA | (5859201.90, 6335216.52, 7023988.75) | |

CFGA | (8025939.27, 8547697.91, 9737463.34) | |

Montage | DPSO-FGA | (794955.76, 840543.65, 863448.55) |

CFRA | (1242644.26, 1331192.71, 1445410.25) | |

CFGA | (1920877.78, 2058800.54, 2311869.58) | |

Sipht | DPSO-FGA | (10733197.95, 12051002.69, 1289802.65) |

CFRA | (11543263.40, 12494008.57, 13725762.53) | |

CFGA | (13105310.17, 14687168.03, 1591557.53) |

In the subsequent experimental results, the fuzzy data transmission time is defuzzified, in order to make the comparison between the algorithms more intuitive, where ∂ is set to 1.

Figure

Average fuzzy data transmission time of different algorithms under various scientific workflows.

From the perspective of algorithms, the DPSO-FGA outperforms the CFRA and CFGA. This is because that the CFGA may easily fall into the local optimum by using the GA during execution, and thus, it ignores the global performance. Moreover, the overall performance of the CFRA is better than CFGA since the search space of the CFRA is larger than the CFGA and will not fall into the local optimum, and thus, the CFRA can obtain a good solution when the algorithm runs for a long time. However, the CFRA does not consider the fitness of the current particle when a solution is generated, and thus, the performance of the CFRA is worse than the DPSO-FGA. From the perspective of workflows, the data transmission time of the same algorithm in various scientific workflows is significantly different. Although all these scientific workflows contain about 50 tasks, the number of datasets used varies greatly. For example, the CyberShake uses datasets only about 70 times, while the Sipht uses datasets up to 4000 times, which results in the different data transmission time between them.

As the number of private data centers in a hybrid cloud environment sometimes changes, the performance of DPSO-FGA needs to be evaluated with different numbers of private data centers. Thus, we change the number of private data centers without modifying other default settings. Specifically, these three algorithms are tested when the number of private data centers is set to {3, 5, 6, 8, 10}, where the bandwidth between newly added private data centers and public data centers is set to 20 M/s and the bandwidth between other private data centers is set to 120 M/s. The experimental results are shown in Figure

Average data transmission time of different algorithms with various numbers of private data centers: (a) Cybershake; (b) Epigenomics; (c) Inspiral; (d) Montage; (e) Sipht.

From the perspective of algorithms, the performance of the DPSO-FGA outperforms the CFRA and CFGA, and the reasons have been analyzed in Figure

Since the maximum capacity of private data centers is regarded as a constraint, the sensitivity of the DPSO-FGA to this constraint needs to be evaluated. Specifically, the CyberShake is selected as the scientific workflow for experiments, the multiple of datum capacity is set to {2, 2.6, 3, 5, 8}, and the rest of the settings remain default. The experimental results are shown in Figure

Average fuzzy data transmission time of different algorithms with various capacities of private data centers.

When the maximum capacity of private data centers increases and the bandwidth between data centers remains the same, each data center is able to store more datasets and the datasets required for executing tasks become more concentrated. Therefore, the data transmission time of the DPSO-FGA is reduced. Specifically, the fastest decline in data transmission time happens when the maximum capacity is 2∼3 times than that of the datum capacity, and the slowest decline happens when the maximum capacity is 5∼8 times than that of the datum capacity. This is because when the maximum capacity of data centers is small, the available space is small and the placement locations of datasets are restricted. Thus, the maximum capacity has a significant impact on data transmission time. When the maximum capacity of data centers becomes larger, each data center can store more datasets, and it will be easy to meet the operational requirements of scientific workflows. Therefore, the maximum capacity has little effect on the data transmission time.

Finally, the performance of the DPSO-FGA is evaluated under different bandwidths between data centers. Specifically, the CyberShake is selected as the scientific workflow for experiments, the multiple of the bandwidth between data centers relative to the default one is set to {0.5, 0.8, 1.5, 3, 5}, and the rest of the settings remains default. The experimental results are shown in Figure

Average fuzzy data fuzzy transmission time of different algorithms with various bandwidths between data centers.

In this paper, we propose a DPSO-FGA-based data placement method for scientific workflows in hybrid cloud environments. Based on the fuzzy theory, the DPSO-FGA fuzzifies the data transmission time for adapting to real-world network environments while considering the characteristics of hybrid cloud environments, bandwidth fluctuations, capacity limitations of private data centers, and dependencies between different scientific workflow tasks. Simulation results demonstrate the effectiveness of the proposed DPSO-FGA method. In the future, we will study the impact of other essential factors on the proposed method, such as different proportions of private datasets in scientific workflows and capacities of various private data centers. Moreover, under the conditions that are not critical for data transmission time, such as business network environments, data transmission costs between different clouds should also be regarded as a prioritized optimization goal. Therefore, a comprehensive model for minimizing the fuzzy data transmission time and costs will be researched.

The data used to support the findings of this study are produced by a public workflow generator available at

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Zheyi Chen and Xu Zhao are equally contributed to this work. Zheyi Chen and Xu Zhao developed the model, carried out the parameter estimations, and planned as well as performed the experiments. Zheyi Chen wrote the main part of the manuscript, while Xu Zhao and Bing Lin provided the support for writing materials. Bing Lin also took part in the design and evaluation of the model. Zheyi Chen and Bing Lin reviewed the manuscript. All the authors read and approved the final manuscript.

This work was supported by the National Key R&D Program of China (Grant no. 2018YFB1004800), Natural Science Foundation of China (Grant nos. 61672159, 41801324, and 61972165), Natural Science Foundation of Fujian Province (Grant nos. 2019J01286, 2019J01244, and 2018J01619), Young and Middle-Aged Teacher Education Foundation of Fujian Province (Grant no. JT180098), Open Foundation of Engineering Research Center of Big Data Application in Private Health Medicine, Fujian Province University (Grant no. KF2020001), Talent Program of Fujian Province for Distinguished Young Scholars in Higher Education, and China Scholarship Council (no. 201706210072). The authors sincerely thank Dr. Jia Hu and Dr. Geyong Min for providing useful advice that greatly improved this paper.