Research on Optimal Path of Data Migration among Multisupercomputer Centers

Data collaboration between supercomputer centers requires a lot of data migration. In order to increase the efficiency of data migration, it is necessary to design optimal path of data transmission among multisupercomputer centers. Based on the situation that the target center which finished receiving data can be regarded as the new source center to migrate data to others, we present a parallel scheme for the data migration among multisupercomputer centers with different interconnection topologies using graph theory analysis and calculations. Finally, we verify that this method is effective via numeric simulation.


Introduction
The development level of supercomputing is an important manifestation of the comprehensive power, which has become the strategic highland which every country competes for.A series of challenging problems can be solved in economic construction, social development, technological innovation, industrial upgrading, national security, and other aspects employing supercomputers.
Supercomputing power and storage capacity of supercomputer centers can help enterprises and scientific research institutions more easily carry out the large-scale dataintensive computing tasks and have been increasingly applied in numerous fields, including global climate simulations, global climate modeling, gene mapping, gamma ray bursts, financial investment decisions, economic policy simulations, aerospace monitoring and control, and electronic-very long baseline interferometry [1][2][3][4][5][6][7].In particular, petascale computing can detail the numerical simulation of multiphysics, cosmological evolution, molecular dynamics, and biomolecules [8].It is important to note that some dataintensive computing tasks such as large hadron collider experiment require supercomputer centers to work cooperatively.However, supercomputer centers located in different regions will inevitably result in data migration between various centers.As for TB or even PB-class data size, the limited bandwidth of supercomputer centers will inevitably give rise to longer data migration time which will prolong the overall task completion time [9,10].In the case of multifiles transfer, time delay can also cause the increase of data migration time which also can prolong completion time of the overall task.Therefore, when there is a transmission task between different centers, an optimizing transmission path to make data transmission path the shortest will reduce the time of occupying bandwidth resources and can avoid effects on the migration of other data which will ultimately improve the efficiency of supercomputer centers [11].
Data migration among supercomputer centers is a kind of shortest path problem in graph theory.There are several classic algorithms for this problem, such as Dijkstra, Floyd, Bellman-Ford, and SPFA.However, it must be noted that these algorithms cannot be directly used for searching the shortest path of data migration for the scenario that the source node can transfer data with a plurality of nodes connected to it and that a node which has obtained data from the previous one can also be used as a source node.In computer science, a lot of related researches have been done.Zhu  Path Searching (APS) for constructing the broadcast index at mobile clients with soft arrival times to destinations for this problem [12].Ward and Wiegand researched on complexity results on labeled shortest path problems from wireless routing metrics [13].Xie et al. found alternative shortest paths in spatial networks [14].Buchholz and Felko presented a new approach to model weighted graphs with correlated weights at the edges.Such models are meaningful in describing many real world problems like routing in computer networks or finding shortest paths in traffic models under realistic assumptions [15].Sommer has solved the shortest path queries in static networks [16].Many other researchers have also done a lot of work on this issue.However, for shortest path of data migration among multisupercomputer centers with different interconnection topologies, the relevant researches and algorithms are very limited.Therefore, we do some research work on this issue.

Factors of Data Migration
Assume that there are five supercomputer centers located in different regions (or countries) and they are defined as A, B, C, D, and E, respectively, of which the distributions are shown in Figure 1.Data migration needs to be carried out between each other due to business collaborations.In engineering practice, factors affecting data migration time are mainly physical factors, network link factors, transmission protocols used, and so forth.However, in this paper, we assume that physical factors and transmission protocols are the same, so the main factors affecting migration time are network link factors as follows.
(a) Bandwidth.Network link resource used in supercomputer centers is provided by network operators instead of private networks due to the high cost so that the transmission medium and bandwidth are deterministic.
(b) Delay.Delay includes Processing Delay (  ), Transmission Delay (  ), Propagation Delay (  ), and Queuing Delay (  ).  and   are determined by the computing capability and the hardware performance of each node (physical device).  is determined by bandwidth.  is determined by length of the link [17,18].When the migrated data is  bits, the delay is where () is the total delay.Bw is bandwidth.The optimal path of data migration among supercomputer centers can be obtained when where  is data migration time.Supercomputer centers are very advanced in terms of physical devices (such as network cards), so that   and   can be ignored.Electrical signal transmission speed is approximately equal to the speed of light and the route between supercomputer centers is less than 300000 kilometers, so   depending on the two factors is very short and it can also be ignored.In this case, the optimal path is mainly depending on   , what is determined by the size of data migrated and the bandwidth.It is assumed that data migrated is  bits and bandwidth between each other is Bw  ; the data migration time determined by these two parameters is as in Table 1.

Optimal Path Planning for Data Migration
3.1.Main Theory.Shortest path problem is a classical algorithm in graph theory, which is intended to find the shortest path between two nodes in the graph.Therefore, this paper carries out relevant research using graph theory.
(1) Graph is a data structure consisting of vertices and edges which is usually expressed as (V, ).
(2) A value () can be assigned to each edge  in (V, ).() is called the weight which represents the delay of each link.
(3) Given two vertices in weighted graph, the path with the minimum weight is the shortest path between them [19].
(4) For weighted graph, the weighted adjacency matrix can be expressed as where V  and V  represent the vertices of weighted graph.  denotes the weight of edge determined by V  and V  .  is the value of this weighted adjacency matrix. denotes the set of all edges.

Searching Method:
Floyd.Floyd algorithm is the easiest shortest path algorithm which can obtain the shortest path between any two nodes in graph.To make it more convenient to discuss the shortest path between supercomputer centers under a variety of scenarios, this paper takes the Floyd algorithm as an example.Constructor method of Floyd is as follows.
Assuming that vertices of weighted graph are where  ()  denotes the length of the shortest path in all paths from V  to V  . () is distance matrix. () is path matrix and  ()   is the node numbered shortest path to pass from V  to V  .
(]) can be obtained at the same time with  (]) and the shortest path between any nodes can be found from  (]) [20,21]. 1, the constraint network graph with weight () is shown in Figure 2.

Searching Process. According to data migration time between any two centers as assumed in Table
In this way, we can get result after each node (V 1 , V 2 , . . ., V ] ) insertion through iterative process according to the constructor method and constraint network graph with weight () ( (1) represents  = 1, which means that V 1 is inserted.Similarly,  (2) represents  = 2, which means that V 2 is inserted.The rest can be done in the same manner.Moreover, "=" in the matrix is the element that has changed after iteration and A, B, C, D, and E are the five nodes).The corresponding distance matrix and path matrix are The shortest path of data migration between any two supercomputer centers can be obtained from the distance matrix.Route of data migration between any two supercomputer centers can be traced from the path matrix.As we can see from the matrix  (5) , to find the shortest path from A to D, we first get to the node |B|.Then we search the shortest path from B to D and find it can be direct to D, so the migration path is A → B → D. We can see from the matrix  (5) that the shortest path is 22.4:

Application Model
We can easily obtain the shortest path between any two nodes from the searching process above.That is, if one supercomputer center is fixed, the other supercomputer center which can perform large-scale data-intensive computing tasks with it is also determined.However, the algorithm above cannot be applied to searching the shortest path when the data is distributed from one node to all nodes.

TSP Model That Does Not Return to the Source Node.
Typically, data migration between supercomputer centers is interpreted as selecting a node as the source node and migrating data to other supercomputer centers from this source node and each node can be routed through only once.This scenario is TSP (Traveling Salesman Problem) model that does not return to the source node [22,23].TSP is a problem that given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?There has been a lot of research work in this field to solve the optimal data migration path problem based on the TSP model.This paper will no longer investigate for this scenario.Taking Figure 2 as example, if source node is 3 (C), the shortest path is 39.2 and migration path is C → D → B → A → E.

Data Migration in Parallel.
The scenario introduced in Section 4.1 is typical, but it does not consider the case when a node is migrating data with another node; it can also migrate data with other nodes at the same time.Moreover, a node which has obtained data from the previous node can also be used as the source node and all the source nodes can migrate data with a plurality of nodes connected to it in parallel.
In order to solve this problem well, this paper presents an optimal path algorithm, which is as follows.
Shortest path can be obtained when all nodes have reached ( =  −1 ).Of course, shortest path and migration path can be obtained through enumeration method when the number of nodes is small.In order to verify the correctness of this algorithm, we compared the results with that of enumeration.Taking Figure 2 as example, the shortest path obtained by enumeration is shown in Table 2 with each node as the source point.
However, as the number of nodes increases, the difficulty of enumeration will increase rapidly.The proposed algorithm can easily obtain the shortest path results, but it is likely to have mistakes.So we developed a function using this algorithm to simplify the calculation which has been tested in MATLAB.Function is as Algorithm 1.
For example, selecting 3 (C) as resource point, the simulation result is shown in Figure S.1 in Supplementary Material available online at http://dx.doi.org/10.1155/2016/5018213.
Result shows that shortest path is 28.8 and data migration path is {C → D → B, C → E → A}.By comparison, the simulation result is the same as the result of enumeration, so the correctness of the algorithm is verified.
It can be concluded that this algorithm is accurate and friendly (machine-executable).Although the time complexity is ( 2 ), the algorithm is feasible in solving these complex problems.

Data Migration among Nonfully Connected Centers.
The above discussion is a perfect case, in which all nodes are connected to each other.If the nodes given are not connected 4.8 to each other, can this algorithm still be used in searching shortest path?In order to verify the correctness of this algorithm in this case, this paper similarly takes Figure 2 as an example.We assume that links between node A and node C, node B and node C, node B and node D, and node B and node E are removed (of course, you can remove any links between the nodes); then the entire constraint network graph is shown in Figure 3. First, get the shortest path and the migration path through enumeration method just as in Section 4.2.The results are shown in Table 3.
Second, get the shortest path and the migration path through simulation.
Finally, compare the two sets of results and determine whether the results are identical.
In addition, setting 3 (C) as the resource point, the simulation result is shown in Figure S.2 in Supplementary Material.
Result shows that the shortest path is 48.8 and the data migration path is {C → E → A → B, C → D}.By comparison, the simulation result is the same as the result of enumeration, so this algorithm can be used in searching shortest path in this case.

Data Migration
When Nodes Are Increasing.It is easy to find out the shortest path when the number of nodes is small.However, as the number of nodes increases, what will be the result?In order to verify that this algorithm can be applied to an arbitrary weighted graph, we carried out further experiments.You can randomly determine the number of nodes and the weight between them.For example, we randomly determine ten nodes and the weight between them and the entire constraint network graph just as in Figure 4.
For example, setting 3 (C) as the resource point, the simulation result is shown in Figure S.3 in the Supplementary Material.Result shows that shortest path is 55 and data migration path is But what we have to note is that it will be very troublesome if we still use enumeration method directly just as in Sections 4.2 and 4.3 to verify the result, so we borrowed calculating process of Floyd.Searching shortest path in accordance with the process described in Section 3.2 and the result is as follows: Result shows that the shortest path C → G has the largest weight and the corresponding data migration path is {C → D → B → F → H → G}.As the goal is to achieve data migration from node C to all the other nodes, this path {C → D → B → F → H → G} with the largest weight is certainly one of the paths we need.After determining this path, the entire constraint network graph shown in Figure 4 can be simplified to Figure 5(a) (the paths between the nodes that have already been reached are removed).For simplification of calculation, we changed the form of Figure 5(a) into Figure 5(b) and marked the arrival time of data at nodes (as shown in the box).
We have determined one path above ({C → D → B → F → H → G}), so there are 4 nodes A, E, I, and J not having been reached.As shown in Figure 5(b), it is easy to calculate the shortest path to the rest of the nodes and the shortest path is {C → E → A, D → J → I}.After the two steps mentioned before, each node has been reached, and the shortest data migration path from node C to all the other nodes is {C → D → B → F → H → G, C → E → A, D → J → I}.In addition, the weights of the migration paths are, respectively, {55, 28.8, 48}, so the shortest path is the largest one which is 55.By comparing with the result obtained by the simulation, the two results are the same.So this algorithm can be used in searching the shortest path when data migration is in parallel, and it can be applied to arbitrary weighted graphs.

Conclusions and Further Work
Based on graph theory calculations we present a parallel method to migrate data among multisupercomputer centers with different interconnection topologies when supercomputer centers are required to work cooperatively.Specifically, this paper gives a method of node selection, a method of searching the shortest path and migration path.At last, the correctness of this method has also been verified.It is worth mentioning that this method can provide a good reference for data migration between different supercenters.The calculation process is given in this paper.However, how load balancing of data transmission link, large data migration, and multiple and fewer files impact on data migration time and migration path selection is not considered.What is more, some other factors for data transfers across data centers in reality such as the availability of the data and the security issues are also not taken into account.Therefore, according to the actual application of the actual circumstances, we will take into account all influence factors above to explore more accurate optimal path selection next.

Figure 5 :
Figure 5: Entire constraint network graph removed paths between the nodes having reached.

Table 1 :
Data migration time between any two centers (units: hours).

Table 2 :
Shortest path and migration path obtained by enumeration method in case 1.
Definition 2.   is the set of nodes reached at step , and   denotes the nodes not reached yet,  ∈ {1, 2, . . .,  − 1}.  V  V  is weight between V  and V  at step .   is the source node of step  and    is the end node of step .

Table 3 :
Shortest path and migration path obtained by enumeration method in case 2.