DMGA: A Distributed Shortest Path Algorithm for Multistage Graph

%e multistage graph problem is a special kind of single-source single-sink shortest path problem. It is difficult even impossible to solve the large-scale multistage graphs using a single machine with sequential algorithms. %ere are many distributed graph computing systems that can solve this problem, but they are often designed for general large-scale graphs, which do not consider the special characteristics of multistage graphs. %is paper proposes DMGA (Distributed Multistage Graph Algorithm) to solve the shortest path problem according to the structural characteristics of multistage graphs. %e algorithm first allocates the graph to a set of computing nodes to store the vertices of the same stage to the same computing node. Next, DMGA calculates the shortest paths between any pair of starting and ending vertices within a partition by the classical dynamic programming algorithm. Finally, the global shortest path is calculated by subresults exchanging between computing nodes in an iterative method. Our experiments show that the proposed algorithm can effectively reduce the time to solve the shortest path of multistage graphs.


Introduction
With the continuous development of big data and information technology, graph has been widely applied in many applications, and various graph structures and algorithms have been proposed. Among them, the multistage graph is a special kind of weighted directed graphs, which are widely used in engineering technology, concurrency control, transportation, task schedule in high-performance computing, and other fields. Many coordination or dynamic scheduling problems can be transformed into multistage graph problems [1,2].
Recently, the scale of graph data has grown tremendously, so it is difficult even impossible to store and process such large-scale graphs by a single computer or sequential processing method [3]. At this point, the distributed computing scheme became a must, and lots of dedicated graph-processing systems have been appearing [4][5][6], such as Pregel [7], PowerGraph [8], GraphX [9,10], GraphLab [11], and PowerLyra [12]. ese graph processing systems extend the computation by dividing the graph into multiple partitions and processing on multiple computing nodes in parallel. High-quality partition can reduce the communication cost and achieve the load balance [13][14][15], thus the processing time can be minimized subsequently. e current distributed graph processing systems and algorithms are usually designed for general graphs, and they do not consider the special structural properties of multistage graphs, so there are some disadvantages in applying them to multistage graphs, such as high communication cost and long solution time. e purpose of this paper is to present a distributed algorithm DMGA (Distributed Multistage Graph Algorithm) for the shortest path problem of multistage graphs to make full use of their characteristics. e main contributions are as follows: (1) It presents a partitioning method for multistage graphs on distributed computing systems, which can make best use of characteristics of multistage graphs to achieve the best load balance and reduce the communication cost (2) It designs a distributed algorithm of the shortest path problem of multistage graphs based on dynamic programming idea (3) It performs extensive experiments to verify the performance of the proposed algorithm, compared to the classical parallel Dijkstra algorithm and the SSSP (single-source shortest path) algorithm on Pregel Table 1 gives an overview about the notations used in this paper.
e organizations of the rest of the paper are as follows. Section 2 introduces the related works, and Section 3 presents the statements of the shortest path problem of multistage graphs. Section 4 describes the proposed DMGA algorithm, and Section 5 introduces the experiments and analysis. Section 6 concludes the paper.

e Shortest Path Algorithms.
Finding the shortest path is a classical problem of graph theory, and the well-known sequential algorithms are Dijkstra, Floyd, and Bellman-Ford algorithms [16], which perform well in centralized computing. However, the large-scale graph needs distributed computing algorithms to obtain the shortest paths quickly. e single-source shortest path (SSSP) is one of the most important shortest path problems. Peng et al. [1] defined a new graph model named by single-source-weighted multilevel graph and presented a parallel SSSP algorithm by constructing the vector-matrix multiple model, dividing into parallel tasks, and setting data communication's method. A·Davidson [17] developed three parallel SSSP algorithms for GPUs (Graphics Processing Unit): Workfront Sweep, Near-Farand, and Bucketing. ese algorithms utilize different approaches to balance the trade-off between saving work and organizational overhead. S·Maleki [18] introduced a partially asynchronous parallel DSMR (Dijkstra Strip Mined Relaxation) algorithm for SSSP on shared and distributed memory systems. Busato and Bombieri [19] proposed a parallel Bellman-Ford algorithm based on frontier and active vertices that exploit the architectural characteristics of GPU architectures. Huang [20] gave a distributed Las Vegas algorithm on the classic scaling technique for the all-pairs' shortest paths on distributed networks. For the dynamic and stochastic graph models of the transportation network, Liu et al. [21] proposed an improved adaptive genetic algorithm by adjusting the encoding parameters to get the dynamic random shortest path. Ghaffari and Li [22] provided a distributed SSSP algorithm with less complexity, and it constitutes the first sublinear time algorithm for directed graphs. For the SSPP-MPN (Shortest Simple Path Problem with Must-Past Nodes), Su et al. [23] proposed a multistage metaheuristic algorithm based on k-opt move, candidate path search, conflicting nodes promotion, and connectivity relaxation. e above algorithms do not consider the structural characteristics of multistage graphs, so they will produce a large amount of communication overhead, resulting in tedious execution time.

Graph Partitioning Algorithms.
e basis of the distributed graph processing system is to partition the entire graph into a set of computing nodes. e graph partition algorithms are classified into vertex-cut and edge-cut. Edge-cut partitioning assigns each vertex to a unique partition, and the edge spanning partitions are called cut edge. As shown in Figure 1(a), edges <b, d > are cut, and their two endpoints b and d are assigned to different partitions. Vertex-cut partitioning assigns edges uniquely to a certain partition, which results in vertex-cuts across multiple partitions [24]. As shown in Figure 1(b), vertex d is partitioned, both partitions P1 and P2 have copies of vertex d, and their references in each partition are also called mirrors [25]. e distributed graph processing systems often use the vertex-centric programming model [26,27], where the computing node recursively operates its active vertices according to the user-defined graph function. Each vertex reads the statuses of its adjacent vertices or edges and updates its own status accordingly. In the iterative calculation of a graph, the partitions exchange intermediate results along edges. To some extent, the number of cut edges or mirror vertices can reflect the network communication overhead, which in turn affects the calculation efficiency. On the contrary, the load among computing nodes should be balanced to ensure that the computing nodes can achieve the results synchronously. Hence, both edge-cut and vertex-cut approaches aim to minimize cross-partition dependencies and achieve load balance [25]. e existing graph partitioning heuristic solutions are basically divided into offline and online partitioning strategies. e offline partitioning strategy refers to dividing the graph into several subgraphs before being loaded by the distributed system. F·Rahimian et al. [28] introduced the JA-BE-JA algorithm that uses local search and simulated annealing techniques for graph partitioning. e algorithm only needs to use part of the information to process the graph. Akhremtsev et al. [29] presented a multilevel sharedmemory parallel graph partitioning algorithm that uses parallel label propagation for both coarsening and refinement, and it can balance the speed and quality of parallel graph partitioning. e online partitioning strategy refers to partitioning the graph during the data loading process, where the input data is usually a vertex stream or an edge stream. Tsourakakis et al. [30] proposed the FENNEL algorithm based on locality-centric measures and balancing goals. Its core idea is to interpolate between maximizing the co-location of neighbouring vertices and minimizing that of nonneighbours. Petroni et al. [15] proposed the high-degree replicated first (HDRF) algorithm according to the characteristics of power-law graphs, which divide the vertices with high degrees in first. Zhang et al. [31] proposed the AKIN algorithm based on the vertex similarity index, which exploits the similarity measure of the vertex degree to collect structure-related vertices in the same partition to further reduce the edge-cut rate. Wang et al. [32] analysed the locality of the graph and proposed the target-vertex sensitive Hash algorithm. e algorithm predivides the target vertices of the edge logically and then partitions the graph in parallel according to the target vertices. Ji et al. [33] proposed a twostage local partitioning algorithm which introduces the concept of local partitions, emphasizing the impact of changes in the graph structure on the quality of partitions. Slota et al. [34] introduced XtraPuLP based on the scalable label propagation community detection technique. It can solve the multiple constraint and multiple objective graph partition problem on tera-scale graphs. Zhou et al. [35] proposed Geo-Cut which uses a cost-aware streaming heuristic and two partition refinement heuristics to reduce the cost and data transfer time of geo-distributed data centres. e above graph partitioning algorithm are all designed for general graphs, which do not consider the special characteristics of multistage graphs, so it is necessary to design the graph partitioning algorithm for multistage graphs to accelerate the distributed processing.

Problem Statements
A multistage graph G � (V, E, W) is a directed single-source and single-sink weighted connected graph, where V and E are, respectively, the set of vertices and edges and W is the weights of edges. e vertices are divided into disjoint stages, and each edge can only point from the vertex of the previous stage to the vertex of the succeeding stage. Formally, a multistage graph G � (V, E, W) should satisfy 1 , and v 1,1 and v m,1 are, respectively, the source vertex and sink vertex. Figure 2 is an example of the multistage graph, where the blue numbers above the graph are the numbers of edges. is Table 1: Notation overview.

Symbol
Definition Graph with vertices set V, edges set E, and edge weights' set W m Number of stages in a multistage graph V i Set of vertices of stage i of a multistage graph n i Number of vertices of stage i of a multistage graph v i,j e jth vertex in the ith stage of a multistage graph Scientific Programming 3 paper supposes that the multistage graphs are dense which means where E i � v i,j , v i+1,k |j � 1, 2, . . . , n i ; k � 1, 2, . . . , n i+1 is the set of edges between Vi and Vi+1 and |E i | is the number of edges of Ei. e shortest path problem of a multistage graph is to find the minimum cost path from the source vertex to the sink vertex. Let c k,i,l,j be the cost of the shortest path from vertex v k,i to v l,j . Obviously, c 1,1,m,1 is the cost of the shortest path from the source vertex to the sink vertex, and Given a large-scale multistage graph G � (V, E, W), we need to partition it to a cluster of computing nodes. Each computing node stores a part of G, and each part is called a partition. Let p be the number of partitions, so G is divided into partitions CN 1 , CN 2 , . . . , CN p and CN i is located on the ith computing node.

DMGA: The Proposed Algorithm
DMGA is run on the homogeneous cluster, which means all computing nodes have the same performance in terms of CPU, memory, and bandwidth. is algorithm partitions the entire graph to the given cluster first, and then, each computing node computes the shortest path of the partition on it. Finally, the computing nodes communicate with each other to obtain the shortest path of the whole graph.
Algorithm 1 gives the framework of DMGA. e details of each step of Algorithm 1 are given in the followingsections.

Multistage Graph Partition.
In order to determine the graph partition strategy, we should analyse their impacts on the communication overhead after partition. According to the feature of multistage graphs, it is a better scheme to divide the vertices of the same stage into the same partition because it is easy to implement load balance and parallel shortest path solution. Suppose V i and V i+1 are divided into two different partitions. If we use vertex-cut strategy, the number of mirror vertices is either n i or n i+1 . If we use edgecut strategy, the number of cut edges is |E i |. According to (1), n i < |E i | and n i+1 < |E i |, which indicates that the vertex-cut strategy has less communication overhead than edge-cut strategy, so we adopt vertex-cut strategy to partition the graph. Figure 3 is an example. e edge-cut strategy produces 9 cut edges (Figure 3(a)), while the vertex-cut strategy only produces 3 mirror vertices (Figure 3(b)).
Since the multistage graphs studied in this paper are dense, we use the number of edges to represent the load of a partition. Let Cap be the capacity of each computing node, which is also the maximum number of edges that can be stored by a partition, then the number of partitions of a given G can be estimated as e above equation gives the lower limit of the number of computing nodes. It may lead to the load of the last computing node far lower than those of the other computing nodes. For example, if |E| � 10100 and Cap � 1000, then p � 11. If the first 10 computing nodes are fully loaded, then the last computing node only has 100 edges, so the load is imbalanced. Hence, the maximum load of each partition is redefined as where c is a predefined parameter to keep the load balance for different multistage graphs. e idea of multistage graph partition is to assign the vertices of the same stage to the same partition. Figure 4 presents the flow diagram, and Algorithm 2 presents pseudocode. In this algorithm, Sum records the number of edges stored in the current partition. Lines 1 and 2 initialize the variables. Lines 3-16 divide G to a cluster of computing nodes. If Sum ≤ ML (line 5), lines 6-8 assign the edges of E i to computing node CN k , and lines 9 and 10, respectively, update e k and i. If Sum > ML (line 11), which means the current partition CN k will be overloaded if we assign the edges of E i to it, lines 12-14 update variables to prepare for succeeding partition.

Local Shortest Path Calculation.
After partitioning the graph, each computing node calculates the shortest path of the subgraph stored on it. e shortest path of each partition is referred to as the local shortest path, and the shortest path of the whole graph is referred to as the global shortest path.
Proof. We prove it by using reduction to absurdity. Suppose v k,i k , v k+1,i k+1 , . . . , v l−1,i l−1 , v l,i l is not one of the shortest paths from v k,i k to v l,i l . ere must exist a shortest path from v k,i k to v l,i l . Let , v l,i l , . . . , v m,1 } is less 1 is not one of the shortest paths from v 1,1 to v m,1 . is is a contradiction.
e above theorem shows that any part of the shortest path is also the shortest path, so the global shortest path is composed of the local shortest paths of all partitions, and where i � 1, 2, . . . , n k and j � 1, 2, . . . , n k+1 . Subsequently, we have where i � 1, 2, . . . , n s k and j � 1, 2, . . . , n e k . Based on the above equation, it is necessary to calculate the shortest paths between any pair of vertices of the first and last stages for each partition. Specifically, each computing node uses the idea of dynamic programming shown as (2) to solve the local shortest path.

Global Shortest Path Calculation.
After finding the shortest path of each partition, the global shortest path is calculated by message exchanging among computing nodes. Figure 6 depicts a sketch of the merging procedure of local shortest paths. is is an iterative procedure. In each iteration, a pair of computing nodes communicates where the "left" computing node sends its local shortest path to the "right" computing node, and the "right" computing node merges these two local shortest paths. Finally, CN p gets the global shortest path. Figure 7 presents the flow diagram, and Algorithm 4 presents the pseudocode. e set R records the indices of computing nodes participating in subresults' combination in each iteration. Initially, R contains all computing nodes (line 1). Lines 2 to 5 initialize two sets for each computing node. Lines 6 to 30 calculate the global shortest path by message exchanging among computing nodes with the basic idea as Figure 6. Firstly, the "left" computing nodes CN R 2i−1 send SPL R 2i−1 to the "right" neighbour computing node CN R 2i (lines 7 to 9), and this can be run in parallel for each pair of computing nodes. Secondly, the "right" computing nodes merge the two subresults to get a longer subresult (lines 10 to 28). Given a local shortest path of SPL R 2i−1 (line 11) and SPL R 2i (line 12), if they can be merged, that is to say, the last vertex of the first path is the same as the starting vertex of the second path (line 13), it tries to merge them to be a longer path from v a1,b1 to v c2,d2 . If the shortest path does not exist, the algorithm generates one and appends it to SPL R 2i ′ (lines 14 to 17). If the shortest path exists but it is longer than the current one, the algorithm updates the shortest path (lines 18 to 21). After the two innermost loops (lines 11 to 25), the algorithm replaces SPL R 2i with SPL R 2i ′ (line 26) and sets SPL R 2i ′ to be empty (line27) to prepare for the next iteration. Line 29 removes the "left" computing nodes from R. Finally CN p returns c 1,1,m,1 and sp 1,1,m,1 as the global shortest path (line 31). Figure 2 as an example to demonstrate the process of the above algorithm. Given Cap � 20, we have p � 3 according to (3). Set ML � 20 according to (4). Based on Algorithm 2, the graph will be partitioned to 3 partitions, as shown in Figure 8.

Experimental Setup.
Because there are no public multistage graph datasets, we synthesize 5 datasets using Java on IntelliJ IDEA, where the number of vertices of each stage and the weights of edges are random values satisfying (1). Table 2 presents the basic data of these 5 datasets.
e shortest paths algorithms are run on Hadoop [36] in conjunction with the Spark [37] computing engine. Spark is a fast and general-purpose computing engine designed for large-scale data processing [38]. e cluster consists of 8 computers, and each computer has a 4-core Intel processor, 8 GB memory, and 1 TB storage. e operating system is CentOS 7, and the distributed environment is built using Hadoop2.7 and Spark2.1.
In order to compare the performance of DMGA with existing algorithms, all graphs are partitioned to 8 partitions, which is a little different from Algorithm 2 whose number of partitions depends on the scale of the graph. Hence, the ML of each partition is in the experiments, and line 2 uses (7) to replace (4) in Algorithm 2.

Partitioning Quality.
At present, there is no partition algorithm for multistage graphs, so we compare the partition quality of the DMGA algorithm with the Hash partitioning algorithm. Hash is the default partitioning algorithm in many distributed graph processing systems, and it is the basis of most of existing distributed algorithms for solving the shortest path.
For the vertex-cut partitioning method, the number of mirror vertices reflects the communication overhead. e fewer the mirror vertices, the less the communication overhead, and the corresponding calculation time will be reduced. Figure 9 shows the number of mirror vertices generated by two graph partitioning algorithms running on the above 5 datasets. It can be seen from the results that the number of mirror vertices of the two graph partitioning algorithms for small datasets is almost the same. With the increase of the scale of the dataset and the number of stages, the number of mirror vertices of the DMGA algorithm is significantly less than the Hash algorithm. Specifically, the number of mirror vertices produced by the Hash algorithm for Graph_3, Graph_4, and Graph_5 is, respectively, 37.25, 24.97, and 69.9 times of the DMGA algorithm.
is shows that the DMGA algorithm has a better partitioning result for multistage graphs than the Hash algorithm. It can be deduced that the number of mirror vertices will be reduced in further if (4) instead of (7) is used to guide partition in Algorithm 2.
For the homogeneous cluster which consists of computing nodes with the same configuration, the load of each computing node should be balanced as much as possible in order to reduce the calculation time. Table 3 shows the number of edges on each computing node of each dataset generated by two graph partitioning algorithms. e Hash algorithm partitions the graph according to the Hash function defined as which means the vertex v i,j is assigned to partition CN k . us, the load of each partition is almost the same. e DMGA algorithm assigns all vertices in the same stage to the same partition, and the numbers of vertices in each stage are different, so the edge distribution is not balanced.   Figure 6: Local shortest paths merging. 8 Scientific Programming e average deviation and standard deviation are further analysed to check the loads among partitions, and they are shown in Figure 10. In this figure, AVEDEV and STDEV, respectively, represent average deviation and standard deviation. We can see that the average and standard deviations of DMGA are 2.3 to 7.5 times of those of Hash. Graph_3 has the maximum difference among all graphs: the average deviation of DMGA is 6.65 times of that of Hash, and the standard deviation of DMGA is 7.51 times of that of Hash. Graph_1 has the minimal difference among all graphs: the average deviation of DMGA is 2.79 times of that of Hash, and the standard deviation of DMGA is 2.36 times of that of Hash.
ese results also show that Hash can produce more balanced partition.
e single-machine algorithm utilizes the sequential dynamic programming algorithm. e experiments are run 10 times, and the results are the average over these runs. Table 4 presents the results, and Figure 11 shows the graphical comparison. It can be seen that DMGA has a  (ii) Graph_2: the running time of these two algorithms is almost the same. (iii) Graph_3 and Graph 4: the running time of the sequential algorithm is, respectively, 1.72 and 3.13 times of that of DMGA.
In further, parallel Dijkstra has a relatively longer running time than the DMGA and SSSP, due to its high time and space complexity. SSSP uses Pregel's default graph partition method which does not take the structural feature of multistage graphs into account, so it has a large amount of communication. SSSP needs longer time for the larger scale graph, which implies that the communication overhead increases significantly with the increase of the scale of the graph specifically.
(i) Graph_2: SSSP has the least running time, and DMGA only needs 4 seconds longer than SSSP, but parallel Dijkstra needs more than 30 times of that of DMGA. (ii) Other graphs: DMGA needs the least running time.
e larger the scale of the graph, the more obvious the advantage of DMGA. For example, the running time of DMGA is, respectively, 25.8% and 8.8% of that of SSSP and parallel Dijkstra for Graph_5. e above results show that DMGA makes full use of the special structural characteristic of multistage graphs, and it has extremely low communication overhead.

Conclusion and Future Work
Nowadays, graph models are widely applied in many fields, and the scale of the graph increases significantly. e existing distributed graph computing systems cannot make full use of the special characteristics of the multistage graphs. To this end, this paper proposes DMGA which is used to solve the shortest path problem of large-scale multistage graphs on a distributed computing system. DMGA consists of three phases: graph partition, local shortest path calculation, and global shortest path calculation. e experiment results demonstrate its high-performance. However, the load of DMGA is not balanced, and it only considers of the vertexcut partitioning method. In future, we will focus on reducing the number of mirror vertices and solution time as much as possible under the premise of load balance and propose the special algorithm based on edge-cut partitioning idea.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request and also provided in the supplementary information files.