Routing Optimization Algorithms Based on Node Compression in Big Data Environment

Shortest path problem has been a classic issue. Even more so difficulties remain involving large data environment. Current research on shortest path problem mainly focuses on seeking the shortest path from a starting point to the destination, with both vertices already given; but the researches of shortest path on a limited time and limited nodes passing through are few, yet such problem could not be more common in real life. In this paper we propose several time-dependent optimization algorithms for this problem. In regard to traditional backtracking and different node compressionmethods, we first propose an improved backtracking algorithm for one condition in big data environment and three types of optimization algorithms based on node compression involving large data, in order to realize the path selection from the starting point through a given set of nodes to reach the end within a limited time. Consequently, problems involving different data volume and complexity of network structure can be solved with the appropriate algorithm adopted.


Introduction
The single source shortest path problems in graph theory are very typical questions that enjoy wide applications in real life, such as network routing path selection, vehicle navigation, and travel routes.The classic algorithm to solve such problems is Dijkstra's Algorithm [1] proposed by Dijkstra in 1959 and a lot of researchers focus on this research area [2][3][4].However, Dijkstra fails to solve problems where routes are required to go from the starting point, pass the specified intermediate node, and finally reach the destination-far more practical problems exemplified as follows: (1) "Postman problem": the postman starts from the post office, sends letters to residents, and returns home, where we need to find the postman a shortest path within a given time.
(2) "Limited time problem": within a limited time, activities designed for staff members who tracked consent using depth sensors were proposed and they were carefully reminded of noncompliant activities [5], and a collaborative smartphone task model is proposed, which is called Collaboration-Based Intelligent Perception Task Model (CMST) [6].
(3) "Traveler problem": calculate a travel route for the traveler within the specified time, who needs to go from a designated location, pass a designated scenery spot, and visit a given place.The total distance should be the shortest or the total expense should be the lowest [7,8].
(4) "Compression problem": a new compression method for large data environment is proposed, which can effectively reduce the data compression of single nodes and ensure the quality of data [9].Due to the large amount of web service data, a data-driven scheme is based on kernel least mean squares (KLMS) algorithm [10].In order to compress the input to further improve the learning effect, a new QKLMS is based on entropy-guided learning [11].
(5) "Network routing problem": find an efficient routing algorithm to solve the problem of path optimization of wireless sensor network, considering the influences of some practical factors such as the consumption of the energy of the nodes and recovery time of routing [12][13][14].
(6) "Laguerre neural network" [15]: it intends to propose a novel automatic learning scheme to improve the tracking efficiency while maintaining or improving the data tracking accuracy.A core strategy in the proposed scheme is the design of Laguerre neural network-(LaNN-) based approximate dynamic programming (ADP).
(7) "Energy of the sensor nodes" [16]: a novel predictionbased data fusion scheme using grey model (GM) and optimally pruned extreme learning machine (OP-ELM) is proposed.The proposed data fusion scheme called GM-OP-ELM uses a dual prediction mechanism to keep the prediction data series at the sink node and sensor node synchronous.
These problems can be summarized as one graph theory problem; that is, in a weighted directed graph, a route goes from a starting point, passes through the designated intermediate node, and reaches a destination.It is required to find valid paths within a specified time, calculate the weight of these paths, and select a path with the lowest weight as the final result.
To solve this kind of problems, we may traverse the whole graph and find a shortest path, although theoretically this traversal algorithm will eventually sort out the optimal solution; however the time complexity remains high.In view of this, this paper proposes a node compression routing algorithm with considered time limits.The study pays attention to node compression and applies useful information obtained in path finding to search conditions, readjusting the order of subnodes and other methods as well.Additionally, the high time complexity in traditional algorithm is improved, offering an effective solution to this type of problem.= ∞; while   and   may be unequal,   = {1  , 2  , . . .,   } ∈ .We need to find the sequence  = { 1 ,  2 ,  3 , . . .,   } within a given time, where  is starting point and  is the destination, ,  ∈  and ,  do not belong to   , all of the elements in   must appear in sequence , making the sum of the weights of all edges of the path formed in sequence  minimal, and loop is not allowed in any path.The mathematical model of the problem is defined as follows.

Problem Description
Under the condition of Time = , solve min  = ∑  ̸ =   ×   , in order to define the starting point  and the destination  and make sure that there's only one in-edge and out-edge on each vertex except the edges of starting point and the destination paths; we make the following constraints: where   is an integer of 0 or 1, 1 represents edge   on the result path, and 0 represents edge   out of the result path, and   is used to calculate the weight of the resulting path.
where  ̸ =  means that the result path cannot contain the edges that the starting node and the end node are the same node, which means the point in the intermediate node set on the result path can only occur once and must occur once.
The formula defines an edge that begins with the starting nodes which should appear in the result path, and the starting node in the edge cannot be the end node.
The formula restricts that the starting node  can only be the starting node in an edge, and it cannot be any other kind of nodes, such as end node or intermediate nodes.
The formula restricts that the result path must have an edge ended with the end node , which means the edge cannot start with the end point .
The formula restricts that the resulting path cannot contain the edge beginning with the end node ; that is, the end node  can only be used as the final node on the resulting path.
This formula defines the number of edges on the resulting path which can be the number of nodes minus one; that is, the resulting path cannot appear with unrelated edges and loops.
For the convenience of subsequent description, the following two definitions are given.
Definition 1 (key nodes).The nodes in   include other mustpass nodes except starting point  and destination .
Definition 2 (free nodes).All other nodes except the key nodes are included.

Improved Backtracking Algorithm: IBA
If using the backtracking method to solve this problem, theoretically, we can have the optimal solution and of course other solutions.However, the backtracking method does not effectively use information constructed in the search process or the optimal solution to lay a foundation for optimization condition of the next-step search.In this section, an improved backtracking method (OPT-Backtrack Algorithm) is proposed based on traditional backtracking method.The new IBA retrieves known information and valid results from the previous search and adds them up to the next search rules before searching from other nodes.In this way, the search method and algorithms can be improved, since existing information and possible results are taken into consideration for a higher search efficiency.The addition rule in the improved backtracking algorithm is shown below.

Rule 1.
If the next node happens to be the destination, yet the current path has not gone through every must-pass node in the node set, the path will track back and begin searching for the next node.This rule avoids the generation of many invalid solutions thus improving the algorithm efficiency.
Rule 2. If the current path weight and the weight of the edge to the next node is greater than or equal to the minimum weight of the available solution, the path will track back and continue searching for the next node.If current path has been found whose current weight and the weight of edge to the next node is no more than the existing weight, then there is no need to search for the next node, because initially the problem is to find the smallest possible weight of the path.Rule 3.For those nondestination nodes with zero child nodes, we should avoid entering the search.If a node is not destination and has no child nodes, the path shall not continue; therefore, it is not necessary to search at such nodes or rather they can be simply deleted from the graph.
The key pseudocode of the improved backtracking algorithm is shown in Algorithm 1.

Node Compression Based Search Algorithm
Although search efficiency can be enhanced by the improved backtracking algorithm to a certain degree, the negative complexity of the improved backtracking method will also increase as scale of the graph and solution domain expand.To reduce algorithm complexity, this paper proposes a new algorithm, node compression based search algorithm: NCSA.
As the scale of graph increases, paths will expand accordingly.The same problem would be finding a path from a start point, reaching an intermediate node halfway and finally the destination.To reduce the algorithm complexity, we may preprocess the graph.The method is to compress the total number of nodes, remove useless nodes and low-value path fragments, and then save the only paths that are necessary to simplify the entire graph; the goal is to compress solution domain and ultimately improve search efficiency.

Node Compression Algorithm (NCA).
The algorithm is applicable to the following circumstance: If a node is relatively remote which only reaches one other node, that is, a node followed only by one child node, in this case, the search will follow down the only child node route and will repeat this wherever there is such a node during the searching process.What we want to do is to avoid the simple and repeated calculations in this kind of situation.
Solution to this problem is Node Compression Algorithm (NCA).NCA records the paths through the above-mentioned nodes when the algorithm is applied for the first time and will remove the nodes but retain the path information; therefore, when the next search continues at this node, only stored path information will be used to avoid duplicated counting.As a result, the total number of nodes is compressed and reduced, making it easier to search for a better solution.
The process is shown in Figure 2.
In Figure 2, node 1 is followed by the only child node 2, the weight from nodes 1 to 2 is 2, marked as path 1; the compression process means transferring node 1 information to node 2 so that node 2 becomes the direct child node of node 0. If compressed, the weight from nodes 0 to 2 is 3, and path from nodes 0 to 2 is "0 | 1."This means node 1 is removed while the path information from nodes 1 to 2 is retained solely in node 2. When the next search algorithm reaches node 0, information retained in node 2 can be used directly without going back to node 1.So the number of nodes is reduced and the path will not be searched again.

Complete Compression Algorithm: CCA. Since Node
Compression Algorithm (NCA) is used mainly to solve free nodes with only one child node, if such nodes are many in the graph, the algorithm efficiency will be significantly improved.However, if the scale of such nodes is limited, the basic compression algorithm will take less or no effect, which limits the effectiveness of compression search algorithm.
In view of the problem of NCA, this paper proposes a more efficient compression strategy, which compresses all free nodes in the graph to reduce the complexity of the graph, improving the search efficiency.
The problem is finding a noncircle path from the starting node to the destination node while passing through the intermediate node sets so that the weights of the edges on paths are as small as possible.When the reachability of nodes is complex, there will be many more possible paths to reach nodes of one and another.Since the problem requires that intermediate node set   be passed and, within the set, there are multiple reachable paths between nodes, yet only one path will be selected within the set as one fragment of the final solution, therefore, we should find out all reachable paths while saving the path with the smallest weight.As the search algorithm reaches a corresponding node, the valid path will be retrieved from the stored information while the original nodes on the path can be removed from the graph, reducing useless nodes and repetitive counting.With this compression method, only the starting point, destination, the intermediate node set, and their interconnected path information will remain, simplifying the entire graph to a large extent with excellent compression efficiency.
Just like Figure 1, it can be seen as a simplified graph, and only the starting point, destination, and intermediate node set are preserved.In this way, we can achieve good compression efficiency by selecting the reachable path with the smallest path.

Improved Complete Compression Algorithm: ICCA.
In order to further improve compression efficiency, this section continues to adjust and improve node compression by the three steps.

Adjusting Child Nodes Order by Weight.
In the search process, algorithm can be done based on the weight of feasible solutions (see Rule 2 of IBA).First the order of subnodes is sorted according to the weight size from small to large.When algorithm searches the path, subnodes carrying smaller weight are searched with priority so that paths with smaller weight are easily obtained.As a result of this search strategy, other paths with larger weight can be skipped.This certainly reduces unnecessary search processes with greater efficiency.

Adjusting Child Nodes Order by the Sequence of Passing
Nodes (from Small to Large).From the perspective of probability, when a new node is inserted into a graph, the more the nodes a path passes, the more likely the repeated path will be generated.Therefore, under the condition of same weight, the nodes with fewer subnodes will be given priority since the paths that follow will make fewer repeated attempts, making it easier to find the solution path.

Removing Child Nodes with Larger
Weight.This strategy is only applicable to high-complexity graphs.After compression, the remaining nodes will connect one and another to form paths; complexity of the graph might be still high.There would be the case where one path might be an effective solution but the nodes it passes carry excessive weight, so the path will not be considered the final solution.In this case, removing large weight nodes will lower the graph complexity and improve search efficiency.In addition, it will save time and figure out a better solution with a lower weight path.
By analysis, the spatial complexity of IBA is (), while the spatial complexity of NCA, CCA, and ICCA is ( 2 ), where  is the total number of nodes in the graph.ICCA can quickly select the shortest paths according to the weights of nodes and the nodes with smaller weights and delete the nodes with larger weights from the compression of large networks efficiently.

Experimental Analysis
5.1.Data Description and Analysis.Without loss of generality, experiment data are from the cases of 2016 Huawei Software Elite Competition; these quoted examples are based on the network topological graph of Huawei's network routers, switches, and other network elements when Huawei established its own network facilities.

Problem Description.
Given a weighted graph  = (, ),  is the vertex set,  is the directed edge set, and each directed edge contains the weight.For a given vertex , , and a subset   of , find a nonringing directed path  from  to  within a given time so that  passes through all vertices in   (the order of passing is not required), making the total weight of all directed edges on path  as small as possible.
(2) The starting point of any directed edge is not destination.
(3) The number of directed edges connecting vertex  to vertex  may be more than one, whose weight may or may not be the same.
(4) The total number of vertices of the directed graph will not exceed 600, and the number of each vertex out-degree (the number of directed edges with these points as the starting point) does not exceed 8.
(5) The number of elements in   does not exceed 50.
(6) The nonringing directed path  starts from  to , where  is a directed connected path consisting of a series of directed edges from  to , with no repeated path allowed.(7) The weight of a path is the sum of all weights on the directed edges of the path.where LinkID is index of directed edge, SourceID is index of the starting vertex of the directed edge, DestinationID is the index of destination vertex of the directed edge, Cost is the weight of the directed edge.The index of vertex and that of directed edge are numbered from 0 (not necessarily continuous, but the case ensures that the index does not repeat).
( Figure 3 shows the experimental result from Experiment 1 and it presents the fact that IBA has higher efficiency than the backtracking method.Efficiency difference is not remarkably obvious in NCA and CCA because the compression process takes time and also the efficiency becomes even less obvious if the complexity of the graph is low.Figure 6 shows the experimental result from Experiment 4 and it presents the fact that backtracking method indicates low efficiency if complexity of the graph is even higher; in contrast, CCA efficiency performs reasonably well.
Experiment results have shown that IBA has a higher efficiency than backtracking method judged by either weights or search time.NCA shows only a slight advantage over IBA because remote nodes in the graph are very limited.In particular, judging from all dimensions, CCA has proved significant quality in searching the results with superior efficiency to other algorithms, indicating the effectiveness of CCA in solving such problems.

CCA and ICCA Comparison.
It is observed from the previous four experiments that the respective efficiency of backtracking method, IBA, and NCA decreases drastically as the sum of nodes increases.Therefore, there is no research value to add up more nodes to the graph.This section continues to compare between CCA and ICCA.
Experiment environment will remain the same as those of Experiments 1-4; experiment will gradually increase total  Figure 7 shows the experimental results which have indicated that compared to CCA, ICCA obtains better solutions.Therefore, the improved strategy in Section 4.3 is proved to be effective.

Conclusion
Problems like postman problem, traveler problem, bus line design, network routing problem, and other similar cases can be abstracted as the path finding graph model as discussed in this study.IBA and NCA are applicable to medium-sized problems.NCA is recommended to solve graphs that contain many remote nodes, while CCA and ICCA are more efficient in dealing with large-scale problems with great algorithm complexity.Additionally, ICCA is able to promote search efficiency when subnodes are readjusted.
As the size of problem becomes larger, CCA and ICCA may not be able to search the whole solution space completely with the optimal solution within a given time.In this case, the compression idea will be integrated into heuristic algorithms such as genetic algorithm and ant colony algorithm to expect a far more efficient search algorithm so as to resolve routing problems with larger scales.

Figure 1 :
Figure 1: A simple example of the problem.

Figure 2 :
Figure 2: The basic idea of compression search algorithm.

Figure 4 Figure 3 :Figure 4 :Experiment 3 .
Figure4shows the experimental result from Experiment 2 and it presents the fact that IBA, NCA, and CCA have a greater efficiency than backtracking method.Efficiency

Figure 5
Figure5shows the experimental result from Experiment 3 and it presents the fact that the superiority of CCA proves obvious as graph complexity gradually improves.

)
Path information includes {SourceID, DestinationID, IncludingSet}, where SourceID is the starting point of the path, Desti-nationID is the destination of the path, and IncludingSet represents the must-pass vertex set   , and different vertex indexes are segmented with "|."