A Hierarchical Load Balancing Strategy Considering Communication Delay Overhead for Large Distributed Computing Systems

Load balancing technology can effectively exploit potential enormous compute power available on distributed systems and achieve scalability. Communication delay overhead on distributed system, which is time-varying and is usually ignored or assumed to be deterministic for traditional load balancing strategies, can greatly degrade the load balancing performance. Considering communication delay overhead and its time-varying feature, a hierarchical load balancing strategy based on generalized neural network (HLBSGNN) is presented for large distributed systems. The novelty of the HLBSGNN is threefold: (1) the hierarchy with optimized communication is employed to reduce load balancing overhead for large distributed computing systems, (2) node computation rate and communication delay randomness imposed by the communication medium are considered, and (3) communication and migration overheads are optimized via forecasting delay. Comparisons with traditional strategies, such as centralized, distributed, and random delay strategies, indicate that the HLBSGNN is more effective and efficient.


Introduction
In traditional load balancing strategies, as the task grain size and the number of hops to traverse are likely to be relatively small, communication overhead between any pair of processors in computing system is commonly assumed to be nearly the same or ignored [1][2][3][4].In large distributed systems, the topology diameter, task grain size, and data scale to traverse are likely big.For example, network topology diameter is commonly big and schedule objects are generally virtual machine resources, and the task grain size to traverse and data scale may be big in cloud computing environment [5,6].Transmission of such large-scale information can lead to big communication delay, which will be able to undoubtedly reduce the accuracy of scheduling strategies and incur the aging problem of obtained information.
In centralized load balancing strategies, a dedicated "central" computer gathers global information about the state of the entire system and uses it to make global load balancing decisions.Centralized strategies are inherently nonscalable and have the following limitations when applied to large distributed systems [7][8][9][10]: (1) collection of global information may become a prohibitively expensive process; (2) the memory usage for storing the global state information on the central node can be prohibitively high; (3) the single central node can become the communication bottleneck with a large number of processors; (4) the execution overhead of a centralized strategy's decision-making algorithm can be very high, given the large number of processors; (5) greedybased centralized strategies tend to lead to migrating almost all tasks away from their current location which can be very expensive in a large system.On the other hand, in a distributed strategy, each processor exchanges state information with other processors in its neighborhood.Compared to centralized strategies, a distributed load balancing strategy is designed to be scalable on large distributed systems [7,11].However, several problems can be found in practice.For example, distributed schemes can suffer from the aging of load information.This is largely due to the nonpreemptive message scheduler.Other messages have to wait in the message queue.Thus the processing of critical messages that contain the load information may be delayed in the queue and get out-of-date when the execution time of the method being processed is long.This aging of load information may lead the load balancing runtime to make poor load balancing decisions.Also, the delay in processing the load messages may further delay the invocation of the load balancing strategies.This is because load balancing can only be triggered when all load information is received from neighboring processors.Any delay in receiving these messages may slow down the invocation of the load balancing strategies and response time.For user experience of cloud service, it depends heavily on the communication delay between end user and service instance the user accesses, which is mainly caused by Internet delay between the user and the data center hosting the instance [12].
Moreover, overall performance of large-scale distributed data processing is known to degrade when the system contains nodes that have a large communication delay [13].For loosely coupled communication applications such as mpiBLAST, we can always see a considerable speedup as here what prevails is the computation gained by an increased parallelism over the communication overhead.Nevertheless, if the dataset is relatively small and the number of VMs utilized is too large, then the parallelism can be overwhelmed by the communication.We could see that as we include more than 32 VMs from three different domains, the performance did not obtain any speedup, which was mainly caused because of a higher communication delay, and the computation time was overwhelmed by the communication time [14].The current virtual machine (VM) resources scheduling in cloud computing environment mainly considers current state of the system but seldom considers system variation and historical data, which always leads to load imbalance of the system [5].However, CPU speed and communication delay are two factors that influence platform performance in high performance computing systems [15].
Hajjat et al. [16] show that communication delay is timevarying.This is attributable to uncertainties associated with the amount of traffic, congestion, and other unpredictable factors within the network [17,18].How to perceive and predict the communication state quickly and accurately in load balancing process is a key problem to be solved in large distributed computing system.At present, there exist some algorithms for solving this problem such as RW (Random Walk) [19], HA (Historical Average) [20], and IHA (Informed Historical Average) [21].The RW judges the communication situation just based on the current network delay situation, HA performs this only according to the average situation of historical data, and the IHA combines RW and HA.However, the nonlinear and uncertain characteristics of communication delays are not considered and the influence of random factors cannot be avoided.Thus, the prediction accuracy of these methods is very low with the time interval getting shorter.
In large-scale distributed computing systems in which the computational elements (CEs) are physically or virtually distant from each other, there are a number of inherent time-delay factors that can seriously alter the expected performance of the load balancing policies that do not account for such delays [17].Dhakal et al. [18] presented a random delay forecasting formula which combines RW and HA.However, the value of forgetting factor  can only be chosen with experience, and it is very difficult to predict for changing network, thereby having no practical usability.In addition, it does not have a complete prediction model and can only predict the average delay of the next time interval but not the delay after the next time interval.
A hierarchical dynamic load balancing strategy based on generalized neural network (HLBSGNN) is presented considering time-varying characteristics of communication delays in large distributed computing systems.This hierarchical load balancing strategy reduces the load balancing overhead in large distributed computing systems with communication-optimized hierarchy.In the new strategy, the computation rate of node and time-varying characteristics of communication delay are considered, and a delay prediction model based on generalized neural network (GNN) theory is constructed.It provides an effective optimization method for load balancing strategies considering delay overhead in large distributed systems.
The rest of this paper is organized as follows.After describing the hierarchical load balancing strategy (HLBS-GNN) in Section 2, comparison experiments are provided in Section 3. We conclude this paper in Section 4.

The Hierarchical Load Balancing Strategy (HLBSGNN)
2.1.Intelligent Neuron Model.In traditional neural network models, the neuron's structure is very simple and its transfer function is not changeable, so the neuron only has information processing ability and the information storage ability of the whole neural network is limited.When dealing with large-scale problems, traditional neural network is hard to converge.In order to greatly increase the information storage ability, the model of generalized neural network (GNN) is presented.The neurons of GNN can be the simple neurons of traditional neural network or the intelligent neurons which have information storage ability or can be the neurons which consist of a neural network (multi-inputs/multioutputs).The neuron is constructed based on sample functions.In this paper, a new intelligent neuron model is constructed based on linearly independent functions.A GNN model is formed with these intelligent neurons and can greatly improve the performance of neural network.The GNN constructed by neurons can be applied to predict communication delays of large distributed computing systems having high practicability.
The intelligent neuron has information storage ability and adjusts its transfer function in a set of functions by some training algorithms.In previous research, Eck and Shih [22] used linearly independent functions to pretreat the neural network's inputs, which made neural network get a better mapping effect.Without importing new inputs, these functions can effectively increase the dimensions of input vectors and therefore can greatly accelerate the network's convergent speed.This paper imported linearly independent functions into interior constructions of intelligent neurons and expanded neuron's input  into linearly independent functions ,  2 ,  3 , . . .,   ( is odd) and constructed a new intelligent neuron model, as shown in Figure 1, where  is neuron's input,   ( = 1, 2, 3, . . ., ) is connected weight, () is neuron's output and can be formulated as If the neuron has  inputs, the mapping relation is where Υ is neuron's output and  can be represented as = ( 1 ,  2 , . . .,   ) . ( Drawing the common factors  1 ,  2 , . . .,   of matrix , the following formulation can be gained: By Vandermonde determinant we can have So the value of determinant of  is If   ̸ =   , then det  ̸ = 0; the rank of  is .It shows that the dimensions of input vectors can be increased and the neuron's information storage ability is consequently improved without increasing the number of inputs in the interior of intelligent neurons by the way of expanding functions.All of that makes neural network have a good mapping effect. The sine and cosine functions (, sin , cos , sin 2, cos 2, . ..) are used to expand neural network's inputs, which can greatly improve neural network's performance [23].This paper introduces sine and cosine functions into the constructions of intelligent neurons and constructs a new model of intelligent neuron as follows: In formulation (7),  is the neuron's input, () is the neuron's output, and (, ) is linearly independent function in the intelligent neuron and can be formulated as The functions , sin , cos , sin 2, cos 2, . . .are linearly independent.The neuron formed by these functions has good function mapping ability and is superior to the one formed by the functions ,  2 , . . .,   in terms of prediction accuracy and convergence rate [24].

Delay Prediction Model and Its Learning Algorithm Based on GNN.
Generally speaking, time prediction can be divided into two main methods: data model and analysis model.Data model is basically characterized by data guidance, taking the historical and current delay time variable sequence as inputs.In Figure 2, we assume that current time is , and the historical data is ( − 1), ( − 2), . . ., ( − ) at time  − 1,  − 2, . . .,  − , respectively.We can forecast future time sequences ( + 1), ( + 2), . . .by analyzing historical data samples.
In this paper, a GNN method based on intelligent neuron model is used to predict the delays of nodes in future time intervals.GNN has many structural forms.The input layer of GNN is composed of ordinary neurons, and the hidden layer and the output layer are composed of intelligent neurons.For real-time prediction of node delay, there exists a certain relationship between the delay of current node and the delays of last several nodes, which thus can be used to predict the delay time of the node in next time interval.We choose the delay amount at times  − 6,  − 5,  − 4,  − 3,  − 2,  − 1 as  the input of GNN and take the delay amount of prediction node at time  + 1 as output.The number of nodes in hidden layer nodes is set to 6, and the delay prediction model can be shown as in Figure 3.
The connection weights between layers can be learned by error back propagation algorithm, and the adjustable parameters of hidden layer and output layer can be learned with LMS algorithm [25].A practical and complete learning process of GNN involves the following steps: (1) Initializing the neural network.
(2) Calculating the actual output and the state of neuron in each layer for different samples.
(3) Computing the error of each neuron and back propagation error in output layer and hidden layer.
(4) Amending the connected weights and thresholds between input layer and hidden layer.
(5) Recalculating the state of GNN, error of node, and back propagation error in each layer.
(6) Amending the nonlinear transfer function of node in hidden layer.(7) Recalculating the state of GNN, error of node, and back propagation error in each layer.
(8) Amending the connected weights and thresholds between output layer and hidden layer.
(9) Recalculating the state of GNN, error of node, and back propagation error in each layer.
(10) Amending the nonlinear transfer function of node in output layer.(11) Calculating the whole neural network's error and verifying if this error meets the requirements.If true, then end the training algorithm; else go to (2).

The Hierarchical Load Balancing Algorithm Based on GNN.
The computing nodes in large distributed systems can be mapped to a hierarchical tree with tree model [26] to take advantage of the architecture.A hierarchical tree can be built according to machine's topological hierarchy to minimize load balancing communication overhead.Assume that the transmission delay is low in large distributed computing system.Therefore it can be assumed that the transmission delay between two adjacent nodes is approximately equal.Let  , (0) be transmission delay of idle load between node  and node  at time , and we have where  = 0,  is idle-load delay constant between adjacent nodes, and   is the distance (in terms of number of hops) of the shortest path from node  to node .For systems with  nodes, we set idle-load delay threshold of nodes in each layer by the equation Therefore we can determine the height of load balancing tree and construct a load balancing tree.In our load balancing strategy, an intermediate node at level   and its immediate children nodes at level  −1 form a load balancing domain, with the root node as group leader (manager).Load balancing group leaders control the load balancing process inside their domain, playing the role similar to the central node in a centralized load balancing scheme.Load balancing domains periodically exchange the load of their processors.This process is triggered by leaf processors at level 0 of the tree, starting to send their local load up to the domain group leader processors which are at level 1.The same process continues by ascending the tree to top level in the tree.The manager load balancing algorithm is described in Algorithm 1.

Load Balancing Process
(1) Load Balancing Initiation.At a synchronous time, group leader collects information at each leaf node.Each leaf node reports the prediction of task completion time   to group leader.For example, a group leader manages  leaf nodes, and it controls a state vector  = ( 1 ,  2 , . . .,   ),   ∈ (0, 1).The vector is initialized to (0, 0, . . ., 0).If group leader receives   sent by node , it then updates the value of   with 1.
(4) Load balancing strategy GreedyCommLB is invoked to complete task migration.

Comparison to Traditional Centralized Load Balancing
Strategy.The experiments were run on a 64-node Lenovo DeepComp 1800 installed at Dalian University of Technology, an SMP cluster machine formed from two 2.8 GHz Intel Xeon processors with 1 MB L2 cache and with 4 GB of physical RAM, each node connected by a Myrinet network, and running RedHat 9.0 operating system.We simulated the BlueGene/L (with 32 K-64 K nodes) on DeepComp 1800 with BigSim emulator [27] using 8 real nodes and used load balancing benchmark program (lb test) to simulate the running of parallel programs.The program generated a certain amount of communication objects or tasks in a system with a mesh 2D topology and each object performs a certain amount of iterations.For centralized and hierarchical load balancing strategies, the load balancing overhead and memory overhead at each iteration of the simulation on 32 K and 64 K emulated processors are shown in Tables 1 and 2.
The compared results of the centralized and hierarchical load balancing strategies in the same lb test are shown in Tables 1 and 2. In the centralized strategy, the single central node can become the communication bottleneck in a computing system with a very large number of processors.However, in the hierarchical strategy, the height of the load balancing tree can be reduced by threshold setting, and thus the overheads on load balancing, memory, and the idle time of leaf node can be greatly reduced.

Comparison to Traditional Distributed Load Balancing
Strategy.In order to compare the performance between HLBSGNN strategy and traditional distributed load balancing strategy, we ran the algorithm proposed in this paper and the nearest neighbor load balancing algorithm [28,29] (migrating tasks from overloaded node to its underloaded neighboring nodes) and used the tool Projections to track the load balancing process.The results are shown in Figures 4(a As shown in Figure 4(a), the average CPU utilization of the HLBSGNN algorithm is about 30%, while the average CPU utilization of the nearest neighbor load balancing algorithm is about 10%, as shown in Figure 4(b).

Comparison to Random Delay Strategies.
In order to verify the communication prediction mechanism of hierarchical strategy, we ran communication test program Commbench of Charm++ on DeepComp 1800 machine using 2 nodes to take delay sampling (Δ = 1 m) to test the delay prediction performance of the IHA, GNN, and BP (back propagation) methods.According to the relationship between the square root error and the forgetting factor  (as shown in Figure 5), it is shown that the prediction results are best when  is 0.7 and thus let  be 0.7 in this paper.
Aim to compare the convergence time and number of learning times of GNN and traditional BP neural network, we ran two algorithms under the same conditions such as initial weight values, training samples, and parameters.The results are shown in Table 3.
As shown in Table 3, the prediction accuracy of BP is lower than that of GNN.The convergence time of BP is higher than that of GNN and it is even nonconvergent in the case of high precision to be required.In order to compare the prediction performance of BP, IHA, and GNN algorithms, we used 75 sets of data to perform training and forecasted 100 sets of data.The results are shown in Figures 6 and 7.
It can be seen from Figures 6 and 7 that the delay prediction of the generalized neural network is better than that of IHA algorithm and BP algorithm.In order to evaluate the prediction performance, we introduce the following evaluating metrics: (1) RME: relative mean error; (2) RMSE: root-mean-squared error; (3) EC: error change rate; (4) mxarer: maximum absolute relative error; (5) mrerr: mean relative error.In order to compare the prediction accuracy of various algorithms, we calculated the prediction errors of various algorithms, as shown in Table 4.In Table 4, the prediction accuracy of GNN algorithm is better than that of IHA and BP algorithm when the delay is sharply changed.In the load balancing process of HLBSGNN strategy, the accuracy of makespan can be improved by prediction results when considering communication cost, as shown in Table 5.
In order to investigate the effect of GNN method on load balancing, assuming a distributed system with 2D mesh topology and each node with 10 tasks (or objects), each object communicating with the nearest neighbor node 100 times, we tested the performance of the GreedyCommLB algorithm based on the GNN and IHA and performed performance comparison of GNN and IHA algorithms.The results are  shown in Figure 8.It can be seen that the GNN algorithm is better than IHA algorithm and can improve the prediction precision of makespan.

Conclusions
In this paper, a hierarchical load balancing strategy based on generalized neural network is proposed, considering the load balancing overhead and time-varying characteristics of network delay in large distributed systems.The proposed strategy can reduce load balancing overhead in large-scale systems with communication-optimized hierarchy and optimize the delay of communication and migration, thereby improving performance of load balancing.Experimental results have shown it to be relatively effective and efficient against traditional centralized and distributed load balancing strategies in large distributed computing systems.
In fact, there is no "one size fits all" when it comes to load balancing.This means that the performance of load balancing is heavily dependent on the particular applications, system architectures, and numerous other variables, and there may be one or more viable solutions.This is also true of the work in this paper, which is mainly suitable for the case that the delay is large and time-varying.Research on load balancing for applications with high real-time requirement will be our future work.

Figure 3 :
Figure 3: Delay forecasting model based on GNN.
CPU utilization projections graph of nearest neighborhood strategy

Figure 5 :
Figure 5: The function relation between forgetting factor and square root error of prediction delay.

Figure 6 :
Figure 6: Comparison of GNN and IHA algorithm.

Figure 7 :
Figure 7: Comparison of GNN and BP algorithm.

Table 3 :
Comparison of convergence time and learning times between BP and GNN.

Table 4 :
Prediction error comparison of various algorithms.

Table 5 :
Error comparison of IHA and GNN to predict makespan of load balancing.