Adaptive Dynamic Process Scheduling on Distributed Memory Parallel Computers

One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The adaptive contracting within neighborhood (ACWN) is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adaptiveness. Its feature of quickly spreading the work helps it outperform the gradient model in performance and scalability.


Introduction
Large distributed memory parallel machines are becoming increasingly available.To e ciently use such large machines to solve an application problem, the computation must rst be divided into parallel actions.These parallel actions are then mapped and scheduled onto processors.
Static, compile-time allocation is one way to accomplish this.As a rather simple example, consider the problem of multiplying two 64x 64 matrices on 16 processors.One may decide that each processor will compute a 16 x 16 submatrix of the result matrix by using appropriate rows and columns from the original matrix.This leads to 16 sub-computations, as desired, and either an automatic scheduler or a programmer can specify the appropriate data movement and computations.
Such static scheduling schemes cannot be used when the size of sub-computations cannot be accurately determined.In fact, in many computations, the sub-computations themselves are not known at compile time.Combinatorial search problems encountered frequently in AI provide an extreme example.Exploring a node in the search tree may lead to a large sub-tree search, may quickly lead to a dead-end, or may lead to a solution.Even with deterministic computations, data dependencies and variable computational costs of operations lead to programs in which the detailed structure of computation can not be predicted in advance.In such computations, one cannot divide the work into N equal parts, where N is the number of processing elements (pes) in the system, because the computational costs of subtasks cannot be predicted accurately.A reasonable strategy for such computations is to divide the work at runtime into many ( N) smaller granules, and attempt to dynamically distribute them across the processors of the system.The grainsize must be large enough to o set the overhead of parallelization.There are systems, such as the chare kernel described in the next section, which can support a grainsize as small as a few milliseconds.Partitioning an application with small grainsize would provide a large pool of work.Thus, even if the amount of computation within individual granules may vary unpredictably, it at least becomes possible to move these granules among processors to balance the load.
A scheduling scheme in such a context must deal with dynamic creation of work.It must cope with work generation and consumption rates that vary from processor to processor and from time to time.It cannot be a centralized scheme as it must work with a large number of processors and must scale up to a larger future system.Rather, it must be a distributed scheme, in which each processor participates in realizing the load balancing objectives.
Obviously, a static scheduling scheme cannot be used for a computation that involves dynamic creation of work.However, a dynamic scheduling scheme can also be used for statically allocatable computations, such as the matrix multiplication problem mentioned above.In fact, a good dynamic scheduler may perform better than static schedulers even in some statically schedulable computations, because it will automatically adapt to variable speeds of processors and to variable numbers of processors.
In this paper we describe a dynamic and distributed scheduling scheme called Adaptive Contracting Within Neighborhood (ACWN).The next section discusses background and context in which the scheme is to operate, and outlines basic issues.Section 3 describes some algorithms with similar objectives.Section 4 presents the ACWN algorithm and compares three di erent scheduling algorithms.Performance evaluation is given in Section 5, showing that ACWN maintains good load balance with low overhead.In Section 6, we discuss why the ACWN algorithm outperforms the other algorithms.

Background
The chare kernel is a runtime support system that is designed to support machine independent parallel programming 1, 2, 3, 4].The kernel is responsible for dynamically managing and scheduling parallel actions, called chares.A chare | the word stands for a small chore or task | is a process with some speci c properties.Programmers use kernel primitives to create instances of chares and send messages between them, without concerning themselves with mapping these chares to processors, or deciding which chare to execute next.Chares have some properties that distinguish them from processes in general.Upon creation, and upon receipt of a message, chares usually execute for a relatively short time.They may create other chares or send messages to existing ones.Having processed a message, the chare suspends, to be awakened by another message meant for it.These characteristics simplify the scheduling of chares considerably.
We will use the chare kernel concepts and terminology in discussing dynamic scheduling strategies.However, it should be clear that the scheduling strategies that are applicable in this context can also be used in other contexts which involve dynamic creation of small-grained tasks.For example, the REDIFLOW system 5] for applicative programming, other parallel implementations of functional languages, rewrite systems and logic languages, and actor-based languages such as Cantor 6], can all bene t from such strategies.
Many previous research e orts have been directed towards the task allocation in distributed systems 7,8,9,10,11,12,13,14,15,16,17].Although some basic ideas can be shared, we cannot simply apply these strategies to multicomputer networks.A recent comparison study of dynamic load balancing strategies on highly parallel computers is given by Willebeek-LeMair and Reeves 18].Work with a similar assumption as ours includes the Gradient Model developed by Lin and Keller 19].Athas and Seitz also point out that random placement can be a quite simple and e ective strategy 20,21].These strategies are discussed in the next section.
A chare instance goes through three phases in its life-cycle: the allocating phase, the standing phase, and the active phase.It is said to be in the allocating phase upon its creation until it enters in a pool of chares at some pe, and to be in the standing phase until it starts execution for the rst time.Then the active phase begins.Opportunities for chare scheduling exist in all three phases, but with di erent cost and e ectiveness.The allocating-phase strategies as well as standing-phase strategies are instances of placement strategies.The active phase can also be used for scheduling.Strategies that move a chare in the active phase are called migration strategies.Since the grainsize of chares is not large, migration is expensive and not necessary for load balance.Hence, this strategy is not considered in this paper.
Scheduling strategies can also be classi ed based on the amount of load information they use.The \load" measure may include the number of messages waiting to be processed, the number of active chares, available memory, etc., possibly in a weighted combination.For the following discussion, the speci c load measure is unimportant.The scheduler at a pe may periodically collect information from other pes to calculate its own \status" information on which the scheduling decision is based.The strategies can be classi ed as follows: type-i strategies involve using no status information.type-ii strategies calculate the status information by using local load information only.
type-iii strategies calculate the status information by collecting load information from neighbors.type-iv strategies calculate the status information by collecting status information from neighbors.type-v strategies calculate the status information by collecting load information from all the pes in the system.
Type-i and -ii strategies typically have low overhead.The randomized allocation to be discussed in Section 3 is an example of a type-i strategy.It is believed that a strategy that adapts to variations in the system is necessary, and using local information alone is not su cient to judge such variations.Type-v strategies, on the other hand, are expensive in large systems, and not scalable.
The algorithm developed in this paper is a type-iii strategy, in which the status information of a pe may be determined based on load information from itself and from its neighbors.The gradient model to be described in the next section is a type-iv strategy.The status information of a pe is determined from its neighbors' status information.Thus, the status of a pe depends on its neighbors, and theirs, in turn, depend on their neighbors.However, the time required to exchange information causes the status to be dependent on possibly outdated information.

Randomized Allocation and Gradient Model
Athas and Seitz have proposed a global randomized allocation algorithm 20,21].The randomized allocation is an allocating-phase scheduling strategy and no standing-phase action is involved.A randomized allocation algorithm dictates that each pe, when it generates a new chare, should send it to a randomly chosen pe.One advantage of this algorithm is simplicity of implementation.No local load information needs to be maintained, nor is any load information sent to other pes.Statistical analysis shows that the randomized allocation has a respectable performance as far as the number of chares per pe is concerned.However, a few factors may degrade the performance of the randomized allocation.First, the grainsize of chares may vary.Even if each pe processes about the same number of chares, the load on each pe may still be uneven.Second, the lack of locality leads to large overhead and communication tra c.Only 1=N subtasks stay on the creating pe, where N is the number of pes in the system.Thus, most messages between chares have to cross processor boundaries.The average distance traveled by messages is the same as the average internode distance of the system.This leads to a higher communication load on large systems.Since the bandwidth consumed by a long-distance message is certainly larger, the system is more likely to be communication bound compared to a system using other load balancing strategies that encourage locality.Eager et al. 8] have modi ed the naive randomized allocation algorithm.They use threshold, a kind of local load information, to determine whether to process a chare locally or locate a chare randomly.
The gradient model 19] is mainly a standing-phase scheduling strategy.As stated by Lin 22], instead of trying to allocate a newly generated chare to other pes, the chare is queued at the generating pe and waits for some pe to request it.A separate, asynchronous process on each pe is responsible for balancing the load.This process periodically updates the state function and proximity on each pe.The state of a pe is decided by two parameters, the low water mark and high water mark.If the load is below the low water mark, the state is idle.If the load is above the high water mark, the state is abundant.Otherwise, it is neutral.The proximity of a pe represents an estimate of the shortest distance to an idle pe.An idle pe has a proximity of zero.For all other pes, the proximity is one more than the smallest proximity among the nearest neighbors.If the calculated proximity is larger than the network diameter, it is in saturation and the proximity is set to be network diameter+1, to avoid unbounded increase in proximity values.If the calculated proximity is di erent from the old value, it is broadcast to all the neighbors.Based on the state function and the proximity, this strategy is able to balance the load between pes.When a pe is not in saturation and its state is abundant, it sends a chare from its local queue to the neighbor with the least proximity.
The gradient model may cause load imbalance.For a tree-structured computation, this behavior could cause the upper level nodes to cluster together near the root pe.When the results need to be collected at the root of the computation tree, the computation slows down.Furthermore, the proximity information may be inaccurate because of communication delays and the nature of the proximity update algorithm: by the time the proximity information from an idle pe propagates through the majority of pes in a system, the state of the original pe may have changed.

Adaptive Contracting Within Neighborhood
Adaptive Contracting Within Neighborhood (ACWN) is a scheduling algorithm using the type-iii strategy.Here, each pe calculates its own load function by combining various factors that indicate its current load.A simple measure may be the number of messages waiting to be processed.Adjacent pes exchange their load information periodically by sending a small load message or piggybacking the load information with regular messages.Thus, each pe maintains load information on all its nearest neighbors.
For pe k, its own load function is denoted by F(k), and its neighbors' load functions are denoted by a set of values F 0 (i), where dist(k; i) = 1.The value of F(k) is calculated periodically.
The load information can then be used to determine a system state.For each pe k, a function B(k) is de ned as Min dist(k;i)=1 fF 0 (i)g, which represents how heavily its neighbors are loaded.Two prede ned parameters, low mark and high mark, are used to compare with B(k) to ascertain the current system state, which is shown in Table I.If B(k) < low mark, the system is considered to be in the light-load state.If B(k) high mark, it is in the heavy-load state.Otherwise, it is in the moderate-load state.phase strategy is called contracting and the standing-phase strategy is called redistributing.
As mentioned before, a chare is in its allocating phase from the time it is created until it enters the local queue at a pe.The allocating-phase strategy of the ACWN algorithm is shown in Figure 1.During this phase, a newly created chare is contracted m hops, where 0 m d and d is the network diameter.We set an upper limit of traveling distance d for each allocating chare to prevent unbounded message oscillation.The contracting decision is based on the system state of each pe.The number of hops traveled so far for each chare c is recorded as c:hops.Thus, at each pe k, for an allocating chare c, which either is created by pe k or received from other pes, there exist the following cases: if the system is in the heavy-load state or c:hops d, chare c will be retained locally and added to the local pool of messages, terminating its allocating state; if the system is in the light-load state and c:hops = 0, pe k will contract chare c to its least-loaded neighbor no matter what its own load is.Otherwise, the chare will be contracted conditionally: if the load on the least-loaded neighbor is smaller than its own load, the chare is contracted out to that neighbor.In this way, the newly generated chare c travels along the steepest load gradient to a local minimum.
The standing-phase strategy of the ACWN algorithm is shown in Figure 2. Load imbalances may appear even though the allocating-phase strategy is applied.Such imbalance may appear either due to limitations Redistributing(k) /* at PE k */ For each time interval and when not in the heavy-load state find i for all j's such that F'(i) <= F'(j), where dist of the underlying load contracting scheme which nds only a local minimum, or due to the di erent rates of consumption of chares.Moreover, since each pe has its own system state, it is possible that there exist pes in the light-load state, moderate-load state, or heavy-load state at the same time in a system.During the heavy-load state, pes accumulate chares without sending them to any other pes.Thus, after a pe leaves the heavy-load state, it may own many more chares than other pes.These chares need to be redistributed to other pes as the allocation of new chares alone may not be su cient to correct the load imbalance.Notice that the redistributing is active only when a pe is not in its heavy-load state.In the heavy-load state, since all neighbors of the pe have su cient work to do, it is not necessary to balance load between them.
The behavior of both contracting and redistributing scheduling strategies is a ected by the system state, which is determined by the load information as well as the prede ned parameters, low mark and high mark.Low mark is used to switch states between the light-load and moderate-load.If it is too high, chares are to be contracted out frequently, and the overhead to move chares becomes higher.If it is too low, chares are spread out slowly and load imbalance may occur.High mark is used to decide whether the system is heavily loaded, i.e. in saturation.If this mark is too high, the scheduling algorithm keeps moving chares among pes even when they all have su cient work, leading to higher overhead.However, if high mark is set too low, the heavy-load state will be reached prematurely, which may cause load imbalance.Experiments suggested that performance is not sensitive as long as these parameters are in a reasonable range.As shown in Figures 3 and 4, the low mark could be about 2 to 5, and high mark about 8.These experiments used the number of messages waiting to be processed as the measure of load.In the rest of the experiments with ACWN, we chose values of low mark and high mark as 2 and 8, respectively.
Scheduling strategies without migration can be summarized as a general model.The model consists of three functions.In the allocating phase, whether a chare is sent out depends on an allocating-phase function.If the function is true, the chare will be sent out; otherwise it is kept local.In the standing phase, whether chares are moved depends on a standing-phase function.If the function is true, the chares will be redistributed between pes.The third function is a destination function that determines which pe to send a chare to when the chare is to be allocated or redistributed.Di erent scheduling strategies can set di erent values for each of the three functions.If a scheduling strategy sets the allocating-phase function always false, it is considered to be inactive during the allocating phase.Similarly, if a scheduling strategy sets the standing-phase function always false, it is said to be inactive in the standing phase.To compare the randomized allocation, gradient model, and ACWN algorithms under this general model, we list three functions for each of them in Table II.For the gradient model, P(k) represents the proximity at pe k and d is the network diameter.For the randomized allocation, whenever pe k generates a new chare, a random number m is obtained to determine its allocating-phase function as well as its destination function, where 0 m < N and N is the number of pes in the system.If m is equal to k, the allocating-phase function is false.Otherwise, the chare will be sent to pe m as its destination.The gradient model has virtually no allocating-phase action.When a chare is generated, it is put in the local queue.This leads to slow spreading of the load.On the other hand, the randomized allocation does not have standing-phase action.It usually generates good distribution of the load.However, when the sizes of chares vary in a wide range, this strategy is unable to redistribute the load even if some pes are busy and others are idle.ACWN conducts the actions in both phases, resulting in a more reliable performance.

Performance Studies
We have tested several examples on an Intel iPSC/2 hypercube to study the e ectiveness of dynamic scheduling schemes on multicomputers.The machine used has a 32-node con guration with 4 megabytes of memory at each node.Three algorithms, randomized allocation, gradient model, and ACWN, were implemented.They shared most subroutines except for the allocating-phase function, the standing-phase function, and the destination function.Notice that the programs are chosen not because they are good parallel algorithms for the problems they solve, but for the suitability of illustrating di erent computation patterns handled by the dynamic scheduling.For each program the best sequential program written in C was also tested without changing the algorithm.
In general, the sum of execution times of all pes can be broken into three parts: computation time, overhead, and idle time.Computation time is spent on problem solving and should be equal to the sequential execution time.This time is invariant with di erent scheduling strategies, di erent numbers of pes, and di erent grainsizes.Overhead includes the work of bookkeeping, communication, and load balancing.Idle time is the time in which pes have no work to do.The overhead and idle time depend on granularity of partitioning as well as scheduling strategy.Experiments for di erent grainsizes of the 10-Queen problem were conducted for analysis of the factors of granularity.In Figure 5, we show the e ciency of this problem for di erent numbers of pes with di erent grainsizes.The performance of the largest grainsize slumps as the number of pes increases because the pool of available chares is not large enough to keep all the pes busy.The poor performance of the curve with the smallest grainsize is due to overhead.For example, in Figure 6 we give the components of the execution time for di erent grainsizes with 16 pes.Here, a small grainsize imposes a large amount of overhead.On the other hand, a large grainsize reduces overhead, but may result in longer processor idle time because of load imbalance.individual chare, the system maintains a chare block, and for each message there is a message header including its source and destination chare information.The overhead of bookkeeping is about 250{400 microseconds whenever a new chare is created or a message is sent.The communication overhead consists of the time spent by the processor that deals with the sending and receiving of messages.The actual transmission time is overlapped with computation and does not need to be considered.The overhead for each communication is about 450 microseconds.The granularity also a ects communication overhead, because the number of messages exchanged between pes tends to increase when the grainsize becomes smaller.Not all the messages between chares introduce communication overhead.Only those going to pes other than the source pe have that result.Thus, the load-balancing strategies also in uence the communication overhead, as di erent strategies have di erent e ects on what fraction of the messages will be between local chares.Scheduling overhead can be divided into two parts: updating load information and chare placement.Time spent on chare placement is proportional to the number of chares and is determined by granularity.System load information can be exchanged periodically.As shown in Figure 7 for the 10-Queen problem on 16-pes, too short a period increases communication overhead, and too long a period leads to inaccurate load information due to sluggish updates.With a long exchanging period, the system acts unstably.We give both worst and best time from many repetitions of experiments for periods after 256 milliseconds.In Figure 7, we have shown two curves, with and without piggybacking for di erent exchanging periods.Piggybacking load information on the regular outgoing messages can reduce the number of load information messages exchanged.One with piggybacking behaves better than one without piggybacking, since with every message we update load information with a negligible additional cost.In Figure 8, we pick up one instance with piggybacking to show the sum of overhead and sum of idle time at all 16 pes.A short exchanging period makes the frequently updated load information unnecessary.However, if the period is too long, load is highly unbalanced with long idle time.From the curves, it can be seen that the best period is between 50 and 150 milliseconds.In the rest of the experiments, piggybacking is applied to both the ACWN and the gradient model algorithms.The period of load information exchanging is set to be 100 milliseconds for ACWN and the best value of exchanging interval is also selected for the gradient model.
We now discuss the in uence of scheduling strategies.A good scheduling algorithm must be able to balance load for di erent application problems.At the same time, it has to keep scheduling overhead small.Furthermore, it must keep good locality so that most chares can be executed locally to reduce Period (msecs)  communication overhead.Here we compare three scheduling algorithms, randomized allocation, gradient model, and ACWN.In Figure 9, for Fibonacci32 on 8-pes, we list chare distribution at each pe with di erent scheduling algorithms.Each chare processed at pe k is either generated by pe k itself or from other pes.The ACWN has the most locally generated chares and a few from other pes.At the other extreme, the randomized allocation has a few local chares (about 1/N), and most chares from other pes.
The only scheduling overhead for the randomized allocation is to generate random numbers whenever a chare is created.However, communication overhead is high since most chares are sent to other pes irrespective of whether the system is heavily or lightly loaded.For the same problem of Figure 9, we illustrate percentage of computation time, overhead, and idle time in Figure 10.To compare the algorithms, the overhead time can be subdivided further into three sub-categories: the bookkeeping overhead (T B ), communication overhead (T C ), and load balancing overhead (T L ). Figure 11 extracts the overhead parts from Figure 10 and illustrates each kind of overhead for di erent algorithms.The randomized allocation has large overhead spent on communication although its scheduling overhead is negligible.The gradient model utilizes the system status information to make loads balanced among pes so that the idle time is reduced.More importantly, the gradient model sends chares away only when necessary.Due to this locality property, the gradient model does not incur high communication overhead compared to the randomized allocation case.However, the gradient model must exchange load information more frequently to balance the load, resulting in large load balancing overhead.The ACWN exhibits better locality than the gradient model.Therefore, it has less communication overhead.Its scheduling overhead is also small, due to a low frequency of load information exchange.
In Table III and Figures 12{15, we give the performance comparison of the randomized allocation, the gradient model, and the ACWN algorithms.Here, one instance of each program has been chosen for execution, that is, 10-Queens, Fibonacci-32, one con guration of 15-puzzle, and the Romberg integration with 14 integrations.Characteristic features for di erent problems are shown in Table IV.The granularity is between 1 to 100 milliseconds, resulting from the medium-grained partitioning.Coarse granularity causes serious load imbalance and ne granularity leads to large overhead.The Fibonacci problem is a regular tree-structured computation.The grainsizes of leaf chares are roughly the same.In the Queen problem, the grainsize is not even, since whenever a new queen is placed, the search either successfully continues to the next row or fails.The 15-puzzle is a good example of an AI search problem.Here the iterative Performance of this problem is therefore not as good as others.In the Romberg integration, the evaluation of function points at each iteration is performed in parallel.As we can see, ACWN is better than both the randomized allocation and the gradient model in all the cases.

Discussion
The ACWN algorithm outperforms the randomized allocation and the gradient model partly due to its two-phase scheduling strategy and partly due to its adaptive locality.Its good locality reduces communication overhead whereas the randomized allocation does not.Besides the standing-phase strategy, the allocating-phase strategy of ACWN allows load to spread out faster than the gradient model.The ACWN can adapt to di erent chare sizes too.Assume at a time both pe i and pe j have m messages waiting for processing, respectively.It happens that pe i gets a message with a large amount of computation.After a while, pe i still holds m ? 1 messages and pe j may have no messages left.At this time, ACWN is able to schedule messages from pe i to pe j to balance the load.In contrast, the randomized allocation cannot adapt to such a case.
For a small number of pes, the gradient model can make better load balance than the randomized allocation.However, since the gradient model was designed based on good locality to reduce communication overhead, it does not spread the load very fast.For a large number of pes, the gradient model leads to more load imbalance than the randomized allocation does.As shown in Figure 16 for the 10-Queen problem, the idle time of the gradient model at 16 and 32 pes is longer than the randomized allocation.A similar conclusion is also made by Grunwald 24].ACWN reaches the most even load distribution among the three scheduling algorithms.
From experiments, the overhead for a local chare that does not involve communication overhead is about 0.3 to 0.4 milliseconds and for a remote chare that involves communication overhead it is about 1.2 to 1.3 milliseconds.Thus, performance may not su er much from bookkeeping and communication overhead if the grainsize of a chare is much larger than that.A few ten-milliseconds can therefore be counted as a reasonable grainsize.Due to a large number of remote chares, the communication overhead for the randomized allocation is large, which in turn implies a large grainsize.Does overhead of a complicated scheduling algorithm always overwhelm the bene t it achieves?Certainly, a complex algorithm (as an extreme example, one that looks for the least loaded processor across the entire system at every scheduling decision) loses its uniform distribution advantage to its high overhead.The randomized allocation algorithm bears negligible overhead for load balancing decisions, but the communication overhead is high and the suspension is large.We have shown that a good load balance can be obtained by a simple algorithm with low scheduling overhead.Even though ACWN pays more scheduling overhead compared to the randomized allocation, it still can achieve better performance in most cases.
Overhead can be reduced by using co-processors.A co-processor can be attached to the main processor in each pe, which handles all bookkeeping, load balancing, and communication activities.In the iPSC/2 hypercube, each pe has a communication co-processor which shares part of the communication overhead.Since we are not able to program co-processors, overhead of bookkeeping, load balancing, and part of communication must be handled by the main processor.If the ACWN scheduling can be applied to a system with co-processors, the frequency of load information exchange can be increased and more communication activities may take place to improve load balance, as long as the load of the co-processor does not exceed the load of the main processor.The randomized allocation and the gradient model may bene t more from the co-processor than ACWN does, since the randomized allocation has more communication overhead and the gradient model has more scheduling overhead.

Conclusion
We described a scheme for dynamic scheduling of medium-grained processes on multicomputers.The scheme, called Adaptive Contracting Within Neighborhood, employs two substrategies: an allocatingphase strategy and a standing-phase strategy.The allocating-phase strategy moves a new piece of work along the steepest load gradient to a local minimum within a neighborhood.It estimates the system state and ensures that pieces of work are moved only when the system requires it.The standing-phase strategy corrects load imbalance by redistributing pieces of work that were initially allocated by the allocating-phase strategy.Every processor maintains load information about their neighbors only, and such information is often exchanged by piggybacking it on regular messages.Thus, the scheme incurs low load balancing overhead.As it manages to retain many pieces of work on the processor that produced them, it has low communication overhead.
ACWN was compared with two other schemes, the randomized allocation and the gradient model.The randomized allocation incurs negligible load balancing overhead and achieves reasonably uniform distribution of work.However, it incurs much communication overhead.The gradient model, on the other hand, enforces locality at the expense of agility in spreading work out quickly to processors.All these schemes were implemented in a system called the chare kernel running on the Intel's iPSC/2 hypercube.The experimental results demonstrate that ACWN performs better than the other two algorithms for many computation patterns.

Figure 1 :
Figure 1: The allocating-phase strategy for the ACWN algorithm.

Figure 2 :
Figure 2: The standing-phase strategy for the ACWN algorithm.

F 1 Figure 3 : 1 Figure 4 :
Figure 3: Low mark e ect on the performance for 10-Queen problem.

Figure 7 :
Figure 7: Comparison of di erent periods to exchange load information.

Figure 8 :
Figure 8: Total overhead and idle time at all 16 pes.

Table I :
System States

Table II :
The Allocating-phase, Standing-phase, and Destination Functions at pe k