Run-Time and Compiler Support for Programming in Adaptive Parallel Envirornnents *

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF)-Iike languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.


INTRODUCTION
In most existing parallel programming systems, each parallel program or job is assigned a fixed number of processors in a dedicated mode.Thus, the job is executed on a fixed number of processors, and its execu-tion is not affected by other jobs on any of the processors.This simple model often results in relatively poor use of available resources.A more attractive model would be one in which a particular parallel program could use a large number of processors when no other job is waiting for resources, as well as a smaller number of processors when other jobs need resources.Setia et al. [1,2] have shown that such a dynamic scheduling policy results in better utilization of the available processors.
There has been an increasing trend toward using a networkk of workstations for parallel execution of programs.A workstation usually has an individual owner or small set of users who would like to have sole use of the machine at certain times.However, when the individual users of workstations are not logged in.these workstations can lw used for executin~ a parallel application.When the individual user of a workstation returns.the application must be adjusted either not to use the workstation at all or to use very few cvcles on the workstation.The idea is that individual users of the workstation do not want the execution of a large parallel application to slow down the processes they want to execute.
We refer to a parallel programming environment in which the number of processors available for a given application varies with time as an adaptive parallel programming erwironment.The major difficulty in us- ing an adaptive parallel programming environment is in developing applications for execution in such an environment.In this article, we address this problem for distributed memory parallel machines and networks of workstations, neither of which support shared memory.In these machines, communication between processors has to be explicitly scheduled by a compiler or bv the user.
A commonly used model for developing parallel applications is the data-parallel programming model. in which parallelism is achieved by dividing large data sets between processors and having each processor work only on its local data.High-performance fortran (HPF) [3], a language proposed by a consortium from industry and academia and is being adopted by a number of vendors, targets the data-parallel programming model.In compiling HPF programs for execution on distributed memory machines, two major tasks are dividing work or loop iterations across processors.and detecting.inserting, and optimizing communication between processors.To the best of our knowledge, all existing work on compiling data-parallel applications assumes that the number of processors available for execution does not vary at run-time [ 4-6].If the number of processors varies at run-time, run-time routines need to be inserted for determining work partitioning and communication during the execution of the program.
We have developed a run-time library for developing data-parallel applications for execution in an adaptive environment.There are two major issues in executing applications in an adaptive environment: 1. Redistributing data when the number of available processors changes during the execution of the program 2. Handling work distribution and communication detection, insertion, and optimization when the number of processors on which a given parallel loop will be executed is not known at compile-time.
Executing a program in an adaptive environment can potentially incur a high overhead.If the number of available processors is varied frequently.then the cost of redistributing data can become significant.Because the number of available processors is not known at compile-time.work partitioning and communication need to be handled by run-time routines.This can result in a significant overhead if the run-time routines an~ not efficient or if the run-time analysis is applied too often.
Our run-time library, called Adaptive Multiblock PARTI (A\1P).includes routines for handling the two tasks we have described.This run-time library can be used by compilers for data-parallellanguages or it can be used by a programmer parallelizing an application by hand.This article describes our run-time library and discusses how it can be used by a compiler.We restrict our work to data-parallel languages in which parallelism is specified through parallel loop constructs like forall statements and array expressions.We present experimental results on two applications parallelized for adaptive execution by inserting our run-time support by hand.Our experimental results show that if the number of available processors does not vary frequently, the cost of redistributing data is not significant as compared to the total execution time of the program.Overall, our work establishes the feasibility of compiling HPF -like data-parallel languages for a network of nondedicated workstations.
The rest of this article is organized as follows.Section 2 discusses the programming model and model of execution we are targeting.Section 3 describes the run-time library we have developed.We briefly discuss how this run-time library can be used by a compiler in Section 4. Section 5 presents experimental results we obtained by using the library to parallelize two applications and running them on a network of workstations and an IBM SP-2.In Section 6, we compare our work with other efforts on similar problems.We conclude in Section 7.

MODEL FOR ADAPTIVE PARALLELISM
This section discusses the programming model and model of program execution targeted by our run-time library.A parallel programming system in which the number of available processors varies during the execution of a program is called an adaptive programming environment.A program executed in such an environment is referred to as an adaptive program.These programs should adapt to changes in the number of available processors.The number of processors available to a parallel program changes when users log in or out of individual workstations, or when the load on processors change for various reasons (such as from other parallel jobs in the systPm).We refer to remapping as the activity of a program adjusting to the change in the number of available processors.\Ve have chosen our model of program execution with two main concerns: 1. We want a model which is practical for dPveloping and running common scientific and engineering applications.2. We want to develop adaptive programs that are portable across many existing parallel programming systems.This implies that thP adaptive programs and the run-time support developed for them should require minimal operating system support.
We restrict our work to parallel programs using the single-program multiple-data (SPMD) modPl of execution.In this modeL the same program text is run on all the processors and parallelism is achieved by partitioning data structures (typically arrays) between processors.This model is frequently used for scientific and engineering applications, and most of the existing work on developing languages and compilers for programming parallel machines uses the SPMD model [3].An example of a simple data-parallel program that can be easily transformed into a parallel program that can be executed in SPMD mode is shown in Figure 1.The only change required to turn this program into an SPMD parallel program for a static environment would be to change the loop bounds of the forall loop appropriately so that each processor only executes on the part of array A that it owns and then to determine and place the communication between processors for array B.
We are targeting an environment in which a parallel program must adapt according to the system load.A program may be required to execute on a smaller rnnnber of processors because an individual user logs in on a workstation or because a new parallel job requires resources.Similarly.it may be desirable for a parallel program to execute on a larger number of processors because a user on a workstation has logged out or because another parallel job executing in the parallel system has finished.In such scenarios.it is acceptable if: 1.The adaptive program does not remap immediately when the system load changes.2. TllP program remaps from a larger number of processors to a smaller number of processors.However. it mav continue to usE' a small number of cycles on the processors it no longer uses for computation.
This kind of flexibility can significantly easr remapping of data-parallel applications, with minimal operating system support.If an adaptive program has to be remapped from a larger number of processors to a smaller number of processors.this can be done by redistributing the distributed data so that processors which should no longer he executing the program do not own any part of the distributed data.The SPMD program will continue to exf'cute on all processors.We refer to a process that owns distributed data as an active process and a process from which all data have been removed as a skeleton process.A processor owning an active process is referred to as an activE' processor and similarly.a processor owning a skeleton process is referred to as a skeleton processor.A skeleton processor will still execute each parallel loop in the program.However.after evaluating the local loop bounds to restrict execution to local data.a skeleton processor will determine that it does not need to execute any itf'rations of the parallel loop.All computations involving writing into scalar variables or replicated arrays will continue to be executed on all processors.The parallel program will use some cycles in the skeleton processors, in the evaluation of loop bounds for parallel loops, and in the computations involving writing into scalar variables or replicated arrays.However, for data-parallel applications involving large arrays this is not likPly to cause any noticeable slowdown for other processes executing on the skeleton processors, especially because replicated arrays are rarely used in significant computations.This model substantially simplifies remapping when a skeleton processor again becomes available for executing the parallel program.A skeleton processor can be made active simply by redistributing the data so that this processor owns part of the distributed data.~ew processes do not need to be spawned when skeleton processors become available, hence no operating system support is required for remapping to start execution on a larger number of processors.In this modeL a maximal possible set of processors is specified before starting execution of a program.The program text is executed on all these processors, although some of these may not own any portions of the distributed data at any given point in the program execution.W'e believe that this is not a limitation in practice, because the set of workstations or processors of a parallel machine that can possibly be used for running an application is usually known in advance.
Figure 2 represents three different states of five processors (workstations) executing a parallel program using our model.In the initial state, the program data are spread across all five processors.In the second state, two users have logged in on processors 0 and 2, so the program data are remapped onto processors 1, 3, and 4.After some time.those users log off and another user logs in on processor 1.The program adapts itself to this new configuration by remapping the program data onto processors 0, 2, 3, and 4.
If an adaptive program needs to be remapped while it is in the middle of a parallel loop, much effort may be required to ensure that all computations restart at the correct point on all the processors after remapping.The main problem is ensuring that each iteration of the (parallel) loop is executed exactly once, either before or after the remapping.Keeping track of which loop iterations have been completed before the remapping, and only executing those that haven't already been completed after the remapping, can be expensive.However, if the program is allowed to execute for a short time after detecting that remapping needs to be done, the remapping can be substantially simplified.Therefore, in our model, the adaptive program is marked with remap points.These remap points can be specified by the progranmH:'r if the program is parallelized by hand, or they may be determined by the compiler if the program is compiled from a single program specification (e.g., using an HPF compiler).
We allow remapping when the program is not executing a data-parallel loop.The local loop hounds of a data-parallel loop are likely to be modified when the data are redistributed, because a processor is not likely to own exaetlv the same data both before and after remapping.Also.remapping can be done at a point only if none of the nodf's arc in the middle of any I/0 activity.We will further discuss how the compiler can determine placement of remap points in Section 4.
At each remap point, the program must determine if there is a reason to remap.\Ve assume a detection mechanism that determines if the load needs to be shifted away from any of the processors which are currently active, or if any of thf' skeleton processors can be made active.This detection mechanism is the only operating system support our model assumes.All the proct'ssors synchronize at the remap point and, if the detection mechanism determines that remapping is required.data redistribution is done.
Two main considerations arise in choosing remap points.If the remap points are too far apart, i.e .. if the program takes too much time between remap points, this may not be acceptable to the users of the machine(s ).If remap points are too dose together.the overhead of using the detection mechanism may start to lwcome significant.Support for the detection mechanism can be easily implemented in one of the two ways.The operating system can check if a user is logged in (or if the user has been idle for a long time).Alternatively, users of the individual workstations can change a variable to let the system know whether or not they want their workstation to be used for parallel programs.
Our model for adaptive parallel programming is closest to the one presented by Konuru et al. [7].They also consider data-parallel programming in an adaptive environment, including a network of heterogeneous workstations.The main difference in their approach is that the responsibility for data repartitioning is given to the application programmer.We have concentrated on developing run-time support that can perform data repartitioning, work partitioning, and communication after remapping.Our model satisfies the three requirements stated by Konuru et al. [7], namely.withdrawal (the ability to withdraw computation from a processor within a reasonable time), expansion (the ability to expand into newly available processors), and redistribution (the ability to redistribute work onto a dynamic number of processors so that no processor becomes a bottleneck).

RUN-TIME SUPPORT
This section discusses the run-time library we have df'veloped for adaptive programs.The run-time library has been developed on top of an existing runtime library for structured and block -structured applications.This librarv is called Multiblock PARTI [8.9], because it was initially used to parallelize multiblock applications.We have developed our run-time support for adaptive parallelism on top of Y1ultiblock PARTI because this run-time library provides much of the run-time support required for forallloops and array expressions in data-parallel languages like HPF.This library was also integrated with the HPF /Fortran 90D compiler developed at Syracuse (;niversity [ 4, 10,11].
We discuss the functionality of the existing library and then present the extensions that were implemented to support adaptive parallelism.We refer to the new library.with extensions for adaptive parallelism, as AMP.

Multiblock PARTI
This run-time library can be used in optimizing communication and partitioning work for HPF codes in which data distribution, loop bounds, and/ or strides are unknown at compile-time and indirection arrays are not used.Consider the problem of compiling a data-parallel loop, such as a forallloop in HPF, for a distributed memory parallel machine or network of workstations.If all loop bounds and strides are known at compile-time and if all information about the data distribution is also known.then the compiler can perform work partitioning and can also determine the Sf'ts of data elements to be communicated between processors.However.if all this information is not known, then these tasks may not be possible to perform at compile-time.Work partitioning and communication generation become especially difficult if there are svmbolic strides or if the data distribution is not known at compile-time.In such cases, run-time analysis can be used to determine work partitioning and generate communication.The Multiblock PARTI library has been developed for providing the required run-time analvsis routines.
In summary, the run-time library has routines for three sets of tasks: A key consideration in using run-time routines for work partitioning and communication is to keep the overhead of run-time analysis low.For this reason, the run-time analysis routines must be efficient and it should be possible to reuse the results of run-time analysis whenever possible.In this run-time system, communication is performed in two phases.First, a subroutine is called to build a communication schedule that describes the required data motion . .and then another subroutine is called to perform the data motion (sends and receives on a distributed memory parallel machine) using a previously built schedule.Such an arrangement allows a schedule to be used multiple times in an iterative code.
To illustrate the functionalitv of the run-time routines for communication analysis.consider a single statement forall loop as specified in HPF.This is a parallel loop in which loop bounds and strides associa ted with any loop variable cannot be functions of any other loop variable [3].If there is only a single array on the right-hand side.and all subscripts are affine functions of the loop variables.then this forall loop can be thought as copying a rectilinear section of data from the right-hand side array into the left-hand array.potentially involving changes of offsets and strides and index permutation.We refer to such communication as a regular section move [ 12].The library includes a regular section move routine.Regular_ SectiorLMove_Sched.that can analvze the communication associated with a copy from a right-hand side array to left-hand side array when data distribution.loop bounds, and/or strides are not known at compile-time.
A regular section move routine can be invoked for analyzing the communication associated with any forall loop.but this may result in unnecessarily high runtime overheads for both execution time and memory usage.Communication resulting from loops in many real codes has much simpler features that make it easier and less time-consuming to analyze.For example, in many loops in mesh-based codes.only ghost (or overlap) cells [ 13] need to be filled along certain dimension( s).If the data distribution is not known at compile-time, the analysis for communication can be much simpler if it is known that only overlap cells need to be filled.The Multiblock PART! library includes a comrmmication routine, Overlap_CelLFilL Sched, which computes a schedule that is used to direct the filling of overlap cells along a given dimension of a distributed array.The schedules produced by Overlap_CelLFilLSched and Regular_ SectiorLMove_Sched are employed by a routine called Da ta..Move that carries out both interprocessor communication (sends and receives) and intra processor data copying.The final form of support provided by the Ylultiblock PARTI library is to distribute loop iterations and transform global distributed array references into local references.In distributed memory compilation, the owner-computes rule is often used for distributing loop iterations [5].Owner computes means that a particular loop iteration is executed by the processor owning the left-hand side array element written into during that iteration.Two routines, LocaLLower_ Bound and LocaLUpper_Bound, are provided by the library for transforming loop bounds (returning, respectively.the local lower and upper bounds of a given dimension of the referenced distributed array) based on the owner-computes rule.
An example of using the library routines to parallelize the program from Figure 1 is shown in Figure 3.The library routines are used for determining work partitioning (loop hounds) and for determining and optimizing communication between the processors.In this example.the data distribution is known only at run-time and therefore, the DAD is filled in at runtime.Work partitioning and communication are determined at run-time using the information stored in the DAD.The function Compute_Transpose_Schedule () is shorthand for a call to the Regular_Sec-tiorLMove_ Sched routine.with the parameters set to do a transpose for a two-dimensional distributed array.The schedule generated by this routine is then used by the Da ta..Move routine for transposing the array B and storing the result in the array Temp.Functions LocaLLower_Bound and LocaLUpper _Bound are used to partition the data -parallel loop across processors, using the DAD.The sizes of the arrays A, B.. and T~mp on each proc~ssor dt>pend on th~ data distribution and are known onlv at run-time.
Therefor~.arrays A. B. and T~mp are allocated at runtime.The calls to the memory management routines are not shown in Figure 3. Tht> codt> could be optimized further by writing specialized routin~s to perform the transpose operation.but the library routin~s are also applicable to mort> gt>neral forallloops.
Tlw \1ultiblock PARTI library is currently impl~ rnented on the Intel iPSC/860 and Paragon.tht> Thinking \1achines CM-5.th~ IBM SP1 /2, and tlw PVM message-passing environment for a n~twork of workstations [ 14 J.The d~sign of the library is archi-t~cture independent and, thert>for~. it can be easily port~d to any distributed memory parallel machine or any environment that supports message passing ( t>.g .. Express).The current implementation of the library is restricted to handling only block-distributed arrays.

AMP
The existing functionality of the \lultiblock PARTI library was useful for d~veloping adaptive programs in several ways.If the numb~r of processors on \vhich a data-parallel loop is to be execut~d is not known at compil~-time.it is not possible for the compil~r to analyze the communication, and in some cases.ev~n th~ work partitioning.This holds tru~ ~ven if all other information, such as loop bounds and strid~s. is known at compile-time.Thus, run-time routin~s ar~ r~quired for analyzing communication (and work partitioning) in a program written for adaptive execution, even if the same program written for static execution on a fixed number of processors did not r~quire any runtime analnis.
Several ext~nsions were r~quired to the existing library to provide th~ required functionality for adaptive programs.When th~ set of processors on which the program ~xecutes changes at run-time.all activ~ processors must obtain information about which processors are active and how the data are distributed across th~ set of activ~ processors.To dt>al with only some of th~ processors b~ing active at any tim~ during execution of the adaptiv~ program, the implementation of AMP uses the notion of physical and logical numbering of processors.If p is the number of processors that can possibly be actiw during the execution of the program, each such processor is assigned a unique physical processor number between 0 andp -1 before starting program execution.If we let c be the number of processors that are active at a given point during execution of a program, then each of these active processors is assigned a unique logical proc~ssor number between 0 and c -1.For active processors.the map-ping b~tw~f'Il physical and logical processor numlwrs is updatt>d at rt>map points.The use of a logical processor numbering is similar in conc~pt to the scheme used for processor groups in the mt>ssag~-passing interface standard (\IPI) [15].
Information about data distributions is available at each proc~ssor in the DADs.However, DADs only stor~ the total siz~ in each dimension for each distributt>d array.Tht> t>xact part of the distributed array owrwd by an active proc~ssor can be determined using tlw logical processor numb~r.Each proc~ssor maintains information about what physical processor corrt>sponds to each logical proc~ssor number at any timf'.The mapping from logical processor number to physical proc~ssor is used for communicating data between proct>ssors.
In summary, the additional functionality implt>mented in AMP over that available in Multiblock PARTI is as follows: 1. Routines for consistently updating the logical proc~ssor numbering when it has bet>n detected that rt>distribution is required.2. Routines for redistributing data at remap points.:1.~Iodified communication analYsis and data movf' routin~s to incorporate information about the logical processor numbering.
Tht> communication required for rNlistributing data at a remap point depends on tlw logical processor numberings b~for~ and after redistribution.Therefore.
after it has been d~cid~d that remapping is required.all processors must obtain tlw new logical processor numb~ring.The detection routine.after determining that data redistribution is required, decid~s on a new logical processor numb~ring of the processors which will bt> active.The detection routine informs all tht> processors which were either active befor~ remapping or will be active aft~r remapping of the new logical numbering.It also informs th~ processors which will be activ~ after remapping about tht> existing logical numbering (processors that are active both before and after remapping will already have this information).These processors need this information for determining what portions of the distributed arrays th~y will receive from which physical proct>ssors.The communication analysis requir~d for redistributing data was impl~mented by modifying the Multiblock PARTT Regular_SectioiLMove_Sched routine.Th~ new routine takes both the new and old logical numbering as parameters.The analysis for dc-t~rmining the data to he s~nt by each processor is done using the new logical numbering (because data will be sent to processors with the n~w logical numbering) and the analysis for determining tlw data to be received is done using the old logical numbering (because data will be reeeived from processors with the old logical numbering).Modifications to the Ylultiblock PARTI communication functions were also required for incorporating information about logical processor numberings.This is because the data distribution infom1ation in a DAD only determines which logical processor owns what part of a distributed array.To actually perform communication, these functions must use the mapping between logical and physical processor numberings.

COMPILATION ISSUES
The example shown previously illustrates how A~lP can be used by application programmers to develop adaptive programs by hand.We now briefly describe the major issues in compiling programs written in an HPF-like data-parallel programming language for an adaptive environment.We also discuss some issues in expressing adaptive programs in HPF.As we stated earlier_ our work is restricted to data-parallel languages in which parallelism is specified explicitly.Incorporating adaptive parallelism in compilation systems in which parallelism is detected automatically [5] is beyond the scope of this article.
In previous work, we successfully integrated the Multihloek PARTI library with a prototype Fortran 90D/IIPF compiler developed at Syracuse University [4.10. 11].Routines provided by the library wert' inserted for analyzing work partitioning and communication at run-time, whenever compile-time analysis was inadequate.This implementation can be extended to use AD.\1 and compile HPF programs for adaptive execution.The major issues in compiling a program for adaptive execution are determining remap points, inserting appropriate actions at remap points, and ensuring reuse of the results of run-time analysis to minimize the cost of such analysis.

1 Remap Points
In our model of execution of adaptive programs.remapping is considered only at certain points in the program text.If our run-time library is to ht> used, a program cannot be remapped inside a data-parallel loop.Thr rrason is that the local loop bounds of a data-parallelloop are determined based on the current data distribution.and in general it is very difficult to ensure that all iterations of the parallel loop are executed by exactly one processor, either before or after remapping.
There are (at least) two possibilities for determining remap points.They may be specified by the programmer in the form of a directive.or they may be determined automatically by the eompiler.For the dataparallel language HPF, parallelism can only be explicitly specified through certain constructs (e.g .•fora II statement, forall construct.independent statement [3]).Inside any of these constructs.tlw only functions that can lw callt>d are those explicitly markt>d as pure functions.Thus. it is simplt> to determine.solely from the syntax.what points in the program art> not inside any data-parallel loop and therefore can be remap points.Making all such points rPmap points may, however.lead to a large number of remap points.1\aturally.this will lead to significant overhead from Pmploying thP detection mPchanism (and synchronization of all processors at each remap point).
Alternatively. a programmer may specify cPrtain points in the program to be remap points, through an explicit dirPctive.This.however.makes adaptiw execution less transparent to thP programmer.
Once remap points are known to the compiler.it can insert calls to the detection mechanism at those points.The compiler also needs to insert a conditional based on the result of the detection mechanism, so that if the detection mechanism determines that remapping needs to be done, then calls are made both for building new DADs and for redistributing the data as specified by the new DADs.The resulting code looks very similar to the code shown in the example from Section 3, except that the compiler will not explicitly regenerate schedules after a remap.The compiler generates schedules anywht>re they will be needed.and relies on the run-time library to cache schedules that may be reused, as described in the next section.

Schedule Reuse in the Presence of Remapping
As we discussed in Section :3, a very important consideration in using run-time analysis is the ability to reuse the results of run-time analysis whenever possible.This is relatively straightforward if a program is parallelized by inserting the run-time routines by hand.When the run-time routines are automaticallv inserted by a compiler, an approach based on additional runtime bookkeeping can be used.In this approach.all schedules generated are stored in hash tables by the run-time library, along with their input parameters.Whenever a call is made to generate a schedule.the input parameters specified for this call are matched against those for all existing schedules.If a match is found, the stored schedule is returned by the library.This approach was successfully used in the prototype HPF IF ortran 90D compiler that used the Multihlock PARTI run-time library.Our previous experiments have shown that saving schedules in hash tables and searching for existing schedules result in less than 10% overhead, as compared to a hand implementation that reuses schedules optimally [ 1 0 J.This approach easily extends to programs which include remapping.One of the parameters to the schedule call is the DAD.After remapping.a call for building a new DAD for each distributed array is inserted by the compiler.For the first execution of any parallel loop after remapping, no schedule having the new DADs as parameters will be available in the hash table.)/ew schedules for communication will therefore be generated.The hash tables for storing schedules can also be cleared after remapping to reduce the amount of memory used by the library.

Relationship to HPF
In HPF. the Processor directive can be used to declare a processor arrangement.An intrinsic function, Number_oLProcessors, is also available for determining the number of physical processors available at run-time.HPF allows the use of the intrinsic function Number_oLProcessors in the specification of a processor arrangement.Therefore, it is possible to write HPF programs in which the number of physical processors available is not known until runtime.The Processor directive can appear only in the specification part of a scoping unit (i.e., a subroutine or main program).There is no mechanism available for changing the number of processors at run-time.
Most of the existing work on compiling data-parallel languages for distributed memory machines assumes a model in which the number of processors is statically known at compile-time [ 4-6].Therefore, several components of our run-time library are also useful for compiling HPF programs in which a processor arrangement has been specified using the intrinsic function Number_oLProcessors. HPF also allows Redistribute and Realign directives, which can he used to change the distribution of arrays at runtime.Our redistribution routines would be useful for implementing these directives in an HPF compiler.

EXPERIMENTAL RESULTS
To study the performance of the run-time routines and to determine the feasibility of using an adaptive environment for data-parallel programming, we have experimented with a multihlock ~avier-Stokes solver template [ 16 J and a multi grid template [ 17].The multiblock template was extracted from a computational fluid dynamics application that solves the thinlayer Navier-Stokes equations over a three-dimensional (3D) surface (multihlock TLl\S3D).The sequential Fortran 77 code was developed by V atsa et al. [16] at l\ASA Langley Research Center, and consists of nearly 18,000 lines of code.The multihlock template, which was designed to indude portions of the entire code that are representative of the major computation and communication patterns of thP original code.consists of nearly 2000 lines of F77 code.
The multigrid code we expt>rimented with was developed by Overman and Van Rosendale [ 17] at J';ASA Langley.In earlier work.we hand parallelized these codes using Muhibloek PARTI and also parallelized Fortran 90D versions of these codes using the prototype HPF/Fortran 90D compiler.lu Loth these codes, the major computation is performed inside a ( sequential) time-step loop.For each of the parallel loops in the major computational part of the code, the loop hounds ami communication patterns do not change across iterations of the time-step loop when the code is run in a static environment.Thus, communication schedules can lw gNteratefl before the first iteration of the time-step loop and can be used for all time steps in a static environment.
We modified the hand-parallelized versions of these codes to usc the A,\IIP routines.For both these codes.we chose the beginning of an iteration of the timestep loop as the remapping point.If remapping is done, the data distribution changes and the schedules used for previous time steps can no longer be used.For our experiments, we used two parallel programming environments.The first was a network of workstations, connected through an Ethernet, and using PVM for message passing.We had up to 12 workstations available for our experiments.The second environment was a H)-processor IBM SP-2.
In demonstrating the feasibility of using an adaptive environment for parallel program execution, we considered the following factors: 1.The time required for remapping and eomputing a new set of schedules,.as compared to the time required for each iteration of the time-step loop.2. The numlwr of tim<' steps that the code must execute aftpr remapping to a greater number of processors to effectively amortize the cost of remapping.3. The effect of skeleton processes on the performance of their host processors.
On the network of Sun workstations, we considered executing the program on 12, 8. 4, or l workstations at any time.Remapping was possible from any of these configurations to any other configuration.We measured the time required for one iteration of the time-step loop and the cost of remapping from one configuration to another.The experiments were conducted at a time when none of the workstations had any other jobs executing.
Table 1 presents the time required per iteration for each configuration and the time required for remapping from one configuration to another for the ,\llultihlock code.The code was executed on a single mesh of size 49 X 9 X 9.In Table  Note that on a network of workstations connected by an Ethernet, it takes much longer to remap from a larger number of processors to a smaller number of processors than from a smaller number of processors to a larger number of processors.For example, the time required for remapping from 8 processors to 1 processor is significantly higher than the time required for remapping from 1 processor to 8 processors.This is because if several processors try to send messages simultaneously on an Ethernet, contention occurs and none of the messages may actually be sent.leading to significant delays overall.Instead, if a single processor is sending messages to many other processors, no such contention occurs.
We performed the same experiment on a 16-processor IBM SP-2.The results are shown in Table 2.The program could execute on either 16, 8, 4, 2, or 1 processor and we considered remapping from any of these configurations to any other configuration.The templates obtain significantly better speed-up and the time required for remapping is much smaller.The superlinear speed-up noticed in going from one to two processors is because on one processor, all data cannot fit into the main memory of the machine.
Table 3 shows the results from the execution of the multigrid template on the IBM SP-2.The code was run on an 8 X 8 X 8 mesh.Again, the remapping time for this code is reasonably small.Another interesting tradeoff occurs when additional processors become available for running the program.Running the program on a greater number of processors can reduce the time required for completing the execution of the program, but at the same time remapping the program onto a new set of processors causes additional overhead for moving data.A useful factor to determine is the number of iterations of the timestep loop that must still be executed so that it will be profitable to remap from fewer to a greater number of processors.Using the timings from Table 1, we show the results in Table 4. Table 4 shows that if the program will continue to run for several more time steps, remapping from almost any configuration to any other larger configuration is likely to be profitable.Because the remapping times are even smaller on the SP-2, the number of iterations required for amortizing the cost of remapping will be even smaller.
In our model of adaptive parallel programming, a program is never completely removed from any processor.A skeleton process steals some cycles on the host processor, which can potentially slow down other processes that want to use the processor (e.g., a workstation user who has just logged in).The skeleton processes do not perform any communication and do not synchronize, except at the remap points.In our examples, the remap point is the beginning of an iteration of the time-step loop.We measured the time required per iteration on the skeleton processors.Our experiments show that the execution time on skeleton processors is always less than 10% of the execution time on active processors.For the multiblock code.the time required per iteration for the skeleton processors was 4.7 and :10 ms on the IBM SP-2 and Sun-4 work-stations.respectively.The multigrid code took 11 ms per iteration on the IBM SP-2.We expecL therefore, that a skeleton process will not slow down any other job run on that processor significantly (assuming that the skeleton process gets swapped out by the operating system when it reaches a remap joint).

RELATED WORK
In this section.we compare our approach to other efforts on similar problems.Condor [18] is a system that supports transparent migration of a process (through check pointing) from one workstation to another.It also performs detection to determine if the user of the workstation on which a process is being executed has returned, and also looks out for other idle workstations.However, this system does not support parallel programs; it considers only programs that will be executed on a single processor.
Several researchers have addressed the problem of using an adaptive environment for executing parallel programs.However, most of these consider a task parallel model or a master-slave model.In a version of PVM called ~igratable PV~ (~PVM) [ 19], a process or a task running on a machine can be migrated to other machines or processors.However.MPVM does not provide any mechanism for redistribution of data across the remaining processors when a data-parallel program has to be withdrawn from one of the processors.
Another system called user level processes (CLP) [ 17] has also been developed.This system provides light-weight user-level tasks.Each of these tasks can be migrated from one machine to another.Again.there is no way of achieving load balance when a parallel program needs to be executed on a smaller number of processors.Piranha [20] is a system developed on top of Linda [21 J. ln this system.the application programmer has to write functions for adapting to a change in the number of available processors.Programs written in this system use a master-slave model and the master coordinates relocation of slaves.There is no clear way of writing data-parallel applications for adaptive execution in all these systems.
Data-parallel C and its compilation system [22] have been designed for load balancing on a network of heterogeneous machines.The system requires continuous monitoring of the progress of the programs executing on each machine.Experimental results have shown that this involves a significant overhead, even when no load balancing is required [22].

CONCLUSIONS AND FUTURE WORK
l11 this article we have addressed the problem of developing applications for execution in an adaptive parallel programming environment.meaning an environment in which the number of processors available varies at run-time.We have defined a simple model for programming and program execution in such an environment.In the SPMD model supported by IIPF. the same program text is run on all the processors.Remapping a program to include or exclude processors only involves remapping the (parallel) data used in the program.The only operating system support required in our model is for detecting the availability (or lack of availability) of processors.This makes it easier to port applications developed using this model onto many parallel programming systems.
We have presented the features of AYIP, which provides run-time support that can be used for developing adaptive parallel programs.We described how the run-time library can be used by a compiler to compile programs written in HPF -like data-parallellanguages for adaptive execution.We have present,ed experimental results on a hand-parallelized 1\"avier-Stokes solver template and a multigrid template run on a network of workstations and an IBYI SP-2.Our experimental results show that adaptive execution of a parallel program can be provided at relatively low cost, if the number of available processors does not vary frequently.
We plan to experiment with several other scientific codes.We would also like to integrate our run -time library with a compiler for an HPF -like language.which would allow HPF -like codes to be parallelized to take advantage of an adaptive environment.

FIGURE 1
FIGURE 1Example of a data-parallel program.

FIGURE 2
FIGURE 2 An adaptive programming environment.
FIGURE 4 Adaptive SPMD program using AY"IP.

Figure 4
Figure 4 shows the example from Figure :3 parallelized using AMP.The only difference from the nonadaptive parallel program is the addition of the detection and remap calls at the beginning of the time-step loop.The initial computation of the loop bounds and communication schedule are the same as in Figure 3.The remap point is the beginning of the time-step loop.If remapping is to be performed at this point, the function Remap is invoked.Remap determines the new logical processor numbering.after it is known what processors are available, and creates a new Data Access Descriptor (DAD).The Redistribute_Data routine redistributes arrays A and B. using both the old and new DADs.After

Table 1 .
Cost of Remapping (in ms): Multihlock Code on Network of Workstations

Table 3 .
Cost of Remapping (in ms): Multigrid Code on IBM SP-2

Table 4 .
Number of Time Steps for Amortizing Cost of Remapping: Multiblock Code on Network of Sun Workstations