Data-Parallel Programming in a Multithreaded Environment*

Research on programming distributed memory multiprocessors has resulted in a well-understood programming model, namely data-parallel programming. However, data-parallel programming in a multithreaded environment is far less understood. For example, if multiple threads within the same process belong to different data-parallel computations, then the architecture, compiler, or run-time system must ensure that relative indexing and collective operations are handled properly and efficiently. We introduce a run-time-based solution for data-parallel programming in a distributed memory environment that handles the problems of relative indexing and collective communications among thread groups. As a result, the data-parallel programming model can now be executed in a multithreaded environment, such as a system using threads to support both task and data parallelism. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
Data-parallel programming has emerged as the premiere programming model for distributed memory multiprocessors (DMMPs).This is primarily due to the fact that the parallelism stems from simultaneously performing the same (or similar) operations on different portions of a data set.This implies that the amount of parallelism that can be exploited increases with the size of the data sets.The result is that it is fairly easy to find and exploit data parallelism in most numericalbased applications.
Central to the data-parallel programming model is the idea that, on each processor, some agent (usually a process) is responsible for executing parallel operations on the section of data elements that reside in local memory.To implement this model, each agent is given a relative index from 0 to n -1, assuming there are n agents.The relative index is used to identify each of the participating agents, and to designate which portions of code are to be executed by which agents.For example, consider the following data-parallel code for computing the sum of the elements of a vector that is evenly distributed over n processes: global real sum real lsurn, vector(N) integer i if (rnyRank .eq. 0) sum lsurn = 0. 0 0.0 do i = lower_bound (my_Rank) , upper_bound (my_Rank) lsum = lsum + vector(i) end do sum = sum + lsum if (my_Rank .eq. 0) print * sum If this examplt>.the relatiw index i~ used to spt>cify which pnH•f'~s will initialize and di~play the global sum.Tht> ~tatemf'nt sum + = lsum corre~ponds to a global reduction.in which each process must communicate its local sum value.lsum.to n dt>signated lt>ad pnwf'ss.who tlwn combinf's the intf'nnediatf' results.This operation demonstrates thf' second major requirement for implementing data-parallel programs: collective operations.Therefore.an implementation of the data-parallel programming model must provide for both rt'lative indexing and collectiw operations.
Lightweight.usPr-level threads are becoming in-<Teasingly useful for supporting paralleli~m and asynchronous events in applications and language implementations.In particular.many recent languages for parallel and distributed computing employ lightweight threads to represent functional parallelism.to ovt>rlap computations with communications.or to simplify resource management [ 1--t J.In response to this increasing demand for parallel language support.several projects havf' emerged with the goal of providing standard lightweight thread support [;)-?] .and a committee has been formed to establish standard interfaces for such a run-time system [8].
In this article.we describe a run-time solution to the problem of supporting data-parallel execution using lightweight threads as the data-parallPl agents.Thus.we describe the ability to provide relative indexing and collective operations among a group of threads.called a rope.Again.consider a simple data-parallel algorithm for computing the stun over a distributed arrny.To execute this example as a set of distributed threads in the midst of other thread activitv.and without involving the other threads.a scoping mechanism i~ nt>t>ded for identifying the ••member•• thrends that will contribute to the global redw~tion.Also.because the thread agents can (and likely do) have different thread identifiers on each process.a translation scheme is needed to map the local thread identifier for each thread agent to a relative index for the entire dataparallel computation.The concept of ropes as dPseribed in this article provides support for these mechanisms.
Relatin' indexing allows the programmer to specify spatial relationships among the parallel execution units.which express the natural ••1wighboring .. relationships in data-parallel algorithms.Also.with proper support for mapping threads to processes.and procPsses to proces~ors.relati\•e indexing can be used to optimize performance by ensuring that an algorithm is correctly mapped onto the underlying communication topology.
Collective operations art> typically supported at the process level by the undt>rlying communication system [9] or by standard communication intf'rfaces [10. 11].
For example.\lPI [ 1 OJ provides a mechanism for process scoping called process groups.HowevPr.~npport for grouping threads within processes is not currently supported by eitlwr \lPI or by tlw new thread-based run-time sy~temsyet such support is clearly needed if threads are to pPrfonn collective operations on a sub~et of the threads in the systenL \re describt> our dt>sign for data-parallel support among lightweight threads in the context of Chant [:'].a run-time system which supports both intra-and interprocessor communication between light WPight threads in a distributed systPm.However. the design issues we present are applicable to any thread-based run-time system that support:-; some form of commtmication between threads.Our contribution is to provide a detailt>d examination of the issues that arisf' in supporting relative indexing and collective operations among lightweight threads.Additionally.we provide an implementation of these concepts atop Chant and report some performance results.
The remainder of the artidt> is organized a~ follo"•s: SPction 2 provides background on Chant.a system supporting communication between threads . .and Section 3 outlines the design of a run-time approach to support data-parallel programming in a multithreaded environment.Section ' "± addre~ses the issues of interfaeing with ropes.particularly from the perspective of a data-parallel compiler.Section;) presents pPrformance results Pvaluating our initial implementation.Section 6 outline~ related research projects.and we conelude in Section ?.

CHANT
The Portable 01wrating System Interfaces for Computer Environments (POSIX) commi11ee has recently established a standard for the interface and functionality of lightweight threads within an operating system process.called pthreads [ 12].Because threads are defined within the context of n process.they share a single address space.and cotmnunication among threads is only defined in terms of shared memory primitives . .such as events and locks.Thus, the interaction of pthreads in a distributed environment is undefined.Likewise, the Ylessage-Passing Interface Forum (MPI) has recently established a standard for communication between processes [ 10 J .A !though various extensions to the standard have alreadv been proposed [13,14].communication between lightweight threads wi.thin processes has yet to he supported by .MPI.Therefore.Chant was designed to provide a simple mechanism for combining lightweight threads with interprocessor communication.
Chant [7] is designed as a layered system (as shown in Figure 1 ), where efficient point-to-point communication provides the basis for implementing remote service request;; and, in turn, remote thread operations.Each layer is accessible to the user so that the proper amount of support and performance can he obtained.Chant relies on a system interface to achieve a high degree of portability.where the underlying thread and communication systems are pthreads and :MPL respectively.
Chant supports point-to-point communication (i.e., send/reev) between any two threads in tlw system by utilizing the underlying message-passing system (:VIP I) and providing solutions to the problems of naming global threads in the system.avoiding intermediate copies for message buffers, and efficient thread-level polling for outstanding messages.It uses the concept of a context to represent an addressing space within a processor.Chant assumes that a linear ordering of contexts in the system is maintained by the underlying communication system.For example, MPI uses rank within MPI_CQMM_WORLD to linearlv order all processes (addressing tern.Therefore, threads identified using the thread_id>. spaces) within a syswithin Chant are globally doublet <context_id, Atop efficient point-to-point message passing, Chant supports remote service requests by instantiat-ing.in each contexL a st>rvice thread which is responsible for handling all incoming remote service requests (asynchronous messages) and delivering any necessary replies.Csing tfw remote service request mechanism.Chant can snpport remote thread operations,.such as remote thread create.by invoking the specified request on the desired process and, possibly.by adding some software •'glue'' to make it work.For example.implementing a remote join operation.in which the calling thread will block until the specified thread in a different context has finished its execution,. is not as simple as invoking a local join in the remote context.Instead, the exit handler of the desired thread must be modified to send a message to the calling context upon exit.The effect of the message will be to awake the calling thread from its suspended state.
Finally, Chant provides a user interface that is an extension of the pthreads standard.where access to each of the underlying layers can be made directly or indirectly.Thus, this is still possible to access the underlying MPI or pthreads interfaces from within a Chant thread.

DESIGN
In this section we oulline a run-time-level design that supports data-parallel programming among threads.
After reviewing concepts (Section :3.1) and the requirement for such support (Section 3.2), we discuss the issues of rope servers (Section 3.3), rope creation (Section 3.4), relative indexing (Section 3.5), and collective operations (Section 3.6).

1 Concepts
First, we give definitions for common terms so as to avoid any confusion over their meaning: A processor is a central processing unit capable of executing instructions.Examples include the Y1IPS R4400, the Intel Pentium.and the Sun SPARC.2. A process is a Cnix process, complete with its own address space, register set values, and execution stack.It provides a single, sequential exPcution stream, and time-shares the processor with other processes.The execution order of processes is controlled by the operating system.3. A context is an address space, and is typically mapped to a Unix process.We use this term to capture the concept of a single addressing space without the connotation of a single execution unit, as is the case with a Unix process.
I<'IGURE 2 An example of three contexts containing seven threads. 4. A thread is a sequential execution stream that executes within the address space of a process.
~1ultiple threads can execute within the same address space, or context.Lightweight, or userlevel, threads are controlled explicitly by the user rather than the operating system.:\1ediumweight, or kernel-level, threads are controlled by the operating system.For the purpose~ of this artiele, we arc only concerned with lightweight threads.
We define a rope to be a collection of threads cap able of spanning context boundaries that define a scope for collective operations and support relative indexing (renaming).Figure 3 depicts the same set of contexts and threads as are presented in Figure 2, except that threads <Cl, T2>, <C2, Tl>, <C2, T3>, and <C3, Tl> have been organized into a rope.A rope translation tabk which translates between the relative index of a rope member and its global thread identifier, is also given.

Requirements
A system for implementing collections among distributed threads must satisfy the following requirements: 1.The collections are entities whose members can span contexts, and thus their identifiers must be unique within the system.2. Each collection must keep track of its constituent contexts and threads, and operations to add and delete from this list must be performed atomically.
:3. Thread ranks within a collection must be unique so that there exists a one-to-one mapping between the thread identifier with respect to the context (global thread id) and the thread identifier with respect to the rope (relative index).
That is, for each rope thread index t, in the set of rope id<>ntifiers R, there exists a corresponding context thread identifier t, in the set of context identifiers C, and a mapping function J1AP that provides the translation: We now describe the design of such a thread collection that satisfies these requirements.Although we describe our design as an additional Chant layer, the same principles can be applied to any system of lightweight threads supporting communication between threads in separate address spaces.

Rope Servers
The requirement that each rope possess a unique identifier spanning all contexts is typically satisfied by having a centralized namer server responsible for allotting rope identifiers and for performing atomic updates to the internal data structures.Although distributed algorithms for name servers [15] and atomic operations [16] are well known, their added overhead and implementation complexity are often unwarranted in an initial design.However, a completely centralized  The rope name-sPrwr twcchanism: rope lists are expanded in Figure 5.
solution for naming and updating ropes will certainly cause hot-spots.Therefore, our initial design for a name server utilizes a two-level approach.as depicted in Figure 4. consisting of a single global name server and a number of local name servers.This design is derived from the idea of two-level page management schemes used in distributed shared memory systems [17].
The operation of the rope name server mechariism proceeds in two levels as follows: 1.The global name server simply provides the unique identifier when a rope is created, and keeps track of which context will host the local name server responsible for the rope configuration and proper execution of rope operations.This server is needed so that any thread can find out which context is responsible for serving a particular rope.For this reason, the global server is always on a fixed.known context.2. The local name servers are responsible for translating a relative index based on a rope identifier ( <rope_id, rank>) into the global address of a thread (<context_id, thread_ id>).The local name server is also responsible for processing queries regarding the state of a given rope, such as the number of total or local threads that are in the rope.All of this information is kept in a data structure as depicted in Figure 5.

Rope Creation
Because a rope is a group of threads that defines a scope for collective operations, creating a rope is tantamount to specifying this set of member threads.In some instances, it may be useful to create a set of new threads which will define a rope.For example, a host program may create a set of new threads as node programs for a data-parallel computation.and the new threads are to employ collPctive operations and should thus comprise a rope.This would correspond to a task parallel model of computation, in which a single controlling task. the leader.is responsible for creating worker tasks, each with a certain amount of work to do.In other instances, it may be useful to add existing threads into an extant rope.This would correspond to a situation in which extant threads enter a data-parallel phase of the computation and then add themselves into a rope.For example. a threaded system may start with a single thread on each processor, and each of these threads may add itself into a rope representing the global set of threads.If all newly created threads were also added to this rope.then this would be the thread-level equivalent to \1PI's MPI_ COMM_.Therefore.the rope creation mechanism must be capable of both creating new threads to comprise a rope.and adding existing threads to a rope.This is accomplished by separating the tasks of (~rearing a rope and specifying membership in a rope.Creating a rope is done using the rope_create* calL resulting in a message being sent from the source thread to the global name server.which returns the next available rope identifier.To avoid further messages and a more complicated protocoL the context of the calling thread is designated as the rope server for the new rope.Thus, distribution of the rope servers is accomplished by having different threads invoke the rope_create routine, which is under direct control of the user.The global name server keeps track of which context is the server for each active rope so that any thread in the system can always find out who the server is for a particular rope (via the global name server).
A newly created rope is initially empty, and the user can use the following two mechanisms for specifying membership: *The intcrfa('P for all thP rope calls.as currently implPmPntPd in Chant_ i~ gin~n in Figure 6.   1. rope_addnew.which creates a specified munber of threads on a set of contexts and adds them to a rope.2. rope_addself, which adds the calling thread to a rope.
In the case of rope_addnew.the calling thread sends a message to the server for the specified rope.indicating how many threads arc to be created and on which contexts.The rope server assigns the ranks and sends messages to the specified contexts informing them to create the required number of local threads.and to update their thread lists with the ranks of the new threads.More details on how the addnew operation works is given in Section 3. 5. after discussing how the translation tables work.

Relative Indexing
Spatial relationships play an important role in dataparallel algorithms.Thus.most communication systems, such as YIPI, NX. etc., provide a linear ordering of the participating processes, which allows relative indexing of the processes independent of their actual system address.Ropes provide a similar relative ordering for a set of threads that is independent of their actual global address.Thus.we say that each thread within a rope is assigned a unique rank.starting from zero and linearly increasing.This makes it possible to send a message from thread i to thread i + 1 within a rope.without regard to the physical location of those threads.Spatial ordering can also be used to gain performance by exploiting the underlying connectivity of the architecture.However, for this to happen.the user must be able to specify a mapping of threads to processes (allowed in Chant) and processes to processors (currently not allowed in YIPI).
To support relative indexing.the system must provide a one-to-one mapping between the rank within a rope ( <rope_id, rank>) and the global address of a thread ( <context_id, threacLid> ).This is accomplished via a rope translation table.which stores the associated global identifier for each relative index within a rope (refer to Figure 3).If the translation table is kept in a centralized location, then remote references would be necessary for translating all relative indices.which would be prohibitively expensive.Therefore.we replicate this information and keep a copy of the table on each participating context for the rope.Figure 5 depicts the data structure for the local rope list.including the rope translation table.
Again.borrowing from earlier work in an area of page coherence for distributed shared memory systems.[18].we adopt two options for keeping the distributed translation tables consistent: new information is broadcast so that all tables are kept up-to-date at all times (strong consistency).or tables arc allowed to remain out-of-date until a reference for a thread is generated.causing the information to be retrieved and stored (cached) in the local table (weak consistency).
If each thread in a rope communicates with only a small number of other threads in the rope.then the weak consistency model should result in better performance, because the creation cost is so much less.If. on the other hand, each thread in a rope will communicate with many other threads in the rope.the strong consistency model should result in better performance.Determining the crossover point for a given application is an open question depending on the overheads of the two approaches (see Section 5).Therefore, the system supports both strong and weak consistency on a perrope basis by providing an argument to the rope_ ere ate routine to specify the consistency requirement.When a rope_addnew is performed.the individual contexts send the thread identifiers for the new threads back to the rope server so that the rope server can update the master copy of the translation table.
If the rope is using the strong consistency model, tht>n an image of the new rope translation table is prop agated to all member contexts.
To translate a relative index ( <rope_id, rank>) into a Chant global thread identifier (<con-text_id, thread_id>), the following stt>ps are taken: 1.If the local rope table does not have an entry for the given rope identifier (i.e., th<> calling thread is in a context which is not a member of the specified rope), a message is sent to the server for the rope (via the global name server) r<>questing that the global thread id of relative index within this rope be returned.

If the local rope tables does have an entry for
the giv<>n rope identifier, the translation table for the rope is accessed to determine if the specified rank has a valid entrv.If not.then either it is an error (strong consistency) or the entry is requested from the rope server (weak consistency), returned, and cached for future requests.

Collective Operations
MPI provides the group facility for specifying which processes will participate in a collective operation, and ropes extend this idea to the thread level.To do this, each context participating in a rope must known the other contexts in the rope as well as the list of local threads in the rope, and so this information is maintained for each rope in a rope table (refer to Figure 5).
To take advantage of system-specific optimizations for collective operations among processors.all collective operations among threads are performed in two steps: at the thread level and at the context level.For example, consider the rope_barrier operation, which performs a barrier synchronization among all threads in a rope.The barrier is performed first at the thread level within a context, and then at the context level, as described by the following algorithm: 1.Each context will maintain an accumulator thread per rope which is responsible for accumulating the number of local threads that have participated in a global operation.2. Each thread, upon executing the barrier command, will send a message to the accumulator thread for the local context, which will accumulate the count for the number of messages received.After sending the message, the calling thread is blocked on an appropriate event.193 number of local threads in the rope on this context (this information is stored in the rope table ). a message is sent to the rope server for this rope.
The accumulator thread then waits for a reply from the rope server.4. When the rope server has collected a message from each context in the thread, a message is returned to the accumulator threads on the participating contexts.informing them that the barrier is complete.5.The accumulator threads then triggers the events for the local waiting threads, thus completing the barrier.
ldeallv.we would like to utilize the context-level primitives from MPI, such as MPI_BARRIER, for replacing steps 2 and :3 in our algorithm.However, the MPI_BARRIER called invoked by the local accumulator threads would block the entire process.including any other threads in that context not related to the rope.until all participating contexts had invoked the MPI_BARRIER call.This would remove one of the key features of a multithreaded system: the ability to overlap useful computation (in the form of ready, waiting threads) with long-latency, blocking operations.As a result, our design does not use the MPI_BARRIER calL but rather a simple messagecombining scheme that allows other ready threads to execute while the barrier operation proceeds.Whenever possible, we utilize the MPI collective operations, and should the \1PI committee see fit to extend the standard with a nonblocking barrier operation, we would certainly incorporate it into the design as mentioned.
Other collective communication operations, such as reduction functions, are implemented in a similar twolevel fashion.

INTERFACING WITH ROPES
In this section we address tht> issues of interfacing with ropes from the perspective of a data-parallel compiler, such as a compiler of high-performance Fortran (HPF) [19], targeting a multithreaded system.such as Chant.For reference, Figure 6 gives the interface for the rope calls as currently implemented in Chant.
The primary goal of a run-time-based implementation of ropes is to allow for an off-the-shelf HPF compiler to generate data-parallel code that can be executed in a multithreaded environment.For scientific computing, this allows for the same HPF program to execute in a variety of environments, including one that supports task parallelism.Add the calling thread to the specified rope.
Send a message to the thread specified by the relative index <rid, rank>.
Participate in a barrier for the specified rope.
Participate in a broadcast for the specified rope, originating from the specified thread within the rope.
Initiate exit; rope will terminate when all member threads have invoked this function.
Wait for the specified rope to exit.
Return the rope identifier for the calling thread.
Return the rank of the calling thread in the specified rope.
Return the maximum rank (number of threads) for the specified rope.To examine the issues involved, let us consider the HPF fragment shown in Figure 7 in which an array, A, is distributed by block across a set of processors.The number of processors is to be determined externally at run-time.The forall statement shifts the array one element to the left.The scalar variation, X, is assumed to replicate, and thus the value of A(t) has to be broadcast to all other processors.
Most HPF compilers would convert the above code into SPMD code to be executed by a context on each processor.A typical version of the code, using Intel's ~X communication calls, is shown in Figure 8.In this  code, each context determines the total number of contexts participating in the execution and allocates enough memory for the local portion of the array A along with an overlap area (in this case 1) to accommodate the required nonlocal data.The forall statement is translated into a strip-mined do loop such that each context loops over its local portion of the array.Before the loop, each context (except context 0) communicates its left boundary element to its neighbor on the left and then receives the value sent by its neighbor and places it in the overlap area.The assignment to the replicated scalar variable X, on the other hand, turns into a broadcast from the context-owning element A(1) to all other contexts.
Targeting a thread-based run-time system would require that the compiler code for threads rather than processes.If support for ropes is not provided, it becomes the compiler's responsibility to generate code which keeps track of all the threads participating in the data-parallel computation including the translation table required for relative indexing.Thus, before a communication, the thread would have to determine the global thread id of the thread it is communicating with.Similarly, for collective communications, such as the broadcast in the above case, the compiler would have to generate code to multicast to the set of threads participating in the execution.
The design of ropes, as described here, provides run-time support for the translation table and the multicasting in the context of a multithreaded environment.The code produced by a compiler targeting such as interface would have two parts.First, the rope would have to be intialized by creating the appropriate threads, and then the user code would be executed as an SP:\iD function in each of the threads.Part of the initialization code is shown in Figure 9.
We assume here that the P contexts participating in the execution have already been set up.The lead context then creates a rope with weak consistency and adds one thread per context in all the contexts.The lead context then waits for the rope to exist before Initialization code executed in the lead context call pthread_rope_create(rope_id, WEAK) call pthread_rope_addnew(rope_id, NULL, UserF, NULL, 0, P, ALL, 1) call pthread_rope_join( rope_id ) FIGuRE 9 Sample initialization codP for ropes.
continuing.The user code is executed by each thread in the ropt> using the subroutine CserF depicted in Figure 10.
In comparing the two codes (Figure 8 and Figure 10).we see that the differt>nces are minor.This implies that a compiler has to do very little work to re-target from a process-based system to a thread-based system.The rope-based code has somt> t>xtra setup code.Also, the individual thread code has to determine its rope before it starts execution and the message sent and broadcast have to provide an extra rope id.Each thread also calls pthread_rope_exi t at the end of its execution so that the lead context can he notified when the data-parallel computation has finished.Overall.an HPF compiler has to put in a few extra calls to run-time routines to target a ropes interface.This offers the advantage of re-targeting a data-parallel compiler for a thread-based run-time system with very little modification.

EXPERIMENTAL RESULTS
Having descrilwd the design of a run-time-based approach for supporting data-parallel programming in a multithreaded environment, we now discuss performance figures for our ropes implementation atop Chant.We will briefly discuss the performance of the inherent design, including rope creation, ropt> addition.and message passing using relative indexing, as well as the performance of a data -parallel program for a Jacobi-like computation.The target machine is a 64-node Intel Paragon located at the l\ASA Langley Research Center.

Rope Creation
Creating a rope results in a remote service request being sent to the global name server, and its reply, which requires about :375 p.,s.After the rope server has returned the new rope identifier to be used, local data structures are initialized and the rope creation is complete.If only a single rope_create operation is being executed on all contexts in the system.then the creation timt> is independent of the number of contexts; otherwise contention for the global rope server will degrade rope creation time depending on the number of simultaneous rope_create calls.
Adding new threads to a rope, using the rope_ addnew call, requires sending a message to the rope server.indicating how many threads are to be created on the specified list of contexts.The rope server must then broadcast a request to those contexts, informing tht>m to create the new threads.After creating the threads, the participating contexts send a message back to the rope server, detailing the thread identifiers of thenew threads so that the rope server can complete the rope translation table.Finally.if the consistency mode for the rope is strong, then the rope server must broadcast the new rope table to the participating contexts.Therefore, the total number of message exchanges required to add threads on n contexts is 2n + 1 for weak consistency and 2n + N + 1 for strong consistency, where JV is the total number of contexts involved in this rope.By using the exchange time shown in Figure 1 Lit is straightforward calcula-  tion to determine the cost of creating new threads on n contexts.Table 1 shows the times (in microseconds) for creating four new threads on each of two to fivecontexts and adding them to an existing rope.

Relative Indexing
Figure 11 depicts the time (averaged over 10 runs) required to exchange a message (i.e., send/receive pair) on the Intel Paragon using four different mechanisms for message passing: 1.The bottom line corresponds to MPI eommunication directly between processes.identifier, <context, threa<Lid>, and then invoke the normal chant_send function.These numbers assume that the rope table entry containing the global thread identifier is valid (either strong consisteney or weak consistency already cached).4. The top line represents Chant thread-to-thread communication using the relative rank within a rope (as with the previous line).but assuming that the rope table entry for this rank is not valid (weak consistency, not cached) and therefore must be fetched from the server for this rope, which is on a different context from the calling thread.Thus, the exchange time is double that of the previous line, accounting for the exchange needed to get the translation information from the rope server.This is the worst-case situation for the rope_send.
The results from Figure 11 indicate that relative indexing adds an insignificant overhead to the cost of sending a message between two threads when the translation information is present, and doubles the cost when the translation information must be retrieved from the server.

A Data-Parallel Computation
For this experiment, we implement a data-parallel algorithm for smoothing a two-dimensional array of  12.
For a balanced Jacobi iteration (c.f., Figure 1.3), all threads would get a similar number of array elements and the computation required to smooth each element would be the same for all elements.With this approach, each thread will take about the same amount of time, so a parallel execution of two ropes will not reveal how the threads can actually take advantage of wasted processor cycles.However.if we implement an unbalanced Jacobi computation, where the smoothing function varies for each array element, then some threads in the rope will complete their execution before other threads, resulting in wasted computation gaps.An unbalanced Jacobi corresponds to real-world situations where, for example, the iterative solution of an element within a grid vades depending on its location within the grid.Tn our experiment, we implement the unbalanet~d Jacobi . .hence there is a difference between the sequential and parallel rope executions.The difference. in fact.corresponds to the gap between the longest-running thread in tlu~ first rope and the thread in the first rope that is mapped to the same processor as the longest running thread in the second rope.This difference is illustrated in Figure 14.Were both longrunning threads mapped to the same processor, the gap would be zero and thus the execution times the ropes in parallel would be the same as the execution times for the ropes in sequence.

RELATED RESEARCH
There are several systems which support distributed threads, such as ~exus [6].and Panda [3], although these systems do not currently support the notion of ropes.
The term rope was first coined in the pthreads+ + system [20], in which a rope is a C++ class that provides support for data-parallel execution of a task in a shared memory environment, and later extended to a distributed memory environment.
A rope is the thread-level analogy to the processlevel scoping mechanisms provided by most communication packages, such as process groups in MPI [10].MPI does not currently support the notion of threads as addressable entities within a process, nor the ability to group such threads.However, there has been a lot of recent research activity combining MPI and threads.
Besides Chant, which was outlined in Section 2, other projects include [21], which addresses the issue of making MPI (using P4) "thread-safe" with respect to internal, worker threads designed to improve the efficiency of MPI, but not intended to be user-accessible entities (i.e., they cannot execute user code); and Sequential Parallel Time Thread# FIGURE 14 Sequential and parallel unbalanced Jacobi using two ropes.
DATA-PARALLEL PROCRA\fMI:\G 199 [ 14], which addresses many possible extensions to the MPI standard, including the addition of long-lived threads capable of executing user code.Suggestions for altering the role and functionality of communicators would allov.•for multiple threads per communicator, thus permitting collective operations among the threads.
The contribution of this article is to provide a design for a thread-level scoping mechanism that is not predicated on MPI constructs and their extensions, but rather on a simple communicating thread model that supports remote service requests.Thus.until the MPI communitv sees fit to extend the standard for useraccessible.long-lived threads, our approach provides a clean and efficient mechanism for supporting dataparallel execution in a multithreaded environment.

CONCLUSIONS
Recently, several run-time systems have been designed to support interprocessor communication between lightweight threads within a process.Although collective operations and relative indexing are common operations for most message-passing systems, support for these operations at the thread level has received little attention.
This article addresses the issues of supporting collective operations and relative indexing among threads in a distributed memory environment.We provide the design for ropes in the context of the Chant system, where a rope defines the scope of collective operations with respect to threads.Our design builds on the Chant system, which provides point-to-point communication between threads in a distributed mernorv environment.
We plan to utilize this extension to the Chant runtime system for supporting data-parallel codes in a multithreaded environment.and will report on the results of this effort in future work.
FIGURE 4The rope name-sPrwr twcchanism: rope lists are expanded in Figure5. n

FIGURE 5
FIGURE 5 Data structure for local rop~ list.

: 3 .
After the local accumulator thread has collected the number of barrier messages equal to the DATA-PARALLEL PROGRA\1.\1T~G

FIGURE 8
FIGURE 8  Resulting data-parallel code from HPF fragment.

Jacobi 10 NFIGURE 12
FIGURE12 Execution times for an N X l\ X :\! unbalanced Jacobi running on a 2 X 2 X 2 Intel Paragon.

Table 2 .
Execution Times for an N x N x N Unbalanced Jacobi Running on a 2 X 2 X 2 {Af.; ifi = 1, i = n, j = 1, j = n A;i = Af./2,0+(Af-t.;+Afn;+ A!;-1 + Ai.;+l)/8.0 otherwise where A represents the current iteration and A' represents the previous iteration.Two copies of the algorithm are mapped onto two ropes over a set of 2 X 2 X 2 processors, where the solution grid size of :'-r X N X N varies from N = 2