Overlapping Communication and Computation with OpenMP and MPI

Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is typically used for exploiting loop-level parallelism it can also be used to enable coarse grain parallelism, potentially leading to less overhead. We show how coarse grain OpenMP parallelism can also be used to facilitate overlapping MPI communication and computation for stencil-based grid programs such as a program performing Gauss-Seidel iteration with red-black ordering. Spatial subdivision or domain decomposition is used to assign a portion of the grid to each thread. One thread is assigned a null calculation region so it was free to perform communication. 
 
Example calculations were run on an IBM SP using both the Kuck & Associates and IBM compilers.


Introduction
This paper discusses combining MPI and OpenMP on collections or clusters of shared memory processor or SMPs.For this class of machine, individual nodes are connected together with some type of network while the processors within a node share memory.
Many new large machines fall into the clusters of shared memory nodes classification.The machine used for this study was the IBM SP at the San Diego Supercomputer Center know as Blue Horizon.It contains 1,152 processors in 144 nodes.Each one of the nodes contains 8 processors that share memory.The nodes are connected with an IBM proprietary network.The machine supports a hybrid programming model with the use of OpenMP among processors within a node and MPI for passing messages between nodes.The SP is an instance from its class of machines.Other machines will have different network, processor, operating system, and compiler characteristics.The specific results given in this paper comparing run times from various tests might not apply to other machines.Most of the motivations and techniques discussed for combing OpenMP and MPI can be used on other machines.
We are interested in studying using OpenMP to facilitate overlapping communication and computation in hybrid programming for stencil-based grid programs.Several tests leading to this goal were performed using a simple 3 dimensional stencil-based program, RB3d.RB3d does Gauss-Seidel iteration with red-black ordering.It uses a combination of C++ and Fortran, with a Fortran kernel.The grid is distributed across the nodes of the machine, with MPI used for communication.
Various experiments were performed, changing the way OpenMP was applied to the program.OpenMP was first applied at the lower loop-level, within the Fortran kernel.For the next test, the directives were moved to a higher level within the program, outside of the kernel.The OpenMP is moved to a routine that calls the Fortran kernel.For this case there is a second level of domain decomposition.The OpenMP threads are assigned regions of this second level decomposition.
Moving OpenMP to a higher level facilitates overlapping communication and computation in hybrid programming.One of the OpenMP threads is assigned a small or null region in the second level of domain decomposition.This thread is then freed to perform the MPI communication.For programs with the appropriate computation to communication ratio this improves performance.
One of our motivations for studying hybrid programming is for its potential application to the KeLP (Kernel Lattice Parallelism) framework.KeLP is a class library, developed primarily by Scott Baden and Steve Fink at UCSD [1,2,6], for the implementation of portable applications in distributed computing.KeLP is based on the concept of regions.Users who write applications using KeLP supply subroutines to perform calculations on regions containing sections of a data grid.KeLP handles communication for regions on different processors.KeLP researchers are looking for ways to exploit multi tiered machines, and are interested in using threads for various region calculations.The KeLP framework may also benefit from overlapping communication and computation with OpenMP and MPI.
Regular grid stencil-based program are important in many areas of computational physics such as fluid dynamics, heat transfer, and electrodynamics.See LeVeque [11] for a discussion of the numerical solution of conservation laws on regular grids.The solution of Euler's hydrodynamics equation is discussed by Toro [16].Leveque [12] also shows that regular grid routines can be used as the basis for adaptive grid methods with his Conservation Law Package, CLAWPACK.Balls [3] discusses an innovative method of solving Poisson's equations in parallel, using KeLP.

Additional motivation for combining OpenMP and MPI
The primary motivations for adopting most new programming paradigms are increased capability, efficiency, and ease of programming.Adding MPI to OpenMP programs allows users to run on larger collections of processors.Pure shared memory machines are limited in numbers of processors.Adding message passing can increase the number of processors that are available for a job.Adding OpenMP to MPI programs can also increase efficiency, and for some systems increase capability.
There can be multiple reads and writes of data to memory when data is sent using an MPI library.In the extreme, data might be copied from the sending array, to a temporary buffer, to the network, to a buffer on the other end, and finally to the receiving array.Even if shared memory is used to pass messages, the data must be copied from the send buffer to the receive buffer.With OpenMP, there is no implicit copying of data.This can lead to greater efficiency.
At the time the tests described in this paper were performed, there were additional limitations on Blue Horizon that encouraged the use of OpenMP along with MPI.While the machine has 8 processors per node the communications hardware only supported 4 MPI tasks per node when using the best communications protocol.To get access to all 8 processors, users were required to use some type of threads package such as OpenMP or Pthreads.
The implementation of MPI available on the SP, has another limitation.The memory required by each MPI task suffers from P 2 scaling, where P is the total number of MPI tasks in a job.This can be a problem for large jobs because MPI can use significant amounts of memory and there is little left for users.Accessing P processors using N threads and P/N MPI tasks cuts the memory required by N 2 .

Combining OpenMP and MPI
The method for combining OpenMP and MPI is clearly program specific.This paper concentrates on an important class of problems, stencil-based or grid programs.
The grid program model is discussed below, followed by a description of the typical way to parallelize stencil-based grid applications on distributed memory machines.Then, two methods for creating hybrid programs by adding OpenMP are presented.The first uses loop-level OpenMP and the second uses OpenMP at a higher level to do coarse grain parallelism.
Data for stencil-based applications is stored in a 2 or 3 dimensional array or grid.Each iteration or time step of the calculation the grid values are updated.The values for a cell at iteration N+1 are a function of some set of the surrounding cells.Some example problems include a Jacobi iteration solution of Stommel's equa-tion, or the solution of Euler's hydrodynamics equation as discussed by Toro [16].
Domain decomposition is often employed to solve these problems on distributed memory parallel machines.A portion of the grid is allocated on each processor and each processor is responsible for updating its portion of the grid.Processors do not have access to values from the entire grid.The information required to update a particular cell might be on a different processor.The required information is sent to/from the various processors using MPI messages.
Consider a simple two dimensional case with a total grid of interest of 100 × 100 and using two processors.Processor one could perform the calculation for the portion of the grid with indices 100 × (1-50) and processor two could perform the calculation for 100 × (51-100).If the values for a calculation are a function of the 4 surrounding cells, then processor one requires information from the cells 1-100 × (51).These values are sent from processor two.The values are stored in cells called ghost cells.One simple way to handle ghost cells is to actually allocate the grid bigger than the region for which the calculation is being performed.The extra storage is used for ghost cells.So in this case, processor one would allocate its grid of size 100 × (1-51) but only calculate for the region 100 × (1-50) and the region 100 × (51) contains ghost cells.Information about additional complexities of parallel grid program can be found in many references including Kaiser [9].

OpenMP loop level parallelism
Stencil-based grid programs often update values using one or more nested loops.A simple example is a Jacobi iteration with a five point stencil and a source term.Consider the subroutine do jacobi.
Here we have old grid values stored in the array psi.These psi values are being used to calculate new grid values placed in new psi.To "OpenMP" the code section, a parallel do directive is added.
Pseudo code for a kernel of a Jacobi iteration program in MPI and OpenMP would be do k=1,num iterations call MPI transfer routine(. . . ) call do jacobi(. . . ) psi=new psi enddo MPI transfer routine passes ghost cells between nodes and do jacobi contains the OpenMP directives as shown above.
The algorithm given above has two distinct phases.There is a communication phase followed by a computation phase.All processors must wait for the communication to complete before they perform any computation.This is true even though some of the computation is not dependent on the communication.
Stencil-based grid programs suffer from a problem: single stepping through the grid using the straight forward nested loops there is very little data reuse.Data is used once or twice and then it is flushed from cache.This leads to slower performance.It is possible to do cache blocking but doing so often destroys the clean nature of the code that is exploited by a programmer using OpenMP directives.When doing cache blocking there are additional loop levels and application of OpenMP directives becomes more problematic.

Higher level coarse grain OpenMP parallelism
Consider the algorithm discussed above.For some cells, the data for the stencil calculation is dependent on communication and for some it is independent of communication.A natural thought is to break the grid into communication dependent and independent regions.With this done, an algorithm can be written to allow overlap of communication and computation.For this algorithm, you first start communication of data for the dependent regions.Calculation is done on the independent regions while communication continues.After communication completes, calculation can proceed for the dependent regions.
We can further decompose the regions into a collection of smaller regions.Threads can be assigned to blocks of data using OpenMP.This concept leads to the OpenMP directives being placed at a higher level in the program as shown below.
This is of interest to the KeLP developers because this concept fits well with the KeLP model, with a collection of regions assigned to a node and regions are assigned to threads.It also allows the exploration of overlapping communication and computation without rewriting the KeLP framework.
There are other advantages to moving the OpenMP directives up to a higher level in the program.For some problems moving the OpenMP directives to a higher level allows a looser level of synchronization, with a reduction of loop level synchronization cost.Also, programmers are freer to do additional optimizations at loop level, such as cache blocking that are difficult if OpenMP is applied to loops.
We define an abstract data type "region" that holds indices for a portion of the grid and an "empty" flag.For each MPI task we break the calculation grid into two sets of regions.The number of regions in each of the sets is equal to the number of threads.If we wish to overlap computation and communication one of the regions in the first set is empty.The thread that has an empty region handles communication.The second set of regions contains portions of the grid that are affected by communication.This is illustrated in Fig. 1 for a case where we have four regions in each set.The forth region in set one, is empty.Thus, the thread is assigned to this region performs communication.
After we have broken the grid into regions the algorithm proceeds as follows

perform calculation in kernel end parallel end timestep
We can compare this to the algorithm shown below when OpenMP is applied in the kernel.

for each timestep Start MPI communication finish MPI communication Perform calculation in OpenMP enabled kernel end timestep
This appears simpler, but applying OpenMP in the kernel might be difficult.Also, this does not allow overlapping of computation and communication.
When computation and communication are overlapped a processor is given up to do the communication.When is it beneficial to overlap computation and communication?Consider a single iteration step of a calculation containing both computation and communication, but no overlapping.Normalize the run time for the step to 1, and assume the communication takes time C. Computation is spread across P processors and take time 1 − C. The work associated with the computation is P * (1−C).If we use one of the threads for communication then the work must be spread across P − 1 processors.The run time sparing one thread for communication is There is speedup when the run time, T , is less than 1.This occurs if C > 1/P .If C > P/(2P − 1) then communication time is dominant and we get the upper limit on speedup of T = P/(2P − 1).If P = 8 then there is speedup if C > (1/8) and the speedup will peak with C = 8/15.That is, there is speedup using overlap if the communication takes over 1/8 of the time for a single iteration without overlap and the maximum speedup is S = 15/8 when C = 8/15.
We clearly need asynchronous communication routines to do the overlap discussed above.MPI contains many routines to support asynchronous communication.The typical way to do this is for the sending process to "post" a send by doing a MPI Isend.This starts the message going.The receiving process does a MPI Irecv.After some time, both processes do an MPI Wait to block until the communication completes.There are versions of MPI Wait that only test to see that messages have completed and don't block.MPI has the ability to define derived data types.These can be used to efficiently send noncontiguous blocks of data such as rows of a matrix.These routines are discussed in most MPI texts.Two good ones have been given above, Pacheco [15] and Gropp [7], Lusk and Skjellum.

Application for testing hybrid program methodologies
The test code that was used to validate the concepts discussed in this paper was an iterative 3d Laplace's equation solver that uses the Gauss-Seidel's method with red-black ordering.RB3d.The original program has been used as a research tool in association with KeLP and is part of the KeLP distribution.The version that was used does not contain any of the KeLP system.RB3d is a simple program with a 7 point stencil.It contains a C++ driver with Fortran kernel.When the OpenMP was placed outside of the kernel, the Fortran routines were not modified from their original form.Minimum modifications were made to the rest of the program.
This program was primarily used to see if overlapping communication with computation using OpenMP is effective.IBM's C++ compiler does not support OpenMP.There is support available from the IBM C and Fortran compilers.C or Fortran routines containing OpenMP can be called from C++.Thus, the C++ routines that contained OpenMP were rewritten in C.
The program was designed with several options that are invoked based on input.Overlapping of communication and computation can be turned on or off.Blocking for cache in the Fortran kernel can also be toggled.
There is also a version that does loop-level OpenMP.This version is simply the original program with di-rectives inserted in the Fortran kernel.The kernel is shown below.The top part is used if cache blocking is enabled.Si and sj are the variables that effect cache blocking.Note the added complexity of this block of code.Most importantly, the "best" place for the Parallel Do directive is a function of the blocking parameters.
The variable rb determines if the updated values are put into the red or black portion of the grid.That is, on one iteration of the kernel half of the grid is updated using values from the other half of the grid.This makes all updates to u(i, j, k) independent of each other.
if(do blocking)then !$OMP PARALLEL DO firstprivate(c,c2) loop a: do jj = wl1, wh1, sj loop b: do ii = wl0, wh0, si loop c: do k = wl2, wh2 loop d: The program was run using 4 MPI tasks spread on to 4 nodes.A grid of 120 × 120 × 240 on each node was We are allowing each thread to grab a region to work on.This second case requires that each thread execute the parallel region once and only once.This might not work on some compilers.Also in our test codes, we found no significant improvement by using a simple "parallel" directive and allowing each thread to grab a region to work on, instead of "parallel for".
The program was run using both the native IBM compilers and Kuck and Associates compilers.Also, the programs were run with cache blocking enabled and disabled.Cache blocking for grid programs is discussed by Douglas [5], Hu, Kowarschik, Rude, and Weiss.

Results
The best result for all tests occurred using the IBM compiler, the OpenMP applied in the Fortran kernel, and without using cache blocking.Table 2 shows the run times when the OpenMP directives were placed in the Fortran kernel.For the IBM compiler, and cache blocking turned off, the run time is reduced by 20% going from 4 to 7 threads and it starts to go back up at 8 threads.For the KAI compiler, run time is flat.
Why did the program perform relatively well when cache blocking was turned off?The IBM SP has a large l2 cache and each processor has its own cache.By using the OpenMP directives to spread the grid across all processors most of the grid stayed in cache.In this case, using OpenMP gave us the effect of cache blocking for free.Note that this would not hold for larger grids.
When using OpenMP directives with the cache blocked section of code, the performance is dependent on where the directive is placed.The "cache blocking enabled" data from Table 2 was obtained with the OpenMP directive was placed before loop a as shown in Section 3.So the compiler can only provide work for two threads, when jj = 1 and jj = 65.This is why the performance was flat for additional threads and why the using the "cache blocking enabled" section of code did not work well.
The directive can be placed ahead of one of the other do loops.This was tried using 7 threads and the same input parameters.Placing the directive above loop b and loop d produced the same results.The best results were obtained with the directive above loop c, the run time was reduced to a little over 12 seconds, which is comparable to the non cache blocked code.This worked well because of the large range of indices for the loop and unit stride.Placing the directive in front of loop e was disastrous, the run time jumped to 938.9 seconds.
The optimum placement is a function of the size of the grid and the cache blocking parameters.Moving the OpenMP directives to a higher level in the program removes these concerns.
A primary objective of this work is to move the OpenMP directives to a higher level and then compare using all threads for computation to using a thread to overlap communication and computation.Did reserving a thread for communication improve performance over using all threads to do computations?It did, but the best results from doing this were not quite as good as the best results discussed above.
When the directives were placed at the higher level to do coarse grain parallelism, using a thread to overlap computation and communication to reduce run time.Unlike the previous results, with high level OpenMP, turning cache blocking on improves performance.OpenMP was used to enable each thread to call its own copy of the Fortran kernel and pass it a region for which to calculate.The OpenMP directive that distributed the regions was #pragma omp parallel for schedule (dynamic, 1) for( ig=0;ig<nthread;ig++) { . . .do kernel(region [ig]); . . .Table 3 shows run times with and without a thread reserved for doing communication.Notice that using a thread for communication improves the run time of the program even though this reduces the number of computational threads by 1.The speedup is almost 22% for the KAI compiler using 6 calculation threads and 1 communication thread, compared to using 7 threads for computation.
Another important trend in this data is that scaleability improves by reserving a thread to do communication.For the IBM compiler, doubling the number of processors and not using a thread for communication results in a 20% reduction in run time.Doubling the number of processors and using a thread for communication gives a 31% reduction in run time.For the KAI compiler there is a 23% reduction in run time while not using a thread for communication and a 34% reduction when a thread is used for communication.Table 4 expresses this in terms of scalability relative to 4 threads.When a thread is not used for communication the scaling drops to about 0.64.Scaling drops less with a thread used for communication, to about 0.75.This is important for the Blue Horizon machine with 8 processors per node and can be very important on machines that have more processors per node, such as the Livermore ASCII White machine that has 16 processor per node.The general trend for new hybrid machines is to add additional shared memory processors per node.
Table 5 contains the run times of the program with and without using a thread to do communication, but with the cache blocking in the Fortran kernel disabled.Notice that the best run times are about twice of those for which cache blocking was enabled.Moving the OpenMP directives out of the Fortran kernel to a higher level enables the cache blocking optimizations to be used without interference from the OpenMP directives.

Conclusion
OpenMP combined with MPI can be used to program machines containing a distributed collection of shared memory nodes.Adding MPI to OpenMP programs allows users to run on larger collections of processors.Adding OpenMP to MPI programs can also increase efficiency and capability for some systems.
OpenMP with MPI were combined in a simple 3 dimensional stencil-based program using different strategies.Domain decomposition was used to distribute the grid across the nodes of a distributed memory machine containing 8 processors per node.Loop level OpenMP was first added to the nested do loop kernel of the program.The kernel contained two different options for performing the calculation.One option used a cache blocking algorithm and the other used a serial progression through the points of the grid.OpenMP preformed well when cache blocking was not used.With cache blocking, the performance varied widely depending on the placement of the directives.
OpenMP was moved to a higher level of the program.A second level of domain decomposition was done and each thread was assigned a portion of the grid to calculate.This can potentially add to the simplicity of a program because no modifications are needed to the kernel.Users are free to use optimization strategies, such as cache blocking, in the kernels without interference from OpenMP directives.This strategy also has the advantage that a thread can be assigned a null calculation region and be reserved for performing communication.When doing coarse grain parallelizm, reserving a thread for communication significantly increased the performance and scalability of the program.

Table 1
The normalized communication time is greater than 1/P so improvement is expected if communication and computation are overlapped These runs were done to determine if improvements in run time could be expected using overlapping.As discussed in Section 3.2, improvement is expected if the normalized communication time is greater than 1/P .Table shows that this program has sufficient communication to improve performance with overlap.For example, with four threads and 1/P = 0.25 we have a normalized communication time of 0.55.The program was next run with overlapping turned on.There were actually two cases.In the first case

Table 2
The run times when the OpenMP directives were placed in the Fortran kernel show poor scaling Red Black 3d Times 4 MPI tasks Grid 120 × 120 × 240 on each node OpenMP applied in Fortran kernel

Table 3
Shows the speedup of using a thread to perform MPI communication with coarse grain OpenMP Red Black 3d Times 4 MPI tasks Grid 120 × 120 × 240 on each node OpenMP Applied at higher level outside of the Fortran kernel Cache blocking enabled

Table 4
Shows improved scaling by using a thread to perform MPI communication with coarse grain OpenMP

Table 5
Program run times are faster with cache blocking enabled.Moving OpenMP to higher levels to do coarse grain parallelism facilitated cache blocking Red Black 3d Times 4 MPI tasks Grid 120 × 120 × 240 on each node OpenMP Applied at higher level outside of the Fortran kernel Cache blocking disabled compared to cache blocking enabled