Automatic Multilevel Parallelization Using OpenMP

In this paper we describe tl, e eaietlsion of the CAPO parallelization support tool to support multilevel parallelism based on OpenMP directives. CAPO generates OpenMP directives with extensions supported by tile NanosCompiler t,, allow for directive nesting and definition of thread g r,,ups. We report first results for several benchmark _odes and one full application that have been par, d/elized using our system.

programming model design as well as implementation. In addition to that, extensions to tie standard are being proposed and evaluated in order t,_ widen the applicability of OpenMP to a broad class of parallel applications without sacrificing po_ tability and simplicity.
What has not been clearly addressed in OpenMP is the exploitation of multiple levels parallelism. The lack of compilers that are able to explo,t further parallelism inside a parallel region has been tte main cause of this problem, which has favored the p_actice of combining several programming models to address scalability of applications to exploit multiple lexels of parallelism on a large number of prc_cessnrs. Th: nesting of parallel constructs in OpenMP is a featurt that requires atten- Since the dependence analysis is interprocedural, the parallel regions can be defined as high up in the call tree as possible. This provides an efficient placement of the directives.

2) Optimization of parallel regions and parallel
loops -the fork-and-join overhead (associated with starting a parallel region) and the synchronizing cost are greatly lowered by reducing the number of parallel regions required. Thisis achieved bymerging together parallel regions where thereisno_,iolation of datausage. In addition,the synchr,_nization between successive parallel loops isremove,J if it canbeproved thattheloopscancorrectlyexecute asynchronously (using theNOWAIT clause).
3) Code transformation and insertion of OpenMP directives -this includes the search for and insertion of possible THREADPRIVATE common blocks. There is also special treatment for private variables in nonthreadprivate common blocks. If there is a usage conflict then the routine is cloned and the common block variable is added to the argumen_ list of the cloned routine. Finally, the call graph is traversed to place OpenMP directives within the code. This includes the identification of necessary varia!_le types, such as SHARED, PRIVATE, and REDUCTfON.

Extension to multilevel paraUelization
Our extension to OpenMP muhilevel parallelism is based on parallelism at different I_op nests. Multilevel parallelism can also be exploited _Ath task parallelism but this is not considered, partly b,;cause task parallelism is not well defined in th,_ current OpenMP specification. Currently, we limit _ur approach to only two-level loop parallelism, which is of more practical use. The approach to automaticaliy exploit two-level parallelism is extended from the s ngle level parallelization and is illustrated in Figure 1. Besides the data dependence analysis in the beginning he approach can be summarized in the following foltr steps. Step 1. These parallel loops and parallel regions are then optimized as before but limited to the scope defined by the first level.

3) Second-level directive insertion.
This includes code transformation and OpenMP directives insertion for the second level. The step performed before inserting any directives in the first-level is to ensure a consistent picture is maintained for any variables and codes that may be changed or introduced during the code transformation. Compared to single level parallelization, the two-level parallelization process requires the additional steps indicated in the dash box in Figure 1.

Implementation consideration
In order to maintain consistency during the code transformations that occur during the parallelization process we need to update data dependencies properly. to be considered as well. This is illustrated by the following example.
Assume we hav,_ a nest containing two loops: DO   A study about the effects of s:lgle level OpenMP parallelization of the NAS Parallel Benchmarks can be found in [12]. In our experiments we started out with the same serial implementation of the codes that was the basis for the single level OpenVIP implementation as described in [12]. We ran class A (64x64x64 grid points), B (102x102x102 grid points), and C (162x162x 162 grid points) for the BT and SP bench-  outer level parallel code generated by the Nanos Compiler runs somewhat slower than the code generated by the SGI compiler, but its relative performance improves with increasing number of threads. When increasing from 64 to 128 threads, the multilevel parallel code still shows a speed-up, provided the number of groups is chosen in an optimal way. We observed a speed-up of up to 85% for 128 threads. In Figure 3 we show the speed-up resulting from nested parallelization for three problem classes of the SP and BT benchmarks. We denote by • SGI OpenMP: the time for outer loop parallelization using just the native SGI compiler, • Nanos Outer: the time for outer loop parallelization using the NanosCompiler, • Nanos Minimal: the minimal time for nested parallelization using the NanosCompiler.
For the BT benchmark CAPO parallelized 28 loops, 13 of which were suitable for nested parallelization.
Thereason that multilevel parallelism has a positive effect ontheperformance of these loops ismainly due to thefactthatloadbalancing be;ween thethreads is improved. Forclass A, forexample, henumber of iterations is 62.If onlytheouterloopis parallelized, usingmorethan62threads will rot improve theperformance anyfurther. In thecase of 64threads, 2 of them will beidling. If, however, lhesecond looplevel is alsoparallelized, all 64thread, _ canbeputto use. Ourexperiments show thatbych(_osing thenumber of groups too small,theperforman,'e will actually decrease. Setting thenumber of gn,ups to 1effectively movestheparallelism completel_ to theinnerloop, which will in most cases belessetficient thanparallelizingtheouter loop.    Figure  5. .it. numt) then do while(isync(iam-l) .eq. 0)  There is, however, a problem in setting up a directive-based two-dimensional pipeline. The structure of the Loop_Body depicted in Figure  A brief overview on this work is given in _ ection 6.

Unsuitable loop structure in ARC3D
ARC3D uses an implicit scheme to solve Euler and Navier-Stokes equations in a three-dimensional (3D) rectilinear grid. The main component is an ADI solver, which results from the approximate factorization of finite difference equations. The actual implementation of the ADI solver (subroutine STEPF3D) in the serial ARC3D is illustrated in Figure 6. It is very similar to the SP benchmark.      There are a number of papers r,_porting experiences in combining multiple programming paradigms (such as MPI and OpenMP) to exploit multiple levels of parallelism. However, there is not much experience in the parallelization of applications with multiple levels of parallelism simply using OpenMP. Implementation of nested parallelism by means of c_ntrolling the allocation of processors to tasks in a single-level parallelism environment is discussed in [5]. "I he authors show the improvement due to nested parallelization.
Other experiences using nested OpenMP directives with the NanosCompiler are repored in [2]. In the examples discussed there, the direclives have not been automatically generated.

Project Status and Future Plans
We have extended the CAPO a_ltomatic parallelization support tool to automatica!ly generate nested OpenMP directives. We used tht NanosCompiler to evaluate the efficiency of our approach. We conducted several case studies which, showed that: • Nested parallelization was useful to improve load balancing.
• Nested parallelization can be counter productive when applied without considering workload distribution and memory access within the loops.
• Extensions to the OpenMP standard are needed to implement nested parallel pipelines.
We are planning to enhance the CAPO directives browser to allow the user to view loops, which are OpenMP extensions are currently being implemented in the framework of the NanosCompiler to easily specify precedence relations causing pipelined executions. These extensions are also valid in the scope of nested parallelism. They are based on two components: • The ability to name work-sharing constructs (and therefore reference any piece of work coming out of it).
• The ability to specify predecessor and successor relationships between named work-sharing constructs (PREC and SUCC clauses).
This avoids the manual transformation of the loop to access data slices and manual insertion of synchronization calls. From the new directives and clauses, the compiler automatically builds synchronization data structures and insert synchronization actions following the predecessor and successor relationships defined [8]. These relationships can cross the boundaries of parallel loops and therefore avoid the problems that CAPO currently has to implement two-dimensional pipelines.
We plan to conduct further case studies to compare the performance of parallelization based on nested OpenMP directives with hybrid and pure message passing parallelism.