OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries

Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shapes. Inherently this means that parallel implementations have to exploit parallelism wherever it is present. While OpenMP allows relatively fine grain parallelism to be exploited in a shared memory environment it currently lacks features to make it easy to partition computation over multiple array indices or to overlap sequential and parallel computations. The inherent flexible nature of shared memory paradigms such as OpenMP poses other difficulties when it becomes necessary to optimise performance across successive parallel library calls. Notions borrowed from distributed memory paradigms, such as explicit data distributions help address some of these problems, but the focus on data rather than work distribution appears misplaced in an SMP context.


Introduction
The BLAS and LAPACK libraries [1] are widely used by scientists and engineers to obtain good levels of performance on today's cache-based computer systems.Distributed memory analogues, the PBLAS and ScaLAPACK [2] have been developed to assist people on this type of parallel system.Shared memory (SMP) variants tend to only be available from hardware vendors (e.g.Intel's Math Kernel Library, [3]) or from library companies such as NAG Ltd. or Visual Numerics Ltd.
Consistent with this pattern, Fujitsu is soon releasing its first SMP version of the parallel BLAS and LAPACK libraries for its PRIMEPOWER series.What makes this release interesting is that it was written exclusively using OpenMP, rather than a special purpose thread library.In the course of developing these libraries, several issues arose concerning OpenMP.Some of these issues can be handled by careful use of OpenMP directives.Other issues reveal weaknesses in the current specification that might be addressed in future versions of the specification and still other issues reveal weaknesses that are inherent in this approach to parallelisation and may be difficult to resolve directly.
In the rest of this paper, we present a brief overview of the SMP environment on the Fujitsu PRIMEPOWER.We then discuss some of the basic library issues surrounding the BLAS and how we have resolved these using OpenMP.Parallel performance in LAPACK routines is often obtained through a sequence of calls to parallel BLAS and by masking sequential computations with parallel ones.The latter requires splitting thread families into groups.The current OpenMP specification does not support this particularly well, but some attractive extensions have been proposed (e.g. the suggestions in [4]) that make this approach simpler.Finally, many LAPACK routines have kernels that consist of a sequence of consecutive BLAS calls within a loop.When these calls operate on just vectors, or perform matrix-vector type operations, they are sensitive to the migration of data from one processor's cache to another and by the overheads that result from making each BLAS call a separate parallel region.Avoiding such overheads is not always possible and the paper concludes by examining some of the limitations that are inherent to OpenMP.

SMP programming on the Fujitsu PRIMEPOWER
The Fujitsu PRIMEPOWER is an SMP system that supports up to 128 processors in a Solaris environment [6].The current processor employed is the SPARC64 IV.This is a SPARC V9 architecture compliant processor that is similar to Sun's UltraSPARC processor series.The SPARC64 IV contains extensive support for out-of-order and speculation execution.It has a fused floating-point multiply-add as well as a separate floating-point add pipeline.Each processor has 128 Kbytes of data cache and a similarly sized instruction cache.There is also a unified second level cache of 8 Mbytes.In July 2001, the clock speed of these processors was 562.5 MHz, so the achievable peak performance is 1125 Mflop/s.Multi-processor systems are built from system boards that have up to 4 processors and 16 Gbytes of memory.There is a maximum of 8 such boards in a node (cabinet) and then up to 4 nodes can be connected together via a high-speed crossbar.The system has nearly uniform memory access across it potential 512 GBytes of memory.As the SPEC OpenMP benchmarks, [7], show, it is possible to obtain parallel speed-ups using the full 128 processor configuration on non-trivial applications.
The parallel programming environment is provided by Fortran and C compilers that support OpenMP Version 1.1, [8].Both compilers also have extensive support for the automatic parallelisation of user codes.Fujitsu's Parallelnavi batch environment, accessible via NQS, binds threads to processors, processors to jobs and provides support for 4 MByte pages.These all reduce performance variations relating to system and other user activity.Therefore, provided there are one or two processors available for systems' use and for handling basic interactive facilities, user jobs run on effectively dedicated processors.

Designing OpenMP parallel BLAS
One of the challenges in providing parallel BLAS and LAPACK routines is that most of BLAS routines contain assembler kernels.Therefore OpenMP parallelism must lie outside of these kernels.This effectively introduces yet another level of blocking within the routines.The practical aspects of this and related issues are illustrated by the general matrix by matrix multiplication routine dgemm.This family of multiplications also forms the kernel around which all the other matrix-matrix BLAS operations are constructed, see [9].
The basic operation that dgemm supports is: , where C is a general m by n matrix, A is a general m by k matrix and B is a general k by n matrix.Both α and β are scalars.In addition either A or B can be transposed, with a consistent change in dimensionality.Each member of this family of four operations is highly parallel.When m and n are sufficiently large, an effective solution is to partition the problem into an appropriate number of sub-matrices and perform each sub-matrix multiplication in parallel.With 4 threads one partition would be This then leads to: , where these sub-matrix operations are independent of one another and can be performed by separate calls to the sequential dgemm on different threads.The only challenge is to ensure that the number of threads allocated to a dimension is proportional to m and n and that the sub-blocks are large enough that near peak sequential performance is obtained.Since OpenMP has no equivalent to the High Performance Fortran (HPF) notion of processor arrays with shape [10], the library writer must map the 2-D thread partitioning onto the 1-D array of thread identifiers.This is not difficult, but the mapping clutters the code and makes it slightly harder to maintain.This is particularly relevant when one recalls that this mapping must be performed separately for each variant of the operation because the partitioning of the matrices A and B across sequential calls depends on whether they are transposed or not.
Performance of dgemm on the PRIMEPOWER is good.Single processor performance using a 562.5 MHz system on 512 by 512 to 1024 by 1024 matrices is around 1 Gflop/s.On 16 processors, the performance is around 12 Gflop/s on the same sized problems and on 64 processors, the performance of dgemm on 512 by 512 to 1024 by 1024 matrices is around 32 Gflop/s.
The strategy of partitioning BLAS operations into a series of independent sequential BLAS calls has proven effective.However, the performance of the matrix-vector and vector-vector BLAS routines is sensitive to whether the matrix was already in cache (the "hot-cache" case) or not (the "coldcache" case).This will be discussed in more detail at the end of this paper.

Building OpenMP LAPACK routines on top of OpenMP BLAS
One of the design decisions in LAPACK was to make extensive use of the matrix-matrix BLAS in order to block computations and thereby make better use of data in cache, [1].It was also felt, with some justification, that the performance of major LAPACK computation routines would improve simply from the use of SMP versions of the BLAS routines.While the operations performed between matrix blocks tend to parallelise well, the operations performed within blocks tend to be sufficiently fine grain that performance is mediocre sequentially and scales poorly.
A classical way to remove such sequential bottlenecks is to overlap the sequential computations on one processor with different parallel computations performed by the remaining processors.
Consider the pseudo-code for the main block of the LU-decomposition routine dgetrf as shown in Figure 1.
The operations performed within dgetf2 and the pivot updates are best performed on a single processor.The routines dtrsm and dgemm are BLAS routines that operate on large parts of the matrix and that tend to perform well in parallel.Observe that the first nb columns of the trailing matrix will be the panel used for factorisation with the next value of j.This leads to the observation that this factorisation could be overlapped with the remainder of the update of the trailing matrix.Indeed, it is possible to do better than this, as is shown in Figure 2 with a segment of a variant of dgetrf containing OpenMP directives.
It is now necessary to distinguish between the names of the sequential BLAS called from within a parallel region (as in Figure 2) and the parallel BLAS called from a sequential region, but the functionality of the routines is identical.The pseudo-code in Figure 2 allows the factorisation of the second and subsequent panels to be overlapped with the updating of the remainder of the matrix.The code will only work if thread 0 updates at least A(j:m,j:j+jb-1)prior to factoring this same block.Further notice that the partitioning in the dl_gemm call is only over columns, which will limit scalability on small to medium problems.
Improving scalability further runs into limitations of the current OpenMP standard.Effectively, three work groups of threads are desired.The first group contains thread 0. The second group (empty in the code of Figure 2, but probably just 1 or 2 threads) consists of threads that update a part of A(j:m,j:j+jb-1) and then proceed to update a part of A(j:m,j+jb:n).The third group of threads (the majority of the threads) just updates a portion of A(j:m,j+jb:n).The goal is to distribute the matrix update across all threads subject to the constraint that thread 0 has additional processing to perform when factoring A(j:m,j:j+jb-1).
If there were a large (say 16 or more) number of threads, it would be desirable to partition the matrix multiply performed with the third thread group by both rows and columns.This is only possible if the operations can be synchronised properly.For instance, before a part of the matrix multiplication can be performed, all earlier operations (e.g. the call to dl_trsm to update the sub-matrix that will form B in the subsequent matrix multiply) must have updated all the relevant parts of the submatrices.A block of columns that is partitioned among several threads for the matrix multiply will be composed from several column blocks that were updated independently in the previous call to dl_trsm.Therefore explicit synchronisation is required.

Figure 1 -Pseudo-code for dgetrf
With the current OpenMP specification, this is only possible through the use of several sets of locks.Thread groups, as suggested in [4], provide a cleaner solution.Thread groups 1 and 2 would work over a range of the matrix independent from that of the third thread group.These first two groups would be kept as one group to update A(j:m,j:j+jb-1).The groups would then be split after a barrier synchronisation so that thread group 1 performed the factorisation and thread group 2 proceeded with another portion of the matrix update.A barrier synchronisation just for the third thread group would ensure that its matrix multiplication could be partitioned optimally.
A simple enhancement to OpenMP would be to allow the programmer to specify the first thread id in a parallel loop as well as the chunk size in a static partitioning.Threads are then allocated to chunks in a wrap-around fashion.This would make it easier to bind threads to array indices in applications such as LU decomposition.An alternative to the parallel loop in Figure 2  Similar problems to those in the dgetrf routine relating to scalable performance can also be found in many other LAPACK routines.Essentially the difficulty is that the current and OpenMP Fortran Version 2.0 standards do not offer sufficient flexibility in the way in which thread groups can be defined.At present, synchronising the activity of a sub-set of threads in a parallel region requires the use of locks, which can become very cumbersome.This problem becomes worse with recursive algorithms, which are becoming increasingly important in linear algebra.This includes formulations of basic linear algebra operations such as factorisation, see [11], as well new algorithms, such as the divide and conquer algorithm for the symmetric tridiagonal eigenvalue problem included in LAPACK Release 3.0.Therefore something like the thread groups proposed in [4] or the work queues proposed in [5] will be required in a future OpenMP specification.
Another difficulty with writing efficient OpenMP LAPACK routines relates to the overheads associated with several successive calls to parallel BLAS routines within one loop of an LAPACK routine.For instance, in the main loop of the symmetric tridiagonalisation routine dlatrd there is a sequence of 5 calls to matrix-vector BLAS, followed by 3 calls to vector BLAS.Each of these creates its own parallel region and each defines how many threads are appropriate for the operation and how work will be partitioned among these threads.In a given call to dlatrd, this sequence of calls of BLAS routines is executed about 64 times, so that at least 512 different parallel regions are created.Even though the overheads associated with creating a new parallel region are low, the accumulated overheads of this many different regions impact the performance on smaller problems.
The current solution to this problem is to create special "in-parallel" versions of the relevant BLAS routines.These routines are written assuming that a parallel region has already been created.It is then possible to have only one parallel region for the entire calling routine.While this reduces the overheads of creating the parallel regions, there is no mechanism within OpenMP by which the partitioning of work among threads within these various routines can be organised to maximise the reuse of data that is already in the cache of particular processors.This can be a major performance difficulty and it is one for which no solution is currently available.

Desired: OpenMP standards to support cache reuse
Cache reuse is related to the discussion of data distribution directives for NUMA systems in OpenMP, see [12] and [13], but it is not identical.For example, the notion of data distributions among processes is not particularly helpful on a uniform memory access system like the PRIMEPOWER.
Effectively such approaches assume that distribution is an attribute of the data.This is a useful model in a distributed memory / shared index space environment for languages such as HPF.Latencies to access remote data elements are orders of magnitude higher than access latencies to local data elements and the memory hierarchy is such that there are substantial performance benefits by communicating one large data block rather than many small ones.
In a shared memory environment, there are fewer benefits to copying one large data block between the cache on one processor to that on another over copying many small data blocks, provided the small blocks are larger than the size of a cache line.
Rather than distribution being an attribute of the data, it might be more useful to regard the partitioning of index spaces among threads as an attribute of the operator, with data residing in the cache of a particular processor being a side effect.The objective in this setting would be to minimize the differences in index space partitions between successive parallel loops.Alternatively, if the differences in index space partitions were known between parallel loops, a less demanding objective would be to define a prefetch strategy that allowed the cache-resident data to be loaded consistent with the second index space partition while executing over the first index space partition.
The performance problems due to different partitionings and hence data lying in the wrong cache can be severe.Consider the code fragment for a parallel rank-1 update.This is the core operation performed in the BLAS routine dger.
*$OMP PARALLEL DO schedule(static) *$OMP& default(shared) private(i) do j=1,n do i = 1,m a(i,j) = a(i,j) + x(i)*y(j) end do end do *$OMP END PARALLEL DO This is a highly parallel operation that should scale well for a range of problem sizes.However, parallel performance is heavily dependent on what precedes this parallel region.For instance, if the array a is defined immediately beforehand using a single thread and the array is small enough that it can fit into that processor's Level 2 cache then parallel performance will be terrible.It will be faster to perform the update on the original processor.
If the problem is large enough that the entire problem does not fit into a single processor's cache, then parallel scalability will be better because cache locality is not as important.If the array is defined in an earlier parallel region using a partitioning similar to that used in the above code fragment then parallel performance will be excellent.
Optimal cache use cannot be determined from just local information about the current data locality and the next operation to be performed.For instance, suppose that several consecutive rank-1 updates were performed after the array had been initialised in a sequential section and that the array was small enough that it would fit into the collective Level 2 cache of the processors involved.A local decision to maximize cache reuse by limiting parallelism to a single thread would be the right decision with 1 rank-1 update, but it would certainly be the wrong decision if there were 50 updates.
The rank-1 update also provides an example of how information from the calling program to the called routine can improve cache reuse.It is possible for the rank-1 update to be parallelised across both array dimensions and so that the actual partitioning used could be chosen to fit well with the partitionings in earlier and subsequent parallel sections while still using all available threads.
The adaptability of the parallel rank-1 update (as well as that of matrix multiplication and several other operations) suggests that HPF-style data distribution directives would be useful.If the data has been distributed among processes1 sensibly, then many parallel BLAS routines should work well just by inheriting this distribution.However, this thinking misses the critical point -these routines are called from within larger applications and it is when defining effective data distributions for these applications that the limitations of this approach become clear.
Consider LU decomposition as discussed in Section 4. This application requires a doubly block-cyclic data distribution, see [2] for a justification, where the blocking factor is consistent with that required for good performance from the single processor matrix multiplication.The cyclic distribution is required to have a degree of load balance among the processes.The block cyclic distribution is performed over both rows and columns of the matrix in order to have scalable matrix multiply performance.However, the induced 2-D process grid forces the "in-block" factorisation (corresponding to calling dgetf2 in Figure 1) to be performed over multiple processes.This will reduce performance except on very large systems.The block cyclic data distribution also makes it difficult to overlap this factorisation with updates to the rest of the matrix.While the data blocks are the same size, the amount of computation required over a sub-group of them is about 33% larger.
Compare these difficulties with those involved in performing LU decomposition with OpenMP.If the in-block factorisation is not overlapped with other computation, then the code in Figure 1 becomes parallel by providing parallel BLAS.The matrix multiply will be partitioned over both row and column indices in the call to dgemm.If this factorisation is to be overlapped with other computation this can be done relatively simply, as shown in Figure 2. The difficulties arise in trying to combine the overlap with an optimal matrixmultiplication while keeping the work distributed evenly among threads, but these could be handled by making thread management in OpenMP more flexible.
Therefore, data distribution directives are useful on distributed memory systems.Something like dynamic page management as discussed in [13] might be a good compromise on NUMA systems.Neither approach appears useful on a uniform memory access system such as the PRIMEPOWER.
However, difficulties with cache reuse remain.If explicit data distributions are not a solution, then what is?Perhaps compilers should become "BLASaware" so that potential performance problems can be flagged and possibly fixed during compilation.Clear documentation about the parallelisation strategy used in each routine is one essential way to avoid pitfalls such as alternating between sequential and parallel sections.It would be useful if library providers could agree upon a standard format and terminology for index partitioning information.
There may be a need for information to be available at run time.One possibility would be for a library of parallel routines to include a "partitioning inquiry" function.Given a routine name, a valid set of input arguments and the number of threads, this function could return a descriptor that defined how the index space of the input and output arrays was partitioned among the threads.Notice that the intention is to provide this information for a specific instance of a routine invocation.For example, the way in which the array indices are partitioned among threads in a call to dgemm depends not only on the value of the arguments N, M and K but also on whether array A or array B is transposed and how many threads are available.Given this information, it might be possible for the writer of the calling program to organise the computation done at this level to reduce the amount of cache migration that will result from calls to a particular routine.
While this idea has merits, there are many difficulties with it.Firstly, there is the need for all of the partitioning algorithms employed in a routine such as dgemm to be accessible from the inquiry function.This also implies that the control structure of each routine is reproduced.When a parallel library routine was written, would it be possible to generate automatically a "shadow" routine that could generate the information required by the inquiry function?How general is the issue of cache migration, does it extend much beyond linear algebra?Would inquiry functions provide the information required by a user?How complex does the library routine have to become before this type of information cannot be provided or is of limited assistance in improving performance?Is there merit in standardising the format of descriptors, possibly to the extent that they become part of the OpenMP specification?

Conclusions
OpenMP provides a convenient means by which users can exploit SMP parallelism.However, obtaining good SMP performance on more than a handful of processors requires careful attention being paid to all of the standard parallelisation issues.OpenMP provides mechanisms to address most of these issues, but the current standard leads to code that is more cumbersome and harder to maintain than is desirable.While OpenMP provides the flexibility and low overheads to exploit loop parallelism, it lacks facilities to optimise the performance of a sequence of such parallel loops by exploiting data cache-locality.
There is an argument for including index partitioning as an attribute of the arguments in calls to routines containing parallel loops.There might also be benefits in organising parallel routines so that one call option was to determine the index partitioning but not perform any further computation.However, when parallel routines have a fixed Fortran 77 interface, the problems become more difficult and it is incumbent upon the library developer to provide users with sufficient information so that they can make informed choices about the best way in which to use such parallel routines.

Figure 2 -OpenMP overlapped dgetrf
with this addition is given in the short loop fragment below * Update A(j:m,k:min(k+nb-1,n)) if (k .eq. j) then * Factor A(j:m,j:j+nb-1) end if