Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines

This article discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon System.


INTRODUCTION
Current advanced architecture computers are non-uniform memorv access (NU.VlA) machines.They possess hierarchical memories, in which accesses to data in the upper levels of the memory hierarchy (registers, cache, and/ or local memory) are faster than those in lower levels (shared or offprocessor memory).One technique to more efficiently exploit the power of such machines is to develop algorithms that maximize reuse of data in the upper levels of memory.This can be done by partitioning the matrix or matrices into blocks and by performing the computation with matrix-vector or matrix-matrix operations on the blocks.A set of BLAS (Level 2 and 3 BLAS) [15,16] were proposed for that purpose.The Level 3 BLAS have been successfully used as the building blocks of a number of applications.including LAPACK [1,2], which is the successor to UNPACK [14] and EISPACK [23].LAPACK is a software library that uses block-partitioned algorithms for performing dense and banded linear algebra computations on vector and shared memory computers.
The scalable library we are developing for distributed memory concurrent computers will also use block-partitioned algorithms and be as compatible as possible with the LAPACK library for vector and shared memory computers.It is therefore called ScaLAPACK ("Scalable LAPACK") [ 6], and can be used to solve "grand challenge" problems on massively parallel, distributed memory, concurrent computers [5,18].
The basic linear algebra communication subprograms (BLACS) [;3] provide ease-of-use and portability for message passing in parallel linear algebra applications.The parallel BLAS (PBLt\S) assume a block cyclic data distribution and are functionally an extended subset of the Level L 2, and :3 BLAS for distributed memorv svstems.
They are based on previous work with the parallel block BLAS (PB-BLAS) [8].The current model implementation relies internally on the PB-BLAS . .as well as the BLAS and the BLACS.The ScaLA-PACK routines consist of calls to the sequential BLAS, the BLACS, and the PBLAS modules.Sca-LAPACK can therefore be ported to any machine on which the BLAS and the BLACS are available.
This article presents the implementation details, performance, and scalability of the ScaLA-PACK routines for the LL QR, and Cholesky factorization of dense matrices.These routines have been studied on various parallel platforms by many other researchers [12, 13, 19j.\Ve maintain compatibility between the ScaLAPACK codes and their LAPACK equivalents by isolating as much of the distributed memory operations as possible inside the PBLAS and ScaLAPACK auxiliary remtines.Our goal is to simplify the implementation of complicated parallel routines while still maintaining good performance.
Currentlv the ScaLAPACK librarv contains . .Fortran 77 subroutines for the analysis and solution of systems of linear equations, linear least squares problems, and matrix eigenvalue problems.ScaLAPACK routines to reduce a real 1-!eneral matrix to Hessenberg or bidiagonal form.and a symmetric matrix to tridiagonal form are considered in [ 11].
The design philosophy of the ScaLAPACK librarv is addressed in Section 2. In Section 3. we describe the ScaLAPACK factorization routines by comparing them with the corresponding LA-PACK routines.Section 4 presents more details of the parallel implementation of the routines and performance results on the Intel family of computers: the iPSC/860.the Touchstone Delta.and the Paragon.In Section 5. the scalability of the algorithms on the systems is demonstrated.Conclusions and future work are presented in Section 6.

DESIGN PHILOSOPHY
In ScaLAPACK, algorithms are presented in terms of processes.rather than the proc:Pssors of the physical hardware.A process is an independent thread of control with its own distinct memory.Processes communicate by pairwise point-topoint communication or by collective communication as necessary.In general there may be several processes on a physical processor.in which case it is assumed that the run-time svstem handles the scheduling of processes.For example.execution of a process waiting to receive a message may be suspended and another process scheduled, thereby overlapping communication and computation.In the absence of such a sophisticated operating system, ScaLAPACK has been developed and tested for the case of one process per processor.

Block Cyclic Data Distribution
The way in which a matrix is distributed over the processes has a major impact on the load balance and communication characteristics of the concurrent algorithm, and hence largely determines its performance and scalability.The block cyclic distribution provides a simple, yet general-purpose way of distributing a block-partitioned matrix on distributed memory concurrent computers.The block cyclic data distribution is parameterized by the four numbers P, Q, mh. and nh, where P X Q is the process grid and mt, X n!J is the block size.Blocks separated by a fixed stride in the column and row directions are assigned to the same process.
Suppose we have JJ objects indexed by the integers 0, 1. . . ., :VI -1.In the block cyclic data distribution the mapping of the global index, m. can be expressed as m ~ (p, b. i).where p is the logical process number, b is the block number in process p, and i is the index within block b to which m is mapped.Thus.if the number of data objects in a block is m!J. the block cyclic: data distribution may be written as follows: m ~(~mod P . .l~J. m mod mh) where s = l m/ m 1 ,J and P is the number of processes.The distribution of a block-partitioned matrix can be regarded as the tensor product of two such mappings: One that distributes the rows of the matrix over P processes.and another that distributes the columns over (J processes.That is. the matrix element indexed globally by (m.n) can be written as (m, n) ~ ((p.q) . .(b. d).(i.j)).FIGURE 1 Example of a block cyclic data distribution.Figure 1a shows an example of the block cyclic data distribution, where a matrix with 12 X 12 blocks is distributed over a 2 X 3 grid.The numbered squares represent blocks of elements, and the number indicates at which location in the process grid the block is stored-all blocks labeled with the same number are stored in the same process.The slanted numbers, on the left and on the top of the matrix, represent indices of a row of blocks and of a column of blocks, respectively.Figure 1b reflects the distribution from a process point of view.Each process has 6 X 4 blocks.The block cyclic data distribution is the only distribution supported by the ScaLAPACK routines.The block cyclic data distribution can reproduce most data di~tributions used in linear algebra computations.For example, one-dimensional distributions over rows or columns are obtained by choosing P or Q to be 1.
The nonscattered decomposition (or pure block distribution) is just a special case of the cyclic distribution in which the block size is given by mh = fM/Pl and nb = fN!Ql.That is, (m, n) ~ ( (l,:J l:,,J)' (0, 0), (m mod mb, n mod nb)).
Similarly a purely scattered decomposition (or two-dimensional wrapped distribution) is another special case in which the block size is given by ~ ((m mod P, n mod Q), (l; J, l~J), (0, 0)) •

Building Blocks
The ScaLAPACK routines are composed of a small number of modules.The most fundamental of these are the sequential BLAS, in particular the Level2 and 3 BLAS, and the BLACS, which perform common matrix-oriented communication tasks.ScaLAPACK is portable to any machine on which the BLAS and the BLACS are available.
The BLACS comprise a package that provides ease-of-use and portability for message passing in a parallel linear algebra program.The BLACS _efficientlv support not only point-to-point operatiOns betw~en processes on a logical two-dimensional process grid, but also collective communications on such grids, or within just a grid row or column.
Portable software for dense linear algebra on multiple-instruction, multiple-data (Ml.\1D) platforms may consist of calls to the BLAS for computation and calls to the BLACS for communication.Because both packages will have been optimized for each particular platform, good performan~e should be achieved with relatively little effort.We have implemented the BLACS for the Intel family of computers, the TMC CM-5, and IR\1 SP1 a_nd SP2. and PVM.Several vendors are producmg optimized versions of the BLACS (e.g., Cray.IB~1, and Meiko).We plan to produce an MPI version of the BLACS in the near future.
The PBLAS are an extended subset of the BLAS for distributed memory computers and operate on matrices distributed according to a block cyclic data distribution scheme.These restrictions p.ermit certain memory access and communication optimizations that would not be possible (or would be difficult) if general-purpose distributed Level 2 and Level 3 BLAS were used [7,9].The sequential BLAS, the BLACS.and the PBLAS are the modules from which the hiaherlevel ScaLAPACK routines are built.The PikAS ~re used as the highest level building blocks for I~plementing the ScaL-\PACK library and provide the same ease-of-use and portabilitv for Sca-LAPACK that the BLAS provide for LAPACK.Most of the Level 2 and 3 BLAS routines in LA-PACK routines can be replaced with the corresponding PBLAS routines in ScaLAPACK.so the source code of the top software laver of ScaLA-PACK looks very similar to that. of LAPACK.Thus, the ScaLAPACK code is modular, deaL and easy to read.
Figure 2 shows a hierarchical view of ScaLA-PACK.Main ScaLAPACK routines usuallv call only the PBLAS, but the auxiliary ScaLAPACK routines may need to call the BLAS directlv for local computation and the BLACS for com~uni cation among processes.In manv cases the Sca-LA~ACK library will be sufficier{t to build appli-catiOns.However, more expert users mav make use of the lower-level routines to build cus;omized routines not provided in ScaLAPACK.

Design Principles
ScaLAPACK is a message-passing version of LA-PACK and is designed to be efficient across a wide range of architectures.The performance of the routines relies ultimatelv on the architectural characteristics of the ~achine.HoweveL the codes are flexible in that they allow the tuning of certain parameters such as block size and the size of the process grid.This flexibilitv ensures that the ScaLAPACK routines will be able to achieve good performance.
The ScaLAPACK routines.like their LAPACK equivalents, are designed to perform correctlv for a wide range of inputs.Whenever practicaL .thevbehave appropriately when overflow and under~ flow problems are encountered or a routine is used incorrectly.Examples of error handling are giYen in Sections 4.2 and 4.3.

FACTORIZATION ROUTINES
In this section, we first brieflv describe the sequential, block-partitioned ve;sions of the dense LC, QR, and Cholesky factorization routines of the LAPACK library.B~cause we also wish to discuss the parallel factorizations, we describe the right_-looking versions of the routines.The rightlooking variants minimize data communication and distribute the computation across all processes [ 17].After describing the sequential factorizations, the parallel versions will be discussed.
For the implementation of the parallel blockpartitioned algorithms in ScaLAPACK.we assume that a matrix A is distributed over a P X Q process grid with a block cvclic distribution and a block size of nb X nh matching the block size of the algorithm.Thus, each nvwide column (or row) panel lies in_ one column (row) of the process grid.
In the Lu, QR, and Choleskv factorization routines, in which the distributio~ of work becomes uneven as the computation progresses, a larger block size results in greater local imbalance.but reduces the frequency of communication between processes.There is, therefore, a tradeoff between load imbalance and communication startup cost that can be controlled by varying the block size.
In addition to the load imbalance that arises as distributed data are eliminated from a computation, load imbalance may also arise due to computational "hot spots" ~here certain processes have more work to do between synchronization points than others.This is the cas~, for example, in the LU factorization algorithm where partial pivoting is performed over rows in a sinale column "' of the process grid while the other processes are idle.Similarly, the evaluation of each block row of the U matrix requires the solution of a lower triangular system across processes in a single row of the process grid.The effect of this type of load imbalance can be minimized through the choice of P and Q.

LU Factorization
The LC factorization applies a sequence of Gaussian eliminations to form A = PLU, where A and L are M X N matrices, and U is anN X N matrix.L is a unit lower triangular (lower triangular with 1 's on the main diagonal), U is an upper triangular, and Pis a permutation matrix, which is stored in a min(M, N) vector.
At the k-th step of computation (k = 1. 2, . . .), it is assumed that them X n submatrix of where the block A21 is (m-nh) X nb, and A22 is (m-nh) X (nnb) 0 £11 is a unit lower triangular matrix and u11 is an upper triangular matrix.At first, a sequence of Gaussian eliminations is performed on the first m X nh panel of A k: (i.e .. A11 andA 21 ).Once this is completed, the matrices £11, £21 , and U11 are known, and we can rearrange the block equations, The LlJ factorization can be done bv recursivelY 0 • applving the steps outlined above to the (mnh) X (n -nh) matrix A22. Figure 3

. . nh)]
-IDAMAX: Find the (absolute) maximum element of the i-th column and its location.
-DSWAP: Interchange the i-th row with the row that holds the maximum.
-DSCAL: Scale the i-th column of the matrix.
-DGER: Cpdate the trailing submatrix.The corresponding parallel implementation of the ScaLAPACK routine.PDGETRF.proceeds as follows: 1. PDGETF2: The current column of processe:o performs the LU factorization on an m X nh panel of A (i.e., A11 and A21 ).

. nh)]
-PDAMAX: Find the (absolute) maximum value of the i-th column and its location (pivot information will be stored on the column of processes).
-PDLASWP: Interchange the i-th row with the row that holds the maximum.
-PDSCAL: Scale the i-th column of the matrix.
-PDGER: Broadcast the i-th row columnwise ( (n~; -i) elements) in the current column of processes and update the trailing submatrix.• Every process in the current process column broadcasts the same pivot information rowwise to all column:,; of processes.
2. PDLASWP: All processes apply row interchanges to the left and the right of the current panel.3. PDTRSM: £ 11 is broadcast along the current row of processes, which converts the row panel A12 to U12. 4. PDGEMM: The column panel £ 21 is broadcast rowwise across all columns of processes.
The row panel U 12 is broadcast columnwise down all rows of processes.Then, all processes update their local portions of the rnatrix, A22.

QR Factorization
Given an J/ X N matrix A, we seek the factorization A = QR, where Q is an J;J X :Vf orthogonal matrix and R is an J;f X :V upper triangular matrix.
At the k-th step of the computation, we partition this factorization to the m X n submatrix of A k as A2 1 is (m -nh) X nh, and An i,; (mnh) X (n -niJ).A 1 is an m X nh matrix containing the first nh columns of the matrix A "• andA 2 is an m X (nnh) matrix containing the last !n-nh) columns of ( ( A,,) (A12)) A" i.e., A, A

A22 R22
A snapshot of the block QR factorization is shown in The corresponding steps of the ScaLAPACK routine, PDGEQRF, are as follows: 1. PDGEQR2: The current column of processes performs the QR factorization on an m X nh panel of A ik) (i.e., A 1 ) • [Repeat nb times (i = 1, . . ., nb)] -PDLARFG: Generate elementary reflector v; and T;.

PDLARFT:
The current column of processes, which has a sequence of the Householder vectors V, computes T only in the current process row.3. PDLARFB: Apply Q T to the rest of the matrix from the left • PDGEMM: Vis broadcast rowwise across all columns of processes.The transpose of V is locally multiplied by A 2 .then the products are added to the current process row (W~ vrAz).
• PDTRMM: T is broadcast rowwise in the current process row to all columns of processes and multiplied with the sum UV ~ TTW).• PDGEMM: W is broadcast columnwise down all rows of processes.Now, processes have their own portions of V and W. then they update the local portions of the matrix

Cholesky Factorization
Cholesky factorization factors anN X N, symmetric, positive-definite matrix A into the product of a lower triangular matrix L and its transpose, i.e., A = LLT (or A = U'~'U, where U is upper triangular).It is assumed that the lower triangular portion of A is stored in the lower triangle of a two-dimensional array and that the computed elements of L overwrite the given elements of A. At the k-th step, we partition then X n matrices Ark:, L and £ik'T, and write the system as where the block be done by recursively applying the steps ?utlined above to the (n -nt,) X (n -nh) matrix A22.
In the right-looking version of the LAPACK routine, the computation of the above steps involves the following operations: 1. DPOTF2: Compute the Cholesky factorization of the diagonal block A 11 .

DSYRK: Cpdate the rest of the matrix,
The parallel implementation of the corresponding ScaLAPACK routine, PDPOTRF, proceeds as follows: 1. PDPOTF2: The process P,, which has the nb X nh diagonal block A 11 , performs the Choleskv factorization of A 11 .
• P; performs A 1 1 ~ L 11 L{~ , and sets a flag if A 11 is not positive definite.
• P, broadcasts the flag to all other processes so that the computation can be stopped if A 11 is not positive definite.

PERFORMANCE RESULTS
We have outlined the basic parallel implementation of the three factorization routines.In this section, we provide performance results on the Intel iPSC/860, Touchstone Delta, and Paragon systems.We also discuss specific implementation details to improve performance and possible variations of the routines that might yield better performance.
The Intel iPSC/860 is a parallel architecture with up to 128 processing nodes.Each node consists of an i860 processor with 8 Mbyte of memory.The system is interconnected with a hypercube structure.The Delta svstem contains .512i860-based computational nodes with 16 .\1byte/node, connected with a two-dimensional (2-D) mesh communication network.The Intel Paragon located at the Oak Ridge 1\"ational Laboratory has 512 computational nodes, interconnected with a 2-D mesh.Each node has 32 .\1bvte of memorv . .and two i860XP processors, one for computation and the other for communication.The Intel iPSC/ 860 and Delta machines both use the same 40 MHz i860 processor, but the Delta has a higher communication bandwidth.Significantly higher performance can be attained on the Paragon system, because it uses the faster 50 .\1Hzi860XP processor and has a larger communication bandwidth.
On each node all computation was performed in double precision arithmetic, using assemblycoded BLAS (Level 1, 2, and 3), provided by Intel.Communication was performed using the BLACS package, customized for the Intel systems.Most computations by the BLAS and communication by the BLACS are hidden within the PBLAS.
A good choice for the block size, mb X nh, was determined experimentally for each factorization on the given target machines.For all performance graphs, results are presented for square matrices with a square block size mh = nh.The numbers of floating-point operations for an N X N matrix were assumed to be 2/3 JV'l for the LL factorization, 4/3 N"' for the QR factorization, and 1/3JY 1 for the Cholesky factorization.

LU Factorization
Figure 6 shows the performance of the ScaLA-PACK LL factorization routine on the Intel iPSC/ 860, the Delta, and the Paragon in Gflop (Gflop or a billion floating-point operations per second) as a function of the number of processes.The selected block size on the iPSC I 860 and the Paragon was mb = nh = 8, and on the Delta was mb = nt 1 = 6, and the best performance was attained with a process aspect ratio, 1/4::::; P/Q::::; 1/2.The LL routine attained 2.4 Gflop for a matrix size of;\/ = 10,000 on the iPSC/860; 12.0 Gflop for N = 26,000 on the Delta: and 18.8 Gflop for N = 36,000 on the Paragon.The LL factorization routine requires pivoting for numerical stability.Many different implementations of pivoting are possible.In the paragraphs below, we outline our implementation and some optimizations that we chose not to use in order to maintain modularity and clarity in the library.
In the unblocked LL factorization routine (PDGETF2), after finding the maximum value of the i-th column (PDAMAX), the i-th row will be exchanged with the pivot row containing the maximum value.Then the new i-th row is broadcast columnwise ((nb -i) elements) in PDGER.A slightly faster code may be obtained by combining the communications of PDLASWP and PDGER.That is, the pivot row is directly broadcast to other processes in the grid column and the pivot row is replaced with the i-th row later.
The processes apply row interchanges (PDLASWP) to the left and to the right of the column panel of A (i.e., A 11 and A 21 ).These two row interchanges involve separate communications, which can be combined.
Finally, after completing the factorization of the column panel (PDGETF2), the column of processes, which has the column panel.broadcasts rowwise the pivot information for PDLASWP, £ 11 for PDTRSM, and L 21 for PDGEMM.It is possible to combine the three messages to save the number of communications (or combine L11 and L2 1 ) and broadcast rowwise the combined message.
Notice that nonnegligible time is spent broadcasting the column panel of L across the process grid.It is possible to increase the overlap of communication to computation by broadcasting columns ro"\\rwise as soon as thev are evaluated.rather than broadcasting all of the panel across after factoring it.~With these modified communication schemes, the performance of the routine may he increased, but in our experiments we have found the improvement to be less than 3% and, therefore, not worth the loss of modularity.

QR Factorization
To obtain the elementary Householder vector v;, the Euclidean norm of the vector.A,;, is required.
The sequential LAPACK routine, DLARFG, calls the Level 1 BLAB routine.DNRM2, which computes the norm while guarding against underflow and overflow.In the corresponding parallel Sca-LAPACK routine, PDLARFG, each proces::; in the column of processes, which holds the vector A,;.
computes the global norm safely using the PDNRM2 routine.
For consistency with LAPACK, we have chosen to store T and V and generate T when necessary.Although storing T might save us some redundant computation.we believed that consistency was more important.Them X nb lower trapezoidal part of V. which is a sequence of the nb Householder vectors.will be accessed in the form.
where V1 is nb X n,, unit lower triangular and V1. is (mnh) X nb.In the sequential routine, the multiplication involving Vis divided into two steps: DTRMM with V1 and DGEMM with V2..However, in the parallel implementation.Vis contained in one column of processes.Let V be a unit lower trapezoidal matrix containing the strictly lower trapezoidal part of V. P is broadcast ro"\\rwise to the other process columns so that every column of processes has its own copy.This a_llows us to perform the operations involving V in one step (DGEMM), as illustrated in Figure 7

Cholesky Factorization
The PDSYRK routine performs rank-nh updates on an (nnh) X (n -n,) "ymmetrie matrix A  an (n -ni>) X nh column of blocks L 21 .After broadcasting L 21 rowwise and transposing it, each process updates its local portion of A 22 with its own copy of L 21 and L~1 .The update is complicated by the fact that the globally lower triangular matrix A 22 is not necessarily stored in the lower triangular form in the local processes.For details see [8].The simplest way to do this is to repeatedly update one column of blocks of A2:z.However, if the block size is smalL this updating process will not be efficient.It is more efficient to update several blocks of columns at a time.The PBLAS routine, PDSYRK efficiently updates An by combining several blocks of columns at a time.For details, see [8].
The effect of the block size on the performance of the Choleskv factoriz:Hion is shown in Figure 9 on 8 X 16 and 16 X 16 processors of the Intel Delta.The best performance was obtained at the block size of nb = 24, but relativelv good performance could be expected with the block size of nb ::=::: 6, because the routine updates multiple column panels at a time.
Figure 10 shows the performance of the Cholesky factorization routine.The best performance was attained with the aspect ratio of 1/2 :5 P!Q :S 1.The routine ran at 1.8 Gflop for N = 9600 on the iPSC/860; 10.5 Gflops for N = 26,000 on the Delta; and 16.9 Gflop for N = 36,000 on the Paragon.Because it requires fewer floating-point operations (1/3 N 3 ) than the other factorizations, it is not surprising that its flop rate is relatively poor.
If A is not positive definite, the Cholesky factorization should be terminated in the middle of the computation.As outlined in Section 3.3, a process P; computes the Cholesky factor L 11 from A 11 .After computing L11 , process P; broadcasts a flag to all other processes to stop the computation if A 11 is not positive definite.If A is guaranteed to be positive definite, the process of broadcasting the flag can be skipped, leading to a corresponding increase in performance.

SCALABILITY
The performance results in Figures 6, 8, and 10 can be used to assess the scalabilitv of the factorization routines.In general, concurrent efficiency, E, is defined as the concurrent speedup per process.That is, for the given problem size, N . . on the number of processes used, N 1 ,.
where Tp(N, Np) is the time for a problem of size N to run on Nl' processes and T., (N) is the time to nm on one process using the best sequential algorithm.
Another approach to investigate the efficiency is to see how the performance per process degrades as the number of processes increases for a fixed grain size, i.e., by plotting isogranularity curves in the (l\~,, G) plane, where G is the performance.Because g. the scalability for memory-constrained problems can readilv be accessed bv the extent to which the . .isogranularity curves differ from linearity.!sogranularity was first defined in [ 24]. and later explored in [ 2L 20].
Figure 11 shows the isogranularity plots for the ScaLAPACK factorization routines on the Parag(m.The matrix size per process is fixed at 5 and 20 Mbyte on the Paragon.Refer to Figures 6. 8, and 10 for block size and process grid size characteristics.The near-linearity of these plots shows that the ScaLAPACK routines are quite scalable on this system.

CONCLUSIONS
We have demonstrated that the LAPACK factorization routines can be parallelized fairly easily to the corresponding ScaLI\PACK routines with a small set of low-level modules, namely the sequential BLAS, the BLACS, and the PBLAS.We have seen that the PBLAS are particularly useful for developing and implementing a parallel dense linear algebra library relying on the block cyclic data distribution.In general, the Level 2 and 3 BLAS routines in the LAPACK code can be replaced on a one-for-one basis by the corresponding PBLAS routines.Parallel routines implemented with the PBLAS obtain good performance, because the computation performed by each process within PBLAS routines can itself be performed using the assembly-coded sequential BLAS.

FACTORIZATIOI'\ ROUTI~ES 183
In designing and implementing software libraries, there is a tradeoff between performance and software design considerations, such as modularity and clarity.As described in Section 4.1, it is possible to combine messages to reduce the communication cost in several places, and to replace the high-level routines, such as the PBLAS, by calls to the lower-level routines, such as the sequential BLAS and the BLACS.However, we have concluded that the performance gain is too small to justify the resulting loss of software modularity.
We have shown that the SeaLAPACK factorization routines have good performance and scalability on the Intel iPSC/860, Delta.and Paragon systems.Similar studies may be performed on other architectures to which the BLACS have been ported, including PVM, T.\1C C:M-5.Cray T3D, and IB.\1 SP1 and SP2.
The ScaLAPACK routines are currentlv available through netlib for all numeric data types, single precision real, double precision real, single precision complex, and double precision complex.To obtain the routines, and the ScaLAPACK Reference Manual [10], send the message "send index from scalapack" to netlib@ornl.gov.
FIGURE 3 A snapshot of block LU factorization.

2 .
DLASWP: Apply row interchanges to the left and the right of the panel.3. DTRSM: Compute the nh X (n -nt,) row panel of U, 4. DGEMM: Cpdate the rest of the matrix.A22,

21 and A 2
= An .R 11 is a nh X nh upper triangular matrix.A QR factorization is performed on the first m X nh panel of A 'k (i.e., A 1 ).ln practice, Q is computed by applying a series of Householder transformations to A, of the form, H; =I-T;v;v/"where i = 1, . . ., nh.The vector V; is of length m with O's for the first i -1 entries and 1 for the i-th entry, and T; = 2/(v!v;).During the QR factorization, the vector v; overwrites the entries of A below the diagonal and T; is stored in a vector.Furthermore, it can be shown thatQ = H 1 lh • • • IIn,, =I-VTVT, where Tis n 1 , X nh upper triangular and the i-th column of V equals V;.This is indeed a block version of the QR factorization[ 4,22] and is rich in matrix-matrix operations.The block equation can be rearranged as ( A",2) (R,.,)
FIGURE 5 A snapshot of block Cholesky factorization.

2 .
PDTRSM: L 11 is broadcast columnwise by P; down all rows in the current column of processes, which computes the column of blocks of £21 .3. PDSYRK: The column of blocks £ 21 is broadcast rowwise across all columns of processes and then transposed ..[\.;ow, processes have their own portions of £21 and Lf1 .They update their local portions of the matrix A 22 .

15 .FIGURE 6
FIGURE 6 Performance of the LL factorization on the Intel iPSC/860, Delta.and Paragon.

FACTORIZATIOlFIGURE 7
FIGURE 7The storage scheme of the lower trapezoidal matrix v in ScaLAPACK QR factorization.

Figure 8
Figure 8 shows the performance of the QR factorization routine on the Intel familv of concurrent computers.The block size of mb = n 1 , = 6 was used on all of the machines.Best performance was attained with an aspect ratio of 1 I 4 5: PI Q 5: 1 I 2.The highest performances of 3.1 Gflop for j\/ 10,000 was obtained on the iPSCI860: 14.6 Gflop for N = 26,000 on the Delta: and 21.0 Gflop for N = 36,000 on the Paragon.Generally, the QR factorization routine has the best performance of the three factorizations because the updating process of Q TA = (/ -VTVT)A is rich in matrix-matrix operation and the number of floating-point operations is the largest (4:13 N 1 ).

FIGURE 8
FIGURE 8 Performance of the QR factorization on the Intel iPSC/860, Delta, and Paragon.

~FIGURE 9
FIGURE 9 Performance of the Choleskv factorization as a function of the block size on 8 X 16 and 16 X 16 processes of the Intel Delta (/V = 10,000).