A Static Approach for Compiling Communications in Parallel Scientific Programs

On most massively parallel architectures, the actual communication performance remains much less than the hardware capabilities. The main reason for this difference lies in the dynamic routing, because the software mechanisms for managing the routing represent a large overhead. This article presents experimental studies on benchmark programs concerning scientific computing; the results show that most communication patterns in application programs are predictable at compile-time. An execution model is proposed that utilizes this knowledge such that predictable communications are directly compiled and dynamic communications are emulated by scheduling an appropriate set of compiled communications. The performance of the model is evaluated, showing that performance is better in static cases and gracefully degrades with the growing complexity and dynamic aspect of the communication patterns. © 1995 by John Wiley &

that have the lowest remote data access to floating-point operations ratio.
Although communication seems to be the bottleneck for parallel architectures.not much is known about the characteristics of the communications used by parallel programs.The first objective of this article is to give some experimental results about the statistical distribution of the communication patterns.The communications that are known at compile-time will be called static and those that can only be determined at run-time will be called dynamic.To obtain satisfactory statistics, a significant benchmark set has been studied: this set amounts to around 25.000 lines of code written in various dialects of parallel Fortran.The set is composed of two parts: The first is a set of scientific parallel codes, partially handwritten and partially generated by automatic parallelization; the second is a subset of library routines from LAPACK.The dynamic (run-time) occurrences of both static and dynamic communication schemes have been gathered.The main result is that static communications are nearly exclusive in parallelized code,; and dominant in uo;er programs, whereas the situation is much more complex in library routines.
\Ve are interested in this taxonomy (static/ dynamic) not for classification purposes but because a considerable speedup in parallel computations can be achieved by a careful exploitation of the compile-time information about static communications.In fact.. a parallel execution model where the communications are computed at compiletime can achieve the hardware's raw performance for the moo;t frequently used static communication schemes.This contrasts with the actual communication perforn1ance of most parallel architecture,;.which is dominated by the communication protocol overhead.However. the m•erall speedup must take into account the contribution,.; of all communication types . .both static and dynamic (Amdahl's law).The task is then to assess the penalty of compiling the dynamic communications.This is verT difficult.hecausP manv factor,; arP in-. .volved . .and it is almost impossible to quantify thPir respective impact,; and interactions.]\evertheless.meaningful results can be derived by evaluating.for broad classPs of communication scheme,;.the speedup achievPd on each elm;,.; by the static execution model.As a testbed.WP compart' the C~l-:J communication figures with tlw expected performance of the static model.Th(• speedup is significant.even in the dynamic ca,.;e.
The rest of this article i,c.; organizPd as follows.The first ,c.;ectiondiscu>i>it'S dynamic routing.the basic conununication nwchanisn1 of almost all parallel architectures.and the background of compiled communications.The second section presents a classification of cornrnunication schemes.The third section is devoted to the experiments.methodolo~;y . .and results.Finally.we assess the cost of emulating dynamic communications in the static model and present the expected perforn1ance.

Dynamic Routing
Almost all massively parallel architectures use asynchronous dynamic routing, which mean,; that the routing circuits in each network node determine the path of each message at run-time.This requires extra hardware (the routing circuits) and network bandwidth (the address header carried bv each message).The routing is asynchronous in the sense that the latency of the messages depends on the network load, thus is unknown: a processor/ network interface is necessary to synchronize the message and the computing threads.The overhead of this interface is large: For instance . . it costs more than 90% of the latency of the Paragon machine [13], and it is from 3 to 90 J..tS for the CYI-5 [20,23].
One could expect that, for large data transfers.this overhead would ultimately vanish.In fact.a significant part of the effort in practical parallel programming is careful data organization in order to pack the data such that the transfers are of the appropriate size; a lot of research is devoted to sophisticated compilation techniques.such as message vectorization.with the same goal [28].
However.. the startup penalty is so high that Pffective use of the network is extremelY difficult.For instance . . to use half of the peak bandwidth of the network.the message size mu,;t be more than 1 kilobvte for thP G\f -5: to reach full usc of the bandwidth.the messagp size must be more than 8 kilobyte [ 61.
Yloreover.parallel sciPntific program,.;arP highly synchronous.becau:oe communications come from parallel array statPments: in general.consecutive cornmunications must proceed only in lockstep fashion.Thus, the major opportunity to enlarge the message size comes from virlualization.ln a data-parallellanguage.the parallelism is not limited: For instance.the FORALL instruction has the semantics of evaluating first the righthand side of an assignment.then performing the assignment.However. the available parallelism on a particular computer is clearly limited by the number of proceo;sors.To take into account the limitation of the actual parallel computer.. the unlimited parallelism of the source code is folded on the limited parallel computer by automatic or user-defined distributions such as cvclic.block, or block-cyclic.This is virtualization.For instance, consider the parallel assignment Forall (i = 0: 14) a(i) = b(i + 1) on a four-processor machine.Each processor has to iterate sequentially over its own piece of arrays a and b to exchange data and compute.In particular.. each processor sends to another one from three to four array elements; sending one piece of data by message is highly inefficient: aggregating data to be sent to one processor in one message is known as message vectorization [ 16].However.message vectorization is limited by the virtualization ratio (roughly speaking, the ratio between the size of a FORALL index set and the machine size).A high startup penalty limits the efficiency of massively parallel architecturt>s on huge problt>ms.This overhead can be greatly reduced if analyzing the communications at compile-time provides some knowledge of the communication behavior at runtime.The hardware design and software tools that provide efficient means to use this knowledge have been developed in the PT AH project.They are beyond the scope of this article: the architecture is described in [ 4] and the principles of the compiler in [10].
The results presented in this article indicate that.at least in scientific programs.a large part of the communications can be determined from analvsis of source code.:VIoreoyer.almost all other programs proyide information that can be used to limit the communications overhead.ln fact.the idea that a lot of communication patterns in scientific programs can be determined at compile-time is the cornerstone of vectorizers and automatic parallelizers.In the following sections . .we consider a number of parallel programs, and quantify this idea.

Compiled Communications
In the static execution modeL all the parameters of the communications are computed at compiletime.This model has been exemplified in the IB:VI GF11 [171. in the iWarp ConSet [2.5 ]. and by the Communication Compiler of TMC CM-2 [7].The model assumes an off-line routed network.Offline means that the message paths are computed in the back-end compiler, by a "communication generator" that is an equivalent for communication of the code generator for computation.All the physical parameters of a communication are then computed at compile-time.At run-time.the switch settings are simply scheduled under program control.This is opposite to the on-line routing model, where the message paths are determined at run-time, the network routing circuits acting on the addresses as an interpreter.The compilation problem is to embed the communication graph into the physical network.
Off-line routing improves the network throughput, by removing the overhead of address headers encapsulated within each message.As no more routing decisions have to be made, the latency can ultimately be reduced to the hardware propagation delay.Finally, shifting the routing task from run-time to compile-time allows more complex routing algorithms, resulting in better resource (links and buffers) utilization.Theoretical studies [15 . . 21.. 22] show that.for somP interconnection networks.off-line routing is feasible in the sen,.;e that the off-line routing algorithm has acceptable complexity.and may be asymptotically optimal [19].The practical experiment,.; on the C\I-2 [7! show that a one order of magnitude ;-;peedup can be achieved by off-line routing on the hypercube.without anv additional hardware: the simulated annealing algorithm provides global optimization of the link allocation.
Off-line routing suppo,.;esthat the communication generator may be fed with the communication graph.which has been constructed by the compiler.This issue is beyond the scope of thi:-; article: however.recent research in the message-passing framework [14,28], and in the static framework [101 provides techniques to tackle this issue.:Moreover, these techniques remove the potential drawback of the first experiments on the C\1-2.which was the long compilation time: As a formal description of the graph can he exhibited, the complexity of the off-line routing process can be simplified in many cases.

COMMUNICATION PATTERNS
As our benchmarks are written in data parallel Fortran (C:Vl Fortran, Fortran 90.high-rwrformance Fortran [HPFJ), the following discussion uses an HPF syntax.However.. this only exemplifies the main data-parallel communication feature: The communications are implicit.derived from operations on parallel data structures (arrays in Fortran).ln HPF, parallel data operations come from, either FORALL loops or array notations, or Intrinsics that summarize multiple parallel data operations.As each of these structures involves parallel array references.our taxonomy begins with a classification of parallel references.

Parallel References
A typical parallel construct is a nest of FORALL loops as illustrated next: For all ( iz a2 Forall Un an bn Cn) A(e where JJ is a m X n integer matrix and L' a vector in Z"'.~•e give an example from Jacobi's method for the Laplace solver: Forall (i=2:9,j=2:9) +A(i,j+l)) *0. 25 endforall Here.there are five parallel reference:,; to A ( 1 store and 4 fetches): the first one ~4(i -1.j)) may be expressed with: Affine references where J/ and l • onlv include numerical constants are called static and nonstatic affine references are called parametric.For example the parallel reference A(i-1.j)i:o static.
whereas a reference such as A (i + k. j) will be parametric if k is a variable which is not a FORALL index, as in the following assignment: This scheme is dominant in LAPACK routine:,;.ln fact.a finer classification would be possible: If the vector C is a scalar variable . . the reference can occasionally be determined at compile-time: for instance, if U linearly depends on sequential loop subscripts, as in the previous example.However.using this information in the static execution model would require the unrolling of the sequential loop to compute the communication patterns.As the sequential index set is almost always too large to allow this optimization.there is no point in using a finer classification.In our benchmarks.nonlinear references were represented by gather and scatter operations.where the array subscripts are themselves arrav elements: the generic form being A [ L l /]].

FORALL Communications
In the typical parallel instruction Forall I in :5' terns depend on the computation location rule and on the mapping.We consider the Owner Computes Rule, which is used by most existing parallel compilers and assumed by many researchers in this field: it means that the computing processor is the destination processor.The mapping between arrays is created by the ALIGN directives.If an array is compressed along one dimension, the corresponding FORALL subscript must not be considered for classification because it is not a parallel dimension.For instance, if A is of dimension 2 and compressed along its second dimension.then A (i. j) is located on the same processor as A(i, 0).\Vith these assumptions, a communication occurs for each array in the righthand side of the parallel assignmenL if combining the mapping and the Owner Computes Rule does not result in an intraprocessor assignment.The communication is labeled by the worse case of the two references, e.g .. left and right member affine static will re,.;ult in a static communication.but a onemember nonaffine will result in a nonaffine communication and so on.
A typical usE' of thE' FORALL notation is to dt'scribt' partial permutations of thE' index ,;et.Although the FORA.LL ,;yntax does not fJrt'cludt' more complex schenws.dficient programmin;r would encapsulate such patterns in intrin,.;ic,; to take advanta;re of dw global communication ft'atures of the target architecturP.

Intrinsic Communications
In data-parallel Fortran lan;ruages.complex data transfers can be described by special functions that are part of intrinsics.The most important communication intrin,;ics implement multireduction (multiple many-to-one communication).multibroadcast (multiple one-to-many).special permutations.and gather/ scatter operations.
The reduction intrinsics art' SUM.ALL.ANY.MAXVAL . .MINVAL.MAXLOC.MINLOC.Tlwy compute tlw result of applying an associati\e operator to all the clements of their array argument.ThP respectiYe operators are sum.logical and.logical or.max.min: :\IAXLOC !resp.\11'\LOC': returns the location of thP maximal lresp.minimal!yaJup.The reduction intrin,.;icshave three panuneters: for instance.SUM (ARRAY, DIM, MASK) adds the elements of ARRAY along the dimen,.;ionDIM.selecting the PlPmPnts dPscrilwd by MASK.\\. e considt'red that a reduction intrinsic is static as soon as the ARRAY parameter is a static reference and the DIM parmneter is a constant: Tlw unit element of the operator \e.g .. 0 or 0.0 for a SUM. or IEEE -x for a floating-point MINVAL; can replace the masked rPferences.and this local te,.;tcan be done at run-tirne.
The syntax is SPREAD (SOURCE, DIM, NCOP-IES) : to cornpute tht> conHnunication schenw at compile-time.the SOURCE parameter mu,;t be a static reference and DIM n1ust lw a constant :in this case.the pattern is considered as static).In the following.we call broadcast a one-to-many pattern, multibroadcast a segmented broadcast.reduction a reduction that result,; in a :-;calar.and multirt'duction a segmented reduction.
Examples of special permutations intrinsics are the cvclic and nonc,clic SHIFTS and TRANSPOSE.
. .All these intrinsic;; :-;ummarize a FORALL per-mutation and require the same analy,;is.\lore complex intrinsics.such as MATMUL and DOT-PRODUCT.are intended to allow an optimal implementation of basic lirwar algebra opPrators.These intrinsics will be considered as static if their pararneters are static or scalar constants.

The Benchmark Set
Three benchmark sets haYe !wen analyzPd (TahiP Recent work r:J" focuses on the dynamic to ,.,tatic transformation: hence statistics in thi,; fiPid mm not be significant at tl1E' pre,;ent time.

Methodology
The tool used for analysis is a parser built from the Tiny tool ,.;et [29:: it consists of an intraproccdural constant propagation package and a program for autornatic reference analn;i,; ba,.;ed on tilt' abstract syntactic repre,.;entationthat WP deYeloped.The output of these tools is a characterization of

Results
Tables 2 to 5 present the statistics.Tables 2 and 4 give the formal expression as a function of the parameters, respectively.for static and dynamir communication patterns:.Tables ;:3 and 5 give the numerical percentage of the total communication patterns.The first column is the benchmark   munications are translations . .apart of LAPACK block-QR where the scheme is a matrix transpose): the "Broadcast" and "Reduction" colurnns are, in generaL multibroadcast;; and multireductions: the column "Special" gathers all the instances of the intrinsics MATMUL and DOT-PRODUCT and, for the \\~eather Climate benchmark, calls to the fast Fourier transfom (FFT) library routine.The column "Total" in Tables 3 and 5 is the partial total of each broad class, static and dynamic.
Most of the application benchmarks have a high percentage of static communications.the exceptions being Cluster Spin and Simplex.How-  .\:z:;.;, where J is a sequential index.As .I ranges on•r the matrix linear size . .no loop unrolling may be considered.On the other hand, although the 2D FFT seems fully parametric.this is mostly an implementation artefact: The communication patterns of a FFT are the folding onto the processor set of the well-known butterf1ies.and are known at eompile-timt>. at least if the array argument of the FFT is static.

S PERFORMANCE EVALUATIONS
The previous results indicate that the static communications are frequent enough to dec;erve specific optimizations.such as the static execution model.However, Amdahl's law requires a com- parison with the speedup expected from these optimizations, and the penalty when executing dynamic cOininunications.This evaluation needs to take into account details of the hardware and software underlying the static execution model.The basic assumptions are the following: 1.The overall architecture is distributed memory .\11.\ID, with P processors.2. The network is strictly synchronous and controlled in a lockstep fashion.ln some sense.this is the single-program multiple data (SP.YlD) execution m<H.leL but us an assumption at the hardware level.
3. For each communication.the data incoming from each processor has fixed size.
4. The routing is ofT-line, which means that the routing switches do not proeess at all.They only orientate the messages according to a configuration giyen hy the processors before sending the whole data set.The configuration of the switches for one data set is called a communication pattern.All the useful pattern,; (that the net\\'ork can use in a run) are compiled,

5.
The network can realize any permutation in constant time.This time is the basic unit of the network operations.and is called an ele-menta~• step in the following.
Among general-purpose commercial parallel machines, none has an interconnection network with these properties.However.. such a network has been successfully built for the GF1 L a research prototype of IB:\1.The iWarp network may be used in this mannec although the fact that it is primarily intended for message passing raises the cost in time of its static use: many research stu(iies, especially in the field of optical interconnection networks, consider off-line routed networks [27].For an in-depth pre::;entation of such networks.see [5,9,17]. w• e must stress that.as the network cannot do any on-line routing . .dynamic patterns have to be emulated by a sequence of static (i.e., compiletime computed) patterns.The size of such a sequence is the emulation cost of dynamic communications.
In the following.we assume that the shape of the processor set matches exactly the shape of the arrays.and that each processor owns only one datum, which has the prescribed size.The issues of generating code for cyclically or block-cyclically distributed arrays have been successfully treated in the PTAH compiler and are not described here.The impact of virtualization on performance will be outlined in a later section.

Permutations
\V e first consider the simplest parameter penn utations (shifts, cyclic shifts, transpositions) and study the ease of gather/ scatter operations later.

Parametric Shihs
A one-dimensional parametric shift nun' be defined by three parameters: the domain hounds and the value of the shift.The following example shows a parametric Fhift where the domain i;-; limited by s and f and the shift value is k.

For all (i = s: f) A. (i) = B (i+k)
To cope with the domain parameters.the cornmllnication pattern is extended to all proces:'lors (using a temporary array) and the final store is conditioned by the membership to the domain.'Vithout virtualization.. the prt>vious code beconles: Forall (i 0: P-1) Where (s <= i and i <= f) A (i) =Temp (i) endwhere endforall PARALLEL SCIE.\TIFICPHOGRA\IS 299 l'\ow, parametric shifts depend only on one parameter, the value of the shift.It is possible to define all the communication patterns corn:sponding to all the shifts inside the processor set.and to use k (or k mod Pin the case of virtualization) to select at run-time the appropriate communication pattern.However.each pattern bas a significant storage cost; for instance O(P log P) bits for a Benes network, leading to O(P 2 log P) for the P possible shifts (log means log 2 ).A reasonable solution is to use only power of two shifts, and to emulate the k-shift by the following procedure: where Vis the array to be shiftecL P the number of processors, s and f the limits of the domain of r•.u is the value of the shift.and !LVD is bitwi,;e.In this ease,.the actual value of u will he k.or k mod P if virtualization occurs.Thus. the emulation eosL which is the number of patterns to be scheduled, is log P.
For multidimensional shifts like A (i. j) = B(i + k 1 , j + k 2 ) where A and B are matrices.the same method holds, except that we have to define the input paran1eter a as a vector.A;.,sui:ning that the n-dimensional processor geometry ( two-dimensional in this example) is linearly mapped to a numbering of the processor set, in row (or column) major order, the (a 1 , a:2) vector ::;hift ultimately produces a shift with value pa 1 + a 2 , where p is the extent of the processor geometry in the first dimension.
Parametric cydie shifts are split into two shifts.the modulo part and the nonmodulo part.A priori. 2 log P steps are needed but as we can interleave the two patterns.the number is only log P steps.

Parametric Transpositions
The general form is Forall (i=sl:fl, j=s2:f2) The only parameter required is the domain of the transpo:-;ition.One solution is first to do a parametric shift of B so that B(sL s2) goes to (0, 0).This can be done in log P steps.The result of this first shift is stored in a temporary array.Then the transposition of the temporary array takes only one step.Finally. the result is stored in A with a parametric shift.The whole operation takes 2/ogP + 1 steps.

Gather and Scatter Operations
These are the most difficult communications for the static paradigm.The data referenced are in an array dynamically computed.The scatter operation sorts an array B according to indice;; L: Forall i ...

A(L(i) )=B(i)
And the gather operation is: Forall i ...

A(i)=B(L(i))
A parallel gather operation makes sense only if the mapping of the index set onto itself is a oneto-one operation.Let array K be defined by K(L(i)) = i: the gather operation may be written as the scatter operation: A(K(i;) B(i).Building Kat run-time requires one gather operation.From this, a gather operation is amenable to two scatter operations.
Lsually the gather operation i,.; used to pack an array into a smaller one, whereas the scatter operation expands an array.W"e assume first that the arravs have the same size and that there i,.; no conflict while reading or storing elements.\Ve study later array size differences and conflict,;.
To emulate dynamic routing, the key idea [18] is to sort the destination addresses of the data to be routed.The sorting algorithm uses the principle of the odd-even merge sorting network.Figure 1 shows this principle where the list L is to be sorted: if the message follows the number of the receiver, the network realizes the scatter operation communication A (L(i)) = B(i).At each stage of the sorting network, crossing links symbolize comparison of two values and perhaps their exchange.
As the switches do not have any logic, the network cannot perform the cornparisons.,,~e simulate each stage of the odd-even network hy a crossing of our network and a comparison inside the processors.As the links between the stages are static, it is possible to compile each corresponding permutation.The number of patterns to schedule is log P(log P + 1 )/2, i.e., O(log 2 P).
Lsing an odd-even merge sortinl' network to realize a scatter operation communication.
Consider the case where A is larger than B. In the example.let L be equal to 3, 5. 1, 0. 4. ? .6 and assume that the nenvork sorts the values into the sorted list 0. L 3, 4, 5. 6, ? .but the values are not all located at their destinations.However, sending them to their destination is a monotone routing problem.:Vlonotone means that the source-to-destination map is a monotone function.We can realize monotone routing using the greedy routing algorithm on the butterfly network.Monotone routing of a sorted list on hypercubie networks is conflict free [181.Figure 2 presents the example of monotone routing in the butterfly network.On stage k of the butterfly.the network transmits the data according to bit k of the destination address.Each stage of the butterfly is emulated by one permutation in our network and by the test of bit k (for stage k) by the processors.The number of permutations scheduled is O(log P).As monotone routing is conflict free.the routing process remains very simple for the processing elements (no buffering or priority managing).
Storing conflicts are prohibited for a scatter operation.but reading conflicts are possible for a gather operation.In this case.the communications must be partially sequentialized.First, the odd-even sorting network sorts the destinations that can be realized without conflict.The sorted list shows repetitions at contiguous stages.These repetitions lead to conflicts while executing the monotone routing.If two idemical references are located on the Rame proceRRor.it stores one of them in a temporary buffer and carriPs on with the routing.then a second Rtage is started for the buffered messages.After that.a second scatter operation takes place.This proeed.ure is expensh•e: however, the rnost complex case is where a multieast is hidden in the gather operation.and thus will also be expensive with any routing medmnism.

Broadcasts
Broadcasts and multibroadcasts have two possible origins: one-to-many gather operations and the SPREAD intrinsic.Assume the network is a Bend network [18".Benes networks are rearrangeable: Any permutation may be routed without conflict.Hence.an elementary stPp is one network crosRing in this particular case.However. the results may be extended, up to a constant factor.to any network emulating the well-known buttedly network in a finite number of steps._becauRe a Bend network mav be considered as two back-tohack butterllv networks [13].In particular, Omega and Inverse Omega networks are topologically equivalent to the butterfly network.
Consider simple broadcasts; any static broadeast ean be completed in one step and any parametric broadcast in log P + 1 steps.If the broadeast source is a program scalar, the broadcast costs nothing, because all processors own the data (by parallel execution of the scalar code or any other way).Thus, we need only consider the case of broadcasting an element of a parallel array.Any input of the Benes network is the root of a Pleaf complete binary tree.Thus, the static broadcast costs one step.
A parametric broadcast cannot use the same technique.Even though the broadcastintr tree does exisL the exact setting of the switdws is not known at compile-time because the position of the root is a program variable.The simplest means to perform a parametric broadcast is to shift the source to a fixed position (e.g .. processor 0) and to use a static broadcast.Shifting data is a paramt'tric point-to-point communication, and has the same cost as a parametric translation.Significant results have been obtained about the implementation of the most general multibroadcast patterns on butterfly and otlwr hypercubic networks [18].However.their implemenlation in the static execution model incurs extremelv high costs because they invoke irregular ;;egmented prefix operations.Thus. the problern of compiling multibroadcast patterns mu"t he earefullv stated.
Consider the following legal HPF code: \Vith L non one-to-one.there are only two wayR to compile such patterns: serializing the FORALl., loop,, as shown previously.or using Leighton We only outline the proof.To avoid a lot of subscripts, we consider the generic example The result is a two-dimensional arra\' B, with B(i,j) A(k,j) for all i andj, 1 s i nand as j s b.
Fip:ure :3 gives an example.with p = 4 and q = 2. \\'ith Dl\l equal to 2. we would han• to corl:-iider the n:'verse butterfly.\lore p:eneral dimensions come under the same analysi:-;.a" it depends <Hlly on the divi:-;ion of a proees:-;or addn:'ss into lop: p hits for the fixed dimensions plus log q bit:-; for tlw parallel dimension,.;.If the dimension of an array is not a power of 2.
we embed the array in an ana y of powt~r of 2 size.
execute the multispread on the temporary array and conditionally store the result according to dw real size.
As the Benes network includes two back-tohack butterfh networks.it can emulate this action in one su•p so that the muhibroadcast using the SPREAD intrinsic takf's one ,;tep.
:\COPIES+ n J::.Thu:-;.row k of A will he copie(l onto the first row of B and a static spread can take place in one stt>p.Finally .. \Ye haYe to nwve the rt>su!t to the correct position with another parametric tran:-;lation requiring log P ;;teps.:\;; a remark.if DI\l is a variable, we can compile the static spread for each dimen:sion because the number of dimen,.;ion,; is f!enerally low.\loreon'r.. if the domain of the rnultispread is nuiahle.af!ain a global multispread can lw pt>rformed on a temporary array and conditionally store the data according to the real domain. These figures may seem quite hif!h: however.all the available parallelism is t:xploited.\loreover.for static multihroadcasts, the solution is optimal in the sense that there is only one stt>p.This contrasts for instance with the C.\[-;) broadca:-;tinl!capabilities.which are limited to one processor at a time.

Multireduction
The (multi-;reduction differs from tiH" (multi-.!diffusion in the sense that the network ha.s to conlbine Yalues.Combining \•alues means that the network switches can forward a uniquP n:':-;uh cmnputcd frorn it:; inpws by an a:-;sociatiYe operator (sum, max).w•e can realize the static (multireduction by combining butterfly with our network: Each stage of the butterfly is exeeuted b~• a crossinf! of our network and the combininp: operation is realized on the processors.Thus. the number of routinf!steps is equal to the number of fltaf!es in the butterfly.i.e .. lof! P.
In the ease of parametric (multi-•:reduction.again we process a parm11etric ,;hift to move ti'w data to a fixed position (for instance hep:inninl! at procesflor Oj: then we apply the :-ita tic (multi-, rt>duetion with a conditional store and proces,.; a parametric shift to mm•e the result to the correct position.Thus. it take:-; :3 lop: P steps.

C matrix conditioning
For all (i

end do C the *product is a pointwise product
This alf!orithm was first designed for rt-'Wions that are similar to our objective.i.e .. to f!et the best performance from a f!rid network and to avoid general communications.The grid network may. in turn.be emulated under the general assumptions stated at the lwginning of this section.with one step for each of the grid ::\E\\-S (~orth East \\•est South: directions.

Comparison
Comparisons between theoretical studies and actual machines are both presumptuous and unrealistic.Thus. the following results are not intended to compare what would be the execution of any program on the C\1-.) and on a possible static machine.\\-e consider the figures from the G\1-.")network only as a testbed.i.e .. giving the orders of magnitude for the performance of a recent dynarnic routing network.Two pararneter,.;charactcrizP the rwrformance of a network: Let r 111 be the maximal rwtwork bandwidth per node and s the time to transmit a zero-sized rnesSaf!e.To an approximation.r 111 depends on the network bandwidth and on the source and destination memory bandwidth.\\ ith pipelincd cornn1unications.the latency of a data transfer is where Lis the data transfer size.\\ith careful optimization.in the infinite data-size limit.tlw performance will be limited only by the proce,;sor• s perfornwnce if the conununication -to-computation ratio i:-; lower than 1. and by the asymptotic network perforrnance (r 11 ,) if this ratio is larger than 1.In fact.assuming equal bandwidth performance . .being better on .. little .. problems is the only advantaf!e that one model has over the other.\\c consider two characteristic figures for this comparison: T and L 1 u. the size for which the network reaches half of it:-; maximal bandwidth.L 1 ; 2 is the communication analog of the so-called rz 112 for vector cornputations [12:.T n1easure,; the performance for program:-; where significant data transfer pipelininer is not possible.The reason rna~ be a very low virtualization ratio or the peculiar characteristics of the algorithm.For instance" a blocked algorithm with block data distribution will provide few comrnunications: if the communications are not overlapped with the computations.r-t will give the actual performance in most practical cases.On the other hand.L 1 u gives one estimate of what would be an effective size for a prob-lem if the communications dominatP the computations.but can be arranged to exploit fully the network bandwidth in the asymptotic limit.
.\1any different values of the C\1-3 For the static network.we wanted to asse,.;,.; two speedups separately.The first comes from the static execution model.a,.;stuning off-the-shelf technolOf..'Y for the network design.The second comes from the fact that a network intended for this model can be desif!ned with a more aggressiw technology than a message-passing network.because its functionalities are simpler.Hence.\\T consider two cases: equal bandwidth perforrnance and the network we are currently designing l-1: (fast network in the following).For the equal bandwidth network.we han~ to assess raw hardware latency for a ;) 12-processor machine.for which the figures of the C::\1-.")cannot he used because they involve the routing delay.\\ e consider a 600 ns latency: this llf!Ure '.\•as reached by the GF11 using 1983 technology [1":'1.Table 6 shows the estimates for the translations patterns using formula 1.For the C::\1-3.the results do not depend on the distinction static or parametric.For the static network.we use the results of Section 3.1: thus.the parametric value forTi,; nirw times its value for the static case (u:oing log 312 = 9): thi:-; con1es frorn the fact that the consecuti\T translations must proceed in a lockstep fashion.Both implementations of the static model outperform the C::\1-3 network with the vendor rnessagepassiner library by one to two order,; of magnitude:-;.\\-ith active messages.both static networks are better for the static translations.but onh• the fast network remains better for the parametric ones.
As no data conr:erning broadcasts and rPductions were available to the authors.we had to limit our numerical comparisons to the tran:-;lation case.l\'evertheless.we must ,;tress the following: for the CM-3.hroadca~ts and reduction,.; use the control network: as it i:-; a usual binary tree [20 j.
no muhioperations are allowed.Thu,;.ewn if multioperations incur high penalization in our modeL this may be lower than pure sequentialization.

CONCLUSION
The key idea of the static model is to adapt the IUSC principle to communications.i.e ... to be optimal on the most frequent cases and correct on the others.Both the experimental results and the gross performance evaluations developed in this article show that the static model provides a significant speedup over dynamic routing.However.these figures isolate the network behavior.whereas the static model has consequences in other parts of a parallel architecture.With synchronous communications.all the processors have to be synchronized at eaeh network cycle, 'fhis synchronization may he realized either by synchronization barriers or by a dedicated processor architecture.Synchronization barriers are the simplest solution, but may create overhead, because they preclude efficient network pipelining.
For the second solution, the supersealar design and complex memory hiPrarchy of n~cent microprocessor architectures create many pipeline hazards.As adjusting the instruction threads by the compiler may be impossible, a VLIW -style architecture is recommended, More generally, the current situation in parallel architectures is unbalanced.:Vlany detailed studies arc available about the performance of the processor's different parts (functional units.caches. ., .).However, experimental data about communications are sparse.and, except in a very few cases, ntainly cnncPrn simple and synthetic situations.Our future research in this area will gather other experimental data about applications: in particular, the development of HPF to provide richer semantics than previous paralld Fortran and better communication statistics.In addition.we want to investigate the possible softening of the static modeL e.g., using synchronous on-line routing in multistage networks would allow the direct execution of a set of dvnan1ie communications.
endforall the assignment creates corr1n1unication patterns where.for each/, the source is the processor owning the reference B [g(I) ]. and the destination is the processor owning the reference A [f(l)].The pat- I 21: I (10000' T :::Jb:SOOOI :\ •.-t0001

21 il
benchmarks are particular implPmentations of an application and have another version (Conventional Spin and Revised Simplex).which is much better for the static model.The IFP benchmark is especially interesting: From the sequential version, it was possible and even easy to write a fully static HPF version of the benchmark, without any change in the initial algorithm.The category Algorithms presents much more diverse results: 50% static communications for the l\'o-Bloek Gaussian Elimination.but oo;;, for LAPACK block-LC.The reason is that in the LA-PACK subset, the applications are matrix decomposition.but the implementations are block algo-CAtTIER DE LAHACT A:\'D GER\1Al:\'

10 :
where ak and bk may depend on it for l < k.For j~l An affine reference can be written A [JJI + C].

Table 1 .
The Analyzed Benchmarks each reference and intrinsic in the source COflf'.following the classification of Section 2. :\1~xt we evaluated the dynamic (run-time) frequencies of each communication type by manual examination of the code.

Table 2 .
Formal Expression of Statie Communications name.The column labeled '•Loop Parameters"' in Tables2 and 4is the name of the program parameters that are used as sequential loops subscripts.For instance ..Cluster Spin shows three nested sequential loops: the indices are Jf . . the number of measures.andIandJ.which are internal to the algorithm.The numbers in parentheses are the parameter values used for Tables3 and 5. if necessary: most of them were indicated by the benchmark.The following columns give the total number of occurrences of each communication scheme, for a complete execution of the benchmark: the column labeled .. Affine and Cyclic''describes affine communications (all these com-

Table 4 .
Formal Expression of U~•namic Communications

Table 5 .
Dynamic Communications as a Percentage of the Total Communi£~atinns rithms.As stated in[25], the target architectures were multiple instruction multiple data (~lL\JD) shared memory, and blockiag increases performance in this ease by reducing memory traffic.The 0.-o-Biock version of the IT decomposition (the routine SGETF2) is fully parametric but with a much lower conununication count: 2:\" parametric MATMUL and N parametric translations.Ho'wever, the applications are inherently dynamic, because they are sequential in either the rows or the columns of the basic matrix.A typical communication is