Performance Issues in High Performance Fortran Implementations of Sensor-Based Applications

Applications that get their inputs from sensors are an important and often overlooked application domain for High Performance Fortran (HPF). Such sensor-based applications typically perform regular operations on dense arrays, and often have latency and throughput requirements that can only be achieved with parallel machines. This article describes a study of sensorbased applications, including the fast Fourier transform, synthetic aperture radar imaging, narrowband tracking radar processing, multibaseline stereo imaging, and medical magnetic resonance imaging. The applications are written in a dialect of HPF developed at Carnegie Mellon, and are compiled by the Fx compiler for the Intel Paragon. The main results of the study are that ( 1) it is possible to realize good performance for realistic sensor-based applications written in HPF and (2) the performance of the applications is determined by the performance of three core operations: independent loops (i.e., loops with no dependences between iterations), reductions, and index permutations. The article discusses the implications for HPF implementations and introduces some simple tests that implementers and users can use to measure the efficiency of the loops, reductions, and index permutations generated by an HPF compiler. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
There is an important class of computer applications that manipulate inputs from the physical environment.The inputs are continuously collected by one or more sensors and then passed on to the computer, where they are manipulated and interpreted.The sensors are devices like cameras, antennas, and micro-phones.The manipulation of the sensor inputs is variously referred to as signal or image processing, depending on the dimensionality of the inputs.We refer to the entire class of applications as sensor-based applications to emphasize this common quality of processing inputs from the natural world.
Sensor-based applications have traditionally been found in military domains like radar and sonar, and there is an increasing interest in commercial domains, such as medical imaging, surveillance, and real-world modeling.For example, a real-world modeling application could use a stereo algorithm to acquire depth information from multiple cameras and then use the information to build realistic three-dimensional (3D) models of the environment.The models could then he used for applications like virtual 3D conferencing, O'HALLARO~.\\EBB.A'-"D SCBHLOK lmildin~ walk-throu~b.or t'XJH'rit~ncing a sporting evPnt from tlw point of view of one of the players. .
Sensor-basPd applications arf' an intert'bting and often ovPrlooked application domain for a parallellan-guagP like High Performance Fortran (llPF) [1].Tlw computations.which typically nmsist of regular OJWrations on dPnsf' arrays.are naturally expressed in I lPF.Furthennorf' . .there are oftf'n stringent latf'ncy and band"•idth requirPmf•nts that demand parallel processing.For exmnplt•.a stereo program that extracts dPpt h infon nation from multiplt> cameras can procf'ss only a few frames per st•cmHI on a powt•rful HISC works tat ion.which is well lwlow the standard video rate of 30 frames per second.If the results of a ownsorbased computation are usPd to control some process.then tlwrf' is also a tninimallatency that can be toln-atP<L For cxamplP. an online nwdic;1l ima~ing application that gatfwrs and pr<J<'<'SSf's multiple imagPs might automatindly adjust tlw scanner to compensate for tnOYt'lllf'llt hy tlw patient.The intportance of minimizing latency.ratlwr than just maximizin~ throu~hput. is mw of the key propnties that distin~uiblws st'nsorbased applications front batch-oriented sciPntific com- This artidf' descrilws thf' results of an Pmpirical study of tlte pt>rfonnanct' of a set of HPF sPnsor-hased applicatiow; on a cmnme1-cial paraliPI computer.Tlw results ".<'IT obtaiw•d using a prototype cornpiler.dP-wlopt>d at Canwgic \lellmL for a dialect of IIPF nmning on an Intd Paragon.The artirlc makes several main points: First.contran to the fears of Inany in the I IPF comtnunity.pcrfor:nHmc.efor the HPF Hf;plications we studiPd is ~ood.SPcrmd.a few corP computational pattt•rns (parallel DO loops.rednctions.and index fW!'lllUtatiOIIS) dominate sensor-based applica-tiOIIS.JIPF implcnwntors can realize great benefits by focu~in~ on these pattt•nb.Third.the~•t• are some sim•ple tests that 11PF programmers and developers Gill use to Pnduate the Pfiicicncy of tlw parallel DO loops.reductions.and index permutations that are crucial to the dfectivt' t'Xf'Clltion of sensor-based applications.Fourth., sine!' the data sPh in sensor-based applications are oftf'n fixed by propcrtieb of tlw sensors.scalability is an important issue.Finally. the same pattPnls th~;, appear in sensor-based computations also appear in scientific applications.ln particular.we will briPfly cxarninP an Fx regional air quality modeling cod!' and an Fx earthquake ground motimt modding code hasf'd on the nwthod of lHHuHlary elements.
In St>ction 2 Wf' give a brief overview of the proto-tyJW llPF compiler (the Fx compiler) that was used in the study.St>ction :3 dPscribes the applications that we implt>mented in Fx and their performance on tlw IntPl Para~on.SPctions -t.:'i, and 6 descrilw some key issues in gpneratin~ efficiPnt code for HPF DO loops.reductions.and permutations.and introduce somP simple tests for m<'asuring tilt' Pfficit>ncy of these OJWrations.Section ?discussps tht' issue of scalability in st'nsor-based applications.Finallv.Section 3 sl;ows how the same DO loops.reduction~.and pennutations that are crucial to sensor-based applications also appear in scientific computations.

FX OVERVIEW
The FX projt>ct "as started in Fall 1991 wit It the ~oal oflearning how to ~etwratf' efficient code for programs writtPn in the emerging llPF standard.*The input language is a diaiPct of subset HPF and consists ofF?? with HPF data lavout statenwnts.arrav assio-nnwnt statPmPnts with snjlport for general CYCiiC(kJ distrilmtions i11 an arbitrary munlwr of arrav dinwnsions

Irregular Dynamic Computations
FIGUHE 1 Tht• structure of sensor-based applications.

SENSOR-BASED APPLICATIONS
St>nsor-based applications typically have the highlevel t wo-stagf' pipelined structure shown in Figure 1.
The front-end accepts a stream of inputs from one or more sensors and manipulates these inputs into some desirf'd form.Thf' back-end interprets the results of tlw front-end and either displays them or initiates some action.For f'xarnple. in a radar-tracking application.the front-end might transform input phase histories from an antemw array into an image in the spatial domain, and the hack-end would manipulate this image to name.identify.and track objf'cts of intf'rf'st.
Tl1e front-end processing typically consists of numerous.rf'gular.data parallel operations on dense arrays.requires high \•TFLOPS rates, and the operations perfornlf'd are usually data independent.Computations such as the FFT convolution, scaling.thresholding.data reduction, and histogramrning arf' common operations.The back-end processing is typically more dynamic, irregular.and data dependent.. with ITal-tinw scheduling of processes.ln this article.we arc con(•erned with the front-end processing."here I !PF on a parallel system is most appropriate.For the remainder of the article.when Wf' refer to sensor-based applications we arc referring to the front -end. One of the nicf' qualities of sensor-based computing is that many applications have similar eomputational patterns.The similarities allow us to foctLs on a ff'\v small application kernels . .with the ass11ranee that anything that wt> learn about compiling these small programs will accrue benefits in larger.more realistic programs.T"o examples that capture many of the key comp11tational patterns, and were of tremendous help in the dewlopment of the Fx compiln.are the 2D FFT (FFT2) and the image histogram (IIIST).The high-level parallel structure of thf'se computations is shown in Figure 2.
Figure 2 depicts the course-grained parallelism that is available in FFT2 and HlST.The vertical lines depict independent operations on array columns and the horizontal lines depict independent operations on rows.The FFT2 program is a collection of independent local lD FFTs on the columns of an arrav" followed The Fortran 90 TRA'\JSPOSE intrinsic IS important because it allmvs each lD FFT to operate locally and in place on a eontiguous vector.The samf' efft>('t could also be achieved with the Fortran 90 RESHAPE intrinsic.using the ORDER argument to swap the arrav indices An alternative is to distribute one array by columns and the other array Jn~ rows.and then redistribute from columns to rows using an array assignment statement.This approach results in rows that are stored locally but not contiguously.Since it wo11ld require extra copies to get the rows in contiguous form.this approach will not be considered here.(rHALLARO'\.WEBB.AND ~LTBHLOK The HIST example consists of a collection of inde-JWndt>nt local histograms on the columns of an integer array . .followed by a plus-reduction operation that adds the local histogram YP<:tors to fonn tlw final result.This might be written in HPF as: INTEGER a(N,N) ,h(M,N) ,r(M) !HPF$ DISTRIBUTE (*,BLOCK):: a The FFT2 and I-JIST examples capture the core computational pattPrns in sensor-based computing: paralld DO loops.rPductions.and index permutations.FFT:2 is a pair of parallel DO loops . .each of which is followed by an index JWrmutation (the TRA;-..SPOSE intrinsic).lUST is a parallel DO loop followed by a reduction.These two pattt>n1s occur rt>pt>atedly in the sensor-bast>d applications \\T have studied.A striking aspect of Figure 3 is the mtmber of parallel DO loops that operate independently along one dimension or another of the array.Each application contains at least one of these loops.The pointwisP scaling operation in RADAR is also anotlwr form of parallel DO loop, which is usually expressed as an array assignment.Another common pattern from the FFT2 example is to ( 1) operate along one dimension . .then (2) operate along another.This pattern, which occurs in FFT2, ABL RADAR.SAlt and \fR. is typically implemented with a TRA~SPOSE intrinsic hetween ( 1) and (2).Reductions are found in I liST.STEREO . .ABL and RADAH.The point is that FFT2 and I liST capture the ba~ic computational structun~ of a wide range of sensor-based cotnputations.
There has been sorne concern about the performance that can be expected from I IPF programs [ 1 4].
Howevt>r, in our experience, tlw pt>rfonnance of HPFlike programs cornpilt>d by the FX compiler for the Paragon is good . .even for moderately sizt'd problems.
Figure 4 shows the absolute }Wrformance of representatively sized FFT.. l liST.and SAR programs.FFT1 is a parallel 1 D FFT program.FFT2 is the FFT exemplcr.and FFT:~ is a parallel :3D FFT program:.each is computt>d in a way similar to FFT.The programs in Figure "i scale reasonably well (although not linearly) and running time is not dominated by communication overhead.Jn the ease of the SAR program . .communication accounts for less than 10'Yr, of the running time.
Wltile certainly not exhaustive.Figure "i offers smne hope that good performance can be expectt>d from sensor-based applications written in IIPF.In the remaining sections.we discuss tlw issut>s involved in achieving good performance.

PARALLEL LOOPS
DO loops in typical sensor-based applications can be efticit>ntly parallelized using a variation on tht' simple Fortran D copy-in copy-out model [ 1;) J.The computation in the main body of the program is modeled as a singlP thread operating on a global data SflaCP.Each itPration in a loop is modeled as a st>parate tltrcad operating 011 its own local data space.When control reaches the loop, the contents of tht> global data space are (conceptually) copit>d to each of the local data spaces.Each loop iteration then works independently on its local copy of tlH' global data space.When all of the loop iterations have terminated, tlw contents of the local data spaces are (conceptually) copied out of the local data spaces back into the global data spacP.
If multiple iterations write to the same address in the local address space.then the values are merged "ith a user-defined binary associative rt>duetion operator before copying back to the global address space.This loop modeL called thP PDO modeL is described in detail in [5].
A parallel DO loop based on the PDO model can be characterized in terms of the addresses that it references.Let The most common form of loop in sensor-based applications has disjoint reads and writes.Every application in Figures 2 and ;3 has at least one loop with disjoint reads and writf.'s.and FFT2 (Fig. 2a).SAR (Fig.    Each invocation of the f ft ( ) subroutine performs an in place FFT1 on the kth column of an array, reading and writing only elements in the kth column.
Although fft () is a complicated subroutine with a complex pattern of array references (and might even be an assembly language library routine), the pattern of array references between loop iterations is extremely simple: The kth iteration references the kth column.
This dichotomy of complicated intra-iteration reference patterns and simple inter-iteration reference patterns is a recurring theme in sensor-based computing, with important implications for HPF implementations.
Another important form of DO loop has overlapped reads and disjoint writes.Loops of this form are typically used to perform convolution operatons such as the error computation in STEREO (Fig. 3a).A similar pattern occurs in relaxation algorithms from scit>ntific computing.For t>xample.a simpk lD convolution is of the form: Finally.loops with overlapped writes are typically used by sensor-based applications to implement reductions.We discuss this important class of loops in Section 3. The remainder of this section discusses onh• loops with disjoint writes.

Implications for HPF Implementations
Generating efficient code inside parallel loops is the key to achieving good performance with sensor-based applications written in HPF.Since parallel loops with disjoint writes are so common . .occurring in every application we have studied.generating efficient code for these loops is especially important.
A !though loops with disjoint writes are often dismissed as ••embarrassingly paralleL•• it is nontrivial in general to generate efficient parallel code for them.There are a nmnber of reasons.all complicated by the fact that loop bodit>s of real applications typimlly contain a lot of code, with cmnplex intra-iteration reference patterns.calls to external library routines.and evt>n inlined assembly language inserts.
First, the IIPF compiler mnst somehow determine that the write sets are disjoint and that addresses that are written bv one iteraticm art> not rt>ad lw another . .iteration.The I IPF II\DEPE_\lDEi\T directive is verv helpful here.The [\DEPENDF:.\JT directive informs tl1e compiler that no address is written by one iteration of a DO loop and then read or written by another iteration.
Second. the IIPF compiler must ensure that the read and write sets are aligned with the loop iterations before the iterations are executed.By alip;ned.we mean that the read and write sets of each loop are available in local memory.Aligning the data sets before executing a loop is key to achieving good performance because this allows the programmer to use arbitrary sequential code in the loop body.including calls to efficient sequential math libraries written in assembly language.For example . . in a Paragon HPF implementation.the fft () routine called by the FFT2 loop might be an assembly language routine handcrafted for the i860 microprocessor.
Finally., the HPF compiler must compute local loop bounds and translate global array indices in the loop hody to local indice;;.Tf not handled properly.these computations can be a significant source of runtime overhead.
The Fx compiler relies on a new PDO keyword to assert that write set:" are disjoint and that addresses that are \\Titten by orw itPratiun arc not rt'ad by another.(The llPF t\DEPENDE1\T dirt>ctive conwvs tlw same information and is more compatible \\ith standard F00 compilers.)\.lso . .since the bodies of loops with disjoint writes can be arbitrarily complex.consisting of calls to externally compilt•d library routines or hand-coded asst>rnbh-routines . . it is not always . .possible to rely on compile-time analysis to align read and write sets with loop iterations.Thus. the Fx compiler relies on hints from the ust>r tlwt describe the read and write sets for each loop iteration.This is an important point because it allows Fx to generate efficient code for loops with disjoint writes.regardless of the complexity of the loop body.

Loop Efficiency Test
There is a simple test.called the loop e.f]icicn(:l• test.The parallel loop overhead consists of a few statemf'nts lwfore thr loop that compute the local loop bounds, and a function call that computes the initial local index value.The only overhead in the loop body is a statement that increments the local index vahw.A similar approach to index mnversion is first described in [1 (>].
It is hard to imaginf' a tightt>r loop.
In stumnary . .tlw loops in sensor-based applications can often be implemented as independent parallel loops, without any communication between loop iterations.A primary goal of an HPF implementation should be to generate paralld loop bodies that are as efficient as their sequential counterparts.ln particular, implementers should focus on minimizing the overhead of parallel loops with disjoint reads and writes.The loop efficiency test we have introduced in this section provides a simple way to characterize these overheads.

REDUCTIONS
In sensor-based applications, loops with overlapped writes are used primarily to impkment reductions.For exarnplt> . .
A common pattern in sensor-based applications is to operate independently on thr columns of an array, and then reduce the columns into a single column by adding them together.The HIST.. STEREO, ABI, and RADAR programs all perform this type of simple reduction.However.thne are other sensor-based applications that require a mechanism for the programmer to define generalized reduction operations.For example, a connected components algorithm can be written as a parallel loop ovn the rows of the image, where each iteration computf's a segment table for ib row.This is followed by a generalized reduction step that rrwrgt>s thf' st>grncnt tables.

1 Implications for HPF Implementations
For most sensor-based applications the JIPF SliM intrinsic is sufficient.Tlowf'Vf'C the first version of HPF provides no support for operations like connected components that require generalized reductions.Achieving good rwrfonnance in these cases will require sophisticated compiler analysis to recognize tfw rt>ductions [ 17]. ln Fx, we avoid this analysis by incorporating a mechanism for defining arbitrary binary associative reductions into the parallel loop construct [;)].

Reduction Efficiency Test
A user or implementer can measure the quality of the parallel reduction loops gt>nerated by an HPF eotnpiler using a test similar to the loop pfficit>ncy test in Section 4.2.Consider the following loop that adds the columns of an }\' X N array.

INDEX PERMUTATIONS
As we saw in Section 3. the following computational pattern occurs in many sensor-based applications: ( 1) operate independently along ol!e dinwnsion of an array, then (2) operate independently along another.
The FFT2 .. \BL RADAR.SAR and \lR programs all exhibit this pattem.For cxamplt>.FFT2 performs a lo('al FFT on each column of an array.then JWrforrn~ a local FFT on each row.In order to exploit locality.this pattern is usually implementt>d with an indn perniutation (also rt>ft'rTed to as a transpost' m• corner turn) between steps 1 and 2: .~uch that after the assignment.. where each i, is a unique array index and 1T is some permutation operator.For example.aftt>r the TRA1\S-POSE stt>p in the FFT2 progranL b(i.j) = a(i.i). for 1 :::; i ..i :::; :\.

Although nuJst of the sensor-based appli('ations that
\\T lul\•e studied transpose 2D arrays.there are important cases when~ index rwnnutations of higherdirnerbioual Inatri('PS are neet>ssary.ln particular, 2D arrays of complex variablt>s arc often implemented as :3D arrays of real variables . .and for d > 1. a ddimensional FFT must permute two indices of a ddimensioual complex array between each local FFT step.

1 Implications for HPF Implementations
Effi('ient index JWI'Inutation is cmcial to achieving good performance in st>nsor-based applications.In generaL an index pt•rmutation induces a complete exchang!:' (i.e .. an all-to-all personalized communication).where ea('h node sends data to every oth~r node.The Fortran 90 TRANSPOSE and REST L\PE intrinsics [9] adopted by HPF provide an opportltnit y to optimize this important opt~ration, but unfortunately TRANSPOSE is only ddined for 2D arrays.So for the general case . .an I IPF implementation must either provide a highly tuned RESHAPE intrinsic.or provide a gctteralizt>d TRANSPOSE extrinsic (this is being consid~red for HPF -2). or be able to generate efficient code for index permutations that are implt>mented with a combination of array assigmnents and DO loops: The Fx compiler prm ides an index permutation intrinsic with the same ha~ic fmH'tionalit v as a Fortran 90 RESIIAPE intrinsi(' with an ORDER argument.The advantage of using an intrinsic is that an intrinsic can leverage off of the existing <~mit' in the nunpiler for generating array assignment statements.\\'riting an extrinsic with the same functionality means duplicating the compiler's array assignment code in tfw nmtime library.FurthemHn•e, capturing the iudex pennutation in an intrinsic allo"•s the compiler to t>xploit significant optimizations on systems "'ith toroidal interconnects [18].

Permutation Efficiency Test
Just as with loops and reductions.there is a simplt> test for measuring the efficiency of llPF index pennutations.The excctttion time of a parallel indt•x p<•nnutation of an ;\' X ,V array (using either tlw TH:\ \S-POSE intrinsic a permutation extrinsic.or an assignment statement and a DO loop) is bounded from below hy dw time to sequentially permut~ an ,V X ,V/ P array on a single node: the percentage of effective local memory bandwidth that is realized by the parallel permutation.Like reduction efficiency, permutation efficiency is influenced by overheads dw~ to the compiler as well as overheads due to the underlying communication system.Figure 7 shows the results for a 2D transpose of an N X N array using the Fx compiler on Paragon.The graph provides a couple of interesting insights.There is substantial overhead even for the single-node version of the parallel permutation, which achieves only 85% of the effective local memory bandwidth for large problems.The inefficiency is due to the fact that Fxgenerated code unnecessarily checks for communications that never occur (because the transpose is on a single node).
The multiple-node versions of the parallel permutation use only :30% of the effective local memory handwidth for large problem sizes.This suggests that the parallel permutation on the Paragon is communication bound, and that further improvements will require a new message-passing layer.The inefficiency is lagely due to overhead from the underlying communication system and it is tempting for us to wash our hands of responsibility for its performance.However, in our experience, significant performance benefits can be realized in compiler-generated code by tailoring the run-time communication libraries [ 4,19].HPF implementers need to be aware of the communication overheads for a particular target machine.The reduction and permutation efficiency tests are a simple and useful way to expose the performance impact of these overheads.

SCALABILITY
Sensor-based applications are typically implemented as collections of functions that process continuous PERFORMANCE ISSUES IN HPF 69 streams of data sets.The sizes of the data sets are determined by external factors such as the type of sensor, the number of sensors, and the frequencies of interest.For example, the image size of the STEREO application is fixed at 240 X 256 by the camera system and cannot be modified by the programmer.The magnetic resonance scanner used by the magnetic resonance application processes 512 X 512 images (ovcrsarnpled from 256 X 2S6 input).The radar subsystem used by the RADAR application produces 512 X 10 data sets.
The fixed size of the data sets is an important property of sensor-based applications that distinguishes them from typical scientific applications.Since the data set sizes are fixed, the amount of available computational work on each node decreases as the number of nodes increases.If a data parallel function performs a nontrivial amount of internal communication.then the efficiency of the function will tend to decrease as the number of nodes increases.This behavior is shown in Figure 8 for a .512X S12 local FFT loop.a 512 X 512 image histogram, and a 512 X S 12 transpose.The local FFT function contains no comnmnication, and thus scales perfectly with the number of nodes.However, the histogram and transpose functions contain internal communication and their efficiency decreases significantly as the number of nodes increases.
If efficient usc of processing nodes is a goal (as it is in embedded svstems where additional nodes increase the cost, size, power.and weight of the system) then we want to use a smaller number of nodes for functions like the histogram and transpose.But if we have a large parallel system with many nodes.how can we effectively use the remaining nodes?One approach that has been proposed is to use a mix of task and data parallelism [6. 7, 20, 21] Task parallf'lisrn can significantly improve tilf' performance of applications with fundions that do not scale well.For example, using a mix of task and data parallelism doubled the throughput (compared to the most efficient data parallel code) of the 240 X 256 Fx STEHEO program so that it was able to run in real-time [8].Since HPF does not currently support task parallelism (although it is l)f'ing considerf'd by the HPF-2 committee) there is the risk that IIPF sensor-based codes with smaller data sets '"ill not run efficimtly.This puts additional pressure on IIPF developers to maximize the loop, reduction, and permutation efficiencies identified earlier.

RELATION TO SCIENTIFIC CODES
Although this article does not specifically address scientific IIPF codes, other groups have used Fx to implement two nontrivial physical simulations: QUAKE.a :3D f'arthquakt-ground motion simulation (basf'd on the method of boundary e!Pments) and Alit a regional air quality modeling program [22].Both codes artlegacy F77 codes of about 10.000 lines that were ported to the Paragon version of Fx.QUAKE is especially intt-resting because it was ported in a few weeks by a seismologist from the Southern California Earthquake Center who had never written a parallf'l program.
Figure 9 shows the computational structure of the Qt:AKE and AIH programs.QUAKE is a single perfectly parallel DO loop.ATH is a sequence of DO loops that operate on different dimensions of an array, with :3D transposes interspersed between the loops.The interesting thing is that at a high leveL the ~tructure of these two modf'ratt-ly large scientific codes is almost identical to the FFT!So again, we ~ee that parallel loops and index permutations are important operations for I IPF developers to optimize.
The QCAKE and AIH programs reinforce an important point tlwt we touched 011 in Section 4: Complicated programs with complicated inner loops can nonetheless have a simple data parallel structure that is straightforward to parallelize.The AIR program takf's this to extremes: Each iteration of the parallel DO loop in each of the horizontal transport steps solves an independent sparse and irregular finite element problem.We normally assume that llPF is not a good platform for sparse irregular codes.but AIH is an example of a sparse code that is quite well suited for HPF.

CONCLUSIONS
\Ve have identifif'd sensor-based applications as an important class of applications that art' gf'nerally well suited to HPF.Sensor-based applications operate on df'nse arrays of data and they reference thesf' data in regular ways that are naturally expressed using HPF loops and array assignment statemt-nts.Typical sensor-based applications include FFT radar and sonar processing, computer vision, and medical imaging.
In the process of studying realistic sensor-based applications compiled by a subset HPF compilf~r developed at Carnegie Mellon.we observed that performance is df'termined by the efficien<~y of three key operations: independent DO loops, reductions, and index permutations.Of course, good performance on these operations is important for many types of applications.The interesting point is that for sensor-based applications.good JWrfonnance for a small set of operations is generally sufficient for achieving good overall performance.Implementing an efficient compiler for all of HPF is diffi(~ult and cxpmsiw.We believe that focusing attention and effort on the performance of independent DO loops.rednctim1s.and index permutations will allow for the rapid devdopment of practically efficient compilns.Based on our experience with the Fx syst(~lll.£IPF developers who focus on these OJWrations will reap large rewards in performance.Cin•n the importance of loops.rNiuctions.and index permutations for sensor-based applications . .we have i11troduced some simple ami useful tests that user~ and irnplementers can use to measure the efficiency of the code generated by an I IPF compiler for these kt•y operations.The tests are written entirely at the user level.but can providt• good insight into the quality of eodt• generated by an IIPF compiler.For example. the Fx compiler.as well as sevt>ral commercial HPF compilers.exhibit a measurt>d loop efficiencye of close to 100%, for the canonical Do loop.In other words.tlw loop bodies geTwrated by these compilers art' nearly as good a~ their sequential countcrparb.On the other hand, several comtnercial HPF emnpilcrs exhibit a maximum loop dficit>ncy of only 50% for the same DO loop.which suggests a problf'm in tlw \\ay these compilers generate loops.
Finally.although JIPF is a good match in general with sensor-based applications.scalability can lw an issue became of the fixed sizes of tlw data sets.If the data sets art' small.then a mix of task aud data parallelism is often requin•d to gt't good performam•!'.

Figme : 3
shows a collection of sensor-based applications.All but ABI (Fig. :3b) haYe been implemented in Fx. and could be ported to HPF with small changes.The STEREO program.developed at Carnegie Mellon . .extracts depth information using the images fmm multiple vidt~o cameras [ 1 OJ.The RADAR program was adapted from a C program developed by MIT Lincoln J ~abs to measure tlw effectiveness of various multicomputers for their radar appliottions [ 11 J.The SAH program was adapted from a Fortran 77 program developed by Sandia .\JationalLaboratories [ 12].The MR program was developed from an algorithm by Doug Noll at Pitt .Vledical Center [1:3].
FIGlJHE 3 Other F x/Paragon sPnsor-based applieations. ~~d ), and MR (Fig .. )c) consist exclusively of the sf.' kinds of loops.For example, the FFT2 exemplar con-sists of HPF DO loops of the form: !HPF$ INDEPENDENT DO k=l, N CALL f f t (a ( : , k) ) 2561( X 1 lD FFT.
this loop on P nodes is hounded from bt>low by the performancf' of the following sequential HPF DO loop: REAL sa(N,N/P) ,v(N) v = 0.0 DO k=l,N/P v = v + sa(:,k) END DO If R"(NP) is the running time of the parallel rt>duction of an /V X N array on P nodes and Rs(i'v',P) the running time of the corresponding sequential reduction of an N x /V/ P array.then E".""'"(NP) = R,(NP)I R"(J'v".P) is tlu• reduction efficiency of the parallel reduction generated by the compiler.*Reduction efficiency expo-* An alternatiw formulation of tlw reduction efficiency test is to nse the HPF Sl.M intrinsic for the parallel reduction.In thi, case. the st'fji!Pntial rednetionmnst 11se the same loeal computational kenwl as the Sl ':\I intrinsic.
DO k=l,N call fft (a (:, k)) END DO b = TRANSPOSE(a) DO k=l,N call fft (b (:, k)) END DO a = TRANSPOSE(b) In fWiwral.an indt>x permutation is an assignment for a d-dimensional arrav a to a d-ditnensional arrm b .
Ideally. the efficit>ncy or1 a singlt> node should be dosf' to Ullity.Anothn point of concern is the slow convergenet' of the' curves for 16 and :32 nodes.Since tlw local computation step of each parallel reduction grmvs aH roughtly :VIP and the communication step grows a roughly A log P. we might expt>ct these curves to converge fastPr than tlwy do.Y ct even for a relatively large N. the reduction efficimcy is below 500ft,.While the reduction efficiency dot>s not pinpoint the source of the overhead.it does point out an opportunity for improvement in the Fx implementation.
"I nodes: A.HJ nodes: +.:12 nodes.ses the run-time ovt>rheads that are incurred by performing the reduction in parallt>l.l'nlikeloop d'ficiencv.which is cmnplett>ly dC'termirwd by the COHlJ;ilf'r.ITduction efficif'ncy i~ a function of overlwads due to tlw compiler as well as overheads due to the underlying communication sy~tem.Thus. in general it may be irnpossihlt> to achit>vf' a rC'duction t>fficiencv of unitv.Figur~ 6 show~ the results for a simple plns-rwlurtion using tlw Fx compiler on Paragon.As "•ith the loop efficiency graph.reduction efi!cieHry is bounded from above bv the curvt> for 1 node and from below In' the curw for .32nodes.Surprisingly . . the reduction efficiency for a single node (kcreases as the problem size innTases.This suggests a problem in the Fx implt>-Illf'ntation of tlw reduction.