DDT: A Research Tool for Automatic Data Distribution in High Performance Fortran

This article describes the main features and implementation of our automatic data distribution research tool. The tool (DDT) accepts programs written in Fortran 77 and generates High Performance Fortran (HPF) directives to map arrays onto the memories of the processors and parallelize loops, and executable statements to remap these arrays. DDT works by identifying a set of computational phases (procedures and loops). The algorithm builds a search space of candidate solutions for these phases which is explored looking for the combination that minimizes the overall cost; this cost includes data movement cost and computation cost. The movement cost reflects the cost of accessing remote data during the execution of a phase and the remapping costs that have to be paid in order to execute the phase with the selected mapping. The computation cost includes the cost of executing a phase in parallel according to the selected mapping and the owner computes rule. The tool supports interprocedural analysis and uses control flow information to identify how phases are sequenced during the execution of the application.

Automatic data distribution maps arrays into ttw physically distrihutt>d mnnories of the processors according to the array access patterns and para lit> I f'x.ccution of operations within computationally intf'nsivc \\Cl'ADI~ ET ;\L.
phase~.This mapping nm lw ~ither static or dynamic.In a static mapping.the layont of the arrays dews not changf' during the execution of the program; in a dynamic mapping.r~mapping opt>rations are performed in onlf'r to change tlw layout of arrays in different computational phasf's.
The main differences lwtween the previous methods is the kind of st ructnrf' selected to n•prf's~nt the prob-lt>m, and tlw way used to fonnulate and solvt> it.For the al igmrwnt step . .Li and Cheu [ 1] dt>finp and usc the Cornponmt Affinity Craph (CAC) to represent align11wnt prden•n<:f's.and it usPs a hf'uristic algorithm to solw it.Cupta [2] also uses the CAC . .but wt>ighted with data move11wnt costs.To do this.a default distribution has to lw assumed.This proposal is more accuratf' thau the previous one ,,•hen w~ighting the edges of the CAG, hut tlw solver is basf'd on the same heuristic algorithm.\\holey [ 14]ust>s the preferem:e graph defined by [ 1 ()] in the franwwork of ~illgl~ inst ruction nmltiplc-data (SIMD) rnachinf's.This f!Taph irwludt>s alignment preferences to pr~scrve par-allcli~lll.. but the rPsolntion \\hen tlw graph is in conllict is also based on heuri~tics.Schcfllcr et al. [4] define the alignment distrilmtion graph \\~hPrP nodes n•ptTsent program operations and edgt>s f"otmect definitions of array objects to thf'ir uses in tires~ operations.Edges are wf'ightcd with tlw numlwr of data items comrnunicatf~d along tlw edge.The alignnwnt is found using a grt>t•dy algorithm as a heuristic to detennirw the minimtun cost . .and applying graph contraction opt>rationb to rt>duce the complexity of tlw problem.Kf•Imedy and Kremer [ 1;)] design a framP\HJrk to he used i11side a data layout assi~tant tool for HPF.lt i~ bast>d on the CAC. and they solvf' the alignment pmblf'm using a 0-1 iutf•ger programming nHHlf'L thus avoidin;r tlw usc of heuristics.
Onn• tlw alignrnent has been decidt>d.the distribution ~tep df'cide~ which dimension(s) of IIH' ternplatP arc distributed and the uurnLwr of processors assignf'd to each of them .. \ good distribution maxirnizPs tlw potential parallelism of the codf• and offers the possibility of furtlwr reducing data lllO\TnH'Ilt hy snializing.This goal could b~ tri\ially ~atisfit>d by assigning a datum to each JJJ"Ocf'ssor.whif•l1 maximize~ parallt>l-isrn.Li and Chcu [ 17] match the aligned reference patterns with a predefined sPt of data mmernent routines.Each routine has an architecture-dependent cost parameterized in t~rms of the Jlllrnber of processors involved in thf' data motion and th~ amount of data being mov~d.The cost function for all the patterns is minimized hy selecting the appropriate distribntion strategy.Gupta [2] decides the dinwnsious to distribute (maximum 1\nJ dimensions).assuming a defau It mnnlwr of processors in Pach one and minimizing th~ total data movem~nt plus computation timf'.When more than mre dimf'nsion is distrihutf'fL he decides the number of processors to assign to f'ach dimension by generating all possible cmnbinations.Wholey [ 14] uses a hill climbing search m~thod \vhich initially assigns all array dements to on~ processor and f'stinwt~s tlw cost.Then it doublt> the munl_)('r of prof•essors and chooses the dinwnsion to assign the nf'w cmes.until all availablt' processors arf' ntilized or the total cost is not further n•duct>d. In large problems where differPnt computational!) intensive phasf's occur.. remapping actions between phases can increase the dfici~ncy of the solution.In this case.a good solution is indqwwkntly found for each phasf' . .and realigmnf'nt and/or redistribution statenwnts are inserted whf'rT m•ccssary.Data rf'mapping is also one of thf' topics in this suhjt>ct area of cmTf'nt research.Some of the proposals presented in the literature about array r~mapping [:'i.[13][14][15][16][17][18][19][20] art> ~umm:uized in the rest of this section.
The D-System.currently under dt•vf'lopmcnt at Hice Lni\•ersity, considers the profitability of dynamic data n~1napping by exploring a st~arch space of reasonahlf' alignnH•nt a11d distribution spacPs [18].In their work.each phas.ohas a set of candidate mapping scht•mes.Selecting a mapping schenw for each phase in the entire program is dow• by n•prf'senting the problf'm with the data layout graph.Each possible mapping for a phase is rt>presented with a nodt•.Edges lwtween t\\O nodes in different phasf's n~present the n•rnapping that has to lw carrit>d out to execute each phase ,,-ith thf~ corresponding mapping.:\odt•s and edges have weights rt>prt>st•nting tlw on•ral cost of f'Xecuting a phase with mapping and remapping costs.respectively.in tnms of execution tinw.The problcrn is translated into a 0-1 integer progn11nmiug problem suitable to lw solved by a statf'-of-tlw-art gt>JH'ralpurpose intPgPr progrannning solver.
Tlw FC:S system [1 f)] considers tlw proldt•rn in the franwwork of a data distribution tool for Fortran <)() source codt•o.ln this scO]H'.array-s~ ntax assignment statf'lllt'llts and \\~liE HE masks are f'Xamim•d to determine candidatt• data mappings.\ phase is basically a DO-loop cfmtaining array-syntax a~sigrn1wnt ,;tate-ments or WT JERE masks in its body.Instead of lookinp: for tlw optimal solution.it uses a lre('-exhaustive algorithm with some heuristics to prune the search spacf'.A conflict table storing the conflicts lwtwf'en the mappings of the arrays from OTif' phase to tlw other is the basis of the n•mapping algorithm.This table deter-mim~s which redistrilmtion options are worth considf'ring at t'11Ch transition.From this information.a tree showing all tlw different alternatives of remapping is built.Tht> aim is to determine the path in the treP with tlH~ lowest COb I.The fu 11 remapping tree can easily grow to intractahiP proportions.
Chatterjet> el a!.[;)]usc a divide-awl-conquer approach to the dynamic mapping problem.Ttwy initially assign a static mapping valid for all the nodes and tlwn recnrsivPiy dividt> tht>m into regions which an~ assigned difft>rt'nl mappings.Two regions are merged wht>n dw cost oft lw dynamic mapping is worse than tht> static mapping, taking computation.data movPnwnl.and rt>mapping costs into account.Palermo and Barwrjcc [20] also usc a dividc-and-cmi<pu:'r approach in which the program is n•cllrsivt>ly decomposed into a hierarchy of nmdidatP phases.Then.takin!-( into a<:<'mmt thf' cost of remapping bel\\een the different phases.the scqtwncc of phases and phrb<' transitions with tlw lowPst cost is sPIPctt><l.Thev usc [2] to assign mappings 10 thP phases generated.
Tht> Data Distribution Tool (DDT) is a research tool designed to generate hoth static ami dynamic solutions.Since it is a research tool. it can use tedmiqut>s that may lw too CO!tl£Httationall~ expensive to lw included in a final compiler; however.this allows ns to explore a rich set of solutions.The static modu!t• is based on the CAG hut Pxtended with somf' information regarding parallelism.W <'haw also modified the ori!-(inalalgorithmsin [1.l'?]toimprovcthe(IUalityofthe mappings generated [21].The curn•nt version of the static module gt~Jwrates both inter-and intradimensional alignments and BLOCK and CYCLiC distributions. The dynamic llHH!ult> explores a rich sci of cmnbinalions: it is not ('xhaustivc thanks to mechanisms included to cut dmvnthc search space [22].Tlw dynamic analysis is interpruccdttral and considers control flow to cktcnnine \\•here the rcmappin!-(actions havP to lw 1wrfonned (betW('f~n ('0111putational phast>s or across pron~durP houndari(~s).
Tlw rt>st of tht• artidP is organized as follO\\S.In tlw JH'XI section\\"(' give an O\Trvin\ of the\\ hole datamapping process in DDT.Section :~ details how the static solutions an• found.Section-+ dt>snilws tllf' al!-(orithm \\•hich finds dynamic solutions and inserb remapping actions if they arc found proJitahle.Sections :) and () describe the nlcnsions to the previous algorithm to handle control Jim, and intcrprocedural anal-yses.respectively.In St>ction 7 we prese11t thf' main results from a set of experiments to test the validity and quality of the solutions generatt>d by DDT.Finally.

AN OVERVIEW OF THE DATA DISTRIBUTION PROCESS IN DDT
Om research tool (DDT) analyzes Fortran 77 code and annotates it with a st't of HPF directives and execlltabiP statemt>nts that specify ( 1) !10\\~ a nays are alignt>d to a set of lt>mplate arrays and how the dimensions of these templates art' distrilmted among processors: (2) the sPt of rt>alignment and redistribution statPnwnts if thf' solution found is dynamic:, ami (:3) the parallelization strategies for the loops that access distributed arraYs.These decisions arf' done so that thP amount of rentolc accesses is rPduct:d as much as possible . .whilt> maximizill!-( the paralldism achieved.
Tht> cxt('rnal shell of DDT is the intcrprocedmal analysis module: this module is based on the call graph for the Pntire program.ln a bottom-up pass over the call graph.each proccdun~ is analyzNi when all the procedures called by it haYt' already been processed.
"\otin• that Fortran 77 docs not allow rPcursion.so 110 cyclt>s can he found in the call graph.From the analysis of a procedure.a set of candidate mappings are generated for it and stored in the DDT interprocedural database.
For each procedure.DDT can generate two diffPrenl kiJH!s of solutions: static and dvnamic.Static solutiom define an initial mapping for each array.and it does not change during the execution of the whole procedure.In a dynamic solution.tfw statements in tlw miginal source code are grouped into a ('ollcction of phases: each one nwy have different mappings for the arrays at't'f'SSPd so remapping operations might be nPt•essary to ext>cule each phase "•ith its mappirt!-(.i\otice that static solutions are a particular case of dynamic solutions wlwre no remapping operations arc needed. A phase is either the outPnnosl loop in a nest wiJOst• control variable is used to subscript au array or a call to a procedun The source eode is shown in Figure 1 for completeness.DDT identifies nine plias<~s in this program.Each phase corresponds to one of the nested loops (labeled fmm 1 to 9) in Fi~ure 1.For each phase.DDT estimates the data movement and the execution costs for different data-mappi11g and loop parallf'!izationalteruative;;.For inslance.while analyzing phase 4. the two possihle mapping" shmvn in Table 1 are taken into account .lnthis case, DDT suggests a perfect alignnwnl of all the arrays used in the phase and two possiblr•   2 shows the costs of the candidate mappings for all the phases within the iterative loop do iter.
Table 2 shows that the favorite solution for phases {4, S. 6} and {7.8, 9} is not the same.Therefore, they are not compatible in their mapping and parallelization strategies.Three main alternatives are evaluated by the intraprocedural algorithm: 1. Assign Solution 1 to all the phases.In this case the cost per iteration is 0.5479-H and the t>stirnated cost for the outer iterative loop .5.47944.2. Assign Solution 2 to all the phases.In this case the cost per iteration is O.S7:'i59 and the estimated cost for the outer iterative loop 5.7559.3. As~ign tht> preferred solution to each phase.Tn this case the computation time is 0.061886 and we have to remap the arrays betwcen inrompati-blt> phases.In particular, arrays a, b, and x will be remapped from row to column distribution before the execution of phase 7, which has an approximated cost of (2.56 * 256)/16 = 4,096 array clements each (or 32,768 byte each).This remapping action is performed 10 times during the execution of the iterative loop.Due to the same loop, phase 4 is executed again after executing phase 9.So we have to consider the compatibility between these two phases and the pos-~ible remapping costs if their mappings are not compatible.In particular, arrays a, b, and x have to be remapped from column to row distribution before the execution of pha~e 4 with the same estimated cost.This remapping action is performed nine times ( ~ince the last iteration of the iterative loop forces the t>xecution to t>xit it).The total cost for the sequence of phases within the iterative loop is estimated as: (10*0.064886) + (10*3*0.0:32768)+ (9 * 3 * 0.0:32768) = 2.5166:36.
In this t>xmnple, the cost of the dynamic alternative is lower thm1 the costs of the two static alternatives.In Section 7. Wf~ analyze and evaluate the validity of the different solutions when changing architt>ctural parameters such as data movement and thc number of processors.

DATA DISTRIBUTION FOR A PHASE
The basic compilation stcps for a computational pha~e an~ dc~cribed rwxt.First of all.a weighted graph called the Dimension Alignment Craph (DAC) is constructcd from the analvsis of thc array reft>n•nces in tlw source . .program and it records preft>rt>nct>s for alignment.The DAC is similar to the CAC hut it indudcs preferenccs for aligmrwnt based on parallelization in addition to data lllOYt'IIH~nt.Tlwn, an array alignment phase follows.In thi~ ~t<>p.all dimensions of the arrays in tlw program are rclated to each other by ( 1) mapping each array dimcnsion into a dimension of a template array (intenlimensio11al alignment) and (:Z) applying an offset between 1 hem (intradimensional aligmncnt).Data movement requirements for those nonaligned refert>nct>s and loop parallelization strategies are ana-lyz<>d in order to deci<k the dimensions of the templat<> to distribute . . the numlwr of JH•oct>s~ors allocatcd.and the kind of distribution applicd to them.

1 Reference Patterns and the DAG
Tlw DAG is a 'wighted undirected graph built from the analysis of array reference patt<>rns in loop statemenh.In this section we rev in\ how rt>ference patterns arc defined and analvzt>d to detect affinitY.and how . . the DAC is built from this analv~is.
wlwrc /1 is an array that appears in tlw left-hand sidt> (!hs) of an a~signment statement locat<•d insidt> the loop and B i~ an array in tlw right-hand sidt> (rhs) of the same assignment statemcnt.lf the assignment is under control of conditional statements.then all tlw arrays in thc cxpr<>ssions that e\•alnate the conditions an• nmsiden•d as if they wert> in the rhs of tlw assignlllt'IIt statenwnt.
,\n affinity rclation can appear bet\n'<'n t\vo dinwnsions of the data arrays in a reference pattf•rn.Dinwnsiorl R, 1 is said to lw aftinc with dinwnsion A 1 ,( denoted ("1 1 ,.R,)) if j, 1 and i 1 , are linear functions of tlw ~a me loop control variable.
From the analysis of referenct' patterns.tlw D"\C is lmilt.\"odt>~ of the DAC rcpn•,;ent dinwnsions of data arrays and edges r<>prcscnt affinity relation~ lw-tween array dimensions obtained by exam1111ng eross-rt>fercnce pattcrn~ (patterns in which the rfts and lhs arrays are different).Self-reference patterns are not considt>red in the DAC building step.1\odes in the DAC are grouped in columns; each column contains those nodes representing dimensions from tlw same data array.An edg<> (4 1 ,.B,) in the D\C sho\\•s a preft>n•nce for alignment of dirrwnsions AI, and n,l.
According to [1].edges in tlw DAC are weighted in two ways.On the one hand (and to solve tlw intt>nlimensional alignment problem) . .each edg<> is \\Tighted depending on whetlwr it is competing or mmcom peting with another edge ( e if it is compcting and 1 if not).Tw~o edges are said to be competing if the~ arc gener-att>d by the same reference pattern and are incidt•nt on the sanw node.A DAC so <lelined rnay contain multiplt> edges betw~een a pair of node~ since there might he several r<>ft>n•nce pattt•rns involving t \\ o data arrays: each set of multiple t>dgt•s caJI he rcplact>d with a singlt> t>dge \\hose w~eight is the sum of tlwir \wighh.On tlw other hand (and to soh-e tlw intradimensional aligmncnt problem).each edge is weight<•d with the offset between the two subscripts in the array dinwnsions innJlvt>d.ln tl1i~ case" multiple edges art> not mcrged into a single edge lwca use each 011e may store information ahottt a differcnt shift preference.Prefer-Pnces for stride alignnwnt arP not reconlt•d in tht> DAG in tlw <•nncnt version of the DDT.
DDT abo perform~ a set of wdl-knmn1 optirnizations such as cxpression substitution.subscript substitution.and imluction variable detection.In [2] the authors evaluate the effectiveness of these optimizations in terms of amount of llt'W reference patterns analyzed and affinity rclation~ ohtairwd.They also analyz<> the complexity of the DAC in real code~ 111 terms of numlwr of node~.edges.and offscts .

Including Parallelization Constraints in the DAG
Loop paralldization 1s not independent of tlw way arrays in the phase are aligned and di~trihuted.It would he interesting to have a dt>ar relationship betw•<>en loop levcls in a ph as<> that is parallel ized and dimensions of arrays that arc aligned and distributed: this nm cas<' the application of tIre owrwr computes rule and of a more t>ft1eient generation of parallel code.For each loop in a phase eligible for paralld execution (according to dqwndcnce analysi:;).a set of edgt's linking dimensions of array~ (in tfw lhs of a~sigmrwnt stat<>ments) ~uhscriptcd by its loop control variable art> added in tlw D"\C.:\"ote that these <'dges arc differ-PHI than the indqH•ndt>nC<' antiprcfcn•nce edgPs delined in [16].Tlwy characterize potentially paralleli- Figure 2 shows the DAC for phasf' 1: in ADT.Notice that in addition to the edgeb that show af!lnity lwtwf'Pn dinwm;ions in array r<:>f<:>r<:>nces (thin edg<:>s (.r 1 • a 1 ).
(.1•".a'!.)• (a 1 • b1).(a".h").(:r 1 • b 1).and (.r".b")).a rww edge betwe<:>n (.1• 1 • b 1 ) is added.This edge lws a wf'ight big <:>nongh to ensure that tlw D.\C partitioning algorithm described in tlw nf'xt section will align all the nodes linked by it.:-.Jodcs in the same subset correspond to dimension' to be aligned.As a consequence.we want to partition tlw DAG so as to minimize tlw total weight of edges that arc between nodes in diff<:>rPnt subs<:>ts.The problt>m .c;tatcdabow is :\P-complete and [1 J propose a lwnristic algorithm (greedy) to soh•e it.In this algorithm.a single data array is ramloruly dwscn at f'ach step for alignment with the ternplatt> (which is chosen among the data arrays that haY<:> maximum dimensionality).The algorithm appli<:>d to a graph G is dt>scrilwd below:   ln our implementation, several optimizations in the lwnristic have been done in order to obtain better alignments.They are described below: 4. In Fonn_Btjwrtiff'_Graph. the weight of an cdge betwcen two nodes is set to the min-cut instead of the sum of the weight of all edges that compose the paths.With min-cut, the weight is set to thc minimum sum of edge weights in G that we had to eliminate to isolate the two nodes.This represents the minimal cost of not aligning tlw two nodes.For instance.consider the DAG shown in Figure 3a (it corresponds to procedure THED2 from the ETSPACK library).Figure 3b shows a step of Fomt_Ripartitt>_Gmph when a path bctwcen T 2 (of C,) and Z 1 (of C, in the step) is looked for.ln this case, there are two paths: T 2 , D

DAG Partitioning and Array Alignment
Function Form_Direct_BipartitP returns the bipartite graph between two columns in a graph that results from direct edges.Functions Rest_Alignment and Worst_AfignmPnf return for a bipartite graph the total weight of nonaligned edges with the best and worst possible alignments.The use of min-cut incrt>ases the execution time of the algorithm with respect to the sum alternative.However. the use of a heuristic to choose the next column decreases the execution time of the min-cut solution because at each step, the complexity of the remainina ara1)h is lower and the algorithm proceeds b b faster.In [21] the authors evaluate the usefulness of these optimizations oriented toward improving the output of the DAG partitioning algorithm.

Intradimensional Alignment
The algorithm we propose to find shifts among aligncd dimensions is described next.For each dime11sion of the template, a directed graph G, is creatt>d.~odt>s in this graph correspond to array dimensions that are This algorithm is basically the same than the one proposed in [23] to solw the statement alignment problem in order to reduce synchronization costs in a shared memory execution model.The main steps of the aigorit Inn are dcscrihed helm\: 1. Pick_fjJ_c\'ode.Tt returns an unmarked node of G, corlllected with a marked node of G,.If such a node is not found.then an unmarked node is randmnlv sclected.

Fin(L)h[fi. This function returns the offset S
(with rcspect to the template) that hm; to be applied to the node"\' currently analyzed.This valm• is obtained from the offset of all the edges lwtwee11 node ;V and any marked node of G,.If sevcral edges het\\Ten nodt> ,y and tfw template nodt• appear.the one that is rq)('ated more times is selected.If several edges are candidates.then the one with minimum value is chosen.\Vt' ha\'e observed that selecting a value different than zero when zero is one of the candidatcs leads to poor solutio11s.

Communication Analysis
Once data arrays are aligned.each reference pattern that is not aligned (inter-or intraeomponent alignment) after the previous phase represents data movement that has to be carried out.In this phase.a matching of reference patterns to a predefint>d set of data mmcment routines is done.
In the cmTent impbnentation.DDT considers simplt• data move1nent routines (routines that perform data movement in a single dimension of the template).
If the refert>nCf' pattern requires data movt>ment in more than onc dimension.then the reference pattern is decompo.~edinto subpatterns and each subpattt>rn matched with a single data movement routim~ ( cach one performing data movf'IIH'llt in a single dimcnsion of the arrays).
For each reference pattern (or subpattern if denunposed).the data movement routine that performs the data movement with les~ cost is chosen.Table 3 shows the st•t of data movement routint's considered In DDT and ib matching "~ith reference patterns.In this table.
p is the dimension whne data mm•t•ment takes place and i 1 , and j" are the subscriph in the dimension p of the reference pattern.Function const(P.lp)returns true if e.rp contains constants only.
The matching bet ween data movement routines and n•f,rTJHT patterns is performed in order to obtain an estimation of the oyerhead due to remote acccsses .Each data mon•ment routine has an estimatcd cost.This cost is dependent on the architecture of the system.the size of the block of data to be transfered.and the number of processor,; involved in the data moyement.The size of the block R is estimated hY DDT as follows: , dim, .

INTRAPROCEDURAL REDISTRIBUTION
ln this section we dPscrihe the implernf'ntation of the algorithm that performs the intrapnwf'dural data ITdistribution.For the sake of claritv.we consider the problem when the application is compost>d of a single module (main program).Tllf' presence of procPdure calls is df'scril)f'd in St~ction 6. ln this section Wf' also consider that a simple control flow lwl\yeen phasf's exists (phases are cxt•euted lexicographically).Section !i describes tlw main control-flow st met ures considned and how they modify tllf' functionality of the . . .main algoritlnn.
The intrapnlcf'dural remapping algorithm imple-rnPnted in DDT is shtm n in Figure ;).Tlw main parts of this algorithm arf' desaihed nf'xt.

1 Identification of Phases and Iterative Generation of Candidate Mappings
Function ldent!fi•_P/wses tags each loop in the main data structure of DDT as phase or uot according to the following ddinition of phase by [

R]:
A phase is a loop nest such that j(n• each induction rariable orcwTing in a subsaipt fJOsition (!/.
Once phases are df'fined.pnH•f'dure Generate _('andidate _l~lappings gennates a sf' I of candidate local mappings for caeh phasf' as explainf'd in Section ;} and stmTs them in the DDT internal data structure.lt is important to kn'p suboptimal solutions in the list of candidatf' mappings becmtsf' somf'times it is bettf'r to f'xecutf' a phase with onf' of thf'm instead of the optimal one.In fact.a solution for a phase is brtter than anotllf'r when not onlY its cost is smaller hut also the remapping cost to execute it "•ith the corTf'sponding nwpping.Each of I he <•andidatf' mappings specifif's only the relative alignmf'nt and distribution between tlw variables reff'n'n<Td within tlw phwif'.hut not an absolute alignmf'nt over a global virtual template array for the application.
Finally,the static solution cost is computed in function StaticSolutiunJ'ost. whieh is used as the initial lower bound for thf' remapping process in pr<Jcf'dure AtHl~rze_Compatibili~r. Tlw static solution is obtained hy applying the algorithm described in Section ;~ assuming all loops in the procedure to lw a single phase.

Compatibility between Phases
Pnwf'dure ;lna(rze J'ompatihili(r builds a search tr<'f' composed of the candidate mappings for the different phases in the procedure under analysis.In progressing from one phase to another.we arf' facf'd with tlw problem of deciding which arrays are remapjwd and wltich onPs are kept with the same mapping.We say that two phases are compatible when they havt• preft'l"f'!lces for the sarrw data mapping.Assunw a sequf'nce of phases {p 11 • p 1  The philosophy lwhind the first alternativt> is that Pach phasf' should be executed with its preferred local mapping and data should be redistributed if nf'<"f'ssary before ext>cuting the phase.The cost is t>stimatf'd as the cost of t'Xf'<"nting phase Pi with LH) plus the cost of remapping.In the second alternative.somf' of tlw arrays are not re111appt>d and as a <•onsf'<pwnce the phase is not exf'<"Utf~d with tlw prcferred mapping.Tn addition to the cost of remapping some arrays . .thf' cost of executing phasf' p, "•ith a noncandidate local mapping LH)' shonld lw <'valuated.In the third altt>rnative.the assumption is that it is not worth trying to adapt to tlw prt>ft>rrt>d distribution of a phase.and thus tlw cost of t>wcuting phast' Pi with mapping GMi has to lw Pvaluated.
Tht> st>cmHl alternative has not lwt>n considercd in the implPmentation since the sf'arch space can easily grow to intractablt> proportions.This alternative is considered by [ 19]. and tlwy rPalizc this problem.One heuristic tlwy propos!' is to cut the st>arch by limiting thc munber of arrays that can lw simultaneously remappt>d lwtwf'Pn two phases.

When trying to adapt thf' actual global mapping
GMi to a solution LM7 in Pi• one of the following actions will takf' place for each array usf'd in pi: Example: Consider the following reaching global mapping GMi to phase p,: The first row reflects that the first and second dimensions of array A are BLOCK distributed with four processors allocated to each one.In addition.there is a perfect alignment between tlu~ dimensions of array A and the dimensious of the virtual global template (each column of the table represents a dimension of the template).Array B is mapped with the same alignment and distributiou as is array A, and array C is transposed with respect to the dimensions of the template.
Array D has the first dimension distributed among the 16 processors and the second dimension internalized.Assume that the following candidate local mapping LM~ is suggested for this phase: UV!,: If we compare this local mapping with the reaching global mapping, we can see that array D has to he redistributed since the number of distributed dimen-

G.M,+t:
A 2. To keep arrays Band Cas they arc in GM, and then transpose array A according to LM 1 • ln this ease all arrays must keep their relative alignment with respect to each other as specified in The resulting global mapping GM,+t would be the following: "iotice that although the second altemative is locally worse (it requires morf' data movement) . . it is possible that it l~ads to a lower overall cost for the whole sequence of phases.So the algorithm must keep it as valid and go deeper in the search tree in order to see which one is the best for the whole sequence of phases.
In th~ current implementation, the initial global mapping GA1 11 is considered empty: however.it is possible to initialize it with either a mapping specified hy the user in the source cod~ or a mapping inherited from a caller procedure.The algorithm in Figure S performs a recursive exploration of all different alternatives of candidate solutions LM, 1 "• for each phase p 11 n-l and remapping alternatives between each UV!k, and GA{.The actual algorithm reduces the search space based on the cost of the different combinations of all these alternatives.Initially we compute the cost of the static solution.This cost is used to leave the exploration of a (complete or iiH:omplete) sequence of phases if we detect that its current cost is worse than the cost of the static solution.Every time we know the cost of a eornplete sequence of phas~s (better than the static solution), its cost is used to update the cost bound for the process.

Realignment and Redistribution Costs
Remapping costs are estimated by DDT from tlw specification of the global GM, and local mapping LM;.Function Compute J'osf in Figure S creates a dummy self-reference pattern for each array whose mappings GM, and I~M; differ.Then the cost of this reference pattern is estimatt>d hy matching it with one of the data mon•nwnt routint>s described in Section :3.-t.

CONTROL FLOW
In this st>ction we describe thf' aspPcts that have to he eonsidt>red whe11 control statenwnt s (like conditional or iterative statenwnts) appear in the source code.These statements provoke a sequencing of the phases in the program different than the kxicographical ordf'r.In this section we pn•sent how the algorithm in Figure 5 is used when these statements appt>ar.
The Phase Control Flow Graph (PCFG) is built for each procedure and tnain program analyzed.In this graph.nodes are phasf's and Pdges link nodcb when there is allow of control hf'tween the associated phases.These arc other nodt's in the PCFG that rq1n•sent statenH'nts in the source code that provoke changt•s in the flow of phases.From the information in the control flow graph.the different sequt>nces of phases that might appear during the f'Xf'eution of a procedure arc generated.For each sequence.the same algorithm descrilwd in Figure S is applied.
In the rest of this st>ction we detail hcl\\ ill'rative loops and conditional statemf'nts are handled by DDT and how they inHtH'nce the generation of st>quences of phases.Other control flow struc.turf'.s(such as entry points am! multiple exits) and rwsting of all of them are also handled by DDT but an• not explained in this article.

1 Iterative Loops
Phases might lw included \\ithin loops whose loop control variables or induction variables ge1wratf'd hy them are not used to subscript arrays.In this cast'.control flow indicates that after executing the last phase insidf' the loop.the first phase insidr• it willlw exenttt>d again.
For instance.Figure 6a sho\\•s the control flo\\ graph between the phast>s that appear in the ADI program shown in Figure 1.'iotice that compatibility has to lw analyzed bet\\•t•en phases {9.-t} since there is a flow of control dne to the outn do iter loop.\\hen an outer iterative loop is fonnd in the sour-ce cod('.DDT gent>ratt>s a sequence of phases that try to represent what happPns during the actual execution.As shown in Figure 6c.DDT repeats twice the phases in the body of the outer loop.Phases {-t, 5. 6, 7. 8. 9} art> assmned to be executed once hut phasf'S {-t' ., S', (J', 7'.8'.9'} are assumed to hf' t>xecuted ~-1 times.where N is the numlwr of times the outer loop is exenrted ..\lotice that now possible remapping betwet>n phases {:i, 4} is accounted for once.remapping hetween any pair of phasf's within the loop body is ac-cotuJted for N times.and rt>mapping bel\\cen phases {9, -t'} is accounted for N-1 times.Tire algorithm f'nsures that the same solution i~ selected for each pair of phases p, and pi .
If tlw loop only contains a phase., tlwn it is not ne("f'Ssary to duplicate the phasf'.sill("f' remapping between a phase and itst>lf will nevf'r occur. In our running example (ADI).DDT would choose a dvnamic data lavout that changes twin~ at f'Vf'ry iter;tion oft he iteratin•loop.An outline of the solution generated by DDT is shown in Figure 6h.

Conditional Statements
Conditional statements generatf' alternative phase se-quenct>s that are executed dPpending on the condition evaluated in tire statf'ment.Thf' probability of taking one of the alternative branches his initially been obtained by profiling . .and it is used to compute the probability of each phase sequence.
The different Sf'({IH'nCf'S are analyzed iteratively.starting from the mo~t probable one.Sinn~ st•qut>nces may h~vf' phases in common., different solutions may be suggested for a given phasf' in different sequentTs.Tire algorithm wf' propose ensures that each phase is always executed with the sarne solution.To t•nsure that .. tlw solution for a phase is chosen in the first sequencf' it appears (i.e .. the most probable sequence wlwre the phase is used).Other less probable sc-qnence~ where tire same phase appPars have the solution for that phase fixr•d.
To illustrate this aspect.we analyze tlw main program in the SPEC s\nn:2;)(J lwndunark.As shmnr in Figure ?a. tlwre is a conditional statement in the main program that selects either the execution of one phast• (call to ra!c"!>) or the execution of another one (call to cafr<)z).The control flow (Fig. ?b) for the main program generates two Sf'fjllf'IICf'S of phases.In this cast'.
if the then path has less probability than tlw else path, then cmnpatihility willlH' 1ir~t analyzed for plwst•s {1.invariant.As a result, a single mapping for phase 4 is selected according to the previously fixed mappings of phases 1, 2, and :~.

61NTERPROCEDURAL DATA DISTRIBUTION
The main aspects considered in the interprocedural data distribution analysis performed by DDT are described in this section.
The algorithms used both for static data distribution or dynamic redistribution in the case of interprocedural analysis are basically the same as those used for the intraprocedural case.The definition of phase is extended to consider some procedure calls as phases, besides the loops as described in Section• 4. lf the procedure call is outside a loop which is considered a phase itself, then the procedure is considered a phase.Otherwise, some information obtained from the previous analysis of the called procedure is used to estimate the effects of the mapping of this procedure while deciding the mapping for the phase.
The analysis is based on the call graph in which nmks represent procedures and edges represent call sites.This graph contains representations of the fonnal and actual parameters and their dimensions associated with each procedure and call site.It is traversed bv DDT to decide the order in which procedures will be analyzed.The approach that has been considered is a bottom-up traversal of the call graph: Those procedures that are deeper in the call graph are analyzed first.This bottom-up traversal ensures that when a call to a procedure is found then it has already been analyzed, and therefore the information for that procedure is already in the interprocedural DDT database.
When a mapping specified for a dummy argument or global variable differs from its actual argument or global variable, liPF requires implicit data redis-tribution and/ or realignment.Additionally.upon return to the caller program, the original mapping must be reestablished.So two possible remappings for each argument and global variable should be considered.Other programming models based on HPF, such as the one offered by FORGE [11], allow you to leave an output mapping different than the input one.In this case, the information stored in the internal DDT database after deciding the mapping for the procedure will be its initial and final mapping (in addition to the realignment actions performed inside it and its execution cost with the selected strategy).Notice that this later model is a generalization of the definition of HPF.DDT actually supports both alternatives.
A procedure may have several candidatt> mappings stored in the interprocedural database.lf different mappings are preferred in different invocations of the same procedure, then procedure cloning is applied in order to ge1wrate versions of the same procedure with different mapping and parallelization alternatives.

1 The Procedure Call is a Phase
A procedure call is considered a phase when it is not placed inside a loop tagged as a phase.The remapping algorithm used in this case is mainly the one described in Section 4.1.However, function GerwrateJJocaL Mappings reads the internal DDT database in order to obtain the different candidate mappings for the called procedure, instead of computing them from scratch.
Phases due to procedure calls have candidate mappings composed of an initial and a final mapping.In this case, the cost of remapping will he determined by the mapping differences between the actual global mapping and the corresponding initial mapping for the phase.However.after the exeeution of the phase, the global mapping will be updated with the final mapping.This means that when generating different permutations of the local mapping in order to estimate different realignment options, both the initial and the final mappings should be permuted.This analysis could be extrapolated and used to analyze any kind of phase (loop and call), assuming that the local mapping for a loop has the same initial and final mappings, whereas the call has its corresponding initial and final mappings.
The next example is used to illustrate the aspects described above.Assume that some phases of a procedure have been analyzed, and that at this point the global mapping GA1 1 contains the following information: Ul!,: R This means that in the global mapping GM,.all arrays u~ed lwforf' that phase are perfectly aligned.and their first and second dimensions arc distributed with four processors assigned to each onf'.Assume also that the next phast' to be analyzed is a proccdurf' call whose local mappings in L!H,+t (obtained from the interprocedural database) are the following: Initial LM,+t: Final LM,+t: \Vhen analyzing this phase . .arrays used in this phase must be tagged according to the distribution differences between thf' global mapping and tlw initial local mapping.\\'e can see that array E is new . .so it will be included in the global mapping with its desirf'd distribution (note that it will also be included in the initial mapping for this procedure But if dw second alternativf' is sclf'cted.then arrays B and Care kept as they are in the global mapping so both the initial and the final local mappings are transposed.In this case the global mapping will be: 1/1(-t) n ')   ......,H(-t) 1/1(-t) c 2/1(-t) ] /1(-t)

j)
] /1(-t) 2/1(-t) 1\ote that tlw mapping for all arrays ( f'xcept for TJ) in GJf,+ 1 in this last alternative is diffPrTnt ( transposPd) from their mapping in GJJ,.and apparently only array A has been realigned.Attention must be paid to the local mappings of phasf' p,+ 1 • The difference between the iuitiallocalmapping and the final one mt>ans that.
at least, arrays R. C. and/) have been rt>aligned inbide tl1e pnlcf'dnre.and thus their cost has already het>n assnnu•d within the cost of f'Xf'Cnting the procednrc.
lf the first alternatiw is selected, then arrays Band C are realignf'd lwfore the procedure eall aud inside it as welL so the G/v/,+ 1 remains unchanged with rt>spf'ct to GJ!,.At this point it is not possible to say which alternative is the best because it dqwnds on the phases not yf't analyzed.so the analysis must continue with the two alternatives.

The Procedure Call is Inside a Phase
When the fH'OCf'dure call is placed inside a loop which is tagged as a phase.theu the call is not considered a phase.ln this UlSf' . . the initial and final mappings assigned to the procedure may affect the choice of tlw candidate mappings for the phase.When a call statement is found, DDT imports from the intPrproccdural DDT database all the information associated to each possible mndidate mapping for the called procedure.
Dnring the aligument step, when a call statement is found, DDT imports from thf' corresponding file in the interprocedural DDT database the information regarding ALlGJ\ directives for the global variables and the actual parameters.This information is included in the DAG of the phase as additional cdgf's.ln fact.this is an approximation to the problf'nL A more accurate model should have to weight these ne\\' edges with its corresponding realignment cost.
During the distribution step.and for eac:h eall to a procedure.the global variables and the actual parameters must lw remapped (if necessary) before and after tht> call.The mapping that minimizes tht> ovt>rall cost including remapping is selected among the candidate ones.

EXPERIMENTAL RESULTS
The main components of tht> data distribution cnvi-ronmt>nt we are using and developing arP shown in Figurt> ii.We haw analyzed thrPe programs: ADI (shown in Fig. 1 and analyzed in Section 2), routine RHS from tht> APPBT.APPLlj, and APPSP NAS benchmarks, and swm2S6 from tlw xiiPF benchmarks sPt.* \V e will see for the examples how changes in some architt>ctural paranlf'ters.such as number of processors and remote access time, lead to changes in the solution gPnerated by DDT.The tool is useful for the characterization of programs as well as tlw study of the effects of these architectural parameters.* AYailable by anonnnous ftp at ftp.informall.org in directory tl'nants/ apri/Bcnch.

Alternate Direction Implicit ADI
In this section we furtlwr analyze ADT and cornparT the perfonnarH"P prt>dicted by DDT against tht> perfor-nH!Ilf"t' obtained when simulating the execution of the mt>ssage-passing code generated by xHPF.Tn addition.
we also show tfw usefulness of the tool to predict the performance of different mapping strategies when changing architectural paramett>rs.
Figure 9 shows tlw predicted and the measurt>d speedups for the program for different numlwrs of processors (ranging from 1 to :32) for two possible solutions: Tht> static solution where all arrays are column distributt>d (adi2 in all plot labels) and the dynamic solution (a diD in all plot labels) as shown in Section 2. For this plot we considered a remote acct>ss time of 1 p.,s. \Ve can draw the following conclusions: 1. Tlw prediction perfornwd by DDT (solid lines) is very clost> to the actual speedup ( dasht>d lines).
Tn the dynamic solution we havt> noticed a s111all differPnce due to the estimation of redistribution costs.The model that we consider to estimate these costs (see Section '1.3) is not very accurate and overestimates the number of data clements moved.2. The speedup of the static solution grows from one (for one processor) to two (for machine configurations with a largt> number of processors).This is due to tlw fact that about one half of the program is executed in a synchronized way with an Pxecution timt> close to the sequential exPcution time.3. The speedup of the dynamic solution is lower than one for configurations with less than four processors but then grows with an efficiency dost> to four.This is due to the fact that remap-16

32
~ adiD (predicted) •+-adiD(measured) ~ adi2 (predicted) • K-adi2 (measured) FIGUH.E 9 ADIspet'dup vs. number of processors for the static and dynamic solutions.Comparison of the predicted and measun•d speedup (remote ac~!'ss time= 1 p.,s).To further compare the dynamic and static solutions.Table :) shows the breakdown of the predicted execution time in computation and data movement times.Notice that for the static solution, the data movement overhead is constant (it is due to shifts where the number of elt>ments moved is independent of the number of processors).However, in the dynamic solution the redistribution overheads decrease with the number of processors involved in the data movement.The following conclusions art> drawn: 1.For very lm,, access latencies, the speedup tends to be in 16 in the ch'namic solution ami 2 in the static solution.
2. The static solution is less sensitive to the memory latencv than the dvnamic solution.This is dw• . . to the fact that the volume of data transferred in the static solution is small while in the dntarnic solution it is large.For large latencies.any gain due to parallel execution is offset by the data movement overhead.:3.For this number of processors.DDT chooses the dynamic solution when the remote access time is less than;) p.,s and the static solution otherwise.

rhs Routine from NAS and swm256 Benchmark
In this section we anaiYzt> the behavior of the solution suggestt>d by DDT for the other two benchmarks: the rhs routine from 'IJAS and the swm2S(J program.For each of them we compare the performance predicted by DDT against the JWrformance obtained in the simulated execution.\X" e assume that the remote access time is 1 p.,s and that the system has from one to eight processors. Figure 11a shows the behavior for rhs.ln this case DDT suggests a dynamic solution where three arrays have to be remapped.The dynamic solution implies that the outer loop in each phase runs in parallel.In this case the prediction is dose to tht> actual performance because DDT performs an accurate estimation of both data movement and parallel computation times.program in swm2;)6 indudPs an itfTative loop and conditional statements that validatP the cmTf'Ct behavior and estimations of the control llow module in DDT.t\oticf' a small difference lwtween the predicted and measured Sfwedups.Tllf' difference is due to an overestimation of the data movenlf'nt overhf'ad: to get a better estimation, we havf' to improvf' the rnodulc~ that detects rf'dundant data motion eitlwr within a phase or hetwPPn phases.

CONCLUSIONS AND REMARKS
In this article wP have prescntf'd the key modules in our automatic DDT.DDT generates both static and dynamic HPF data distributions for a giwn Fortran 77 routirlf' and for a whole application with intf'rprocedural analysis.In the static solutions, tllf' mapping (alignment and distribution) of each array in the program does not changf' during the exPcution.The static module is based on the CAC but is extPIHlf'd with some information regarding parallelism.We have also nwclifiPd the origiual algorithms in [1. 17] to improw thP quality of the mappings gerwrated [21 J.
Dnuunic solutions ineludt> exPcutable statpments in the source code that change the mapping of specific arrays when nf'cessary lwtween computational phasPs.DDT performs a cost analysis of profitability in order to include tlwm.This analysis of profitability is basf'd on the following steps: 1. Df'tection of phases or computationally intensivf' portions of code.which mainly correspond to nestf'd loops and calls to procf'dures.Remapping is only allowed betwf'f'n phases.
2. Generation of candidate mappings for the previously detected phases all(! estimation of their cost (including data movf'ment and expcution time costs).:1.Analysis of compatibility among phases . .selection of mappings for them.and remapping actions to lw performf'd betwet>n consecutive phases.This selection is done by analyzing the cost in tPnns of data rnoyenu•nt due to rf'distrihution and its benefits in the cost of successive phasf's.
Control flow information is ust>d to identify sequencing of phases.The algorithm explorPs a rich sPt of combinations although it is not Pxhaustivf'.It includt>s nwchanisms to cut down the search spacf'.
DDT is a research tool \\•hich is currently usf'd in our group to support differpnt rf'search aspects.Since it is a research tooL it can use techniques that may be too computationally expensive to be included in a final compilf'r; howf'Yf'L this allows us to f'xplore a rich sPt of solutions.
We have enduatf'd the quality of the solutions gen- We are currently porting this technology to generatP dficient code for hierarchical global shared memory architectures.In these architectures a number of central processing units can simultaneously access data anywhere in the system.However, the nonuniformity of the memory accesses is still an important issue to consider and may require a higlu~r programming effort in order to achieve performance; trying to access those ll'vPls in the hierarchy closer to the processor will increase execution efficiency.The technology developed to study the profitability of dynamic data remapping can be used to track the movement of data during program execution and thus parallelize loops accordingly, so that the access to data is done locally as much as possible.

FIGURE 4
FIGURE 4 Heuristic to choose a IH'W column in dw graph for alignment.(a) lntermPdialP graph.(h) Bipartite graph widt domain E. ((')Bipartite graph with domain D.

offst't 8
obtained in thc previous step is suhtractt•d to all incoming arcs into 11ode N. and added to all outgoing arcs from node Y.At thf' e11d of the algorithm . .each node in G, has an associated shift with respect to the template node.If G, is acyclic.a perfect intradimensional align-I!Wnt results.In this cast~.all the edges are aligned and no data movement is needed.If cyclf's are present in C,, then some t>dges may not he aligned and therefore.data mo\'ernent may be required for them.DDT: A RESEARCH TOOL 81

FIGURE 6
FIGURE 6 (a) Control flow graph for ADI.(h) Source code for ADI with the directives specifying mapping and rPmapping of arrays generated by DDT.(c) SPquenee of phases analyzNI by DDT.

FIGURE 7
FIGURE 7 (a) Outline of the main program in the SPEC swm256 benchmark.(b) Control flow graph.

FIGURE 8
FIGURE 8 Main components of our automatic oat a distribution platform: DDT.xHPF compiler.ano simulator from APR Inc.

Figun• 10
Figun• 10 shows the predicted speedup when the remote access time changes for the static and dynamic solutions.The aim of this graph is to show the influence of remote access latencies in these solutions.In this

Figure 11 b
Figure 11 b shows the behavior for swm2:)6.ln this case DDT suggests a static solution where all the arrays are distributed by columns.This static solution implies that almost all the loops run in parallel.The main enttf'd by DDT by comparing predicted performance against the aetual performance when the parallel program is t>xecuted.\Ve have also shown the usefulness of the tool for the characterization of the programs as well as the study of the effects of architectural pararnt'tPrs.\Ve have shown how the prf'dieted speedups arP close to the actual ones ohtairwd when the program is executed.DDT also accepts HPF dirf'ctivPs in the sourcf' Fortran 77 program: in this case DDT is useful as a support tool for the dewlopcr of HPF codes in estimating the effect of user-selected data mappings and parallelization strategies m the final performance of the parallel program.

2. 1 An Example: Alternate Direction Implicit
•.If tlw phase is a loop rw-;t.the candidate ntappinf!s for it are obtained by performing an analysis of rPff'rt'IH'P paiiPnts within the 1wst.If th(• phase is a calL the candidate ntappings art' imported front tlw DDT interprocedural database.The contml flow mod-JH'occssors through the interconnection network.Data movemrllt costs are estimated as the numlwr of remote accesses multiplied by the remote access time.

Table 1 .
Array Mapping Alternatives and Associated Loop Parallelization Strategies Analyzed by DDT for Phase 4 in ADI pings for a phase.n SPI of them are ~dectf'd as candidate 111appings (this selection is done based on cost

Table 2 .
Data Movement and Computation Costs (in seconds) for Phases 4 through 9 in ADI for Two Candidate Solutions Let n be the ma.rimumnwnber r~( nodes in a column <d' G. Partition the node set ~~l G into 11 di.~joint subsets T 1 • • • • • I 11 • !l'ith the restriction that no tu•o nudes belonr:ing• to the sanu' data arra.1•are allo!l'ed to be in the sante subset. IIWJJsionality of' all the arrays analyzed.Each dimension of each array is aligned \\itl1 a dimension of the tf•mplatc.S<:>nmd. the intradinwnsional aligrnnent problem tries to decide how all the array dimensions aligned into a dil!H'nsion of tlw template arc shift<:>d DDT: A RESEARCH TOOL 79 f'ach other.Although thi!-i includes offset and stride aligmrwnts and reflections (stride -1).only offset alignments ha\'e be<:>n implemellt<:>d in the CliiT<:>nt Ycrsion of DDT.

Table : 3
. Communication Houtines and Their Matehing with Reference Patterns
LM7 need a different permutation of their distributed dimensions to fit into the GM 1•In this case we propose to keep one of them by turn in the GM, ., and realign the rest in order to maintain the relative alignment that is specified in the LM7.All these alternatives are stored in :3.If the array is already includt•d in GM1 • and its mapping in CMi has tht> sam!' distributed dimensions than in Llf) (no matter if they are transposed or not), tlwn thf' array is eligiblt~ to bt> rt>aligned.If only one array fits in this case.tht•n no rcaligmrwnt is nf'<:f'ssary.Realignment is only necessary when two or more arrays in the 1 • thereforf' it can directly be added.Arrays A, B.
and C are candidates to be realigned since the number of dimensions and the dimensions actually distributed are the same.Since arravs B and C have the same relative alignment in botlt Gil{ and L1""(, two different alternatives can be considered: 1.To keep array A as it is in GM 1 and to transpose arrays Band C according to LM 1 • In this case we would obtain the following global mapping ). Arrays 11.B. and Care candidates to be realigned.Two different alternatives should be nmsidf'rcd: to keep array A as it is and realign arrays Band C:. or to keep arrays Band C as they are in the global mapping and realign array A.

Table 5 .
Breakdown of the Total Execution Time for ADI (Remote Aeee8s Tim!' = 1 f.LS)