Towards Architecture-Adaptable Parallel Progrannning

Parallel processing is facing a software crisis. The primary reasons for this crisis are the short life span and small installation base of parallel architectures. In this article, we propose a solution to this problem in the form of an architecture-adaptable programming environment. Our method is different from high-level procedural programming languages in two ways: ( 1) our system automatically selects the appropriate parallel algorithm to solve the given problem efficiently on the specified architecture; (2) by using a divide-and-<:onquer template as the basic mechanism for achieving parallelism, we considerably simplify the implementation of the system on a new platform. There is a trade-off, however: the loss of generality. From a pragmatic point of view, this is not a major liability since our strategy will be useful in building domain-specific problem solving environments and application-oriented compilers, which can be easily and effectively ported to diverse architectures. We give preliminary results from a case study in which our method is used to adapt the parallel implementations of the conjugate gradient algorithm on a multiprocessor, a multicomputer, and a workstation network. © 1997 John Wiley & Sons, Inc.


INTRODUCTION
The most efficient parallel algorithm for solving a problem often depends on the target architecture.Thus, unless a parallel programming system has the ability to adapt the algorithm to the architecture, it will not be truly machine independent.
In the traditional approaches to machine-independent parallel programming, the user encodes an algorithm as a parallel program using a high-level programming language.Using a combination of compilers and run-time systems, this program can be executed on a variety of platforms, but the algorithm embedded in the program may not execute efficiently on all the platforms.Hence, only limited machine independence is achieved.
In this article, we present a new scheme for machine-independent parallel programming.Our scheme is built on the following three key ideas: ( 1) the use of a database of parameterized algorithmic templates to represent computable functions; (2) frame-based representation of processing environments; and (3) the use of an analytical performance prediction tool for automatic algorithm design.
By automating the detailed design of an algorithm and the generation of a parallel program, our approach relieves the user from much of the burden of parallel programming.There is a trade-off, however: the set of problems that can be solved efficiently using our approach is limited by the contents of the template database.However, we believe our strategy will be useful in FIGL:RE 1 (a) Traditional algorithm-oriented approach to parallel proct>ssing.(b) Our prohlem-oriPnted approach to parallel prot't'ssinl!.
Figur<> 1 contra~ts our approach with the traditional approach.In our problt>m-oriented approach . . the user descrilws the problem to lw solved.rather than an algorithm to solw tlw problem.The set of problems that can be solved using a system may be called the scope of tlw system.\~•e restrict tlw scope of our system to provide a portable.easy to use. and high-perfonnanc<> proce~sing environment.In contrast.the traditional approach maximizes the scope to include all Turing computable problt>ms at the expense of restricting portability.programmability.and performance.
To see that limited scope is not a major liability.one onlv need to look at the recent historv of the . .computer industry.1.The massive surge in the popularity of personal computers is primarily due to the availability of domain-specific software packages with r<>stricted scope.prime examples being word processors and spreadsheets.2. In the realm of scientific: computing.users are increasingly moving towards problem-solving environments such as .\1ATLAB.abandoning the traditional programming languages.such as Fortran. 3. The biggest challenge to parallel computing comes from the .. killer workstations:• simply because tlw improvement in performanc<> resulting from using the parallel computer is not enough to justify the additional cost and effort.
Pragmatically, this implies that it is fruitless to parallelize all applications.Those that benefit from parallelization form a subsf't, and programming models with enough t>xpressive power to cover a reasonable number of applications will have just as much practical usc as a Turing equivalent model.
We use a computational modt>l based on divideand-conquer to dt>sign the algorithm templates.In the next few sectiom.wt> dt>scrihe this model and tlw details of our scheme to automatically g<>rwrate architt>cture-adaptahle parallel programs.We haw applied our scheme to dt>vt>lop efficient parallel programs for several scientific applications on diverse architectur<>s.Included in this article is a case study describing tlw application of our strategy to parallelize the conjugate gradit>nt (CG) method 011 a shart>d memory multiprocessor.a distributed memory multicomputer.and a network of workstations.\\~<> believe the diversitv of the target platforms and the complexity of the application make this case study a good test of the validity of our approach.

COMPUTATIONAL MODEL
There is only one basic mechanism for parallelism in our model: a meta-function c.allt>d parallel dil'ide-rwdconquer (PDC).Divide-and-conquer is a well-known problem-solving strategy in which a problem is solved by dividing it into a number of smaller subproblt>ms and then solving tlw subproblems by the recursive application of the same procedure.Infinite recursion is prevented by using a base predicate which triggers a base jimction.The solutions to the subproblems form partial results.which arc combint>d to form the final result.In PDC the subproblems art> solwd in parallel.providing an easy opportunity for exploiting parallelism in architecture.
In our modt>L a program is represented as a se-qu<>nce of divide-and-conquer.Figure 2a shows the graphical representation of such a program in the form of a DAG, comprising thrc<> divide-and-conquer operations.The shaded squares dt>note the base cases.1\'ote that the number of subprograms generated and the depth of recursion change for each invocation of the operation.Essentially, each operation has a well-defined top-level structure.but the details can change for each invocation.We use the notion of a parameterized template to represent these operations: the template describes the top-level structure and the parameters are used to add the details.The lowest laver of our template database is made up of such templates.""fetatemplates, consisting of cascaded divide-and-conquer operations such as the one shown in Figure :2a.are formed from these base templates.
Another important aspect of our computational model is the mapping of the subproblems to the pro-ct>ssing nodes.We combine the divide-and-conquer paradigm with the singl<>-program multiple-data (SP\'ID) style of programming to obtain an efficient implementation of the cascaded divide-and-conquer explained above.The subproblems at each level of the DAG are mapped to all or a subset of the processors.This is in contrast to the conventional approa(~h to divide-and-conquer programming where each subproblem gets mapped to a single processor.Figure 2b shows a possible implementation of the program in Figure 2a using two processors for the first divideand-conquer operation and four processors for the rest of the computation.
A meta -template is an abstracc high-level representation of a generic method to solve a problem.There may be a large number of plausible implementations for a meta-template.\Ve generate an efficient program to solve a problem on a given architecture by choosing the implementation that performs best on that architecture.Thus.we see that there is a search space associated with each meta-template and the problem of generating an efficient program reduces to a search problem.The size of this search space is determined by the number of constituent base templates, the number of parameters in each one of them.and the number of permissible values for each of these parameters.:\'ote that there could be several meta-templates to solve the same problem, adding one more dimension to the search space.

METHODOLOGY
\\-e begin with a collection of meta-templates for the problem and an abstract description of the architecture.The templates represent methods for solYing the problem.The number of templates in the collection is problem dependent-some problems will have only a single template.whereas others may have two or more.Our goal is to generate an efficient algorithm to solve the problem on the specified an:hitecture.
To achieve this goaL we traverse the path from a generic method to an algorithm by adding the necessary details.This means customizing the template by determining the appropriate values for the parameters.
If the search space is small.we can exhaustively search for the best set of values for the parameters.provided we have a good objective function.The role of the analytical performance prediction tool is to provide this objective function.Given a set of parameter values and the relevant specifications of a target platform.the tool predicts the performance of the implementation on the specific platform.
What kind of details do we need to add to the template to make it an efficient algorithm?Here is a partial list: 1. Structure of the divide-and-conquer tree: This will vary based on the processing environment for the sample template.2. ""lapping of the processing nodes to the leaves of the tree: The mapping that minimizes the communication overhead is desired.3. Depth of the tree: This determines the granularity of the resulting parallel program.4. Optimal subset mapping: Sometimes performance may be enhanced by using only a subset of the resources.;).""lachine-specific data decomposition: There are several ways grid data can be decomposed, and based on the problem instance and the architecture.a particular decomposition may be superior.6. \-lachine-specific solution method: When there are several candidate templates, the one that maximizes the performance needs to be selected.
A combinatorial explosion of search space is conceivable for complex applications, making exhaustic search impractical.For these cases, there are two ways to prune the search tree: 1. We can make use of application-specific knowledge to limit the number of parameters and their permissible values.This kind of pruning is done manually at the template design stage.We will show an example of this in the case study.2. We can use branch-and-bound algorithms to eliminate the fruitless searching of unproductive branches.The system can automatically perform this pruning while searching for the best implementation.
If there is more than one template for a problem, then each one of them will be customized and the best one selected using the performance prediction tool.
Converting the detailed template to a message-passing program or a shared memory program can be accomplished using current compiler technology [ 1]. Figure 3 shows the method schematically.
It is important to note that divide-and-conquer is bing used in this system merely as a methodology for designing the templates.Users do not write divideand-conquer functions-they call higher-level functions like matrix-vector multip{y or dot product.Emitted code is not a divide-and-conquer program.It is an SPMD program in which every processor is active throughout the execution of the program and doing useful work:

Divide•and•Conquer Template
Our template is based on the algebraic model of divideand-conquer proposed by Mou and Hudak in [2].The template encapsulates problem solving using divideand-conquer in three phases: a divide phase, a conquer phase, and a combine phase.An overview of the template is given below.
• Data distribution dedarations: This explains how data points are distributed among the pro:.cessing units for distributed memorv machines: for shared memory systems, this represents the logical division of data points among processing nodes.Using grid problems as an example, row decomposition, column decomposition, and block decomposition can all be captured using appropriate data distribution declarations.These declarations can also be thought of as pre-conclitions and post-conditions on the template.The distribution of input data is a necessary pre-condition for the invocation of the template; the result of the invocation (post-condition) is the output data distributed in the specific layout.• Divide phase: Actions in this phase can be npressed using two functions: 1. Divide function: As shown in Figure 2, the SPMD implementation of the parallel divide-and-conquer requires a mapping from the subproblems to the processing nodes.
The purpose of the divide function is to specify this mapping.For example, consider the first divide-and-conquer operation in Figure 2. The original problem is mapped to a pool of two processors.In the divide phase, two subproblems are generated.To map these subproblems to the processing nodes, we split the processing nodes into two partitions: a left partition and a right partition.The first subproblem is allocated to the left partition and the second problem to the right partition.This is an example of a frequently used simple divide function: binary, equal, one-dimensional, left-right division.Thus, the domain of the divide function is the set of processing nodes of the target environment.2. Pre-adjust function: Notice that the divide function implies the partitioning of the domain of the function computed by the template as well.For example, if the function is computing the product of a banded matrix and a vector, the left-right divide will cut the input vector and the matrix into two chunks as shown in Figure 4.In addition to partitioning data, the divide phase may need to modify the partitioned data sets.This is accomplished using a pre-adjust function which is applied to the partitions before the subproblems contained in these partitions can be solved.The divide phase of the banded-matrix-vector multiplication template is illustrated schematically in Figure 4, where wP show how the divide function splits the data sets and the pre-adjust function modifies them.
• Conquer phase: We need to specify only a sequential base function and a base predicate in this phase, because everything dse rt>duces to recursive applications of the previously defined template.The base predicatP is used to specify the terminating condition of the recursion.Because we use the divide function to partition the set of processing nodes, the recursion will have to terminate when there is only a single node in a partition.Thus, the default base predicate checks the number of nodes in the partition and returns true if there is only one.In this case, the base function is simply a sequential program.
In some architectures, it might be advantageous to terminate the recursion early with a group of processing nodes in each partition, instead of single-node partitions.The base function will be a parallel program in such cases, but less complex than a program designed to run efficiently on the entire platform.An interesting special case is when the base functions are dataparallel programs, giving nested data parallelism.
• Combine phase: When the subproblems are solved, we will have partial results distributed among the processors.In the combine phase, we wish to combine them to produce the final answer.Similar to the divide phase, we use two functions to accomplish this: 1. Post-adjust function: Subproblem solutions arc modified using this function.We can use the powt>r of recursion and invoke the template itself from within the postadjust function, if necessary.The example tt>mplate for matrix multiplication, given below.illustrates this.A simpler example will be the computation of the dot product of two vectors.In the combine phase.we use the post-adjust function to add the partial results to form the global sum.

Combine function: The combine function
is merelv the inverse of the divide function.
As an example, the fpjf-right combine merges the left and right partitions into a single partition.
~otice that the divide and combine functions do not operate directly on the application data.These functions merely change the state of a set of registerswhich we call system data-maintained by each processor.The system data collectively determine the position of a processor within the DAG representing the divide-and-conquer operation.For example, consider how the position of the processor Pl changes during the last divide-and-conquer operation in Figure 2. The first application of the divide function changes the system data of Pl to make it the second processor in the left partition.The second divide makes it the first processor in the right partition of the first left-right pair.During the combine phase, the first application of the combine function makes it once again the second processor in the left partition.The adjust functions, which operate directly on the application data, use the position information to determine the exact nature of communications and computations at each processor.
Figure 5a shows the execution of the generic PDC on n processors.Figure Sb shows the execution on a single processor, the base case.
Figure 6 shows an example template for matrix multiplication.C = AB.This is one of three templates we have for matrix multiplication in the system database.
Figure 7a shows the execution of the matrix multiplication template on n processors.The unrolled execution sequence for four processors is shown in Figure 7b. Figure 8 shows how the data structures at each processor change as execution proceeds.
Our current design of the system uses a higherorder function to implement the template with the arguments of this function representing the fields of the template.All communications appear exclusively in the adjust functions.Additionally, these templates Data distribution: have the benf'fit of having only regular and well-dPfined communication patterns.

Representation of the Processing Environment
The computing envirormlf'nt is described using a frame structure.The slots in the frame rPpresent attributf's.values of which may be represemed by otlwr frames.The collection of frames.thus formed.holds all the information we need to design a program that will executf' efficiently on the reprt'sentecl environment.The information stored in the frame includes the number of processors, tlw processing power of the nodes.the interconnection network, and the mf'nwry hif'rarchy.Figure 9 shows the frame rf'prt'sentation of a typical high-performance computing environment.

Performance Prediction
Performance prediction plays an important rolf' in the development of a detailed algorithm from a generic template.as pointed out in Section 3. (In distributed-memory machines, this would automatically imply the division of data structures as well.) Pre-adjust function: None.
Base function: Sequential Matrix Multiplication.
Post-adjust function: Swap columns of B between partitions.
Apply Matrix Multiplication Template to both partitions.

Combine function:
Combine the LEFT and RIGHT partitions.' ' ' ' ' ' ' ' Frame representation of a typical processing environment.
to estimate their run-time on the specific processing environment.
We begin by introducing the terminology used for describing the model as shown in Table 1.
The input size is a vector, because some problems may have more than one input parameter.An example is the banded linear system solver, whieh has three input parameters: the number of unknowns, the bandwidth, and the tolerance.The input size vector of the subproblems can be formulated as a function of the original vector and the number of subproblems (n,.= g(n, k) ).This function is very simple in most cases: The input size of the subproblems is obtained by simply dividing the input size of the original problem by k.For some applications, such as the banded-matrixvector multiplication shown in Figure 4, a slightly more complex function is required.
We make several simplifying assumptions in forming the performance prediction model: • The divide tree is complete and balanced.
• The processing environment is homogeneous.
• The base functions are sequential.
• There is no overlapping between computation and communication.
• The effect of cache on the speedup is negligible.
The first three assumptions have no bearing on most processing environments, including the three used for the ease study in the next section.Nevertheless, we plan to improve the model in the future to include arbitrary trees, heterogeneous processing environments, and nonsequential base functions.
The SPMD programs generated by our system currently do not support the overlapping of communication and computation.In our computational model, the opportunity for such overlap is limited, because it an be done only in the adjust functions.We do not consider this to he a liability as there is empirical and analytical evidence to suggest that communication-computation overlap has limited benefits [4,5].
Can we ignore the impact of cache on the performance?Because we are using performance prediction to compare implementations, we are interested in relative perfom1ance rather than absolute performance.As long as cache effects do not change the relative performance of the implementations we are comparing, it is safe to ignore them.Our experience with the system, including the case study presented in this paper, validates this assumption.However, cache effects could be significant for certain applications and architectures.Refining the performance prediction model to include memory effects is one of our future goals, especially since our system has access to the required information-sueh as the input size, the memory access pattern of the application, and the memory hierarchy of the architecture.
Predictor fundions Attached to f'ach template io; a function to compute its predictf'd performance.\'\' f' will call this the predictor function.Parameters of tlw template are arguments of this function.Additionally.the prrdictor function has an extra argumf'nt deuotin;r the type of the architecture.There are only thrf'f' permissible vahlf's for this argument.These valuf's and tlw associatrd architrcture typf' arr listt>d below: 1. S\1: Shart>d nwmon• machines.The equations above nwrf'ly reflect tht• structure of the divide-and-conquer template and hrnce remain the smnf' for all trmplates.The predictor function of each template will have additional expressions to compute the application-specific details. ;\ext we discuss these details and the methods used for computing them.1. f(n).We nePd the number of scalar floatingpoint operations.the number of vectorizable loops.and the number of vector operations in each such loop of the secpwntial base function to computef(n).The predir•tor function of eaeh template computes thesr numbers using the input size wctor n.
We assume that the vectorizahle loops in the base function can be identifit>d without prior knowledge of the target platform.In reality.a loop that vt>ctorizes on one vector machine may not vectorize on another machine.primarily due to variations in the compiler technology.Because the templates would be dewloped by experts rather than novice usf'rs.we assume the base functions are coded in such a wav that most compilers can vectorize tlwm.
2. TJ,, 1111 ,(n.p) and C,,,"(n.p ).Just as in the previous case.the predictor function computes the scalar and vector operation counts using the input size veetor and the number of processors.These numbers correspond to computations in the adjust functions of the template.(p = 1) requirt>d in tlw adjtbt functions in a shart>d Illt'liiOrv environnwnt a~ the \a! ues of these expressions.
.\lotice that the prPdictor function is not computinp; the execution time~ directly.This is atTomplished by the Performance Predictor thing machine-,.;pecificdetaik \Xt> will show hm\• this is dmw for tlw <'OIIliiiliiiication time component.Because communication printitivt's are onlY fpw in number.the cost function for t>ach such primitive is stored explicitly in the machinP datahast>.BPcause the Pt>rfonnaii<'t' Predictor knm\s tlw specific targf't machine of the template.it invokt's tlw appropriate cost function of this machine for each entry in tlw list it receives from tlw prt'dictor function.Computation and synchronization timPs art' cmnputPd similarly.by combining the expressions returnt>d hy the prt>dictor function with machine parameters.
Tlw two-step computation of predicted performance dt>scrilwd above has thP advantage of decoupling the t<>mplatt's from the architectural dt'tails.whiJP maintaining grPat flexibility in anaJyticaJ JWrformance prPdict ion.

CASE STUDY: CONJUGATE GRADIENT
\\'e pr<>sPnt an example in which the scheme described in the pr<>,~ious sections is used to dewlop efficient parallel impl<>nwntation~ of the CC nwthod for three diverse architectures.

1 Mathematical Description of the CG Method
The ( :c method rs an itf'rativP scht>mf' for soh'ing linear systPms of P<pwtions.Civ<>n a symmetric.positiw definitf'.ccwfficit>nt matrix A. and a vt>ctor b. it compute~ the solution vt>ctor of the linPar systt>m 4.r = b using the algorithm ~hown in Figure 11 [(J].
The llH'ta-template is parameterized using the following three parameters: 1. PnH'Pssors ust'd for matrix-n•<•tor multiplication (Pl ).The matrix-vector multiplication is tlw most compute-inten~e task in the CC itt>ration.I IPnct'.it willlw lwiwficial to use all the aYailablt> processors for this operation.Thus.tlw size of the target platform e~sentially detPnnint>s this paranwtn.:2.ProcesS<>rs ust'd for tlw J't'st of the operations (P:2).The poor granularity of dot product ran affect the owrall performanc<> of tlw CG imple-nH'ntation.Thi~ paramt>tf'r would iPt us improve tlw granularity by computing the dot product 011 a subset of the a\'ailable processors.Tlw systf'm nsf's JWrformance prediction to decidt> tht> optimum granularity dt>pt>nding on tlw machine characteristicb and problem size.The computational complexity of the CG iteration is concentrated in tht> matrix-Yector multiplication.Thib is an O(n 2 ) operation.wlwrt'as tht> rest of tlw <'Oillputation has only lirwar tim<> complt>xity.By tying together the granularitit>s of all linear-time op<>rations.we reduce the numi = 0; 9i =hi = b-Axi; while (not converged) do: ber of paranwtrrs and thrrrby limit the size of the search space.In contrasL.if Wf' allow tlw granularity of f'ach individual opt'ration to change indqwndently .. the search space will have a combinatorial f'xplosion.But this will not necessarily lead to a bettf'r solution, because the cost of redistributing the data (in thf' distributed memory machines) or synchronizations (in the shared memory machines) will eventually force the search f'nginc to choose an implementation with the same granularity for these operations.This is an example of the pruning of tlw Sf'arch space using applieation-specifie knowledge.As mentioned in Section 3. an alternate method is to start with a full set of paranwters and then use branch-and-bound algorithms to eliminate unproductive branches of the search tree.
The distribution of tlw ctwfficient matrix among the processors is an important parameter.because the adjust functions, the divide function.and the combine function will be determined by this distribution.w•e consider three different distributions: row-contiguous.column-contiguous . .and block -contiguous.
In Figure 12. we present a parameterized metatemplate for a single iteration of CG using pseudocode.
The CG-template in Figure 12 is called a metatemplate because it is built by the composition of a number of simple divide-and-conquer templatf's.This nample illustrates our layered approach to building adaptable parallrl programs usmg divide-andconquer.
The base tt>mplates used for building meta-templates fall into two categories on thP basis of their fum:t ionalit\•: 1. Basic linear algebra templates.Examples indude matrix multiplication, dot product, and matrix-vector multiplication.2. Data redistribution templates.Each linPar algebra template has a set of pre-conditions and post-conditions.which are specifif'rl in terms of the distribution of the input and output data structures.respectively.W'hen two templates are concatenated to form a mPta-template, it might he rwcessary to retlistribute the output data from the first to mef't the prr-eonditions on the second.Data redistribution ternplatf's are used for this purpose.1\otice that these will be required only for distributed memory machines.
All constituent base templates of the CG template utilize eitlwr the Left-Right divide (LH.) or the LEFT-RlGHT-TOP-BOTT0.\1 divide (LH.TB).the two standan!divide functions.The LH. divide splits the processor pool into two equal partitions, a LEFT partition and a RIGHT partition.It impost's a linear ordPring on the processors.Figure 1:3 shows the LH.divide of f'ight processors.The LRTB divide splits the processors into four partitions.LEFT-TOP.LEFT-BOTT0.\1,RIGHT -TOP, and H.IGHT-BOTT0.\1.The processors are arranged logically as a squarP two-dimensional (2D) mesh as shown in FigurP 14.where the LH.TB dividr is applied on a pool of 16 processors.

Linear Algebra Templates
Below we describe the linear algebra templates used for building the CG template:

Adapting the Template to a Shared .Memory Machine
Fignrcs Vi. 16. and 1? shr)\\ the predicted and obsf'rved spef'dups for an input size of 102-t for J'O\Y.
column.and block distribution of the watrix.rcspt>ctively.The predictions tend to be more optimistic than the actual performance.hut in terms of the rdatiw performance.the prt'dictrd Yalues match with observations.In gt>neraL tlw St>arch Engine can make the following decisions hast>d on the performance prediction:

Adapting the Template to a Workstation Network
Figurt> 19 shows the prt>dictn!and observed performance using row decomposition of the matrix for vary-  all messagrs to be serialized.This explains the selection of column decomposition for the workstation network.The extremelY eoarsc nature of the network discourages parallel execution of the dot product computation.

Adapting the Template to a Multicomputer
Figures 22, 23, and 24 show the predicted and observed speedups on a CM<) u;o;ing row,.column, and block decomposition, respectively.To see how predictions help in selecting the best values of the parameters, we will look closely at the data for a 16-proeessor machine and a 32-proeessor machine.
Figure 25 shows the predictt>d and observed perfor-

FIGURE 21
Predicted and observed performance of CG on worbtatiou network using BLOCK decomposition of the matrix.

Predicted Speedup
Observed Speedup FIGURE 22 Predicted and observed rwrfonnance of CG on CM-5 using ROW decomposition of the matrix.PI is the number of processors used for matrix-vector multiplication and P2 is the number of processors used for the rt'st of the computation.The fine granularity of thf' machine justif1<•s tlw spreading of the entire computation among all the available processors.

Comparison with Other Parallel Programming Systems
In the prf'Yious section, we gave JWrformance figures for a parallel CG solver on diverse arehitectnrf's programmed based on our approach.How does our system compare with other commonly used parallel programming tools in terms of pt>rforrnanee?To answer this question.wf' implemelltPd the CG nwthod using C.\1 Fortran on a C\f-.) and PowPr Fortran on an SGl.Our lwst implenwntation on the .

Conclusions from the Case Study
Three key i,.;sues in parallt>l procPssing are perfor-nwneP.portability.ami programmability.Wt> heliew that our nwthod addre,.;,.;es all three of tht>m.at tlw expense of g<'Iwrality.The loss in generality is not a sPriou,.; dnt\\back if tlw <'XfH"f'Ssihility of tlw system is snfficit>nt Pnough to cover most problems of practical intcre,.;t.By using tlw CG method a,.; an example.we have shown that an algorithm of significant practical intt>rt>st can be repre,.;entedusing our system.
As dt>tailt>d in the previou,.;,.;ection, the ca,.;e study corroborates our tht>sis that performance prediction along with the dividt>-and-conqut>r paradigm can provide architectur<> adaptability.Tlw ea,.;e study shows that the algorithms g<>nt>rated by tlw syst<>m can have pfficit>nt implementations on diwrs<> proct>ssing Pnvironnwnts.TlwrP are two reasons for this.First.the system is able to st>arch tlw paramf'tPr space exhaustiwly.because this space is relatively small.Th<> other reason for good pfficiPncy is the low overhead of our implementation of the dividt>-and-conquer templates [?]. "' e have seen similar results from anotlwr case study involving a finite element ocean circulation model [8].

CODE GENERATION
Hatcher and Quinn describe a compiler for a dataparallel language targeted towards multiple-instruction multiple-data (YllYID) computt>rs in [ 1].Tlwir compiler generates SPYID programs on distribut<>d memorv and shared-memorv machines from data-.
. parallel specifications.The task of our code generator is similar.albeit much simpl<>r.In this section.we outline the impl<>m<>ntation of the tt>mplates, and the mechanisms for handling communication calls and data distribution primitives.

Template Implementation
\\'e use a set of higher-order functions to implement the tt>mplates.The base templates fall into four cat ego-riPs based on the presence or absence of the adjust functions.Templates in each category art> implt>mented using a function associated with that category.Below we describe these categories and the associated functions: Additionally.the number of processors and the dimt>nsion of the processor grid, along with a pointer to the application data and a pointer to the system data.are also passed as aqruments.The dimension of the processor grid refers to the logical 1 D or 2D mesh embedded in the topology of the machine.Divide-and-combine functions assume the Pxistence of such an ern-b<>dding.The system data structurt>s remain the same for all templates.They simulatt> the trawrsal up and down the divide-and-conquer tre<>.The application data will change for each template.becaus<> they are spt>cific to the problem being solved.
The routine first computes the depth of tlw divide-and-conquer tree.Here we assume the default bast> predicate; tht> recursion terminates wh<>n th<>re is only a single processor in each part1t10n.Because the depth of the divide tree is known a priori, we replace the recursion with two iterative loops with the call to the base function placed between them.The first loop simulates going down the divide tree from the root to the leaves.At the leaves, we invoke the base function.The second loop simulates going up the tree from the leaves to the root.

Communication
The interprocessor communication is limited to the adjust functions.These communications appear as calls to a handler function in the pre-adjust and postadjust functions.The handler functions are responsible for generating machine-specific communication memory, and the target node reads them.But unlike in messagr--passing architectures.processors should synchronize to ensure that a read does not happen before a write.It is the responsibility of the handler function to insert the necessary synchronization calls.

Data Distribution Primitives
A meta -trmplate may include several data distribution primitives to ensure compatibility between the constituent base templates.\Ve hae usPd t\vo such primitives in the CC template.vector concatenation and vector distribution.These primitives themselves are implemented as divide-and-conqut>r templates.But unlike other base templates, thest> arc required only on distributed memory machint>s.On platforms with shared memory, there is clearly no need for explicit data redistribution and the Code Generator suppresses these calls.

USER INTERFACE
Our goal is to isolate the computational scientist from the details of parallel hardware and algorithms.This is achieved by having a very dear delineation of roles for the participants.We briefly describe these roles below: 1. Enviroment builder: The role of the environment builder is to develop the enabling technology.This indudes the user interface, search engine, and the performance prediction tools.2. Environment maintainer: The primary role of the environment maintainer is to extend the problem-solving capabilities of the system by gf'nerating appropriate algorithmic templates and by providing the necessary parameters to drive the performance predictor.3. Computational scientist: At the highest level, the users interact with the system by specifying the problem to be solved using a high-level notation.Thus, the computational scientist is able to concentrate on the science, without worrying about the details of the underlying computing environment.
The high-level notation mentioned above serves as the user interface.There are three important criteria for the selection of this notation: 1.The computational scientist should be familiar and comfortable with the notation.

2.
It should be possible to represent computational science problems easily using the notation .
.  The algebraic model of divide-and-conquer, introduced by Mou and Hudak in [2], has influenced the design of our templates.Chandy's group at Caltech [11] and Dongarra's group at Tennessee [12] are also investigating the use of "templates,, for high-performance scientific computing.Our method differs from these and other work on templates by introducing a novel approach to architecture adaptability, combining parameterized templates with analytical performance prediction.
A vast amount of research in parallel programming has been motivated by the desire to make parallel programming easier and portable.Here we attempt to categorize these efforts and comment on their impact in the real world.

Architecture-Independent Programming Languages and Systems
A high-level parallel programming language f'Hn provide limited architecture independence and programmability if efficient compilers are available on several machines.Data-parallel C [1] and Fortran 90 [13] are two examples.Chameleon [ 14] is a shared memory library designed for architecture independence.Sev-eral messagt'-passing libraries -most notably PYI\T [ 1 S J. \IPI [ 16 J. and p-t [ 1 7 ] -art' in Pxi,;;tencP to facilitate portabk parallPl programming on distributed ntemory machines.Crowr s \latroshka system prt>sPnts the framework of a parallel llrOf!Tamnting langwtgP which has tlw ability to adapt to difft>rent architectures [18].A \latro,;;hka-basPd program expo,;;es all tluavailahle parallPlism in an application.L :sing annotations.a subset of this parallelism is sckcted for execution on a specific machine.
SelPction of tlw appropriatP algorithm is still up to thP user when a genPral-purpose.procedural language is used for proble111 solving in parallel a11d distributed Pnvironments.This implies that effective portability is not achiPved.For sPquential machines.because there is only one machine rnodel.this is not a problt>m.For distributed computing.this difficulty impPdPs thP drvt>lopment of Pasily portable problem-solving environnwnts.Tlw mPssagt>-passing libra riPs (such as PV\L \IPI,p4. and Illinois Fast .\lessages [ 19]) mPrel~ become accessories to our system because our goal is the automation of algorithm design and irnplt>mentation using the availablt> programming tools.
Another approach to designing machine-independent parallel programming relies on implicit parallf'lism.Tlw parallel prograrmning languages based on thf' functional paradigm falls into this category.A munbcr of such languages have been proposed. including EPL [20].Id [24 J. and a parallel dialect of Haskell [2:> J.A functional language can providr a highrr lrvPl of abstraction to tlw prograrnnwr.cmnparPd to Pxplicitly paralld procedural languages.But the objective of our work is to providP to users a much higher level of abstraction.We are striving for a parallel-processing Pnvironment where tlw ust>rs art' able to concentrate on the problems being solved.rather than the ~election of tlw algorithms and their implementation in sorrw programming paradigm.The choice of thP programming paradigm.whether it is functional or proceduraL is of little help to the computational scientist in designing the algorithm.

Algorithm Architecture Mapping
Rt>prt>sPnting the algorithms aml architt'cturt>s as task graphs.graph embedding nm lw used to generate parallel programs that adapt to machine topologit>s.Tlw OrPgami project [.'l<J] i~ an example of tool dt>wlop-mPnt using thi~ approach.
In practice.graphical rPpresentations of parallel program~ are complt>x and may contain ~event!communication patterns intPnningled.Further.with the adnmce of wormhole routing.communication ovt>rlwads are dominated by me~~age startup tinws rather than the distance hetwePn tlw communicating proce~ ~ors or link congt>stion.In short.complex task graphs makt> this approach impractical.and tlw change in technology renders it itTPlPvant.

Multiprocessor Scheduling
Another attempt at solving the problem resulted from looking at parallel pnwt>ssing as precedence-constrained multiprocessor scheduling with interproces-~or cornnnmication delays.The toob developed by El-RPwini and Lewis [ -±0] for ••scheduling parallPl program tasks onto arbitrary targf't machines•• are Pxamples of this approach.
As in tlw mapping problem.parallel algorithnt'i art> reprPsentPd as ta~k graphs.Tlw sizP and complexity of the task graphs of rt>al-life applications make this scheme impractical.

FUTURE WORK
\X'e plan to use thi~ method to develop domain-specific problem-solving Pnvironments (PSE) and application-oriPnted compilers targeted to linear algebra and partial differential equations (PDE).PSEs and compilers based on this nwthod will be architecture independent hecausP the dPseription of the processing environment is an independent parameter.
Although we focused on templatPs based on PDC in this article.the nwthodology presPnted here could be applied to other models as well.A case in point is the sequential diYide-and-conquer (SDC).where deperHIPnciP~ Pxist lwtwf'Pn ~ubproblem~ [ -t 1 J. Tem-platPs basPd on SDC may be usPd to g<'nPra1P ptjJe!ined parallel programs.TlwrP art' architccture-prohlPm combinations wherP this approach will givP tlw lwst re~ults.Dynamic programming (DP).like tlw diYidPand-corHpwr mPthod . . is a problem-solving strategy that ~olvPs problem~ by cornbining tlw ,o;olutiotb to ~ubprohiPms [ -t:2].But the stru<"turP of a DP-bast•d tPmplate will be wry difft•rf'tlt from that of a PDC template.There art' application domains that can lwnefit from tlw automatic parallelization that uses DP tPmplate~ as the seeds.,,.e plan to extend the ~cope and solying powf'r of our methodology by designing templates based on othn paradigms.such as DP and SDC:.\lost of the future work will he focused on three areas: performance prrdictiotl.the databasr of algorithrn templates.and code generation.Thr pcrforrnance prPdiction needs to be improYed to take into account memory cffech.Automatic generation of code for a specific target machine based on tlw tPmplate with hest predicted performarl<•e i~ an area which requires further work.
The database of algorithm templates will be organized with a hirrarchical structurP.At the bottom.we will haw basic linear algelmr kernels.data di,.;trihution and t•omrmurication primiti\t~s.and divide/combine functions.On top of this layer.nontriYial applications-~uch as CC met !rod.:20 FFT.banded-sy,.;temsolwr.and rigensystem solwr-will be built.:\rmwrical models and PDE solvrrs will form yet another layer.\\e belie\t' thi~ layered approach will haw tlw expn•ssihility to soht• most problems in scientific computing.. including irregular and un,.;t ructured problems.

FIGURE 2
FIGURE 2 \lapping of the divide-and-conquer tree to tlw processing nodes.Task graph shown on the Jpft is mapped as shown on tl1e right.The shaded squares denote the basP case.'lute that the first divide-and-conquPr opPratiou is pPrfornwd using only two processon;.whereas the rest of the computation uses four processors.

FIGURE 5
FIGURE 5 Schematic viPw of the ext'cution of the generic PDC template.(a) PDC on n processors.(L) PDC on one processor: tllP haoe case.
Our analytical performancE' prediction tool is built on the model df'veloped by Clement and Quinn [3].It exploits tlw algebraic structure of divide-and-conquer algorithms matrix A distributed row-wise among the processors matrix B distributed column-wise among the processors matrix C distributed row-wise among the processors Divide function:Divide the processor pool into two equal partitions, a LEFT partition and a RIGHT partition.

FIGURE 8
FIGURE 8  Snapshots of thP oat a strucwres and processor partitions at the states labeled in Figure7h.Shaded areas show the matrix blocks stored at a processor at tlw indieatf'd state.

1 .
Row-oriented matrix-vector multiplication: Ab = c.All data structures are distributed among the processors with row-contiguous distribution used for the matrix.BBBB LEFT RIGHT FICLRE 1:3 Left-Rif!ht division of right processors.

!FIGURE 19
FIGURE 19Predicted and ohscrwd performanct> of CG on workstation nf'twork using HOW decomposition of the matrix.Pl i~ tlw HUI!lber of proces,;ors used for matrix vector multiplication and P2 is the munlwr of processors ust>d for the rest of the computation.

FIGURE 20
FIGURE 20 Predicted and ohst'rved performance of CG on workstation netv.-ork using COL-t\J\: decomposition of the matrix.
FIGURE 25 Predicted awl ohsern•d performance of CG on a 16-processor C.\1-S for row.cohmuL and hlnck dt'compo~ition of the matrix.

FIGURE 26 C
FIGURE 26 C function PDC:.

FIGURE 27
FIGURE 27 Correspondence communication with LEFT-RIGHT divison.
2.Templates with only pre-adjust functions.We call the associated function prePDC.The vector distribution template presented earlier is an example of this category.This function is derived from PDC by replacing the second iteration by a single function call, which has the effect of transforming the state from the leaves to the root in a single operation.

3.
Templates with only post-adjust functions.Similar to the previous case, we replace the first iteration by a function call to derive postPDC from PDC to represent this category.Most templates presented earlier in this article are examples of this type.4. Templates with no adjust functions.We call the associated function purePDC.The SAXPY operation used earlier in the CG template is an example of this category.

FIGURE 28 FIGURE 29 FIGURE 30
FIGURE 28 Mirror image communication with LEFT-RIGHT division.

Table 1 .
Terminology Four on the Left and four on the Right) ThP prt'dictor functio11 simply ITturns a list of commtmication OJWrations in the adjust functions of the template.These opera-tion~ arc well-defint'd .-;ystemprimitives.Exam-piPs include rorrespomlence commtlftication and mirror irnage commwtiration.In addition to the idPntifier of an operation . . the list ell1rirs will also include the values of the paramt'tt'r,; of this orwratioll ..\n cxarnplP is shown in Figure10.-±.D,,,,(ii.p)andC,.,,(ii.p).Tiw predictor function simply returns the munlwr of synchro11ization~ .\HCI!lTECTl HE-.\Il\PT\BI.E P\H\LI.EI.PHOCHA\1\11'\C 17:3 Template to distribute the product, vector from Pl nodes to P2 nodes.Pl as the number of processor~> to use.Invoke DC-Template to distribute the product vedor from one node to P2 nodes.caseBlock-contiguous:InvokeDC-Templatefor blo\k-or~ented matrix vector multiplication with Pl as the number of processors to use.Invoke DC-Template to distribute the product vector from one node to P2 nodes.End sw1tch (MAT)Invoke DC-Template to distribute the vector h from Pl nodes to P2 nodes Invoke a senes of DC-Templates br dot product .andSAXPY with P2 as the number of processors to use lnvoke DC-Template to redistribute the vector h from P2 nodes to Pl nodes.
3.It should be possible to extract templates easily from •'programs" written in the notation.