Parallel Performance of a Combustion Chemistry Simulation

We used a description of a combustion simulation's mathematical and computational methods to develop a version for parallel execution. The result was a reasonable performance improvement on small numbers of processors. We applied several important programming techniques, which we describe, in optimizing the application. This work has implications for programming languages, compiler design, and software engineering.


Introduction
Numerical simulations of reactive o w are widely used for problems such as controlling combustiongenerated pollutants, reducing knocking in internal combustion engines, studying the environmental impact of compounds emitted from combustion, and disposing of toxic wastes 1 .These simulations require extensive computation.Many can only be served by the advanced capabilities of a parallel supercomputer.In this paper we describe an e ort to optimize the parallel performance of a reactive o w simulation written for serial execution.Speci cally, w e examine Premix 2 , which simulates combustion, an important subclass of reactive o w.
Reactive o w modeling problems are governed by equations conserving mass, energy, and momentum.They are coupled with a hydrodynamic system driven by the energy released or absorbed from the chemical reactions.Researchers seek to understand the chemical kinetics behavior of large chemical reaction systems and the associated convective and di usive transport of mass, momentum, and energy.
Complicating the numerical simulation of reactive o w i s n umerical sti ness.Sti equations have one or more rapidly decaying solutions and usually require special treatment.In the context of chemical kinetics Curtiss and Hirschfelder 3 rst identi ed the problem of sti ness in ordinary di erential equations in 1952.In reactive o w, sti ness often arises as a result of the di ering time scales of the chemical kinetics and the hydrodynamics 4 .Chemical reactions occur on the order of picoseconds, while the convective o w occurs on the order of seconds.Sti ness also results where large temperature gradients occur.To o vercome these numerical di culties researchers often employ time-implicit algorithms and adaptive gridding schemes.
A group at Sandia National Laboratories has developed a number of software tools that facilitate simulation of reactive o w.Three basic packages lie at the heart of their e ort.The Chemkin library 5 is used to analyze gas-phase chemical kinetics.The Transport 6 library is used for evaluating gas-phase multicomponent transport properties.Surfkin 7 is a package for analyzing heterogeneous chemical kinetics at a solid-surface gas-phase interface.These three combustion libraries undergo continual revision as part of an ongoing e ort to provide the numerical combustion community with standardized software.This approach is successful because the governing equations for each reactive o w application must share a n umber of features.A general discussion of this structured approach to simulating reactive o w is found in 1 .
Several codes have been built by Sandia to exploit Chemkin, Transport, and Surfkin.One of them is Premix, which is used to predict the steady state temperature and species concentrations in one-dimensional burner-stabilized and freely propagating premixed laminar ames.This combustion is chemically interesting because the large energy release associated with burning gives rise to high temperatures and many exotic chemical species.The high temperatures resulting from the transfer of chemical energy to heat lead to rapid expansion of the gases which in turn a ect convective o w.
The goal of this paper is to describe experiences in an e ort to improve the performance of the Premix application.The machine architectures we considered are shared memory multiprocessors with a modest number of CPU's, such as the Alliant FX series, the Convex C2 series, and the high ends of the Sun SPARCstation, HP Apollo, IBM RS 6000, and Silicon Graphics Iris series.Such machines are becoming less expensive and more widely available.
Only one version of the FORTRAN 77 source for Premix is distributed by Sandia.This code executes without signi cant modi cation on all machines from a personal computer to a Cray.T o insure the software can still be used by the large established user base, modi cations to the code are strictly backward compatible; that is, the subroutine interfaces are xed.Our main concern, then, was with extracting parallelism from the chemical and thermodynamic computations performed by the Chemkin and Transport libraries.
We approached Premix with a simple goal: Reduce the actual time a program requires to produce a solution to a given problem through e cient use of multiprocessing hardware.To accomplish this, independence must be present in the code so that di erent subproblems can be executed by separate processors concurrently.Often the desired independence, if it exists, is apparent from the mathematical description of the physical problem.This conceptual independence may not, however, be expressed in the actual code.Two factors contribute to the absence of conceptual independence in the nal program: 1 the computational method chosen to approximate the mathematical problem may sequentialize formerly independent tasks; 2 the speci c implementation of the computational method adds unnecessary synchronizations.
We therefore make a reasonably sharp distinction between the mathematical model of a problem, the computational method for its solution and the particular implementation of the method.We begin in the next section with a brief overview of Premix.In Section 3, we observe h o w w ell the original version of Premix expresses parallelism inherent to the mathematical model and computational method.In Section 4 we describe the program transformation techniques applied to produce an optimized version of Premix.I n Section 5 we exhibit the resulting performance improvement, and in Section 6 we o er the conclusions drawn from this work.2 The Premix application Premix is a typical example of a library-oriented production FORTRAN code.It is a exible program developed to analyze general problems involving combustion of premixed gases in a ame.Premix consists of a driver and four libraries: Chemkin 5 , used to analyze gas-phase chemical kinetics; Transport 6 , used to evaluate gas-phase multicomponent transport properties; Twopnt 8 , a two point boundary value problem solver; and Linpack 9 , a popular numerical linear algebra package.Each is a standardized, extensible library intended for use on a wide variety of platforms.The code, approximately thirty thousand lines of standard FORTRAN 77, is highly modular, robust, and portable.The program can be stopped at any of several checkpoints and restarted with a new con guration.
Our testing environment w as a shared-memory MIMD machine, an Alliant FX 80 10 with eight processing units.The processors are register-based with chained functional units and memory port.The computational processors are connected by a concurrency bus, which k eeps the overhead for concurrency small.A sequential pro le for an execution of the nitrogen combustion simulation mentioned earlier appears in Figures 1 and 2. For the test problem the program tracks 34 chemical species and 151 chemical reactions through three simulated burns.The one-dimensional grid begins with 19 grid points and is ultimately re ned to 61 grid points.
The program spends most of its execution time in routines from the Chemkin and Transport libraries.Approximately 65 of the sequential execution time is consumed performing chemical kinetics computations in Chemkin routines ckytx, ckmmwy, ckwyp, ckrat, ckhml, ckcpbs, ckrhoy, and ckcpms.These subprogram names are de ned in Table 1.Another 20 of the execution time is consumed by transport computations in Transport routines mtrnpr, ckytx, mcadif, mcedif, mceval, and mcacon.Solving systems of linear   equations consumes most of the remaining time.The Twopnt library simply controls the ow o f t h e computations and thus contributes little to the execution time.

Description of the algorithm
We rst give a description of the mathematical model and the computational method, which assisted us in discovering which level of outer loop parallelism is best to obtain a granularity su cient to saturate available processors with reasonably sized parcels of independent w ork 11 .A mathematical description of the general problem appears in several references e.g., 2 .We review them brie y here.We then consider the computational methods employed to solve the combustion problem and explore the potential for parallelism in these methods.Finally, w e describe the particular implementation of these methods and explore the remaining potential for parallelism in the actual program.

Mathematical model
Premix computes the steady state temperature and species concentrations in one-dimensional burnerstabilized and freely propagating premixed laminar ames.The steady state is de ned by the following conservation equations 2 : 3 where K is the number of chemical species.Thus, K + 2 conservation equations govern the steady state of the system.The symbols appearing in these equations are de ned in Table 1.
The chemical kinetics computations occur in evaluating the molar rates of species production _ !k , the speci c form of which is determined by the input dataset according to the equation, where the k;i are user-speci ed integer stoichiometric coe cients and the q i are the computed reaction rates.Determining the value of q i is computationally intensive, consisting of numerous exponentials, logarithms, and reductions, both multiplicative and additive.
The heat generated or absorbed by these reactions strongly a ects the gas ow.In Premix, the chemical kinetics are computed rst from the input data; then the hydrodynamic system governed by the conservation equations 1 3 is solved in the presence of the chemical reactions.
Equations 2 and 3 are discretized using nite di erence approximations.A grid is numbered from 1 at the cold input boundary to J at the hot output boundary.The convective terms, _ M dT dx from the energy equation and _ M d Y k dx from the momentum equation, are modeled by either rst order windward or central di erences as necessary.The other derivatives are approximated by rst and second order central di erences.The di usive term of the species conservation equation, d dx AZ k , is approximated in the same manner.Appropriate boundary conditions are implemented for both the cold and hot boundaries, yielding a t wo point boundary value problem.See equations 10 21 in 2 and discussion therein for a detailed description.The nitrogen combustion problem is solved rst using windward di erences for the convective terms.Then the initial solution is used as a starting condition for a run using central di erences for the convective terms.
The nite di erence approximations reduce the sti two point boundary value problem to a system of nonlinear algebraic equations.The boundary value problem is modeled rst on a coarse mesh.When necessary, new grid points are added nonuniformly in regions where the solution or its gradients change rapidly.Assuming a unique solution exists, this process ends when the solution has been resolved to a speci ed degree.
The nonlinear system is solved using the modi ed Newton-Raphson algorithm.We seek a vector which satis es F = 0 : 5 We begin with a usually poor approximation ^ to .It is clear that F ^ is not zero.The quantity y = F ^ 6 is called the residual.
In order to obtain a block-tridiagonal structure in the Jacobian, the mass ow rate, _ M, is treated as an independent v ariable _ M j at each grid point, and the additional equation stating that they are all equal, d _ M j dx j = 0 j = 1 ; : : : ; J 7 is added with a suitable boundary condition.This mass conservation equation, coupled with the energy conservation equation 2 and the K equations of momentumconservation 3 yield a total of K+2 equations.The approximate solution vector ^ has the form, ^ = ^ 1 ; ^ 2 ; : : : ; ^ J 8 where ^ j = T j ; Y j;1 ; Y j;2 ; : : : ; Y j;K ; _ M j : 9 Equation 9 corresponds to the independent v ariables for temperature, species concentration, and mass ow rate for each grid point, j.
The modi ed Newton-Raphson algorithm produces a sequence f n g, n+1 = n , n J n ,1 F n : 10 In the equation, is a damping parameter and J is a nite di erence approximation to the Jacobian matrix.The sequence converges to the solution of the nonlinear equations F given a su ciently good starting estimate 0 .It is rejected if it does not converge.
Should the Newton algorithm fail to converge, a user-speci ed number of arti cial time integrations are performed to improve the conditioning of the nonlinear system.The discretized time integration is again a system of nonlinear equations.The modi ed Newton-Raphson method is employed to solve the nonlinear system, but in this case it is much more likely to converge.See the discussion in 2 for more details.
Independence inherent to the computational method Each Newton or time-stepping iteration depends directly on the result of the previous iteration, so we will not discover independence necessary for parallelization outside the computations within a single iteration.We will show, however, that Jacobian evaluation contains considerable independence, in that all residual di erences can be computed simultaneously.Additionally, many of the properties evaluated for each species and reaction within a single residual evaluation are independent in principle.Others are not independent, but many h a ve the form of a reduction, a computation amenable to partial parallel optimization.
Let n represent the vector of independent v ariables after Newton iteration n.It has been shown in 12 that y = F n depends only on the partial vectors, n j,1 ; n j ; n j+1 ; n,n0 j ,1 ; n,n0 j ; n,n0 j +1 : 11 The dependence on some previous evaluation n , n 0 arises from the fact that the transport coe cients are not recomputed for each iteration.It follows that y depends only on solution vectors n and n,n0 , both of which are available at the beginning of Newton iteration n+1.That is, y = F n is a completely explicit computation.Thus, the computations for each grid point sectioning of y can be performed simultaneously.
It follows that all the residuals needed to approximate the Jacobian can be computed concurrently.We see that there exists the potential for several levels of signi cant parallelism in Premix.Note, however, the hierarchy is not strict.For e ciency, the Jacobians are often reused.Thus, a signi cant number of residual evaluations occur which are not part of Jacobian evaluation.In the nitrogen combustion simulation we used for testing, one third of the residual evaluations occur independent of Jacobian evaluation.This suggests that if a single level of parallelism is to be exploited, it should be done at the level of residual evaluation.

Speci c implementation
The control ow o f Premix can be viewed as in Figure 3.The Chemkin Interpreter 5 and Transport P r operty Fitting Code 6 are each external modules which access databases to create linking" les to be read during execution.The Chemkin and Transport libraries require access to many problemspeci c constants, such as the molecular weights of the species.In addition, each library requires some scratch space, or memory locations used to store values needed only temporarily.T racking the use of these scratch arrays is signi cant when analyzing for parallelism.
Because the libraries are general-purpose and used in a wide variety of applications, these work arrays must be of arbitrary size.Thus, a dynamic" memory allocation scheme is used.Both Chemkin and Transport implement dynamic memory allocation in a way common to scienti c programs written in FORTRAN.F or each data type employed by one of the program libraries character, integer, double precision oating point, a single, large array is carved into sections by a sequence of integer o sets computed at runtime.The indices are computed during initialization and stored in COMMON blocks for future use.They are never modi ed after initialization.The work arrays for each of the libraries are passed as arguments down the calling tree.A COMMON block for each of the libraries encapsulates the pointers into their respective integer and oating point w ork arrays.It is important to note that the COMMON blocks for a particular library are declared only in procedures within that library.
Returning to Figure 3, we see that each time the outer control loop iterates, either the Newton solver or time stepping is invoked.The Newton solver is always invoked rst; time stepping is only performed when the Newton solution phase fails to converge.A single Newton iteration consists of the following steps 2 :  calculate the residual fun, if necessary, e v aluate jacob and factor dgbco the Jacobian matrix, and backsolve dgbsl.Because chemical computations involve only a grid block and its immediate neighbors equation 11, the chemistry is local.As the residual evaluations are independent of one another, no conceptual reason exists that the residuals cannot be computed e ciently in parallel.
Computing the residual requires numerous chemical and thermodynamic property e v aluations at each grid point.The computation has three distinct steps.First, the transport coe cients are evaluated, if necessary.Then the di usion velocities are computed.Finally, the chemical kinetics terms are evaluated and the residuals of the governing equations 2, 3, and 7 are determined.
However, the speci c implementation of the computational methods hides some of the potential for parallelism.Concurrent e v aluation of the residuals is hampered by the presence of shared local variables and work arrays.The chemical and thermodynamic computations for each grid point, which w e also identi ed as independent in principle, cannot be executed concurrently either.In addition to shared local variables and work arrays, the nearest-neighbor communication of density and area data forces a sequentializing synchronization.The next section describes techniques to overcome some of these problems.

Programming techniques and optimization
In this section we describe the program transformation techniques we applied to the speci c implementation of Premix and the program analysis that was necessary to do this.We compare these techniques to those applied in other application programs and discuss some implications on programming languages, compiler design, and software engineering issues.

Transformation and analysis techniques
The basic program modi cation that enabled multiple processors to participate in the parallel execution of the program was to declare a number of time-consuming loops to be executable concurrently.Simply speaking, in order to do this we rst had to recognize that the iterations of these loops were potentially independent, then perform some transformations to make them truly independent, and nally insert a directive informing the compiler that the loops shall be executed in parallel.
By far the most important transformation in this process was the privatization of arrays Figure 4 that are used as temporary work spaces within loop iterations.In the original program all such loop iterations use the same arrays for storing temporary results.In a parallel execution of the unmodi ed program, every iteration would have t o w ait before using this array u n til the previous iteration is done using it, which e ectively would serialize the loop.However, by giving each iteration a separate copy of the array, w e can avoid these dependences.The di culty of this transformation is in making sure that it is a truly temporary array where no array element passes information from one loop iteration to the next.This is usually done by an array de nition use analysis of the program.
An additional technique the parallelization of reduction operations we h a ve found to be applicable in our program.However, we h a ve not done this because we exploited an outer level of parallelism.The transformation will become important o n m a c hine architectures that support the exploitation of multiple levels of parallelism, for example machines that have cluster structure so that the outer parallel loops can be spread across clusters while the inner loops exploit the parallel resources within the cluster.
For both the de nition use analysis and the detection of independence of the loops we had to analyze the program interprocedurally.Often, array sections were de ned i.e., written in one subroutine and used i.e., read in another subroutine.Even more di cult was the analysis of accessed array sections that involved program input data.Sometimes it was only knowledge of the application that could ensure that, in all reasonable executions of the program, input variables would relate so that the de ned array ranges would always cover the used ranges or that the ranges accessed in di erent loop iterations would never overlap.
The dynamic memory allocation scheme, mentioned in Section 3.2, further complicated the situation.We had to track array subscripts which w ere themselves subscripted array elements in order to determine which sections of the original, large array are read or written.Since the subscript arrays are read-only after their initialization, it is possible to determine temporary arrays and parallel loops from the analysis of the program code.However, this process is tedious and it makes the interesting question of whether such techniques could be automated in a compiler quite challenging.

Tools, languages, and programming methodology
A pro le facility that identi ed the most time-consuming loops in the program was the basic instrument for our program analysis.In addition, the most helpful tool was an array section analysis facility that determined the array sections read and written in each subroutine and loop.This information was then propagated up the calling tree so that the summary of all accessed arrays could be seen at each loop.
The actual transformations were done in a conventional text editor.Compared to the time consumed by the program analysis this task was not overly expensive, although the mechanics of array privatization could be somewhat tedious as described below.
We restructured our program by explicitly specifying parallel activities, rather than changing the program so that the compiler could recognize the parallelism automatically.The language we used is FORTRAN 77 plus directives.The only directive w e used is CNCALL, which speci es that the loop shall be executed in parallel.Private arrays were speci ed in two forms, both using available FORTRAN 77 constructs.One form is to declare the array local to a subroutine that is called inside the parallel loop and the other is to expand the array b y one dimension and index this dimension with the loop variable.The second form is usually called array expansion.Sometimes, subroutine parameter lists had to be modi ed in order to pass expanded arrays from calling to the called routine.
Common extensions to FORTRAN 77 are constructs for dynamic array declaration.Arrays of arbitrary size and dimension can be declared locally, within a subprogram.Had we used this extension, we w ould not have had to modify any of the subprogram parameter lists, leaving the Chemkin and Transport libraries backward compatible.
The availability of a directive that declares variables priva t e t o a l o o p w ould have been very useful for our purposes because it would have allowed us to leave the existing program text unchanged.Such a directive would also have to support the privatization of a partial array.W e encountered situations where part of an array w as read-only and another part was used for temporary storage.To handle this situation we split the arrays into di erent parts and privatized the temporarily used sections.The need for a PRIVATE directive i s an important conclusion of our work, and it corresponds to ndings of related work.
The method of program optimization we h a ve applied consists of identifying the time-consuming loops in the program, analyzing array sections that are read and written in these loops, and deriving privatizable and independent array sections.The parallel loops in our program could then be determined from this information.The actual transformations necessary to express the parallelism were straightforward.This programming scheme seems generally applicable and may be used as a programming methodology that can be applied in a systematic way.Although we h a ve found this to be useful for optimizing other programs as well, we should note that there are time-consuming optimization steps for which w e don't know generally applicable methods.Such steps are the gathering of knowledge about the application that goes beyond the analysis of the program text.We h a ve found this to be important in some cases for our program optimization.

Comparison to ndings of related projects
In a related project of optimizing application programs for parallel computers similar results were found.Such projects include the Cedar Fortran project 13,14 which w as completed at our center in 1992, and the follow-on Polaris project 15 .Both projects studied transformation techniques that are needed to speed up real programs.This was done by hand-parallelizing a suite of codes, including the Perfect Benchmarks and some applications of relevance to the users at the NCSA 1 .
The most important transformations identi ed were the same as in our project.Array privatization was most e ective, followed by the parallelization of reduction operations.Interprocedural de nition use analysis was a crucial technique to determine the applicability of the transformations.The transformations yielded fully parallel loops whose iterations could be executed independently on multiple processors.
Our application is relevant for these other projects in that it con rms the results and thus shows that they carry over from the sample benchmark suite to new programs.One di erence seems worth noting.The ultimate goal of the above related projects was to nd techniques that can be automated in a parallelizing compiler, and in fact most of the transformations identi ed were reported to be automatable.In our program we h a ve found that some crucial informationfor determining the applicability of the parallelization techniques is known only from the input les and thus is not available at compile time.Although there are compilation techniques that are able to parallelize such situations at runtime 16 , our ndings indicate that it will be at least di cult to detect the parallelism automatically.A full discussion of this point i s b e y ond the scope of this paper and is the object of future projects.
A related approach to methodologies for parallel programming is described in 17 .Our ndings largely agree with this approach.One di erence is that 17 envisions a program-level" optimization, in which all necessary information for transforming the program can be gathered from the program text.As we h a ve mentioned, for optimizing Premix there was sometimes a need to use knowledge about the mathematical and physical properties of the problem that could not easily be gathered from the speci c implementation of the program.
Our ndings can also be compared with the parallel programming methodology that envisions the design of application programs from optimized libraries.The parallelism would be hidden in these libraries and the programming method for the user of these libraries would be no di erent from sequential programming.A further advantage of this approach is that the libraries could be optimized speci cally for each machine and the application program would be portable.Because Premix uses standard libraries, it would be a natural candidate for such an approach.However, we h a ve found that exploiting parallelism within the libraries does not lead to signi cant speedup.The parallelism we exploited is at a higher loop level and the libraries themselves execute on one processor each.

Results
We gathered performance data on the Alliant FX 80 for four versions of Premix: Original Sequential the original Premix code compiled with sequential optimizations fortran -Og; Original Parallel the original code optimized for parallel execution by the FX FORTRAN automatic compiler fortran -Ogc; Optimized Parallel Original Parallel with explicit parallel constructs added, as described in Section 4; and Optimized Sequential Optimized Parallel compiled for sequential execution fortran -Og.The pro ling option -pg w as disabled for these experiments.We also excluded vectorization optimizations from our performance tests because the vectors were too short to be useful with the FX 80 architecture.Enabling vectorization consistently resulted in greater execution times.
The performance improvement can be seen in Figure 5.The third group of bars shows total execution times for the four versions of Premix.W e see that the Optimized Parallel version of the code executes approximately 4.4 times faster than Original Sequential.The added overhead of the manual parallelization, seen by comparing the execution time of Optimized Sequential to Original Sequential, is minimal less than 0.3.Automatic compiler optimizations, isolated in the Original Parallel version of the code, are responsible for about half the performance improvement.This result can also be seen in Figure 6, which exhibits the inverse execution times of the parallel versions of the code for varying numbers of processors.
We separated the linear algebra and chemistry computations in Figure 5     the chemical computations improved signi cantly, b y a factor of almost six, the gain for the linear algebra routines, which w e w ere only able to parallelize partially, w as a more modest 2.3.Linear algebra commanded only about 13 of the Original Sequential execution time.In the Optimized Parallel version linear algebra is responsible for about 27.As the number of processors grows, linear algebra computations will increasingly dominate execution time.
Formerly the chemistry was so expensive that the time spent in linear algebra could be easily ignored.Now that the chemistry can be made relatively cheap, the algorithmic trade-o s have c hanged.Alternatives to the overall solution strategy should be reviewed.Some discussion of parallel methods for solving two point boundary value problems can be found in 18 .The Lapack e ort 19 o ers parallel versions of banded system solvers, exploiting parallelism in multiple right-hand sides and blocking algorithms.We did not, however, obtain any performance improvement when we replaced the Linpack linear algebra routines with their Lapack counterparts.Premix has no multiple right-hand sides to exploit, but blocking should have yielded some improvement.The reason it did not do so is still under investigation.
Table 2 shows the execution times of the four versions of Premix.The three loops we manually parallelized constitute nearly all the signi cant c hemical computations in the code.As these loops are explicitly parallel in the Optimized Parallel version of the code, we h a ve successfully modi ed the implementation to express the parallelism inherent to the computational method.However, the execution times of these loops exhibit only a six-fold improvement on eight processors.We believe this is largely due to an imbalance in the work load.The program spends much of its time working with a 19-point grid.If we assume each iteration of the loop over the grid points executes in the same amount of time, 19 iterations are completed in the time that the eight processors could execute 24.This roughly 79 e ciency would reduce the performance improvement factor to 6.3.A further source of ine ciency is the limited memory bandwidth of the FX 80 machine, which provides only a four-way path between the eight processors and the shared memory, and penalizes applications with poor cache hit ratios.Factoring in these ine ciences models the measured performance with good accuracy, so that we can characterize the scalability of our application as follows.
The Premix application runs with reasonable e ciency on machines with small numbers of processors.It is potentially scalable to larger numbers of processors for the solution of larger problems with signi cantly more than 65 grid points.As the chemistry component of the application speeds up, the linear algebra part becomes speed limiting.We h a ve not investigated possible improvements to this component of the application; however, it seems possible to resolve this limitation through appropriate changes in the used algorithms.

Conclusion
We performed a detailed analysis of the mathematical model used in Premix coupled with a study of the computational methods to gain a picture of a hierarchy of parallelism inherent to the problem being solved.A manual analysis of the code followed, from which w e determined to what extent the parallelism inherent to the implementation was expressed in the original version.We then chose an outer loop level appropriate to our target machine and applied a handful of manual parallelizations.In all, we modi ed less than 100 lines of code.The result was a greater than four-fold improvement in the simulation's execution time on an Alliant FX 80 with eight processors.
In this work we h a ve found that the Premix combustion chemistry application runs with reasonable speed on small numbers of processors and potentially scales up to more highly parallel systems.The most important program transformation to achieve our performance improvement w as the privatization of arrays.To determine the applicability of this transformation we had to do a careful, interprocedural analysis of de ned and used array sections.The available language constructs were not always adequate for expressing dynamically-sized loop-private arrays, and we suggest that such constructs be included in future language designs.The method used for optimizing our program seems generally applicable and, with the provision of supporting tools, we believe they represent a step toward the understanding and improvement of the process of optimizing large application codes for high-performance computers.

Figure 1 :
Figure 1: Execution Pro le of Sequential Program.Times were obtained on an Alliant FX 80 with serial optimizations compile command: fortran -Og -pg.Elapsed times in seconds are superscripted and the number of events is subscripted.Procedure times include time spent in called subprocedures.Total elapsed time is 9305 seconds.Times for the two i n vocations of fun" are combined in Figure 2.

Figure 2 :
Figure 2: Execution Pro le for Procedure fun.

Figure 3 :
Figure 3: Flow Diagram for Premix.The nonlinear discretized system is solved using the modi ed Newton-Raphson algorithm.Should the Newton algorithm fail to converge, a user-speci ed number of arti cial time integrations are performed to improve the conditioning of the nonlinear system.The time stepping algorithm also uses the Newton method.

Figure 4 :
Figure 4: Privatization of Arrays.In the second code fragment, each iteration of the outer loop is provided a separate copy o f w ork array temp".

Figure 5 :
Figure 5: Comparative P erformance of Four Versions of Premix.Times were obtained using an eight-processor Alliant FX 80 with the FX FORTRAN parallelizing compiler.

Figure 6 :
Figure 6: I n verse Execution Times Versus Number of Processors.Times for original and optimized parallel versions of Premix were obtained on an Alliant FX 80.An ideal performance improvement line is included for comparison.

Table 1 :
Symbols Appearing in the Premixed Flame Equations 1 4.More detail is available from the Chemkin and Transport documentation 5, 6 .

Table 2 :
Elapsed Execution Times of Four Versions of Premix.The parallel versions were executed using eight processors.