An Algebraic Programming Style for Numerical Software and its Optimization

The abstract mathematical theory of partial differential equations (PDEs) is formulated in terms of manifolds, scalar fields, tensors, and the like, but these algebraic structures are hardly recognizable in actual PDE solvers. The general aim of the Sophus programming style is to bridge the gap between theory and practice in the domain of PDE solvers. Its main ingredients are a library of abstract datatypes corresponding to the algebraic structures used in the mathematical theory and an algebraic expression style similar to the expression style used in the mathematical theory. Because of its emphasis on abstract datatypes, Sophus is most naturally combined with object-oriented languages or other languages supporting abstract datatypes. The resulting source code patterns are beyond the scope of current compiler optimizations, but are sufficiently specific for a dedicated source-to-source optimizer. The limited, domain-specific, character of Sophus is the key to success here. This kind of optimization has been tested on computationally intensive Sophus style code with promising results. The general approach may be useful for other styles and in other application domains as well.


Introduction
The purpose of the Sophus approach to writing partial dierential equation (PDE) solvers originally proposed in [12] is to close the gap between the underlying coordinate-free mathematical theory and the way actual solvers are written. The main ingredients of Sophus are: 1. A library of abstract datatypes corresponding to manifolds, scalar elds, tensors, and the like, guring in the abstract mathematical theory.
2. Expressions involving these datatypes written in a side-eect free algebraic style similar to the expressions in the underlying mathematical theory.
Because of the emphasis on abstract datatypes, Sophus is most naturally combined with object-oriented languages or other languages supporting abstract datatypes. Hence, we will be discussing high-performance computing (HPC) optimization issues within an objectoriented or abstract datatype context, using abstractions that are suitable for PDEs. Sophus is not simply object-oriented scientic programming, but a much more structured approach dictated by the underlying mathematics. The object-oriented numerics paradigm proposed in [8,23] is related to Sophus in that it uses abstractions corresponding to familiar mathematical constructs such as tensors and vectors, but these do not include continuous structures such as manifolds and scalar elds. The Sophus approach is more properly called coordinate-free numerics [13]. A fully worked example of conventional vs. coordinate-free programming of a computational uid dynamics problem (wire coating for Newtonian and non-Newtonian ows) is given in [11].
Programs in a domain-specic programming style like Sophus may need additional opti-mization in view of their increased use of expensive constructs. On the other hand the restrictions imposed by the style may lead to new high-level optimization opportunities that can be exploited by dedicated tools. Automatic selection of high-level HPC transformations (especially loop transformations) has been incorporated in the IBM XL Fortran compiler, yielding a performance improvement for entire programs of typically less than 2 [14, p. 239]. We hope Sophus style programming will allow high-level transformations to become more effective than this.
In the context of Sophus and object-oriented programming this article focuses on the following example. Object-oriented languages encourage the use of self-mutating (self-updating, mutative) objects rather than a side-eect free algebraic expression style as advocated by Sophus. The benets of the algebraic style are considerable. We obtained a reduction in source code size using algebraic notation vs. an object-oriented style of up to 30% in selected procedures of a seismic simulation code, with a correspondingly large increase in programmer productivity and maintainability of the code as measured by the Cocomo technique [4], for instance. On the negative side, the algebraic style requires lots of temporary data space for (often very large) intermediate results to be allocated and subsequently recovered. Using self-mutating objects, on the other hand, places some of the burden of variable management on the programmer and makes the source code much more dicult to write, read, and maintain. It may yield much better eciency, however. Now, by including certain restrictions as part of the style, a precise relationship between self-mutating notation and algebraic notation may be achieved. Going one step further, we see that the natural way of building a program from high-level abstractions may be in direct conict with the way current compilers optimize program code. We propose a source-to-source optimization tool, called CodeBoost, as a solution to many of these problems. Some further promising optimization opportunities we have experimented with but not yet included in CodeBoost are also mentioned. The general approach m a y b e useful for other styles and other application domains as well.
This paper is organized as follows. After a brief overview of tensor based abstractions for numerical programming and their realization as a software library (Section 2), we discuss the relationship between algebraic and selfmutating expression notation, and how the former may betransformed into the latter (Section 3). We then discuss the implementation of the CodeBoost source-to-source optimization tool (Section 4), and give some further examples of how software construction using class abstractions may conict with eciency issues as well as lead to new opportunities for optimization (Section 5). Finally, we present conclusions and future work (Section 6).

A Tensor Based Library for Solving PDEs
Historically, the mathematics of PDEs has been approached in two dierent ways. The solution-oriented approach uses concrete representations of vectors and matrices, discretisation techniques, and numerical algorithms, while the abstract approach develops the theory in terms of manifolds, scalar elds, tensors, and the like, focusing more on the structure of the underlying concepts than on how to calculate with them (see [15] for a good introduction). The former approach is the basis for most of the PDE software in existence today. The latter has very promising potential for the structuring of complicated PDE software when combined with template class based programming languages or other languages supporting abstract datatypes. As far as notation is concerned, the abstract mathematics makes heavy use of overloaded inx operators. Hence, userdenable operators and operator overloading are further desirable language features in this application domain. C++ [18] comes closest to meeting these desiderata, but, with modules and user-denable operators, Fortran 90/95 [1,2] can also be used. In its current form Java [10] is less suitable. It has neither templates nor user-denable operators. Also, Java's automatic memory management is not necessarily an advantage in an HPC setting [16,Section 4]. Some of these problems may be alleviated in Generic Java [ 6 ] . The examples in this article use C++.

The Sophus Library
The Sophus library provides the abstract mathematical concepts from PDE theory as programming entities. Its components are based on the notions of manifold, scalar eld and tensor eld, while the implementations are based on the conventional numerical algorithms and discretisations. Sophus is currently structured around the following concepts: Basic n-dimensional mesh structures. These are like rank n arrays (i.e., with n independent indices), but with operations like + , and mapped over all elements (much like F ortran 90/95 array operators) as well as the ability to add, subtract or multiply all elements of the mesh by a scalar in a single operation. There are also operations for shifting meshes in one or more dimensions. Parallel and sequential implementations of mesh structures can beused interchangeably, allowing easy porting between architectures of any program built on top of the mesh abstraction.
Manifolds. These represent the physical space where the problem to be solved takes place. Currently Sophus only implements subsets of R n . Scalar elds. These may be treated formally as functions from manifolds to reals, or as arrays indexed by the points of the manifold with reals as data elements. Scalar elds describe the measurable quantities of the physical problem to be solved. As the basic layer of \continuous mathematics" in the library, they provide the partial derivation operations. Also, two scalar elds on the same manifold may bepointwise added, subtracted or multiplied. The dierent discretisation methods provide dierent designs for the implementation of scalar elds. A t ypical implementation would use an appropriate mesh as underlying discrete data structure, use interpolation techniques to give a continuous interface, and map the +, , and operations directly to the corresponding mesh operations. In a nite difference implementation partial derivatives are implemented using shifts and arithmetic operations on the mesh.
Tensors. These are generalizations of vectors and matrices and have scalar elds as components. Tensors dene the general dierentiation operations based on the partial derivatives of the scalar elds, and also provide operations such as componentwise addition, subtraction and multiplication, as well as tensor composition and application (matrix multiplication and matrix-vector multiplication). A special class are the metric tensors. These satisfy certain mathematical properties, but their greatest importance in this context is that they can be used to dene properties of coordinate systems, whether Cartesian, axiosymmetric or curvilinear, allowing partial dierential equations to be formulated in a coordinate-free way. The implementation of tensors relies heavily on the arithmetic operations of the scalar eld classes.
A partial dierential equation in general provides a relationship between spatial derivatives of tensor elds representing physical quantities in a system and their time derivatives. Given constraints in the form of the values of the tensor elds at a specic instance in time together with boundary conditions, the aim of a PDE solver is to show h o w the physical system will evolve o v er time, or what state it will converge to if left by itself. Using Sophus, the solvers are formulated on top of the coordinate-free layer, forming an abstract, high level program for the solution of the problem.

Sophus Style Examples
The algebraic style for function declarations can be seen in Figure 1, which shows specications of some operations for multidimensional meshes, the lowest level in the Sophus library. The mesh class is parameterized by a class T, so all operations on meshes likewise are parameterized by T. Typical parameters would be a oat or scalar eld class. The operations declared are dened to behave like pure functions, i.e., they do not update any internal state or modify any of their arguments. Such operations are generally nice to work with and reason about, as their application will not cause any hidden interactions with the environment.
Selected parts of the implementation of a continuous scalar eld class are shown in  as self-mutating operations (Section 3), but are used in an algebraic way for clarity. It is easy to see that the partial derivation operation is implemented by shifting the mesh longer and longer distances, and gradually scaling down the impact these shifts have on the derivative, yielding what is known as a four-point, nite dierence, partial derivation algorithm. The addition and multiplication operations are implemented using the element-wise mapped operations of the mesh.
The meshes used in a scalar eld tend to bevery large. A TorusScalarField may t ypically contain between 0.2 and 2MB of data, perhaps even more, and a program may con-/** some operations on a scalar field implemented using the finite difference method */ class TorusScalarField { private: Mesh<real> msf(); // data values for each grid point of the mesh real delta; // resolution, distance between grid points with self-mutating implementations of a partial derivation algorithm, a scalar eld addition, and a scalar multiplication algorithm. The code itself is using algebraic notation for the mesh operations. 6 optimizing C or C++ compiler would translate the expression to a sequence of compound assignments 1 c = a; c *= 4.0; c += b; c += a, which repeatedly uses variable c to store intermediate results.
We would like to be able to do a similar optimization of the mesh expression above as well as other expressions involving n-ary operators or functions of a suitable nature for user-dened types as proposed in [9]. In an object-oriented language, it would benatural to dene self-mutating methods (i.e., methods mutating this) for the mesh operations in the above expression. These would be closely similar to the compound assignments for predened types in C and C++, which return a pointer to the modied data structure. Sophus demands a side-eect free expression style close to the underlying mathematics, however, and forbids direct use of self-mutating operations in expressions. Note that with a self-mutating += operator returning the modied value of its rst argument, the expression (a += b) += a would yield 2(a + b) rather than (2a) + b .
By allowing the user to dene self-mutating operations and providing a way to use them in a purely functional manner, their direct use can beavoided. There are basically two ways to do this, namely, b y means of wrapper functions or by program transformation. These will be discussed in the following sections.

Wrapper Functions
Self-mutating implementations can be made available to the programmer in non-selfmutating form by generating appropriate wrapper functions. We developed a C++ preprocessor SCC doing this. It scans the source text for declarations of a standard form and automatically creates wrapper functions for the self-mutating ones. This allows the use of an algebraic style in the program, and relieves the programmer of the burden of having to code the wrappers manually.
A self-mutating operator op= is related to its algebraic analog op by the basic rule x = y op z; x = copy(y); x op= z; (1) or, if the second argument is the one being updated, 2 by the rule x = y op z; x = copy(z); y op= x; (2) where denotes equivalence of the left-and right-hand sides, x, y, z are C++ variables, and copy makes a copy of the entire data structure. Now, the Sophus style does not allow aliasing or sharing of objects, and the (overloaded) assignment operator x = y is always given the semantics of x = copy(y) as used in (1) and (2). Hence, in the context of Sophus (1) can be simplied to x = y op z; x = y; x op= z; (3) and similarly for (2). We note the special case x = x op z; x op= z; (4) and the obvious generalizations  where e, e1, and e2 are expressions. SCC uses rules such as (6) to obtain purely functional behavior from the self-mutating denitions in a straightforward way. Figure 4 shows the wrappers created by SCC for the self-mutating mesh operations of Figure 3. The case of n-ary operators and functions is similar (n 1). We note that, unlike C and C++ compound assignments, Sophus style self-mutating operators do not return a reference to the variable being updated and cannot be used in expressions. This simpler behavior facilitates their denition in Fortran 90/95 and other languages of interest to Sophus.
The wrapper approach is supercial in that it does not minimize the number of temporaries introduced for expression evaluation as illustrated in Section 3.1. We therefore turn to a more elaborate transformation scheme.

Program Transformation
Transformation of algebraic expressions to selfmutating form with simultaneous minimization of temporaries requires a parse of the program, the collection of declarations of selfmutating operators and functions, and matching them with the types of the operators and functions actually used after any overloading has been resolved. Also, declarations of temporaries have to be added with the proper type. Such a preprocessor would be in a good position to perform other source-to-source optimizations as well. In fact, this second approach is the one implemented in CodeBoost with promising results. Figure 5 shows an optimized version of the partial derivation operator of class TorusScalarField (Figure 2) that might be obtained in this way. In addition to the transformation to self-mutating form, an obvious rule for ushift was used to incrementalize shifting of the mesh.

Benchmarks 3.4.1 Two Kernels
Consider C++ procedures F and P shown in Figure 6. F computes x 2 + 2 x using algebraic notation while P computes the same expression in self-mutating form using a single temporary variable temp1. Both were run with meshes of dierent sizes. The corresponding timing results are shown in Figures 7, 8    The mesh size is given in the leftmost column. Mesh elements are single precision reals of 4B each. The second column indicates the benchmark procedure (F or P) or the ratio of the corresponding timings (F/P). The columns NC, NS, OC, and OS give the time in seconds of several iterations over each mesh so that a total of 16 777 216 elements were updated in each case. This corresponds to 32 768 iterations for mesh size 8 3 , 64 iterations for mesh size 64 3 , 1 iteration for mesh size 256 3 , and so forth. In columns C (conventional style) the procedure calls are performed for each element of the mesh, while in columns S (Sophus style) they are performed as operations on the entire mesh variables.
Columns N give the time for unoptimized code (no compiler options), while columns O give the time for code optimized for speed (compiler option -fast for the SUN CC compiler and -Ofast for the Silicon Graphics/Cray CC compiler). The timings represent the median of 5 test runs. These turned out to be relatively stable measurements, except in columns NS and OS, rows 256 3 F and P of Figure 7, where the time for an experiment could vary by more than 100%. This is probably due to paging activity on disk dominating the actual CPU time. Also note that the transformations done by the optimizer are counterproductive i n the P case, yielding an NS/OS ratio of 0.8.
When run on the SUN the tests where the only really active processes, while the Cray w as run in its normal multi-user mode but at a relatively quiet time of the day ( Figure 10). As can be seen the load was moderate (around 58) and although fully utilized, resources where not overloaded.
In the current context, only columns NS and OS are relevant, the other ones are explained in Section 5.1. As expected, the selfmutative form P is a better performer than the algebraic form F when the Sophus style is used. Disregarding the cases with disk paging mentioned above, we see that the selfmutating mesh operations are 1.8{2.4 times faster than their algebraic counterparts, i.e., the CodeBoost transformation roughly doubles the speed of these benchmarks.

Full Application: SeisMod
We also obtained preliminary results on the Silicon Graphics/Cray Origin 2000 for a full application, the seismic simulation code Seis-Mod, which is written in C++ using the Sophus style. It is a collection of applications using the nite dierence method for seismic simulation. Specic versions of SeisMod have been tailored to handle simulations with simple or very complex geophysical properties. 3 We compared a version of SeisMod implemented using SCC generated wrapper functions and a self-mutating version produced by the Code-Boost source-to-source optimizer: The algebraic expression style version turned out to give a 10{30% reduction in source code size and greatly enhanced readability for complicated parts of the code. This implies a signicant programmer productivity gain as well as a signicant reduction in maintenance cost as measured by the Cocomo technique [4], for instance A 30% speed increase was obtained after 10 selected procedures out of 150 procedures with speedup potential had 3 SeisMod is used and licensed by the geophysical modelling company UniGEO A.S. (Bergen, Norway). been brought in self-mutating form. This speedup turned out to beindependent o f C++ compiler optimization ag settings. This shows that although a more userfriendly style may give a performance penalty compared to a conventional style, it is possible to regain much of the eciency loss by using appropriate optimization tools. Also, a more abstract style may yield more cost-eective software, even without these optimizations, if the resulting development and maintenance productivity improvement is taken into account.

Implementation of Code-Boost
CodeBoost is a dedicated C++ source-tosource transformation tool for Sophus style programs. It has been implemented using the ASF+SDF language prototyping system [20]. ASF+SDF allows the required transformations to be entered directly as conditional rewrite rules whose right-and left-hand sides consist of language (in our case C++) patterns with variables and auxiliary transformation functions. The required language specic parsing, rewriting, and prettyprinting machinery is generated automatically by the system from the high-level specication. Program transformation tools for Prolog and the functional language Clean implemented in ASF+SDF are described in [7,19]. An alternative implementation tool would have been the TAMPR program transformation system [5], which has been used successfully in various HPC applications. We preferred ASF+SDF mainly because of its strong syntactic capabilities enabling us to generate a C++ environment fairly quickly given the complexity of the language.
Another alternative w ould have been the use of template metaprogramming and/or expres-sion templates [21,22]. This approach is highly C++ specic, however, and cannot be adapted to Fortran 90/95. Basically, the ASF+SDF implementation of CodeBoost involves the following two steps: 1. Specify the C++ syntax in SDF, the syntax denition formalism of the system.
2. Specify the required transformation rules as conditional rewrite rules using the C++ syntax, variables, and auxiliary transformation functions. As far as the rst step is concerned, specication of the large C++ syntax in SDF would involve a considerable eort, but fortunately a BNF-like version is available from the ANSI C++ standards committee. We obtained a machine-readable preliminary version [3] and translated it largely automatically into SDF format. The ASF+SDF language prototyping system then generated a C++ parser from it. The fact that the system accepts general context-free syntax rather then only LALR or other restricted forms of syntax also saved a lot of work in this phase even though the size of the C++ syntax taxed its capabilities.
With the C++ parser in place, the required program transformation rules were entered as conditional rewrite rules. In general, a program transformer has to traverse the syntax tree of the program to collect the contextspecic information used by the actual transformations. In our case, the transformer needs to collect the declaration information indicating which of the operations have a selfmutating implementation. Also, in Sophus the self-mutating implementation of an operator (if any) need not update this but can indicate which of the arguments is updated using the upd ag. The transformer therefore needs to collect not only which of the operations have a self-mutating implementation but also which argument is being mutated in each case. As a consequence, CodeBoost has to traverse the program twice: once to collect the declaration information and a second time to perform the actual transformations. Two other issues have to be taken into account: C++ programs cannot be parsed before their macros are expanded. Some Sophus-specic language elements are implemented as macros, but are more easily recognized before expansion than after. An example is the upd ag indicating which argument of an operator or function is the one to be updated.
Compared to the total number of constructs in C++, there are relatively few constructs of interest. Since all constructs have to be traversed, this leads to a plethora of trivial tree traversal rules. As a result, the specication gets cluttered up by traversal rules, making it a lot of work to write as well as hard to understand. One would like to minimize or automatically generate the part of the specication concerned with straightforward program traversal. Our approach to the above problems is to give the specication a two-phase structure as shown in Figure 11. Under the reasonable assumption that the declarations are not spoiled by macros, the rst phase processes the declarations of interest prior to macro expansion using a stripped version of the C++ grammar that captures the declaration syntax only. We actually used a Perl script for this, but it could have been done in ASF+SDF as well. It yields an ASF+SDF specication that is added to the specication of the second phase. The eect of this is that the second phase is specialized for the program at hand in the sense that the transformation rules in the second phase can assume the availability of the dec- specializes CodeBooost ASF+SDF specication Figure 11: Two-phase specication of CodeBoost.
laration information and thus can be specied in a more algebraic, i.e., context independent manner. As a consequence, they are easy to read, consisting simply of the rules for the constructs that may need transformation and using the ASF+SDF system's built-in innermost tree traversal mechanism. In this way, we circumvented the last-mentioned problem. As CodeBoost is developed further, it will have to duplicate more and more functions already performed by any C++ preprocessor/compiler. Not only will it have to do parsing (which it is already doing now), but also template expansion, overloading resolution, and dependence analysis. It would behelpful if CodeBoost could tap into an existing compiler at appropriate points rather than redo everything itself. One of the candidates we are considering is the IBM Montana C++ compiler/programming environment [17] It would be easy for an optimizer to partition the loops in such a way that the number of cache misses is reduced by a factor of 8.
In the abstract case aggressive in-lining is necessary to expose the actual loop nesting to the optimizer. Even though most existing C++ compilers do in-lining of abstractions, this would benon-trivial since many abstraction layers are involved from the programmer's notation on top of the library of abstractions down to the actual traversals being performed.
Consider once again the timing results shown in Figure 7, Figure 8, and Figure 9. As was explained in Section 3.4, the procedure calls in columns C (conventional style) are performed for each element of the mesh, while they are performed as operations on the entire mesh variables in columns S (Sophus style). Columns OS/OC for row P give the relevant gures for the performance loss of optimized Sophus style code relative to optimized conventional style code as a result of Sophus operating at the mesh level rather than at the element level. The benchmarks show a penalty of 1.1{5.3, except for data structures of less than 128kB on the SUN, where a speedup of up to 1.4 (penalty of 0.7) can be seen in Figure 9. As is to be expected, for large data structures the procedure calls in column OC are more ecient than those in column OS, as the optimizer is geared towards improving the conventional kind of code consisting of large loops with procedure calls on small components of data structures. Also, cache and memory misses become very costly when large data structures have t o be traversed many times.
The gures for P in column OS of Figure 9 are somewhat unexpected. In these cases OS is the fastest alternative up to a mesh size somewhere between 32 3 and 64 3 . This may be due to the smaller number of procedure calls in the OS case than in the OC case. In the latter case F and P are called once perelement, i.e., 16 777 216 times, while in the OS case they are called only once and the self-mutating operations are called only 4 times.
Another interesting phenomenon can be seen in column NC of Figure 7 and Figure 8. Here the self-mutating version takes longer than the algebraic version, probably because the compiler automatically puts small temporaries in registers for algebraic expressions, but cannot do so for self-mutating forms. The OC column shows that the optimizer eliminates the dierence.

New Opportunities for Optimization
The same abstractions that were a source of worry in the previous section at the same time provide the specicity and typing making the use of high-level optimizations possible. Before they are removed by inlining, the information the abstractions provide can be used to select and apply appropriate datatype specic optimization rules. Sophus allows application of such rules at very high levels of abstraction. Apart from the expression transformation rules The laws of tensor algebra. In Sophus the tensors contain the continuous scalar elds as elements (Section 2.1), thus making the abstract tensor operations explicit in appropriate modules.
Specialization of general tensor code for specic coordinate systems. A Cartesian coordinate system gives excellent simplication and axiosymmetric ones also give good simplication compared to general curvilinear code.
Optimization of operations on matrices with many symmetries. Such symmetries oer opportunities for optimization in many cases, including the seismic modelling application mentioned in Section 3.4.2.

Conclusions and Future Work
The Sophus class library in conjunction with the CodeBoost expression transformation tool shows the feasibility of a style of programming PDE solvers that attempts to stay close to the abstract mathematical theory in terms of both the datatypes and the algebraic style of expressions used.
Our preliminary ndings for a full application, the Sophus style seismic simulation code SeisMod, indicate signicant programmer productivity gains as a result of adopting the Sophus style.
There are numerous further opportunities for optimization by CodeBoost in addition to replacement of appropriate operators and functions by their self-mutating versions. Sophus allows datatype specic rules to beapplied at very high levels of abstraction.