Designing for Geometrical Symmetry Exploitation

Symmetry-exploiting software based on the generalized Fourier transform (GFT) is presented from a practical design point of view. The algorithms and data structures map closely to the relevant mathematical abstractions, which primarily are based upon representation theory for groups. Particular care has been taken in the design of the data layout of the performance-sensitive numerical data structures. 
 
The use of a vanilla strategy is advocated for the design of flexible mathematical software libraries: An efficient general-purpose routine should be supplied, to obtain a practical and useful system, while the possibility to extend the library and replace the default routine with a special-purpose - even more optimized - routine should be supported. 
 
Compared with a direct approach, the performance results show the superiority of the GFT-based approach for so-called dense equivariant systems. The GFT application is found to be well suited for parallelism.


Introduction
Many engineering problems in the real world exhibit a significant amount of geometrical symmetry.Just to mention a few examples, we note that groups of conductors used for high voltage electricity transport are often placed symmetrically in order to minimize disturbances, Buckminster balls are based on the symmetry of an icosahedron in order to maximize volume with a minimum of material, and propellers, used in technical applications ranging from turbines to airplanes, exhibit a cyclic symmetry.
In this paper, we are interested in the numerical simulation of partial differential equations (PDEs) that evolve in symmetrical regions, and we are concerned with the exploitation of this symmetry in the numerical algorithms.Our approach is based on the generalized Fourier transform (GFT), and has its origin in work by Allgower et al. [6] The approach can be particularly useful for a class of dense linear algebra systems that stem from geometrically symmetric domains.As explained below, systems in this class are referred to as dense equivariant systems, and they may arise for instance when the boundary element method is used [14].
The GFT approach can also be useful for geometries that are almost symmetrical, or partly symmetrical.An almost symmetric problem can be approximated as symmetrical, and the solution of this problem can be used to derive a preconditioner [32].Block-circulant preconditioners, see e.g.[19,20], can be seen as a special case of this strategy.Another approach can be applied for partly symmetric problems.Here, it is possible to use domain decomposition methods and treat the symmetric part in isolation, thereby exploiting the symmetry [9].
However, our impression is that the GFT and its potential are relatively unknown in the scientific computing community.This is partly explained by the fact that the underlying mathematics, in particular representation theory for groups, is not common knowledge among scientific computing researchers.With adequate software support, however, we believe that usage of the GFT for symmetric problems could increase.
Software for group theory has a strong tradition with the group theory language Cayley from 1976 as a noteworthy landmark [12].At present, "there are two systems which are particularly well suited for computations with groups: GAP and MAGMA" [28].In the related research area of fast GFTs [26], GAP has been used e.g. by Egner and Püschel [13].However, for the numbercrunching applications that we are interested in, we think stand-alone software is more appropriate.
In this paper, we present a software package that is designed particularly with the solution of dense equivariant systems in mind.We are aware of two other packages dedicated to this application [15,21].The first of theses packages handles so called fix points, while the second focusses on a design based upon generic programming.In comparison with these projects, we have paid particular attention to the layout of the data structures that affect performance the most.We use LAPACK [7] for efficient matrix computations.The ANSI-C software abstractions are based closely on the mathematical abstractions, in order to achieve understandability and flexibility.
The outline is as follows.In Section 2, we give an overview of the mathematical machinery required for the GFT.In Sections 3 and 4, we outline how the software is designed and implemented, and we also discuss implementation variants, both general-purpose "vanilla" routines and special-purpose, fast routines.Section 5 shows performance results for the variants mentioned, and makes comparisons between a symmetry-exploiting solve and a direct solve of a dense equivariant system.These comparisons clearly show the importance of exploiting symmetries, as concluded in Section 6.Here, we also discuss the vanilla strategy for mathematical software.

Background
Consider the geometries of Figure 1.The propeller shape in (a) is clearly invariant under rotations by 90, 180, and 270 degrees, as well as under the identity transformation (a rotation by 0 or 360 degrees).The cylinder shape in (b) is invariant under the same rotations, as well as under four reflections.We notice the connection between symmetries and groups; the rotations in (a) form the cyclic group C 4 with four elements whereas the rotations plus the reflections in (b) form D 4 , the dihedral group with 8 elements.The transformations under which a symmetric shape is invariant is referred to as the shape's symmetry group.Symmetric shapes and their corresponding symmetry groups are important in many applications.A linear operator L, such as the Laplacian or an integral operator, is equivariant if it commutes with every transformation g in the symmetry group, i.e., Lg = gL.
In this section, we give an overview of how a corresponding discrete operator is block-diagonalized by the GFT.The underlying mathematical theory is based on representation theory for groups.We summarize the theory in order to motivate our design and to make the paper self-contained.The notation is based on a recent introduction to the subject [3].

Groups and representations
First, we need some basic group theory.A group G is a set of elements that is closed under an associative binary operation, that has an identity element e, and for which every element g has an inverse g −1 .We point out that the groups we consider do not need to be abelian (commutative).The dihedral groups, for example, are nonabelian because a reflection followed by a rotation is not the same as a rotation followed by a reflection.We are concerned with finite groups, i.e., the number of elements |G| < ∞.Actually, since we are interested in symmetric shapes in three dimensions, we are interested in relatively small groups.For the applications that we have in mind, the largest group of interest is the symmetry group of the icosahedron, which has 120 elements [5].
We consider groups acting on indices, thus partitioning the index set into orbits.An orbit of an index i consists of all indices ig obtained by letting all group elements g act on i.In this paper, we consider the case where all orbits have |G| elements, i.e., the case when the action is free.
Second, we need some representation theory.Given a group G, (1) We focus on complex representations here, but real representations are be defined analogously.
Two representations ρ and σ are equivalent if they are related via a similarity transform, i.e., ρ(g) = T σ(g)T −1 for all g ∈ G and some nonsingular T ∈ C d ρ ×d ρ .A representation is reducible if it is equivalent to a block-diagonal representation.This means that ρ is reducible if there exists a nonsingular T such that where the representations σ and τ have dimensions d σ and d τ that are strictly between 0 and d ρ .An irreducible representation cannot be block-diagonalized as above.For example, any representation of dimension 1 is irreducible.Every representation can be reduced into irreducible representations.For every finite group G, there exists a complete list R of nonequivalent irreducible representations [29].The list of representations can be understood as a way to represent the elements of a group G as block-diagonal matrices, a point of view that we have discussed elsewhere [4].The number of nonequivalent irreducible representations (the number of blocks in the block-diagonal matrices) and the dimension d ρ of each representation (the dimension of the corresponding block) are properties of the group in question.These properties also depend upon whether the representation is real or complex.For complex representations and finite groups, it holds that Third, we introduce the generalized Fourier transform as an isomorphism between the group algebra and its corresponding Fourier algebra.The group algebra CG is a vector space C |G| where the group elements are basis vectors and for u ∈ CG we use the notation u = g∈G u(g)g.The group algebra CG is equipped with a convolution product u, v → u * v where The corresponding Fourier algebra is a block-diagonal matrix algebra CG = ρ∈R C d ρ ×d ρ .The generalized Fourier transform (GFT) is a mapping CG → CG defined by û where û = ρ∈R û(ρ).We remark that the GFT depends upon the choice of bases for the representations in R. The inverse GFT (IGFT) is defined by

Exploiting equivariance
We will now discuss how the basic concepts introduced above can be used to design efficient algorithms for the kinds of problems that we are interested in solving.Thus, we consider geometries that are invariant under a symmetry group G of transformations, and we study the discretization matrix A of an equivariant PDE operator.Provided that the discretization is symmetry respecting, A becomes equivariant [1].In the context of matrices, this means that where i, j are discretization indices acted upon by g ∈ G. Equivariant discretization matrices are encountered in the context of many different numerical methods, e.g., the finite emelement method [10, 1], finite differences [33], discretization via radial basis function [17,18], or as we mentioned in the introduction, the boundary element method [6,9].In this paper, we assume that the discretization of the PDE leads to a dense equivariant linear system of equations Ax = b.Furthermore, we assume that the group action partitions the index set I of size n into m orbits of size |G|.By picking one index from each orbit we may represent the orbits by a suitable selection S ⊂ I [6].
An overview of the symmetry exploitation strategy is given in the commutative diagram of Figure 2. The diagram shows that an original matrix vector multiplication Ax in vector space C n , where A is equivariant, can be reformulated as a convolution A * x in the vector space C m G or as a multiplication Âx in the vector space C m G.As discussed in more detail below, we refer to C m G and C m G as the group space and the Fourier space, respectively.We point out that the key to exploiting symmetry and equivariance for our kind of applications is to do the expensive work in the appropriate space.
Figure 2: Relationships between the different spaces.
Group space formulation It is evident that the equivariance property (5) implies that A contains a number of duplicates.The group space formulation avoids this by the following invertible mappings, for i, j ∈ S and g ∈ G.
Here, x and b belong to the vector space C m G = C m ⊗ CG.They contain the same data as their vector space counterparts x and b.A difference is that while Similarly, A ∈ C m×m G = C m×m ⊗ CG, and by the same notational convention we have that A i,j ∈ CG and A(g) ∈ C m×m .The advantage with the above mappings is that duplicates in A are avoided, and A contains only a fraction 1/|G| of the elements in A. This is beneficial not only in terms of storage but also when the discrete operator itself is constructed, since duplicate elements need not be computed.
By generalizing the group algebra convolution, it is simple to show that the original linear system of equations can be expressed as A * x = b.The group space convolution A * x is given by where a standard matrix vector product We conclude that the original equation can be solved by solving its counterpart in group space instead.Note that x and b are vectors according to the vector space axioms, even though they are not the kind of vectors that computational scientists usually deal with.In the same spirit, we will refer to A as a matrix in group space, even though it is not a standard matrix.
Fourier space formulation In order to solve the problem in group space, it is reformulated as a linear system of equations in Fourier space.This is achieved by applying the GFT in the group algebra element-wise.The vector GFT is given by xi (ρ) = Here, x ∈ C m G = C m ⊗ CG can be understood as a block-diagonal matrix where block ρ, i.e. x(ρ), has dimensions md ρ × d ρ .The matrix GFT is given by Âi The matrix Â ∈ C m×m G = C m×m ⊗ CG is a block-diagonal matrix where block ρ, i.e.Â(ρ), is md ρ × md ρ .
It is easy to show that A * x = b corresponds to Âx = b, a matrix matrix equation where all matrices are block-diagonal.We refer to Â as the matrix in Fourier space, while we refer to x and b as Fourier space vectors, even though they actually are block-diagonal matrices.In summary, the formulation in Fourier space is a block-diagonal version of the original linear systems of equations.It is much cheaper to solve the system in Fourier space, since each block is smaller than the original matrix and each block equation can be solved separately.For applications where dense equivariant systems are to be solved, this block-diagonalization is the most important way to exploit equivariance.
Regarding the GFT formulas ( 6) and ( 7), they are generic with respect to different groups and choices of irreducible representations.By studying the structure of the groups and their representations, it is possible to derive fast versions of the GFT, so called fast GFTs [23,25].The standard FFT is actually a prominent example of this.The FFT is a fast version of the DFT, which may be interpreted as a GFT for a cyclic group.In Section 4, we describe the implementation both of general vanilla versions and of two fast GFT versions for D 4 .

Data structures
We strive to design data structures and algorithms that match the mathematical concepts, in order to get software which is easy to use and maintain.The UML [24] class diagram in Figure 3 is motivated by the theory in the previous section.It shows the static associations between the key concepts corresponding to the group G, a list R of nonequivalent irreducible representations, and vectors and matrices in group space and Fourier space.The dynamic interactions among the participating objects are visible in the commutative diagram in Figure 2, and the overall GFT approach for solving an equivariant system Ax = b is shown in Algorithm 1.Note that it is not necessary to form all of A and we therefore assume that the mapping into the group space equation A * x = b has already been taken into account.
The design challenge, however, is to address computational efficiency and efficient memory handling as well.In the following subsections, we describe design considerations for the data structures for groups and irreducible representations, as well as vectors and matrices in group space and Fourier space.We refer to Yamba Yamba for a more detailed description [34].

Group
The concept of a group is clearly an important abstraction, which must supply the group order, the group operation, the inverse operation and so forth.There are different ways to implement groups, taking advantage of special properties of different groups or exploiting reuse between groups.A cyclic group C N , for example, can be represented as integers and the group operation can Algorithm 1 GFT approach for solving an equivariant system A * x = b.
be implemented as addition modulo N .Another structure that is simple to exploit is the direct product G 1 × G 2 , which is the Cartesian set of groups G and H, and where the group operation is (g Previously, we have discussed how C++ template mechanisms can be used to obtain a clean group interface and efficient variant implementations of different groups, since compile time polymorphism can be exploited [21].In that design, cyclic groups are objects of the parametrized class Cyclic<N>, and the direct product of for example C 4 with C 8 is represented by the class typedef DirectProduct< Cyclic<4>, Cyclic<8> > C4timesC8 ; In the present project, we have chosen to implement in C and to keep the design at a low abstraction level.In the spirit of the vanilla design philosophy, we have chosen to implement the group by representing each group element by an integer, and to supply the group operation and the inverse operation as arrays of integers.In our context, there are three issues that we want to emphasize: 1.The implementation of the group is not time-critical.The important task is to design a useful API, and to have a working implementation.If a future application requires a more sophisticated solution, it is possible to supply special-purpose implementations for special groups.
2. We focus on symmetry groups of geometries in 2D and 3D.The most complicated group for the applications that we have in mind is the symmetry group of the icosahedron which only has 120 elements [5].It is therefore not a serious restriction to use arrays for maintaining the group operation.
3. Our implementation allows a user to input the group operation and inverse operation as tables from a file, generated by hand or perhaps by another program.Since this solution is rather error-prone, we provide a routine which checks that the group axioms hold.
A group is represented by the data structure GftGroup, see the UML diagram in Figure 4 and the description of the application program interface (API) given below.Note that it is simple to implement variations of GftGroup data structure, for instance to support cyclic groups without having tables for the group operation.We also stress that the number of irreducible representations and their dimensions depend upon the underlying field, supplied by the user.

Group API
The operations of GftGroup are: gftGroupDelete Destructor.
gftGroupOperation Return the group operation gh.
gftGroupInverse Return the inverse element g −1 .
We also supply get operations for the name, the order, the number of irreducible representations and their dimensions.The data members of GftGroup are: name String containing the name of the group.
order Positive integer containing the group order.
num reps Positive integer containing the number of irreducible representations.
table order × order matrix containing the group operation table.
inverse Array containing the inverses.
dims Array containing the dimensions of the irreducible representations.

Irreducible Representations
The list R of nonequivalent irreducible representations is maintained by the GftGroupRep data structure.There are several more or less sophisticated ways of implementing this important abstraction.Here, we focus on the interface and we provide a vanilla implementation, in analogy with GftGroup.
Let us first formalize the notation in more detail.Let q be the number of representations in R, and enumerate the nonequivalent irreducible representations from R 0 to R q−1 .We discuss complex representations here, and the r'th irreducible representation is thus R r : G → C dr×dr , where d r = dim R r .For a given representation R r and a given group element g, we denote an element in the matrix R r (g) by R r (g) µ,ν where µ, ν ∈ [0, . . ., d r − 1].Equation (2) shows that R contains |G| elements R r (g) µ,ν in total.
In our vanilla implementation of GftGroupRep, all |G| 2 elements of R are organized in a straight-forward array.In order to obtain fast access to individual representations R r , the d 2 r |G| elements of R r are stored consecutively.To determine an appropriate ordering of the remaining indices µ, ν ∈ [0, . . ., d r − 1] and g ∈ G, we examine how R r is accessed during the GFT (3).The transform x → x is carried out as a sum over g ∈ G. Therefore, we store R r (•) µ,ν , i.e., the |G| elements R r (g) µ,ν , g ∈ G, consecutively.As a consequence, we notice that the transform (3) actually can be expressed as a matrix vector multiplication.Let each row of a |G| × |G| matrix R contain R r (•) µ,ν , and let col(x) denote a "columnization" of û, i.e., organize the elements in the block-diagonal matrix û ∈ CG as a column vector with |G| elements.Then we see that computes the GFT, provided that the indexing of R matches the indexing of u and û.Thus, by storing elements of R as a |G| × |G| matrix R, the GFT can be achieved by a call to LAPACK.In addition, we see that a convenient way to implement a vanilla version of the IGFT (4) is by computing R −1 by another LAPACK call, and carry out the matrix vector multiplication u = R −1 col(û).
The data structure GftGroupRep is illustrated in Figure 5 and described below.Notice that the number of irreducible representations as well as their respective dimensions are available through the GftGroup member.Regarding precision, both single precision complex and double precision complex are supported via a link time option.We also support single precision and double precision real numbers.This means that the vanilla implementation supports real GFTs for groups where the dimensions of the real and complex nonequivalent irreducible representations coincide.This is the case for many important groups, e.g., all dihedral groups, the symmetry group of the cube, and the symmetry group of the icosahedron [22].gftGroupRepCheck Check that each representation R r is a representation, i.e., check that R r (gh) = R r (g)R r (h) for all g, h ∈ G and all R r ∈ R.

Group Representation API
gftGroupRepInverse Return R −1 , obtained via a Lapack call.The outparameter ipiv contains pivot information.
The data members of GftGroupRep are: group An association to the corresponding group.
data The matrix R, containing all elements of R.
blockIndices An array containing the array indices of the first element of each of the different irreducible representations in the data buffer.This array is maintained in order to facilitate a fast computation of the addresses of the elements of R.

Group space vectors and matrices
When transforming matrices and vectors of an equivariant system Ax = b, where A ∈ C n×n and x, b ∈ C n , into their corresponding elements in the group space, A ∈ C m×m G and x, b ∈ C m G, the computational effort can be minimized through careful selection of the numbering system used for the indices.Different numberings lead to different but equivalent equivariant systems.The transformation into the group space can be simplified by the selection of a numbering scheme which corresponds to the order in which the group elements are listed.With such a selection, the transformation of a vector into its group space counterpart is reduced to a reinterpretation of the contents of the data structures, cf. the discussion in connection with Algorithm 1.
Let G be the group describing the symmetry, let m be the number of orbits of the discretization.Under a free action of G a symmetry respecting discretization of m orbits has a number of elements equal to n = m |G|.For all orbits, let i ∈ 0, . . ., m − 1 be the orbit index.Assign the subset of element indices {i |G| , . . ., (i + 1) |G| − 1} to the elements of orbit i according to the following procedure.Pick o i as the first element of the orbit, i.e., let o i correspond to index i |G|, and let g k for k = 0, . . ., |G| − 1 denote group element of index k, and assign to o i g k index i |G| + k. Figure 6 illustrates this numbering for a symmetry respecting discretization with two orbits for the case of group D 4 generated by a rotation a and a reflection b.The group elements are enumerated in the order e, a, a 2 , a 3 , b, ab, a 2 b, a 3 b.Element 0 is chosen as the first elements of orbit {0, . . ., 7} and element 8 is chosen as the first element of orbit {8, . . ., 15}.With this numbering the vector x is stored in the following order: where γ = |G| − 1.
A group space vector x is represented by the data structure GftVector and a group space matrix A is represented by the data structure GftMatrix.See Figures 7 and 8 and the API descriptions below.Notice that the original vector x can be represented by the same data structure as x, thanks to our choice of index ordering.gftVectorElement Return address of element x i (g).

Group space vector API
The data members of GftVector are: group An association to the corresponding group.
num orbits The number of orbits.
data The elements of x.

Group space matrix API
The operations of GftMatrix are: gftMatrixCreate Constructor.Create a group space matrix by supplying the requested inparameters.
gftMatrixOrbit Return address of A i,j .
gftVectorElement Return address of element A i,j (g).
The data members of GftMatrix are: group An association to the corresponding group.
num orbits The number of orbits.
data The elements of A.

Fourier space vectors and matrices
The data types representing elements in the Fourier spaces C m G and C m×m G were designed with the primary goals of achieving efficient solution of the system A x = b and to obtain efficient computation of the GFTs.To this end, we organized the data structures as follows.
The choice of index ordering for the vectors determines the ordering of the Fourier space matrices.Fourier space matrices are organized with the indices in the following order: This data layout allows us to solve Âx = b via standard Lapack calls.Other layouts are possible, but it is essential to store blocks separately and it is important that the matrix layout and the vector layout match each other.
A Fourier space vector x is represented by the data structure GftTrans-formedVector, see Figure 9 and the API description below.A Fourier space matrix Â is represented by the data structure GftTransformedMatrix, see Figure 10.The API is not further described, since operations and data members are almost identical to those for Fourier space vectors.

Fourier space vector API
The operations of GftTransformedVector are: gftTransformedVectorCreate Constructor.Create a Fourier space vector by supplying the requested inparameters.
gftTransformedVectorBlock Return address of x(R r ).
The data members of GftTransformedVector are: group An association to the corresponding group.
num orbits The number of orbits.
data The elements of x.
blockIndices An array that maintains the first index of each block.

Algorithms
Using the data structures described in the previous section, we have developed several implementations of the GFT algorithm.The purpose is to study how different strategies perform in the context of solving dense equivariant linear systems of equations with the GFT approach.The first variants, Vanilla, are general and map closely to ( 6) and (7).The second Matrix Multiplication variant is based upon the notion of computing the GFT via a matrix vector multiplication (8).This makes it possible to use LAPACK.Finally we have implemented Fast variants for the case of D 4 .
Here, the structure of the specific group and its list of nonequivalent irreducible representations are used to reduce the number of arithmetic operations in the GFT.

Vanilla variants
The general Vanilla implementations map closely to ( 6) and (7).Six nested loops are used for the matrix GFT C m×m G gft → C m×m G, and five loops are used for the vector GFT C m G gft → C m G. Apart from roundoff errors, the ordering of the loops has no implications for the final result, but the performance of the algorithm depends on the loop ordering [16].
We have implemented two loop orderings, shown for the case of the matrix GFT in Algorithms 2 and 3.The first ordering is chosen to match the access pattern of Â, whereas the second ordering accesses the elements of A more efficiently.The computational complexity of both algoritms is |G| 2 multiplications and |G| 2 additions per group algebra GFT CG → CG.Algorithm 3 Vanilla 2 algorithm for the matrix GFT.

Matrix multiplication variant for the matrix GFT
Since the GFT can be understood as a matrix vector multiplication, we developed a version where the GFTs are carried out via calls to LAPACK.The matrix GFT is shown in Algorithm 4. The layout of A makes it possible to interpret A as a G × m 2 matrix, and the GFT is carried out by a single matrix matrix multiplication.Since the result must be permuted in order to fit the prescribed layout of Â, this variant uses a temporary memory buffer T .

Algorithm 4 Matrix multiplication matrix GFT.
Interpret A as a G × m 2 matrix Compute T = RA Insert T into col( Â)

Fast variants
By exploiting the structure of R, fast GFTs can be developed.It is beyond the scope of this paper to discuss fast GFTs in detail, and we refer to Maslen and Rockmore for a survey [23,25].Fast GFTs for dihedral groups have been studied by e.g.Stiller [30].We are interested in studying the importance of fast GFTs for our kind of applications, and we have developed two fast versions for the case of D 4 , where the GFT corresponds to multiplication with In the case of D 4 , the structure of R is simple enough to allow development of fast GFTs by inspection.For the conventional vanilla approach, the application of the GFT requires 2|D for j = 0, m − 1 do for i = 0, m − 1 do Âi,j = fgft v (A i,j ) end for end for

Results
In order to evaluate our implementations we have carried out several numerical experiments.These tests were done on a Sun Fire 15k server using the Sun WorkShop 6.2 compiler.Compiler flags were -fast -xarch=v8plusb -dalign -xc99=%all -xO3.The reported timings are obtained as the fastest result of ten consecutive runs, in order to avoid random effects.We have chosen maximum problem sizes in the order of 10000 unknowns, which is quite large on this platform.
The first study focuses on the serial performance of the GFT operation.Particularly, we are interested in comparing the flexible Vanilla implementations with the Fast versions.The second study examines the use of the GFT for solving dense equivariant systems.Third, we investigate the parallel performance of our GFT implementation.The group D 4 is used for all performance results, but the Vanilla algorithms have also been tested for other groups.

Different implementations of the generalized Fourier transform
Figure 11 shows the serial performance of the different implementations described in Section 4. The best performing version is Fast 1 and the worst is Vanilla 1.We note that even though Fast 1 uses more arithmetic operations than Fast 2, it still performs better.We believe that the reason for this is that the aggressive reuse of temporary variables in the Fast 2 algorithm, see Appendix A, makes it impossible for the superscalar processor to pipeline the computations [31].
The bad performance of the Matrix Multiplication variant indicates that the cost of using intermediate, temporary storage is high.An interesting result is that Vanilla 2 variant performs quite well.This general purpose variants perform almost as well as the Fast 2 group specific implementation.

Solving equivariant systems
The execution time for solving a D 4 -equivariant system Ax = b using the GFT, was compared to that of direct solution of the system.For each m ∈ {64, 128, 256, 512, 1024} a randomly generated equivariant system Ax = b, with A ∈ C n×n , n = 8m, was solved using the symmetry exploiting method on one hand and direct solution on the other.For the symmetry exploiting case, the Fast 1 algorithm was used for the GFT.We have measured the time taken to compute A = gft(A), b = gft(b), and x = igft( x) and for solving A x = b, and we compare the sum of these times to the time used to solve Ax = b directly.In both cases LAPACK routines were used when solving the systems.Figure 12 shows a comparison between the times used by the two methods.The time complexity is actually O(n 3 ) in both cases, but the constant is much lower for the GFT approach.
The O(n 3 ) growth of the GFT-based approach is seen more clearly in Figure 13, where the time taken by each step of the symmetry exploiting method is shown.The most time consuming part of the GFT method is the solution of the transformed, block diagonal system.Compared to this time, the times for the GFT operations are almost insignificant.In particular the GFT of b and the IGFT of x take approximately 0.005% of the total time, and can therefore not be distinguished in the graph.

Parallelism and speedup
The performance of an OpenMP parallelization of the GFT approach was also examined.an n × n matrix with n = 24576.We computed A = gft(A), using 1, 2, 4, 8, 16 and 24 CPUs.Figure 14 shows the relative speedup, calculated by dividing the time for one CPU by each of the times measured.The results show that this application is well suited for shared memory parallelism, and we think that the reason for this is that the GFT of different orbits are independent.

Conclusions
We have presented the design of a GFT-based symmetry-exploiting software.The basic idea is to transform a dense equivariant system of equations to a corresponding system in Fourier space, which can be solved much cheaper.Major design goals have been to facilitate efficient equation solving in Fourier space and to simplify the various mappings involved, including the GFT.Our results show that the GFT-based approach is very efficient compared to a direct approach.We have also shown that the matrix GFT is well suited for shared memory parallelism.
We would like to explicitly draw the attention to our design philosophy, since it has a broader applicability than just symmetry exploitation.Most importantly, our software is based closely upon the mathematical abstractions, which makes the software flexible and easy to understand [2].We have verified the flexibility by extending the software to support other groups as well.In this context, we stress the role of the "vanilla strategy."By supplying a generalpurpose GFT routine, a useful software is obtained.If, however, the vanilla version is not fast enough, it is possible to add special-purpose, fast routines.It is well-known that it is much easier to verify an optimized routine once a first version is in place [27].Our point, however, is that the vanilla routine supplied by the framework should be optimized as far as possible while still being general.
Regarding the interesting topic of fast GFTs, we would like to make two observations.First, for the applications we are interested in, the most time  consuming part is the system solving in Fourier space, and therefore we think that an optimized vanilla version is often good enough.Second, an interesting point is noted in Section 5.1.The performance of the Fast 1 and Fast 2 illustrate the fact that it is not only the number of arithmetical operations that is important.The underlying computer architecture, memory accesses, possibility of pipe-lining and so forth are also important factors that determine the performance of an algorithm.It would be an interesting challenge to develop fast GFT algorithms that take issues such as these into account.
For future work, we will address more groups and applications.One application that we have studied is the Black-Scholes equations [8] discretized with radial basis functions [11] on a square.By suitable transformations, the equations are transformed into the heat equation, leading to a dense D 4 -equivariant space discretization matrix.Even though time discretization and boundary treatment destroy equivariance, techniques similar to those presented in [9] can be used to obtain promising results [17,18].

Figure 3 :
Figure 3: Class diagram for the key concepts GftGroup, GftGroupRep, GftVector, GftMatrix, GftTransformedVector, and GftTransformed-Matrix, representing G, R, C m G, C m×m G, C m G, and C m×m G, respectively.

Figure 11 :
Figure 11: Comparison between the different GFT implementations.

Figure 12 :
Figure 12: Comparison between direct and symmetry-exploiting solutions for different matrix sizes.

Figure 13 :
Figure 13: Time consumption of the different steps of the GFT-based approach.
4 | 2 = 128 arithmetic operations.The Fast 1 variant u = fgft 1 (u) is computed by an algorithm that requires 12 temporary variables and 20 arithmetic operations per orbit.The algorithm avoids unnecessary multiplications and reuses temporary results, in a straight-forward fashion.The Fast 2 variant reuses temporary results more agressively, and computes u = fgft 2 (u) with an algorithm that requires 6 temporary variables and 17 arithmetic operations per orbit.The details of the Fast algorithms are accounted for in Appendix A. The matrix GFT is carried out as shown in Algorithm 5, and the vector GFT is carried out analogously.Algorithm 5 Fast matrix GFT variants v = 1, 2 for D 4 .