Implementing a High Performance Tensor Library

Template methods have opened up a new way of building C++ libraries. These methods allow the libraries to combine the seemingly contradictory qualities of ease of use and uncompromising efficiency. However, libraries that use these methods are notoriously difficult to develop. This article examines the benefits reaped and the difficulties encountered in using these methods to create a friendly, high performance, tensor library. We find that template methods mostly deliver on this promise, though requiring moderate compromises in usability and efficiency.


Introduction
Tensors are used in a number of scientific fields, such as geology, mechanical engineering, and astronomy.They can be thought of as generalizations of vectors and matrices.Consider the rather prosaic task of multiplying a vector P by a matrix T , yielding a vector If we write out the equations explicitly then Q y = T yx P x + T yy P y + T yz P z , Q z = T zx P x + T zy P y + T zz P z .
Alternatively, we can write it as j=x,y,z T zj P j or even more simply as j=x,y,z where the index i is understood to stand for x, y, and z in turn.In this example, P j and Q i are vectors, but could also be called rank 1 tensors (because they have one index).T ij is a matrix, or a rank 2 tensor.The more indices, the higher the rank.So the Riemann tensor in General Relativity, R ijkl , is a rank 4 tensor, but can also be envisioned as a matrix of matrices.There are more subtleties involved in what defines a tensor, but it is sufficient for our discussion to think of them as generalizations of vectors and matrices.Einstein introduced the convention that if an index appears in two tensors that multiply each other, then that index is implicitly summed.This mostly removes the need to write the summation symbol j=x,y,z .Using this Einstein summation notation, the matrixvector multiplication becomes simply Of course, now that the notation has become so nice and compact, it becomes easy to write much more complicated formulas such as the definition of the Riemann tensor There are some subtle differences between tensors with indices that are upstairs (like T i ), and tensors with ISSN 1058-9244/03/$8.00 2003 -IOS Press.All rights reserved indices that are downstairs (like T i ), but for our purposes we can treat them the same.Now consider evaluating this equation on an array with N points, where N is much larger than the cache size of the processor.We could use multidimensional arrays and start writing lots of loops and have the computer do all of the summing and iterating over the grid automatically.
There are a number of libraries with varying amounts of tensor support [1][2][3][8][9][10][11].With one exception, they are all either difficult to use (primarily, not providing implicit summation), or they are not efficient.GRPP [8] solves this conundrum with a proprietary mini-language, making it difficult to customize and extend.
We have written a program to simulate neutron star collisions in General Relativity.It uses tensors extensively, so we have developed a library to simplify their use.In this paper, we start by describing our first, simple design for a tensor class within C++.We progressively refine the overall design to improve the performance, while keeping most of the usability intact.Then we describe the details of implementing natural notation for tensor arithmetic within this final design.We follow with a survey of different compilers, testing how proficient they are at compiling and optimizing the library.We end with a look at a more generic version of the library and how it affects performance.

Design choices for tensor libraries
There are a few different ways that a tensor library can be constructed.Ideally, we want a solution that is easy to implement, easy to use, and efficient.

Simple classes
The most straightforward way to proceed is to make a set of classes (Tensor1, Tensor2, Tensor3, etc.) which simply contains arrays of doubles of size N. Then we overload the operators +, − and * to perform the proper calculation and return a tensor as a result.The well known problem with this is that it is slow and a memory hog.For example, the expression will generate code equivalent to double *temp1=new double [N]; for(int n=0;n<N;++n) for(int i=0;i<3;++i) [1]; delete[] temp2 [2]; delete[] temp3[0]; delete[] temp3 [1]; delete[] temp3 [2]; This required three temporaries requiring 7 N doubles of storage.None of these temporaries disappear until the whole expression finishes.For expressions with higher rank tensors, even more temporary space is needed.Moreover, these temporaries are too large to fit entirely into the cache, where they can be quickly accessed.The temporaries have to be moved to main memory as they are computed, even though they will be needed for the next calculation.With current architectures, the time required to move all of this data back and forth between main memory and the processor is much longer than the time required to do all of the computations.

Expression templates
This is the sort of problem for which template methods are well-suited.Using expression templates [13], we can write and have the compiler transform it into something like for(int n=0;n<N;++n) for(int i=0;i<3;++i) The important difference here is that there is only a single loop over the N points.The large temporaries are no longer required, and the intermediate results (like ) can stay in the cache.This is a specific instance of a more general code optimization technique called loop-fusion.It keeps variables that are needed for multiple computations in the cache, which has much faster access to the processor than main memory.
This will have both nice notation and efficiency for this expression.What about a group of expressions?For example, consider inverting a symmetric, 3 × 3 matrix (rank 2 tensor) A. Because it is small, a fairly good method is to do it directly Through the magic of expression templates, this will then get transformed into something like for(int n=0;n<N;++n) Once again, we have multiple loops over the grid of N points.We also have a temporary, det, which will be moved between the processor and memory multiple times and can not be saved in the cache.In addition, each of the elements of A will get transferred four times.If we instead manually fuse the loops together [n])/det; // and so on for the other indices. . . .} then det and the elements of A at a particular n can fit in the cache while computing all six elements of I.After that, they won't be needed again.For N = 100, 000 this code takes anywhere from 10% to 50% less time (depending on architecture) while using less memory.This is not an isolated case.In General Relativity codes, there can be over 100 named temporaries like det.Unless the compiler is omniscient, it will have a hard time fusing all of the loops between statements and removing extraneous temporaries.It becomes even more difficult if there is an additional loop on the outside which loops over multiple grids, as is common when writing codes that deal with multiple processors or adaptive grids.
As an aside, the Blitz library [11] uses this approach.On the benchmark page for the Origin 2000/SGI C++ [12], there are results for a number of loop kernels.For many of them, Blitz compares quite favorably with the Fortran versions.However, whenever there is more than one expression with terms common to both expressions (as in loop tests #12-14, 16, 23-24) there are dramatic slow downs.It even mentions explicitly (after loop test #14) "The lack of loop fusion really hurts the C++ versions." The POOMA library [9] uses an approach which should solve some of these problems.It splits up calculations into chunks that fit neatly into the cache.Then, multiple loops are no longer a problem, because all of the variables in each loop fit into the cache.Doing this correctly is quite difficult, because the library writer must be able to deduce how many different variables are involved in each execution block.In addition, it still requires storage for named temporaries.Does all this mean that we have to go back to C-tran for performance?

Expression templates + manual loop fusion
The flaw in the previous method is that it tried to do two things at once: implicitly sum indices and iterate over the grid.Iterating over the grid while inside the expression necessarily meant excluding other expressions from that iteration.It also required temporaries to be defined over the entire grid.To fix this, we need to manually fuse all of the loops, and provide for temporaries that won't be defined over the entire grid.We do this by making two kinds of tensors.One of them just holds the elements (so a Tensor1 would have three doubles, and a Tensor2 has 9 doubles).This is used for the local named temporaries.The other kind holds pointers to arrays of the elements.Making it a simple double * allows us to use any sort of contiguous storage format for the actual data.The data may be managed by other libraries, giving us access to a pointer that may change.In that sense, the Tensor is not the only owner of the data, and all copies of the Tensor have equal rights to access and modify the data.
We make the pointers mutable so that we can iterate over const Tensor1 iter's.The indexing operators for const Tensor1 iter returns a double, not double * or double &, so the actual data can't be changed.This is different from how iterators in the Standard Template Library work.This keeps the data logically const, while allowing us to look at all of the points on the grid for that const Tensor1 iter.
For our specific application (Numerical General Relativity), these were not serious problems, because most of the logic of our program is in the manipulation of local named variables.Only a few variables (the input and output) need to be explicitly iterated.
However, this may not be the right kind of solution for generic arrays.They correspond to rank 0 tensors (tensors without any indices).It is a win for higher rank tensors because most of the complexity is in the indices.But for generic arrays, there are no indices.A solution like this would look almost identical to C-tran.

Implementing natural notation efficiently
The previous section described the design choices required for efficiency.In this section, we describe how we achieve a natural tensor notation with minimal overhead.

Basic classes
To illustrate our basic design, we start with rank 1 tensors.The code is slightly more clear for Tensor1's than for Tensor1 iter's, so we will concentrate on them.However, the techniques are the same, and almost all of the code is used by both types.
We define a class Tensor1 with three elements corresponding to the x, y, and z components.We define operator()(int) to return these elements, so if we have a Tensor1 named A, A(0) gives the x element, A(1) gives the y element, and A(2) gives the z element.The outline of this class so far is Note that there is no range check on the index, so A(1221) will return the same result as A(2).We also have a checked version selected at compile time with #ifdef DEBUG macros, but we omit it for clarity.
We want to support the notation A(i) = B(i), where i is implicitly summed over 0, 1, and 2. To do this, we use expression templates [13], because they transparently provide high performance.We define two auxiliary classes, Index and Tensor1 Expr.Index is used to tell the compiler what the index of the Tensor1 is.It uses a template parameter to store this information, so it is otherwise empty.The definition of Index is thus rather simple template<char i> class Index {}; On the other hand, Tensor1 Expr is designed to hold any kind of expression that eventually simplifies to a rank 1 tensor.For example, the expressions A(i) and B(j)*T(j,i) (which has an implicit summation over j) both simplify to a tensor with one index.To accomplish this, Tensor1 Expr has two template parameters that tell it 1) what kind of object it contains, and 2) what its index is.This class is analogous to the DExpr class in the original expression templates paper [13].The definition for Tensor1 Expr is then Here, the template parameter class A is the object contained in the expression, and char i is the tensor index.We overload the member function Tensor1::operator()(Index) to return a Tensor1 Expr.template<char i> Tensor1_Expr<Tensor1,i> operator() (Index<i> index) { return Tensor1_Expr<Tensor1,i>(*this); }

An example of its use is
Here, the statement A(i); creates a Tensor1 Ex pr<Tensor1,'i'>.This just illustrates the simplest case, where a Tensor1 Expr holds a Tensor1.To assign one tensor to another, we create a partial specialization of Tensor1 Expr for the case when it contains a Tensor1 This is almost the same as the general Tensor1 Ex pr.The only differences are that it defines the equals operator, and it takes a reference to the object that it contains (Tensor1 &iter), instead of a copy (A iter).The second change is needed in order for assignment to work.Our example now becomes The last statement creates two Tensor1 Expr<Ten sor1,'i'>'s, one for A and one for B. It then assigns the elements of B to the elements of A. If we had tried something like Index<'i'> i; Index<'j'> j; Tensor1 A, B; A(i)=B(j); then the compiler would not have found a suitable operator=().The second Tensor1 Expr<> template parameter (char), which was obtained from Tensor1::operator()(Index<i> index), wo uld not match.This provides strong compile-time checking of tensor expressions.

Arithmetic operators
Now we want to do something really useful.We want to add two Tensor1's together.This is where expression templates really come into play.We do this by creating a helper class Tensor1 plus Tensor1.In the original expression templates paper [13], there was a generalized DBinOpExpr class.However, the action of * is very different from + and −, and / is not well defined for tensors of rank 1 or higher, so there is no real advantage to a generalized BinOpExpr class.The helper class is defined as Note that the indices of the two Tensor1 Expr's have to match up, or they won't have the same char template parameter.This is another example of strict compile-time checking for validity of tensor expressions.
The Tensor1 Expr<> object passes these calls to it's contained object, the Tensor1 plus Tensor1.The Tensor1 plus T ensor1 object returns the sum of the calls to the two objects (Tensor1 Expr's) it contains.The Tensor1 Expr's pass the call to the Tensor1 it contains, and we finally get the results.
The code for subtraction is exactly the same with + replaced with − and plus replaced with minus .The * operator has a very different meaning which depends on what the indices are.For example, A(i)*B(i) contracts the two tensors together, implicitly summing the indices, yielding a double, while A(i)*B(j) creates a new Tensor2 with indices of i and j.Implicit summation is described in the next section, but the solution to the latter is quite similar to the addition operator described before.We first need a helper class

Implicit summation
The preceding work is not really that interesting.Blitz [11] already implements something almost like this.What really distinguishes this library from others is its natural notation for implicit summation, or contraction.There are two kinds of contraction: external and internal.

External contraction
External contraction is when the index of one tensor contracts with the index of another tensor.This is the most common case.Consider the simple contraction of two rank 1 tensors To accomplish this, we specialize operator*(Ten sor1 Expr,Tensor1 Expr) Because the function is typed on the template parameter i, which comes from the Index when the Tensor1 Expr is created, it will only be called for operands that have the same index (i.e.A(i)*B(i), not A(i)*B(j)).
We also want to contract tensors together that result in a tensor expression, such as a Tensor1 contracted with a Tensor2 (A(i)*T(i,j)).As with the addition and subtraction operators, we use a helper class The 0 appended to the end of the class is a simple way of naming the classes, since we will need a similar class for the case of A(i)*T(j,i) (as opposed to A(i)*T(i,j), which we have here).Then we specialize operator*(Tensor1 Expr,Tensor2 Expr)

Internal contraction
Contraction can also occur within a single tensor.The only requirement is that there are two indices to contract against each other.A simple example would be Index<'i'> i; Tensor2 T; double result=T(i,i); The last line is equivalent to double result=T(0,0)+T(1,1)+T(2,2); This internal contraction is simply implemented by specializing Tensor2::operator()(Index,Index) template<char i> double operator()(const Index<i> ind-ex1, const Index<i> index2) { return data00 + data11 + data22; } There is also a more complicated case where there is an internal contraction, but the result is still a tensor.For example, a rank 3 tensor W contracting to a rank 1 W(i,j,j).For this, we define a helper class Then we define a specialization of operator()(In dex,Index,Index) to create one of these objects template<char i, char j> inline Tensor1_Expr<const Tensor3_contracted_12 <Tensor3_dg,i>,i> operator()(const Index<i> index1, const Index<j> index2, const Index<j> index3) const { typedef const Tensor3_contracted_12 <Tensor3_dg,i> TensorExpr; return Tensor1_Expr<TensorExpr,i> (TensorExpr(*this)); } Now, if we ask for the x component of W(i,j,j), the compiler will automatically sum over the second and third indices, returning W(0,0,0)+W(0,1,1)+ W(0,2,2).

Reduced rank tensors
Expressions like A(i)=T(0,i) can sometimes pop up.To handle this, we make a helper class The end result of all of this is that when we write Index<'i'> i; Tensor1 A; Tensor2 T; A(i)=T(0,i); we create a Tensor1 Expr<Tensor2 numeral 0 <Tensor2>,'i'> which then gets assigned to the Tensor1 Expr<Tensor1,'i'> created by A(i).
Unfortunately, a syntax like T(0,i) is inefficient, because the value of the first index (0) is difficult, even in simple cases, for compilers to deduce and apply at compile time.Simple tests show that a good compiler can use the template arguments to optimize this expression as well as if it were written with simple arrays, while it can't optimize expressions with simple int's as indices.

Symmetric tensors
It turns out that implementing a symmetric tensor is quite simple.First, we define a class (e.g.Tensor2 symmetric) with the minimum number of elements.Then we write the indexing operators (operator()(int,int,. ..)) so that, if an element that is not available is requested, it uses the symme-try and returns the equivalent one.For example, for a symmetric rank 2 tensor, we only define data00, data01, data02, data11, data12, and data22.Then, if element (2,1) is requested, we just return data12.

Antisymmetric tensors
Implementing antisymmetric tensors is a bit more tricky.
The same kinds of changes are made to the definitions of Tensor and Tensor Expr, but it is not clear what should be returned when an operator()(int,int,. ..) asks for an element that is identically zero (such as A(0,0)) or is the negative of the value that we store (such as A(1,0)).The imperfect solution that we have is to rename T& operator()(int,int,. . . ) to T& unsafe(int,int,. . .).However, T operator() (int,int,. . . ) const is still defined, and returns the appropriate value.This allows a library user to access the elements of the antisymmetric tensor using the natural syntax, but assigning to it will be obviously unsafe.However, if only Index's are used, then the library will automatically sum over the correct elements.It uses unsafe internally, so most of the time, the library user will not even need to know about unsafe.In those cases where the user does have to use unsafe, debug builds should catch any illegal assignment during runtime.

Implementation and testing
We have implemented the (inefficient) "Simple Classes" method (Section 2.1) and the (efficient) "Expression Templates + Manual Loop Fusion" method (Section 2.3).We did not attempt to implement "Expression Templates" (Section 2.2), because it was clear that it could not be as efficient as the method with manual loop fusion, while still being a difficult chore to implement.

Compiler compatibility
Not all compilers support enough of the C++ standard to compile expression templates, while simple classes work with almost any compiler.A comparison of twelve combinations of compiler and operating system is shown in Table 1.
The template support of these compilers is actually quite good.A year ago many of these compilers would not have been able to compile the efficient library.We tested a slightly out of date version of SGI's compiler, with the current version being 7.3.1.3.Even so, it's only problem is that it does not make <cmath> available, and it is easy to work around that.IBM's compiler seems to be immature, with a remarkable number of cases of Internal Compiler Errors (ICE's).They are currently in beta testing for version 6.0 which may rectify that.The Sun compiler is the only one that was completely unable to compile the efficient library.They have also come out with a new version recently, which may fix those problems.
The C++ standard specifies that compliant programs can only rely on 17 levels of template instantiation.Otherwise, it would be difficult to detect and prevent infinite recursion.However, the intermediate types produced by template expression techniques can exceed this limit.Most compilers allowed us to override the limit on the number of pending instantiations, with the exception of the SGI, Portland Group, and IBM compilers (although the Intel command line option, "-Qoption,cpp,-pending instantiations,N", is not documented).The SGI and Portland group compilers would not compile any program with too many levels.The IBM compiler did not honor the standard and happily compiled programs with more than 50 levels of template instantiation.This is not a complete list of C++ compilers.Notably, it does not include the Microsoft, Borland, Metrowerks, or HP compilers.The Microsoft compiler probably can not compile the efficient library since As for Metrowerks on other platforms, as well as the Borland and HP compilers, it is impossible to say anything without actually trying them out.This template library tends to flush out obscure bugs in compilers that claim full compliance.It is useful enough in that regard to become part of the official release criteria for GNU gcc [25], in addition to the POOMA [9] and Blitz [11] template libraries.

Benchmarks
We have three tests to measure the efficiency of the various methods relative to each other and to C-tran.We have not attempted a direct comparison with other tensor libraries, because most do not support implicit summation and none of them support the wide range of tensor types needed (ranks 1, 2, 3 and 4 with various symmetries).This makes replicating the functionality in the tests extremely time consuming.

Small loop kernel: Implementation test
To make sure that we didn't make any gross errors in implementing expression templates, we use a small loop kernel to compare the efficient library against Ctran style.The tensor version is simply Tensor1 x(0,1,2), y (3,4,5), z(6,7,8); for(int n=0;n<1000000;n++) { Index<'i'> i; x(i)+=y(i)+z(i); The complexity of the expression is determined by how many (y(i)+z(i))−(y(i)+z(i)) terms there are in the final expression.Note that since we're adding and subtracting the same amounts, the essential computation has not changed.We also coded a version of the code in C-tran style using ordinary arrays, and compared the execution speeds of the two versions.
For large expressions, KCC was the only compiler that could fully optimize away the overhead from the expression templates, although we had to turn off exceptions.This is good evidence that we didn't make any serious optimization errors in implementation.For the other compilers, the slowdown increased with the number of expressions, becoming more than 100 times slower than the C-tran version.
This benchmark may be deceiving, though.The Ctran versions all run at the same speed regardless of how many terms we added.An examination of the assembler output shows that the compiler removes the identically zero subexpressions.This wouldn't be possible in most production codes, so the relative slowdown may not be as great.

Small loop kernel: Compiler optimization test
To get a better handle on how well the compilers optimize more realistic code, we wrote several versions of a code to compute an infinite sum The simplest version computes the sum that only has the (a1 i ) term, the second version has both the (a1 i ) and the (2 • a2 i ) terms, and so on.This gives the compiler a more and more complicated expression to optimize.The code for the tensor version is then Tensor1 y(0,1,2); Tensor1 a1(2,3,4); Tensor1 a2 (5,6,7); Tensor1 a3(8,9,10); Tensor1 a4 (11,12,13); Tensor1 a5 (14,15,16) with complexity determined by how much we fill in the ellipses.After n gets to about fifteen, the sum converges to machine precision, although current compilers can not deduce that.We vary the number of iterations (1000000 here), so that the run finishes in a reasonable time.We also laboriously coded a C-tran version and compared the execution speed of the two.Figure 1 plots the relative execution times of the the Tensor1 and C-tran versions versus the number of operators (+ and *) in the expressions.
The specific compiler options used to create this plot are listed in the appendix.The performance of some of the compilers may be a little overstated since they don't optimize the C-tran code as well as some other compilers.On Linux x86, the fastest compiler for the C-tran code was either the Intel or KAI compiler, and on AIX, it was IBM's xlC.So in Figs 2 and 3 we plot the relative execution time of the fastest C-tran codes versus the various Tensor codes for Linux and AIX.
These are busy plots, but the main thing to notice is that sometimes the compilers can optimize the expressions well, though often not.Some compilers do well with small expressions, and some do better with larger expressions.However, even for the best compilers, the Tensor1 class can run much slower than C-tran.In particular, the non-x86 compilers seem to fare worse than their x86 cousins.

Complete application: Numerical relativity
To gauge how much the lack of optimization matters in real codes, we used our application for simulating neutron star collisions in General Relativity [5].This code has to compute many complicated formulas on grids much too large to fit into processor caches.We found that, when compiled with KAI's KCC 4.0d compiler on an IBM SP2 running AIX, the "Expression Templates+Manual Loop Fusion" library (Section 2.3) runs about twice as fast and uses a third of the memory of the "Simple Classes" library (Section 2.1).This was after we attempted to optimize the "Simple Classes" code by manually constructing expression trees to reduce temporaries [26].Unfortunately, the expression trees quickly became too numerous and varied to create by hand.
When compiled with GNU gcc 2.95.2 or IBM's xlC 5.0.1.0,the "Expression Templates+Manual Loop Fusion" library code was 10-20% slower than when compiled with KCC.The logic was far too complicated to create a C-tran version.
Because these tests were run with different compiler versions on different machines than that used for the loop kernels, we plot the results analogous to Fig. 3 in Fig. 4. KAI's compiler does well with small expressions, which may imply that the application spends more time in small tensor expressions.It is difficult to tell for sure, because the compiler inlines almost everything, making direct measurement tricky.But in that case, the difference between KCC and gcc would be much larger.This suggests that the differences lie elsewhere, and the bottleneck is not expression templates.It is impossible to say anything for sure.

Extending the library
An experienced reader may have looked at the rough declaration of Tensor1 and thought that hard coding it to be made up of double is rather short sighted.It is not so difficult to envision the need for tensors made up of int's or complex<double>'s.It might also be nice to use two or four dimensional tensors (so a Tensor1 would have 2 or 4 elements, a Tensor2 would have 4 or 16 elements).The obvious answer is to make the type and dimension into template parameters.We can use arrays for the elements, in which case the class becomes template<class˜T, int Dim> class Ten sor1 { T data[Dim];  There is a minor wrinkle if we want to use a simple constructor syntax like and write off of the end of the array.This could be a source of subtle bugs.Since compile-time checks are better than run-time checks, we solve this with a helper class.For Tensor1, this class is In the Tensor1 class, we define the constructors to just call Tensor1 constructor's constructor Tensor1(T d0, T d1) { Tensor1_constructor<T,Dim>(data,d0,d1); } Tensor1(T d0, T d1, T d2) { Tensor1_constructor<T,Dim>(data,d0,d1,d2); } Now, if someone tries to give too many or too few arguments to the constructor, the compiler will not be able to find the correct constructor for Tensor1 constructor.The partially specialized versions of Tensor1 constructor only have constructors for the correct number of arguments.
Indexing is also much simpler when using arrays, although symmetric and antisymmetric tensors require some attention.A rank 2 symmetric tensor in Dim dimensions has (Dim(Dim+1)/2) independent elements, while an antisymmetric tensor has (Dim(Dim-1)/2) independent elements.We store them in a one-dimensional array and translate between the tensor indices and the array index.For Tensor2 symmetric, this means that operator()(int,int) becomes We also modify the TensorN Expr classes so that they carry information about their dimension and type.We can use traits [7] to automatically promote types (e.g. from int to double, or from double to complex<double>).We can also make the arithmetic operators dimension agnostic with some template meta-programming [14].It turns out that the only place where the dimension comes in is in assignment and in implicit summation.For the case of assignment to a Tensor1 Expr, the code becomes  That is, we've used templates to do a simple loop from Dim-1 to 0. Defining assignment operators for higher rank tensors as well as the implicit summation functions uses similar loops.Now, if we're trying to follow Buckaroo Banzai across the 8th dimension, we only have to define the constructors for Tensor1, Tensor2, Tensor3, etc. classes for eight dimensions, and all of the Tensor Expr classes and arithmetic operators are ready to use.
In this framework, we can also define Index to have a dimension template<char i, int Dim> class Index{}; When creating a Tensor Expr, we can use the dimension of the Index rather than the dimension of the Tensor to determine Tensor Expr's dimension.Then if we have a four-dimensional tensor, it becomes simple to manipulate only the lower three dimensional parts by using only three dimensional Index's.There is a danger, though.What if we use a four-dimensional Index in a three dimensional Tensor?Range-checked builds should catch this kind of error.
We have implemented this generalization [6].IB-M's xlC can not compile it, always aborting with an Internal Compiler Error.Also, KCC can't fully optimize complicated expressions in the first benchmark as it could with the simpler version of the library, leading to code that runs hundreds of times slower.Interestingly enough, the TinyVector classes in Blitz [11] are also templated on type and dimension, and complicated expressions can not be fully optimized in that kind of benchmark as well.
However, the performance in the second benchmark is not affected in the same way.Figure 5 shows the relative execution times for C-tran versus the more general Tensor's, and Fig. 6 directly compares the two versions of the Tensor library.
The results are generally mixed, although the KAI, Portland Group, and Intel compilers generally do better while the Comeau compiler does worse.Once again, the non-x86 compilers do not perform very well.However, the overall conclusions from the last section are unchanged.This is nice, in the sense that using a more general library doesn't necessarily cause another hit in performance.

Conclusion
We have described the design of a freely available, high performance tensor library [6].It uses a number of modern C++ template tricks, supported by most, but not all, compilers, to produce generic, flexible, and fast code.These tricks include expression templates [13], template metaprograms [14], and traits [7], mixed in with a number of helper classes.They allow the library to express indices, tensors, tensor expressions, binary operators on these expressions, internal and external contractions, reduced rank tensors, and tensor symmetries in an efficient, type-safe framework, generalized for any dimension or type, using natural notation.
However, the original promise of expression templates as a way to get away from C-tran is not completely fulfilled.Although the syntax is much improved, there are still cases where a programmer must resort to at least some manual loops in order to get maximum performance.Even with this work, there are still performance penalties, sometimes severe, which vary from problem to problem, although in some cases making a library more general and using template techniques in more places can improve performance.In particular, non-x86 compilers seem to do a bad job of optimizing complicated template expressions.This is probably just a reflection of the relative efforts that has been put into optimizing for the x86 platform.Despite these caveats, for our General Relativity application expression templates were a huge win.Compared to simple tensor classes, template techniques enabled faster, smaller executables, though at the cost of longer compilation times and stringent compiler requirements.Even compared to C-tran, it is not clear that there is a significant speed penalty.What is clear is that the program would never have been finished without the syntactic simplicity afforded by tensor classes of some sort.Furthermore, compiler requirements have become much less onerous as C++ compiler technology has caught up with the ISO standard.

Fig. 1 .
Fig. 1.Relative execution times of C-tran and Tensor1's.Any points less than one mean that the Tensor code is slower than C-tran.

Fig. 3 .
Fig. 3. Relative execution times of fastest C-tran and Tensor's on AIX.

Fig. 4 .
Fig. 4. Relative execution times of fastest C-tran and Tensor1 on AIX for different compiler versions.

Fig. 5 .
Fig. 5. Relative execution time of C-tran and more general Tensor's.

Fig. 6 .
Fig. 6.Relative execution time of simple and more general Tensor's.
To aid the compiler in the case where the programmer does know at compile-time what the value of the index is, we introduce a new auxiliary class Like Index, there is very little to the actual class.Because of the conversion operator, it can be used anywhere an int can be used.For example

Table 1
Metrowerks compiler is actually the GNU gcc 2.95.2 compiler with a fancy GUI.That version of gcc will compile this library, but it can't compile the more general version that will be described in Section 5.