Speeding up N-body Calculations on Machines without Hardware Square Root

The most time consuming part of an N-body simulation is computing the components of the accelerations of the particles. On most machines the slowest part of computing the acceleration is in evaluating $r^{-3/2}$, which is especially true on machines that do the square root in software. This note shows how to cut the time for this part of the calculation by a factor of 3 or more using standard Fortran.


INTRODUCTION
.\Iany phenomena in astrophysics and chemistr• are being simulated usin;r :'\ -l)()dy methods [ 1. 2:. The most time cow;umin;r pm1 of such simulations is computing the accelerations oil each particle due to all the others. This is trut:' for the f'imple S 1 methods. tree-ba,.;ed methods [ :i]. or those usin;r neighbor lists [ -l].
If the potential lwin;r u,.;t'd has an odd power of the particle separation in it. computing the ortho;ronal componellls of tlw acceleration will in-HJlYe takin::r a "Y uare root. Although iiome machineii do >'quare root in hardware. many do not.
It is not unusual to find that half the run tinw of an :'\-body calculation i,.; ,.;pt>lll in the ;;;quare root sulmnttine.
In our case. we want to eYaluate the accPlt>ration on Pach particle in a systt:'lll of self-graYitating bodies. For example. the .r-comp01wnt of the ac-where i is the unit Yector in the x direction. G is the !!rm·itational constant. mk is the mass of particle k. and r is the separation between the particles.
The system ~quare root routine. not knowing how itii re~ult will be u;;;eJ. compute;;; the ,;quare root with a diYidt>-free 1'\t>wton iteration to compute the inverse of the square root followed by a multiplication by the input \·alue to get the final re>'ult. The compiler then multiplies by r 1 and di-Yides the re~ult into the numerator. Becau,;e lJOth diYisions and "quare roots are usually slow, this operation takt>s a lon!! time. formance by pipelining arithmetic operations. unrolling loops frequently speeds things up. Clearly. the operations measured do not benefit. It is possible to do considerably better than the direct computation. This note shows how to use standard Fortran to evaluate the acceleration in about one third the time taken bv the direct e,·aluation.

ALGORITHM
The onh· way we can beat the efficiency of the system routines is to use our extra knowledge of the problem. In this case, we will not compute W. Instead. we will compute r-:i directly from r 2 .
The simplest approach is to approximate this function with a polynomial. Chebychev polynomials are frequently used because they minimize the maximum error of the approximation on some in-ten·al [ 5]. One difficulty is that the approximation is accurate only if the argument,; are limited to a relatively small range. \\"ith a range reduction the procedure for an input argument r 2 becomes 1. Find a u such that a s ur 2 < {3. where a and f3 are numbers of order unitY.
3. Get the correct result bv multiplying the approximation by t = u:v2.
Because I want to keep my algorithm entirely in Fortran, I decided to use two tables for u and t. The entries in u are simply the power of 2 such that 1 s ur 2 < 2. The entries in tare u:l/2. The onlY problem is to figure out which table entries to use for a given input value.
The range reduction I use is based on the IEEE double precision number format [ 6]. Each number consists of 64 bits-1 sign bit. 11 exponent bits, and 52 fraction bib. In addition. tlwre i,.; an implicit 1 bit for normalized number,.;.
If I know the ar;!unwnt is po,;itiw. as it mtl:-it lw for the function I am interested in. I can extract the exponent by shiftin_g the hi~Jh order :32 bit,:; of the floating-point number 20 bits to the ri~Jht. In Fortran, this procedure requires that I EQCI\"A-LEJ\"CE the double preci,o;ion number to an inte~JPr or pass a double preci,;;ion ar~Jument to a :;ulJroutine that uses it a,;; an integer. Shifting the inu·ger gives the index in the tables. Because therP are only 11 bits to represent the exponent. I know my tables need onh-2.0-t8 entrie~.
I could have coded mY table,; a,.; DATA statements in the prowam. but I decidt>tl to ask the user to make a single call to set them up a,.; i,.; frequently done with Fourier tran,;form routine,.;. The code to build the tables i,; contained in tlw program in the Appendix.
l\"ow that I have scaled the input to a mode,;t range, I can do the approximation. The coefficients of the fit are ea;;y to compute [S:. If I wriw the approximation of (ur 2 )-:lt2 a,.; where x = 21:ur 2 ) -3. the coefficiPnts ck can be calculated from where the :tj are the zero,; of T\(.r;. x 1 = cosf rr.j-· 1/2 )IS]. The change of variable from ~ur 2 ; to .r i,;; needed because the Chebyche,· polynomial,; are orthogonal on the in ten al [ -1. 1:. If we choo,;p S P m. the approximation will be wry close to the minimax polynomial.
It is important to use knowledgP of the hardware in writing the code. I made my runs on an IB~I RISC Svstem 6000 ~lode! 5-tO. RISC machines nominally do all operations in one machine cycle, but in practice complicated operations are pipelined. On this machine, all floating-point additions and multiplications are treated as compound multiply/add operation,.; [?]. An i,;;olated operation takes two cycle,:;. but a ;;equence of operations produces one re,;;ult per cycle after a delay of two cycles. Thus, our goal is to produce compound operations that can be pipelined. Figure 1 shows part of a function that the user invokes to do the Chebychev fit. The input value is x*t02 -tOl s = s + c( 4)*t03 result= s*t(it) FIGCHE 1 CodP to naluate a four-term Clwbydle\· fit. The input i~ r2. The tables u. t. and c were calculatPd in the ~etup routine.
r2. The statement EQUYALEXCE (r2, ir2) is needed because the shift function will only work on an integer argument.* Only four terms are ,.;hown. but the exten:;ion to more is ob,·ious.
After ;;;ome ;;et-up code to do the ranf!e reduction and ;;;hift the arguments into the range of the Clwhyche\· polynomial;;. ull opemtions but the last are multiply/adds. Cnfortunately. the;;e operation;; are dt>pendent on each other . . so we are not making optimal use of the arithmetic pipeline. For example. the multiply/add that updates s cannot be started until the preYious tis ready. Al,.;o. computaiion of the next t cannot be started until the pre,·inu;; one is done. HoweYer. we can 0\·erlap these two calculations so we expect euch order of approximation to take an additional 3 cycles. Table :2 summarize,; the timing and accuracy re,..ults. The rt>latiYP errors are mea,.,ured u,.;ing the direct calculation as the ('OJTect ,-alue. These errors are identical to the bounds computed by ,.;umming the ab;;olute \·alues of the dropped coefficients r;) 1-The times are gi,·en in machine cycles per element. In each case I measured the elapsed time with a clock accurate to a few nanoseconds. The times reponed are the smallest of :20 runs of 10.000 random inputs .. -\!though there are some anomalies. most of the time it takes 3 cvcles to add one more order to the approximation. The anomalies are caused by running out of registers and the performance of the memory when loading the coefficit>nts.
One way to improye the overlap is to do more  Table 2 shows that adding one more term to the approximation costs less when the loop is unrolled. ahout 2 cycles per term. Is it worth using this approximation? It depends on the accuracy needed. The time stepping scheme will have some tmncation error. Clearly, there is no point making the function evaluation more than an order of magnitude more accurate than this mlue. It is based on finding the roots of some function, in this case

y-
The iteration is then where n is the iteration index. l\ewton' s method is quadratically com·ergent when applied to a conYex function such as the one in which we are interested [8]. This means that each iteration doubles the number of correct bits  in the result. ,,.e only need to get a reasonably accurate first guess. If we use the same range reduction as before, a reasonable first guess would be the function e,·aluated near the midpoint of the range. In fact. I chose to use a zero'th order Chebychev fit for the first guess. Figure 2 shows the key part of the function invoked by the user. ~-e see that this code will not use the hardware a;; effecti\·eh· a;; tht> Chebwht>v code. Each Kewton iteration l;as a multipliC'~tion. a multiply/add. and a final multiplication for a total of 6 cvdes. Table 3 shows the comergence and time for rolled and unrolled loop,.. If the loop iii not uurolled. Kewton\;; method takes 6 cycles per iteration as predicted: it takes only abow :3 if the loop is unrolled. Single preci;;ion accuracy is achie~·ed in about 29 cycles per element and double prt>cision in 3-i cycles per element. Is it worth using i\ewton's mellwd~ Yes it j;; unlesii you need the ln,o;r ft>w bit,; eom-•ct. \.fithout doing arithmetic in a higher prt>cision the loss of a few bits of aecuran i:'i ine,·itnble. Howen'r. the simplicity of the cod;, und ib ::;pet>d art> in it,; favor. We can do considerably better by makinr: two changes to the Kewton 's method code. First of alL note the small improvement in the first fe\\· it erations. A better first guess would reduct' the number of Kewton itf'rations dramatically. I chose a six-term Chebychev fit that results in single precision accuracv with one i\ewton iteration and double precision with two.
using this approach should run considerably faster.
The complete subroutine, including the set-up code, is shown in the Appendix; the coefficients are shown in Table 5.

CONCLUSIONS
How can I beat the performance of a highly tuned system routine with Fortran code? Simple-! cheat. I <'heat in a number of \\'3\'S. First of alL I f'valuate the function directly rather than in pieces. Second, I cheat by not getting the last few bits right. Finally, I cheat by not doing any error checking. (I could take the absolute value of the input at the cost of one additional cycle.) However, the output value is accurate for any floatingpoint input. Yery large input values produce denormalized results: \·ery small inpUt values produce floating point infinity as they should.
If your machine supports the IEEE double extended format [6], a format with at least 64: bits in the fraction that is usually resen·ed for data kept in the registers. you can get the last few hits right using a simple trick. Compute the array t in extended precision. but store it as two double precision numbers. t ( 1, i) and t (2, i). Then the final scaling becomes s * ( t ( 1, i) + t ( 2, i) ) . I could not check the accurac\· because the RISC System/6000 does not support the double extended format. but the change added only 1 cycle per result to the time of the unrolled loop.
Is it worth the effort and worry to use this new approach? If your calculation is typical and spends 3/-t of its time eyaJuating the acceleration, speeding up this one line of code by a factor of :3 will cut your total run time in halL ACKNOWLEDGMENTS 1 would like to thank Yin•k Sarkar for helping me under::>tand the HS/6000 in,;truetion sdteduling, Rad Olson and Bill Swope for trying to convince mel could not heat the sy,..;tem functions (I love a challenge). and the referee for seYeral good iJeas.