A Parallel Implementation of Particle Tracking with Space Charge Effects on an Intel iPSC / 860

Particle-tracking simulation is one of the scientific applications that is well suited to parallel computations. At the Superconducting SuperCollider, it has been theoretically and empirically demonstrated that particle tracking on a designed lattice can achieve very high parallel efficiency on a MIMD Intel iPSC/860 machine. The key to such success is the realization that the particles can be tracked independently without considering their interaction. The perfectly parallel nature of particle tracking is broken if the interaction effects between particles are included. The space charge introduces an electromagnetic force that will affect the motion of tracked particles in three-dimensional (3-D) space. For accurate modeling of the beam dynamics with space charge effects, one needs to solve 3-D Maxwell field equations, usually by a particle-in-cell (PIC) algorithm. This will require each particle to communicate with its neighbor grids to compute the momentum changes at each time step. It is expected that the 3-D PIC method will degrade parallel efficiency of particle-tracking implementation on any parallel computer. In this paper, we describe an efficient scheme for implementing particle tracking with space charge effects on an INTEL iPSC/860 machine. Experimental results show that a parallel efficiency of 75% can be obtained. © 1994 by John Wiley & Sons, Inc.


INTRODUCTION
The superconducting super collider (SSC) under design and construction near \Vaxahachie.Texas, will be the largest and most powerful scientific in-eHAKG ET AL. sse (54 Miles) FIGURE 1 Schematic layout of the SSe complex.The 54-mile sse will be the largest and most powerful particle accelerator ever built.pensive (expected to exceed 58 billion), there is a demand to reduce the construction cost by design refinement.To reach such a goal, computer simulation of particle motion in each accelerator is essential.For a simulation to be useful, many particles must be tracked in a design lattice* for hundreds of thousands of turns [ 1].If hundreds of thousands of particles are tracked in an sse lattice with 14,000 magnet elements for 10 5 turns in a computer, then more than a thousand Tflops (10 12 floating-point operations) will be performed.This requirement makes computer simulation difficult without the use of high-performance, scalable, parallel computers.Parallel programming for a large scientific application requires one to understand the characteristics of the problem and to redesign the program to take advantage of hardware features.At the SSC, thin-element accelerator program for optics and tracking (TEA-POT) [2] particle-tracking code has been implemented in a 64-node iPSC/860 machine.In the absence of space charge, it achieves about 98.3% parallel efficiency, because each node can track different groups of particles independently and * A lattice is a detailed description of how an array of elements such as dipole (bending), quadrupole (focus/defocus), sextupole, and rf cavities are arranged to form an accelerator ring at sse.only very little communication is required between nodes [3].
The perfectly parallel nature of particle tracking is due to the fact that the interaction effects between particles can be ignored when the beam energy is high.ln practice, when the beam energy is low, the space charge effects dominate the dynamics as in the LEB and the MEB (at transition) at the SSC.Readers who are interested in the space charge effects are referred to :Yiachida et al. [4'.From the computational point of view.the computational model and characteristics of space charge should be of interest to computational scientists.To model the beam dvnamics at low energy correctly, the momentum changes of each particle caused by the space charge force must be calculated.This requires one to solve three-dimensional (3-D) :Yiaxwell field equations by a particle-in-cell (PIC)* fashion, an approach widely used in the simulation of plasma physics.A characteristic of the PIC method is that each particle needs to allocate charge to its nearest neighboring 3-D grids and to retrieve field information from its nearest grids by an interpolating method at each time step.Because particles will change their relative spatial position during tracking, communication between particles and their surrounding grids will be costly for parallel implementation.An efficient and reliable scheme is needed to reduce the communication cost of the space charge calculation and to keep the load balance of particle tracking and space charge computations among nodes as even as possible.
In this paper, we will show a reliable and efficient scheme to implement parallel particle tracking under the influence of space charge on an IN-TEL iPSC/860 MI~D computer.
The rest of the paper is organized as follows: Section 2 describes the computational model and algorithm of particle tracking under the influence of space charge.Section 3 addresses our parallel implementation techniques.Section 4 presents empirical results on the iPSC/860 and CRAY.Section 5 concludes the paper and provides future research direction.

SIMULATION MODEL
Particle simulation based on a single-particle model is perfectly matched for scalable parallel * The computation of the electromagnetic field is dependent on the charge density to which all particles must contribute in a nonlocal manner.architectures.An example is the ZTRACK code by Y an [ 1].However, it cannot simulate the operation of the LEB and the :\1EB because their dynamics are dominated by space charge.To integrate space charge into an existing tracking code such as TEAPOT, it is necessary to calculate charge densities and fields consistent with numerical stability requirements.
For computer simulation, we will use an explicit, fixed-time advance.The motion of particles is tracked for k(t) magnetic elements followed by a kick (space charge force), where k (t) is the number of elements tracked by a particle within the time t and t + At, and may be a fraction less than 1.At each element, the motion of particles is calculated in 6-D phase space.At each kick, we compute a 3-D space charge force on each particle and update its position and momentum.The choice of ll.t is determined by numerical stability and strongly affects the speed of the calculation.Figure 2 shows the computational flow of the fixed-timestep sampling method.
In the following, we briefly discuss the models of space charge calculation.Our intention is to provide the minimum accelerator background needed for understanding the parallelization methods used for particle tracking with a space charge code.The tracking algorithm used is based on a 4-D symplectic procedure [21 and later extended to 6-D bv .\Iachidaet al. [5].

3-D PIC Formulas for Space Charge
To calculate a 3-D space charge field.we employ the PIC method.At each time step, a 3-D cylindrical grid is first constructed, then the electromagnetic field for particles is computed using the discretized version of 3-D .\Iaxwellequations.
We assume that the 3-D cylindrical grid has coordinates (r, e. z), and the beam pipe has a circular cross section with radius b, the perfect conductivity a-= CIO,~and the scalar potential cf>""' 0 at r = b.Let n and J be the cha}ge density and the current, respectively.and let A be the vector potential.
We further simplify the current density, where Vz is the average velocity of the beam.This is commonly referred to as the ultrarelativistic approximation, and it means that particles translate I!S a rigid body.Frpm the Lorentz force equation , the electromagnetic force is F = -q/y 2 Vcf>, and we need only solve the scalar potential Eq. ( 1).The charge density and scalar potential are Fourier transformed in 8: n(r, 0, z, t) = 2: nm(r, z, t) exp(im6), (5)   m with the inverse transforms nm(r, z, t) = 2 7T Jo n(r, 0, z, t) exp(-imO)dO, (6) and the same for cp.
For each m (referred to as mode number), Eq.
( 1) assumes the form The general solution for the equation for m 2::

Algorithm
The space charge algorithm proceeds as follows: 1. Construct the bounding cylinder of particles.The cylinder is then decomposed into 3-D grids [see Fig. 3(a)].Each grid has index (r, (), z), which corresponds to cylindrical coordinates (rdr, ()d(), zdz).2. For each particle, we allocate charge to each grid nearest the particle by the trilinear interpolation method based on the relevant volume ratio [see Fig. 3(b)j.For example, the grid point at index h, 01, z 1 ) has volume , where (r, (), z) are the cylindrical coordinates of the particle and (r 2 , () 2 , z 2 ) is the index of the opposite grid point of (r 1 , () 1 , z1).
5. Compute the momentum changes for each particle from its surrounding fields and update its coordinates and other attributes.

PARALLEL IMPLEMENTATION
The key to a parallel implementation of a computational model into a ~IMD hypercube parallel computer is to distribute the computation and data into the separate nodes such that each node has an equal share of computations, while communication between nodes is minimized.Although the principle is simple, the practice is more complicated, and a given implementation must take advantage of the available hardware features and take care of subtle issues with each parallel scheme-I/O and memory problems, numerical stability, and hardware failures-to achieve high performance for a large scientific application.
To take advantage of scalable parallel computers, it is necessary to understand the characteristics of the problem so that the program can be implemented correctly and efficiently (with performance scalable to the number of computing nodes available).The characteristics of particle tracking with space charge effects are summarized below.

Characteristics
• Particles are tracked in 6-D phase space, and they can be tracked independently in BASE3D( ).• Particles are lost when they collide with the wall of the machine.• The 3-D PIC method requires a large number of field quantities defined on a 3-D mesh.
The memory requirements of a complete problem exceed the limit of 8 Mbytes of memory of a single node, so domain decomposition is required.• Communication between a particle and its eight nearest grid points is required to allocate charge and to interpolate the fields.• A large data set must be produced for the visualization of tracking results and for restart capability (primarily to permit recovery from hardware failure).

Implementation Schemes and Techniques
For reasons that will be explained in the next section, the main task of parallel implementation is to ensure that each computing node has the same number of particles in order to achieve load balancing.This is very important, because tracking the noninteracting particles in the magnetic lattice occupies most of the computational time.The second primary task is to decompose the 3-D grids in KICK3D( ).The constraints are (1) the space charge code should fit into an 8-.VIbyte node: (2) each node should have the same amount of workload in computing the loops over particles (for allocating charge to grids and retrieving field information from grids) as well as the loops over 3-D grids (for computing electromagnetic force): and (3) communication among nodes should be minimized.
Below we consider partition schemes for particles in the tracking phase and 3-D grids in space charge.

J Partition Scheme for Particle Tracking
So that each computing node has the same workload, particles are assigned equally into computing nodes by the block or cyclic method.*Because particles are frequently lost during tracking when they run into a wall, a load imbalance situation will develop.That is, some nodes might have many more particles to track than the others.The cyclic method is usually a better approach to deal with a load-imbalance situation.However, such an approach is not adequate when space charge is introduced.Figure 4(b) demonstrates a case using a cyclic approach, which will produce busy communications among all nodes because it violates the data locality principle.In practice, we found that a block decomposition is a proper way to deal * Assume that the index set of all particles is [0.L .15] and the index set of all nodes is [0, L 2, 3:.In the block method.the particle index set [ "1, 5. 6, :: is assigned to node 1. whereas in the cyclic method, the particle index set [L 5, 9, 13 j is assigned to node L with our problem as long as particles are not lost dramatically.

3-D Grid Decomposition Schemes lor Space Charge
There are several ways to map the 3-D grids into computing nodes.It depends on how the hypercube is connected as a ring, a 2-D mesh, or a 3-D hypercube.For programming simplicity.we will use block mapping from 3-D grids into a 1-D array of computing nodes.The cylinder is partitioned in the longitudinal direction [see Fig. 3 (c)].
The partition method can be based on the equal number of grids (equal-grid) strategy or equal number of particles (equal-particle) in a subcylinder strategy.
In the equal-grid approach, each node gets a nearly equal size of subcylinder (or the same number of mesh points).If particle distribution is uniform in the longitudinal direction, then each subcylinder will contain the same amount of particles.Therefore, there is little or no communication between nodes.Figure 4(a) shows the best-case situation, in which all the particles and their grid neighbors belong to the same node; therefore, little or no communication is needed.In practice, the distribution of particles tends to be nonuniform during simulation.Figure 4(c) shows a nonuniform particle distribution case in which not only is communication necessary, but some nodes have to update grid information for the other nodes as well.As a result, load balancing among nodes is uneven.
In the equal-particle approach, the cylinder is partitioned into subcylinders, each of which contains a nearly equal number of particles.When the particle distribution is nonuniform, each node will have an unequal subcylinder [see Fig. 4(d)].This strategy gains a performance advantage by keeping communication minimal at the expense of uneven grids in each node's domain.To achieve high parallel efficiency, an effective mechanism is necessary to maintain the "equal-particle" struc-tures and to minimize the load-imbalance effect of uneven grids.
Both approaches have their advantages and drawbacks (see Table 1 for a comparison).In general, there is a trade-off between speed and memory.Because a memory upgrade for iPSC/860 is relatively expensive, it would be very desirable to combine both approaches to compromise parallel efficiency and memory to achieve high parallel efficiencv within the limit of node memorv.

First Try
Because we want to keep the code size as small as possible to fit into 8 .\1bytes of available memory, we have chosen the simplest strategy to implement particle tracking with space charge code.That is, particles are partitioned using the block method, and 3-D grids are decomposed using block mapping with the equal-grid approach in the z direction.This approach can be implemented more quickly than alternative methods.\Ve made no assumption about spatial relationships between particles and their surrounding grids.Particles can move anywhere (e.g., across several donwins [ subgrids]) between calculations.This approach is very general and could be implemented with moderate effort should a parallel compiler, which can effectively solve irregular communication within a parallel loop, become available in the future.

Communication Patterns and Programming Techniques
In the following, we discuss briefly the techniques used to solve irregular communications in space charge.We consider the case where a particle needs to allocate charge to its eight nearest neighbor grid points (referred to as allocating process).The inverse process of interpolating field inform ation to the particle location (referred to as interpolating process) can be treated similarly.For clarity, only one grid point with index (0, 0, 0) is shown in the following sequential code.Note that • e " The current parallel programming tool availabl.e to us is the Mimdizer (Pacific Sierra, Inc.), whiCh has an automatic decomposition tool at the loop level.However, the performance we obtained with this tool has not been acceptable.Another tool reported by Hiranandani et al. [ 6] as being able to transform this kind of code into explicit message-passing routines without much programming effort is Fortran D by Rice University.However, Fortran Dis still under development and was not available to us when we developed the code.The communication strategy, therefore, had to be developed by hand.
The strat~gy that we use is similar to the approach proposed by Saltz et al. [7].The idea is based on block l/0 transfer to minimize the communication between nodes.That is.all information that a node needs to communicate with other nodes is accumulated into a buffer.A global communication table that describes how a pair of nodes should communicate with one another is computed first.Each node then sends out selfdescriptive information to the other nodes.The information received by a node includes the position and fractional density for each grid that should be updated by this node.An advantage of this approach is that the global communication table needs to be computed only once in KICK3D( ) at each time step.Therefore, it results in a reduction of communication time that would be very difficult to achieve even using future automatic parallel compilers, because such an automatic parallelizer will not be able to plan ahead and collect operations as a human programmer can.This strategy provides a reliable and effective communication mechanism for Fortran implementation.

Performance Tune-Up
A particle can move from one grid to another grid between space charge calculations and is in fact unlikely to keep the perfect spatial position seen in Figure 4(a) and (d) all the.time.It is probable that a situation like that in Figure 4(c) will happen during a long nm.One way to keep the particles and their associated grid points in the same memory is to sort the particles in the z direction and to re~ap into computing nodes.The best sorting algorithm requires order of (n log n)lp operations [O(n)/p if bucket sort is used:, where p is the number of nodes.•when n is large, the overhead will be exceptionally high.Because particles change their relative position and surrounding grids in the longitudinal direction gradually, it is necessary to sort these particles only occasionally (about e~erv 50 turns).
• Another approach is to have subcylinder guards for each node.Here, sorting is red~ced to subcylinder guard communication between two neighbor nodes.This approach usually assumes that particles can move only from a sub~vlinder to the next neighbor subcyiinder betwe~n space charge calculations.Such a constraint is imposed by numerical stability considerations for anv explicit time advance algorithm.Therefore, ~om munication is performed only between neighboring nodes.A combination of the above tune-up strategies with our current scheme makes it possible to provide better performance than either of the above approaches with only a little extra memory expense.

Code Development
The particle tracking with space charge code was first written for the CRA Y-YMP by Machida et al. [5].The CRAY code that can handle 10,000 particles in an LEB lattice utilized about 19 Mbvtes of memory, which includes 11 Mbytes space ~barge (KICK3D) code.This code was analyzed with FORGE (Pacific Sierra, Inc. [8]) to identify the most computationally intensive portion.The most time-consuming routine is the particle-tracking code (BASE3D) subroutine.The code is very complex and not well suited for pipelining.It uses about 61% total time when the time step llt = 10 ns.The next most time-consuming routine is the KICK3D code, which takes about 25%.The code is tuned using the optimization option of Fortran compiler only, because the code is very complex and not well suited to pipelining.Parallel implementation was based on the strategies addressed in Section 3.2.3.

Hardware Platform
The SSC iPSC/860 has 64 computing nodes, 62 of which have 8 Mbytes of memory and 2 of which have 32 Mbytes.The MIMD architecture allows one to run different programs on different nodes simultaneously, although the programming paradigm at this machine tends to favor the single program multiple data (SPMD) style.For particle tracking with space charge we combine both paradigms, wherein the master node (node 0) runs a different program from the other nodes (workers).Because the worker nodes are utilized for tracking and kicking, they can execute in 8-Mbyte memory nodes.The master node, which has 32 Mbytes, is also utilized to deal with 1/0 and to input data.

EMPIRICAL RESULTS AND DISCUSSION
We have tested the program on our iPSC/860 machine, running 10,000 particles for 500 turns in an LEB lattice that contains 693 elements.For our applications, a cylindrical grid size of 40 X 20 X 32 is appropriate.This suggests the maximum number of nodes used in this study is 32.Additional nodes will provide redundant computation using our domain decomposition strategy.As mentioned previously, the fields are Fourier decomposed in the azimuthal direction, with a maximum mode number of 2 utilized in this case.Using 32 nodes, the simulation took about 26.3 hours to finish.The results of the parallel implementation were checked against the CRA Y version by starting a run on the CRA Y and tracking it for six turns.The two codes were exercised in tandem from this point and tracked for 500 turns.Differences in the random number generator required this type of start-up procedure.Figure 5   shows the emittance growth in the longitudinal S, X, andY directions, respectively.Readers can see that the emittance evolution in the S direction is identical, but it differs slightly in both the X and Y direction after about 0.2 msec (100 turns).However, they almost converge at 1. 2 msec (500 turns) in both the X and Y direction.The differences in numerical results between the two supercomputers are small and are probably due to differences in word length.Although our goal in using parallel computers is to reduce the computation time in tracking study, readers are often interested in the scalabilitY issues such as whether the performance of i~ple mentation is scalable to the number of processors.From what we learned in using massively parallel computers, such issues can be observed in the following ways.First. in the absence of space charge force, our problems have a natural granularity •  that makes it "embarrassingly parallel," one simply distributes the particles over available nodes and track.The number of nodes should not exceed the number of particles tracked and the overall performance of the calculation is dominated by single node performance.In this context, obviously, the scalability is limited by the number of particles studied.Second, in the space charge case, there is also a natural granularity of the grid size (32 in our test case) that limits its scalability to the number of processors as explained previously.It is also obvious that the space charge is communication code intensive, which will be the bottleneck eventually as the number of processors increases.To show the performance of our parallel implementation, a speed-up performance graph and a parallel efficiency figure are shown in Figure 6a and 6b, respectively.The speed-up comparison is based on the performance of the original code (nonparallelized version) running on a big (32 Mbytes) node versus 8-node, 16-node, and 32-node parallel version.Both figures represent the performance of the overall loop, the performance of the BASE3D routine for particle track-ing, and the performance of the KICK3D routine for space charge at a time step (Fig. 2).The parallel efficiency of the tracking code is 80-92%, whereas the parallel efficiency of the space charge code is about 75-88%.The overall loop performance is slightly lower than the performance of tracking and of space charge code because we need to check the particle loss situation and collect emittance information at each time step.The above facts indicate that our parallel algorithm does not provide the optimal solution, but it does a fairly reasonable job.Using 32 nodes, we are able to obtain a speed-up in overall performance by a factor of 22.A more significant fact is that the 32-node performance makes space charge simulation feasible, which otherwise would be impossible using SUN-SPARC II, and eliminates a month of computations.
The use of advanced visualization techniques as an aid to understanding coherent wave motions in plasma simulation is well accepted.Part of the parallel space charge simulation effort is to develop high bandwidth visualization techniques capable of displaying simulation results from the hypercube.To this end, we have integrated a Silicon Graphics Crimson/VGX to the hypercube and have begun the software development task.
Figure 7 shows the motion of 6000 particles in the first six consecutive time steps.The transverse dimension is scaled by a factor of 1 00 relative to the longitudinal direction.Eight different particles are shown in the figure; particles with the same number are assigned into the same nodes.An optimal algorithm must maintain the same identification number of particles in the same contiguous slices of cylinders to minimize communication among nodes and to maintain workload balance.

CONCLUSION AND FUTURE RESEARCH DIRECTION
We have successfully implemented particle tracking with space charge effects using the 3-D PIC method with an explicit time advance on our iPSC/860 parallel computer.The numerical results are compared with CRA Y.We show that the new version of iPSC/860 code does the right physics and is very effective and scalable for our applications.The current implementation is very effective and can be implemented quickly to suit our operational needs.
For future research directions, we are investigating the use of 3-D visualization techniques in
FIGURE 3 3-D grid decomposition.(a) 3-D grids of bounding cylinder of particles; (b) communication between a particle and its neighbor grids; (c) grid decomposition in longitudinal direction. {a

FIGURE 5
FIGURE 5 Emittance growth (a) in the S direction: (b) in the X direction: (c) in theY direction.

FIGURE 7
FIGURE 7The motion of 6000 particles at the first six time frames.

Table 1 . Comparison of Equal-Grid and Equal-Particle Partition
Note: Nonuniform refers to the distribution of particles.