Parallel Algoritlnns for Molecular Dynamics Simulation of Irradiation Effects in Crystals

We present new parallel algorithms for solving the problem of many body interactions in molecular dynamics (MD). Such algorithms are essential in the simulation of irradiation effects in crystals, where the high energy of the impinging particles dictates computing with large numbers of atoms and for many time cycles. We realized the algorithms using two parallelization methods and compared their performance. Experimental results obtained on a Meiko machine demonstrate that the new algorithms exploit parallelism effectively and can be used to simulate large crystals. © 1995 by John Wiley &

\10 methods integrate numerically the 1\"ewton equations of motion with forceo; derived from the potential.Computing the forces of each atom (and its state) is computationally expensive particularly when the computational crystal contains a large number of atoms.In fact.most of the computation time is spent on the calculation of the forces at every time step.A standard way to reduce the computation time is the neighbor list method.In this method, one compiles for each atom the list of neighbors which are contributing to the force acting on it.This neighbor list has to be updated periodically after a constant time interval !lt which depends on the physical problem.
In this article we introduce new parallel algorithms for calculating and managing the neighbor list; in addition, we provide a new algorithm for computing the forces and the coordinates of an atom in a distributed manner.The first improvement is based on the fact that not all atoms have to be considered when building the neighbor list of an atom.The second improvement relies on the fact that no updates for the neighbor lists are needed at all in areas with low energy atoms.
We describe the programming techniques and present numerical results for two parallelization methods.In the first one the main data structures are partitioned between the processors and the "heavy" procedures of the sequential algorithms are parallelized whenever possible.In such an approach e\•ery processor computes only part of the values but has all the information at each time step.In this scheme every two processors exchange information.In the second scheme, the physical domain is statically divided among the processors and the algorithm is changed accordingly.In this scheme every processor communicates only with some neighboring processors to receive the required information.
In previous work, Li et al. [10] used a Transputer-based ~1eiko machine (no i860 processors) in a ring configuration in order to compute the forces •with a parallelized version of the standard N-body algorithm.
In Section 2 we describe the application problem and the physical model used.The new algorithms are presented in Section 3. In Section -i we discuss the parallel algorithms and the programming technique.W~e provide simulation and performance results of runs in Section 5 and finallv.we draw our conclusions in Section 6.

OVERVIEW OF APPLICATION AND PHYSICAL MODEL
The ~•ID calculation used in this work is derivt'd from the cascade calculations in copper [12], and was adapted for treatment on nuclear-stimulated desorption in palladium [7].The calculations are carried out on a cubic microcrystal, with a variable number of atoms.The atoms are arranged at f.c.c.lattice, and the interaction was choseu to be a many-body potential as used in earlier studies [6,7].
Trajectories of the atoms are determined by the numerical solution of the Newtonian equations of motion.The integration scheme uses the central difference method for computing the positions and velocities of the atoms that are calculated for successive time-steps (this is known as Verlet method [13]*).Let]\' be the number of atoms in the microcrystallite and !1t is a given time-step.
In (3) V!j is the !!radiant operator with respt'ct to coordinate '1• In !5) and (b) ry is tlw distance between atom,; i andj.and r 0 is the nt'art'st nPil!hbor distance: the parameters A. p. q. and~ haYe been determined by experiment.
In the present nwtlwd the partide pott'ntial is computed under the assumption that tlw atom interacts only with its Ilt'il!hboring atoms.Tlw neighbors of an atom are those partides "•hich art' locatt'd within a splwre of a prt'snilwd <'lltoff radius R,. around it.The cutoff R, was <'host'n between the third and fourth order Iwighbors so as to a\oid problems arising from finite temperature simulation in conjunction with short ranl!e interactions [11].
Starting from a gi \•en cq uilibri um confil!uration, one srwcifie atom in a prt'dt'termined site is given an additional kinetic eneri!Y• The resultant cascade is then followed in phase spa<'e for 1.000
Due to the high eneq . . .~• proce,-,;e;, in the irradiation damal!e problem.tht> numlwr of particle;, in the computational crystal is of the order of 1 ()h.In this type of application the distribution of enert-ries across tlw computational cry:ital is not homogeneou;,.This sugl!ests the use of a nPil!hiJOr list method (for computing the forces) which is le:-;s global in nature compared to other methods.

Existing Neighbor List Methods
The idea of the neighbor list method can be summarized as follow;; [1:3): tlw neil!hhor list of ewry particle includt>;; all tlw atoms that art> inside the potential radius R ..
In the \"erlet method the distance to all the other atom;, i,-calculated for each atom.Those atoms within the splwre of the radius R, are in-cludt>d in the appropriate neil!h!Jor li,.;t.In this method dw time for t•omputinl!tht• IWif.diiJOr list.
In the link t't>ll method [9:.dw crystal is divided into a number of cells (sub-cells).whose size is determined by tiH' pott>ntial cutoff (R,.;.In thi,; method, checking for nei~hbors of a given atom i,; limited to atonl:" within the ,-amP cell or in adjacent cells.Cnmputinl! the list in this cast' i=-faster thhn in the ••Verlet-list'' method anrl is ~iven by: where Cis the number of neighbor cells multiplied by n-the number of atoms per sub-cell.
In the next section we de;,cribe two modifications of the sequential .\10al~orithm.

The Relational Algorithm
The relational method that we propose can be used to improve both the Verlet and the link cell methods.It is based on a recursive algorithm that uses the neighbor list at time t -At to create the neighbor list at time t.The idea is to reduce the number of atoms considered during the creation of the neighbor lists.For every atom we check the neighbors in a sphere given by the potential cutoff R,. and the neighbors of these atoms.This method is based on the physical assumption that new neighbors cannot enter into the potential radius of the checked atom during a constant time-step without first becoming neighbors of some of that atom• s existing neighbors.For the initial time-step we use the Yerlet method to calculate the primary neighbor list.The computation time of the neighbor list is given by: where ;\j", J\'."' are the numbers of first and second order neighbors.respectively.For an f.c.c lattice, v.ith the commonlv used interaction cutoff radius.one has ;\j-, = 50 'and X.m = 2: 1 • :\[,.One advantage of the new scheme for calculating the neighbor list is its potential suitability for parallelization.

The Refreshment Algorithm
The inhomogeneous distribution of atom energies is normally such that the atoms with high energy are mainly concentrated in small regions of the crystal.On the other hand the areas of low energy atoms are not changing dramatically.Thus.there is no need to update the neighbor lists for atoms v.ith low ener!!y in the same frequency as for the neighbor lists of high energy atoms.In the standard methods the update of the neighbor lists is done after a constant number of time-steps.This is a compromise between the need to calculate the neighbor lists every time-step and the computation cost.Our conclusion is that we need a partial update of the neighbor lists.The criterion for updating the neighbor list of an atom may vary from one problem to the other.
For the high energy damage calculation we find that several criteria can be used."•e differentiate between two approaches.
In the first approach one checks the high energy atoms of the crystal.In this case we only update the neighbor list every time-step for atoms with energy higher than a given energy E,-rit.E,•rit depends on the time-step size (which is a constant of the problem) and the maximum kinetic energy at the crystal.In order to take into account low energy atoms, and the overall thermal motion in the system, the full neighbor list has to be updated periodically after an appropriate number of timesteps.The second approach is based on the displacement of the atom since the most recent updating of its list of neighbors: if the atom moves more than LlX in that period, its neighbor list is updated.LlX is a constant distance, which is determined as a function of the time-step size and the maximum kinetic energy of the atoms in a similar way as for the first approach.
The time needed for updating the neighbor list essentially depends only on the number of high energy atoms and the number of their ueighLur,.The partial updating of the neighbor list calculation should reduce enormouslv the CPC time.especially when simulating large systems [ 6].

PARALLELIZING THE ALGORITHMS
The following parallelizations of the algorithms assume many interacting processors.Each processor is an independent computational unit simulating part of the crystal.This corresponds to the ML\10 model with shared or distributed memon,.
To generate the parallel algorithms it is necessary to overcome three problems.The first involves the distribution of the panicles among the different processors in a way that minimizes the cost of data transfer (or the sharing of data) during the computation.The second problem is the modification of the algorithm for the parallel machine.Finally, there is a need to presen•e the load balance between the computing processors.For our specific problem, wherever an equal number of atoms is assigned to different processors the computation loads are the same.Since during the ~im ulation (move between processors) process onl~, a few atoms change their locations the computation load is kept in balance in these cases.

Domain Decomposition
The domain decomposition include~ both a ~un ple "particle parallelism" [ 1 OJ distribution and a special "geometric parallelism'' [10] partitioning for the algorithms presented above.
In the particle parallelism method atoms are assigned to processors in an arbitrary way.For this case we define the neighborhood of a computing processor to be those atoms within a distance Rr from its own atoms and which were assigned to other processors.,,,e refer to those procc;-;sors as the neighboring processors.
"'e view the computation cells as arranged in a torus (V and 1';+ 1 modP are neighbors)."'e define the neighborhood of a computation cell to Le the atoms in its neighboring cells which are within distance Rr from the common boundaries (see the overlapping boundaries in Fig. 1 ).:\"ote that the number of cells needs to Le small for ,.;mall crystals: this number depends on R, .Thus. one can employ only a small number of processors for ;-;imulating small crystals using this method.

Parallel Implementation of the Algorithms
In this section we modify the two algorithms-relational and refreshment-to run with the partidP and geometric distribution methods.""e tbe the following notation: 2. SF.\ID is dw sPquential refreshnwnt all!oritlun.

189
where a distinct processor collects the local information (from all thP processors).computes the global characteristics, and then sends the information to all the other processors.In our parallel program thi,.; job is done by the control program ('•master'" processor).

EXPERIMENTAL RESULTS ON THE MEIKO PARALLEL COMPUTER
In thi,; section we report timinl!result,; of a few experiment,.;with our all!orithm,;runnintr on a .\leikomachine.
From the experimenh WP l1,arn that when the number of proces,;ors i,.; 8. thP temporal perfor-GLIK:\1:\:\' ET AL. mance is maximized.This can he t>xplained l1y the communication times as seen in Figure 3. \Vhen there are more than 8 proces,.;ors the total execution time grows in ,.;pite of the fact that each processor has a smaller crystal.
In addition, we conducted sen•ral experiments with more than 5.600 atoms (up to 22.-tOOi and four processors.This was the largest c-ry!'ital that can be computed due to the lack of mPmnry on our units.
~•e can see that the total tilllt' ;!aint>d in CF.\lD decreases wlwn the number of atoms increast's but CF\ID is always fastt'r than GR.\ID.This result is explained hy the fact that for more atoms there is a net'd for more communication.F a,.;ter communication channels can improYe this pllt'nomenon.Clearly, the geometric JHlrallt"lization approach is better for large cry,..,tak   1 and 2 _ The re:-;ults an• measured in sPconds per time-swp.Tlw partidt• paralldization method impro\P:i on the sequential alwJrithm by a factor of two.This was achieYed in spit<:> of the slow communication chanrwls that consume about :3;) 0 /.1 to -±O'Yo of the total <:>xecution time.The !!eometric parallelization m<:>tlwd wa,.; ,.;upposed to reducP the communication load.Thi,.; did not happen duP to the u,.;e of the host collljllilPr to run the mastPr proce~s.l"eYertheless.this parallelization method improve,; on tlw particle parallelization by a factor of two.

PARALLEL .\LGORITII\IS 191
ThP refreshment algorithm improv<:>s on the relational in both the parallel and the sequential cases.The refreshment altroritlun removes the need to compute tlw n<:>ighbor list in every few time-step~.In the parallel case this improYement almo~t eliminates th~ overhead generated by tlw neil!hbor list procedure.

DISCUSSION AND CONCLUSIONS
"~e developed n<:>w algorithms for simulating high energy irradiation damage u~ing .\10calculation:;.The altrorithms were parallelized for a messal!e passinl!-ba,o;ed archit~ctur<-' and were t<:>~t<-'d on dw Meiko machine.From our experiment,; WP conclude that in e\•ery case the parallelization decreases the execution time in spite of the slow communication channels.
The modification~ of the molecular dn1amic algorithm focused on the computation of the neil!hbor list.Since this component has a major rol<:> in the computation proce~,; the improvPnwnts were significant.
The success of the geometric parallelization method compared to the particle method can he explained by the reduction in the size of the local subcry,.;_taland that of the communication time.
For future work we plan to graft the principle,; and ideas applied in the refreshnwnt algorithm,; into th<:> parallel computation of the interactinl!forces.
the relation bet ween tlw communication and the calculation times is characterized by the performance o..-erhead: f,!P) T,., 111 !P)I T, 0 t!P) where Tmt, T,., 111 are the total calculation time and communication time.respeeti..-ely.spent by tlw slowest processor.The performance overhead is compared in Figure 4 for configurations of 2.

FIGURE 5
FIGURE 5  TPmporal performance of the CH\ID and GF:\10 algorithms for three cry,.;tals.