Domain Decomposition Method for Molecular Dynamics Simulation on a Scalable Multiprocessor

A molecular dynamics algorithm for performing large-scale simulations using the Parallel C Preprocessor (PCP) programming paradigm on the BBN TC2000, a massively parallel computer, is discussed. The algorithm uses a linked-cell data structure to obtain the near neighbors of each atom as time evolves. Each processor is assigned to a geometric domain containing many subcells and the storage for that domain is private to the processor. Within this scheme, the interdomain (i.e., interprocessor) communication is minimized. © 1993 by John Wiley & Sons, Inc.

In generaL a thrPe-dimensional sysrPm of l\ atoms of mass min a simulation cell of sides L,.L,. and L, is the basic building block of an \lD system.\\"e employ an embedded-atom method (EA.\I) [8] to express the interaction between the atoms in a simple metal.The total potential energy is written as: with (:2; The first term is the usual rn•o-b<Hh-interaction enerp:y and the 5eeond term (F(p,)) is the energy required to embed the atom,; into the local electronic charge den5ity (Pi).which is written as a superposition of charge densities due to neighboring atoms (f\ry) ).The newtonian equations of motion for the EA\l are Thest' equations are inherently nonloeal: they depend on both the embedding densities at momk(pfrk )\ and at atom-jlp:r 1 )':.Tlwy mu,;t be solwd in a two-step manner.The embeddinf!density at all atomic 5ites ~~ i5 evaluated lir5t.then the forces actintr on each atom may be calculated.
for the parallel implt>mentation of problem5 with localizPd data.The local nature of the short-ran;re potemial provides a good opportunity to appl~tlw domain decomposition scheme to our :\ID simulation on a scalable multiprocessor machirw.In general the physical \ID cell is divided into geometrically separatt> ;;ubdomain,;.each containing many subcells.Each proees5or is a5sigtwd to a subdornain and is responsible for updatinp: all at-oms contained therein.W'e employ the shared memory on the TC2000 as a "hub" for data communication between subdomains, although the same communication/ decomposition algorithm may abo be implememeJ using the messagepassing programming model on distributed memory machines [11][12][13].
The omline of this paper is as follows.ln Section 2 we describe the basic data st.ructures in our .\10code and the programming model used on the BBl\ TC2000.In Section 3 we discuss our implementation of the linked-cell list method and the domain decomposition scheme on the BB:\ TC2000.Performance benchmark results are presented and discussed in Section -f.

THE MD PROGRAM AND THE PARALLEL C PREPROCESSOR {PCP) PROGRAMMING MODEL
Our \1D protrram i,.; designed to smdy various types of tribological systems (e.g .. friction and wear).Details concerning the applications of .\IDsimulation;; to thei'e problems can be found in the papers by Belak and Stowl:'rs [1-t, 15:. Figure 1 illustrates a schematic geometry of a typical sy;;tem.For the simulation of the orthogonal metal cutting process, the \10 simulation cell is a fixed window in the reference frame of the 10ol.The FIGt:RE 1 The l£t'OJ11etry of our ;;teady-:;tate varia!Jieparticle molecular dynamic,:; moJel of orthowmal metal cutTing.The calculation is performed \\•ithin the refererwe frame of the tool.The thermo,., tat atom;; are maintairwd at room temperature and tire boundary atom" are w;ed to impo,;e the cutting speed-th<'y propaf!ate to the right at the cutting ~peed.
boundary atoms are used to impose the cutting speed: they move to the right at the cutting speed.In order to produce a steady state flow, new atoms are continuously inserted from the left, while atoms that leave the top or the right of the cell are discarded: the system is open.l\ext to the boundary.we place a thermostat region.A time-dependent viscous damping ({) is added to the equation of motion for the atoms in this region [ 16,17].
(5) with (6) \\'here T ii> the relaxation time for~ and 'Lntc is the kinetic temperature of the system.The purpose of the thermostat is to remove heat from doing work at the tool tip.
Om• .\10computer simulation code is written in the C programmin~ language.Cnlike Fortran.C pnwides the adYanta_ges of allowing complex data structures.pointer arithmetic.and dynamic allocation of memorY.all of which are necessarY for performing Yariai)le atom simulations.We employ the PCP [18] as our programming model on the BB:-1 TC:WOO.PCP prm•ides an extension of the single-program-multiple-data (SP.\lD) programming model in the familiar C programming enYironment.Each processor executes the same code and the path through that code is determined by the data that the processor encounters.PCP introduces the concept of a ••ream" of processor~.Each team ha;; one master proeei'sor that is ust>d for performing serial work such as data accumulation and initialization.Flow synchronization is obtained throul!h the barrier &tatement.En:'ry processor reaching a barrier waits until all memlwrs of its team !including the master) reach that barrier.Additional flow control for critical sections is aecompli,;-lwd with locks.PCP pt•oyid<'s the lock (&loclL.variable)and unlock (&lock_ variable) functions to isolate critical sections.The lock variable is stored in :;!Jared memorv.The first processor entering the critical section sets the lock Yariable to loeked and proceeds with the calculation ..\Ieanwhile, the remaininf!proce;;;;ors te;;t 1 he lock n1riahle 10 see whether it is locked.\riwn the finn processor finishes the calculation.it sets the lock variable to unlocked.The next pro-cessor to find it unlocked immediately locks it and proceeds with the calculation.Parallelism is exploited via domain decomposition.Each processor doe,;; the work for its domain and the interprocessor communication i,;; performed through the shared memory.In effect.each processor is performing a separate .\IDsimulation and obtaining boundary data from neighboring processors.Locality of data is exploited by explicitly declaring variables as private or shared with the private and shared storage class modifiers.Private and shared memory are dynamically allocated using the prmalloc and shmalloc functions.
The majority of our parallel ..\ID code consists of routines from a standard serial ..\ID code based on the linked-cell method.\\•e use the linked-cell method in our parallel domain de<:omposition.Before going into our domain decomposition implementation.we list the most important functions in our .\IDcode as follows: 1. Initialization within that subeell.l\LLL is a parameter to indicate the end of a linked-celllist.One aiipen of this task we want to empha,;ize j,-that all data are private to each proct>ssor.thus the nwmory latt>ncy i,; minimized.
One nuw note that in tht> scht>nw we ha,•p riPscribed abm•e.we haw done the domain (and therefore the work) allocation to the processors once at the start of the job.Thi,; is suitable for a problem wherein the dynamics of tlw simulation do not change the load balance while the proiJlt'm is executing.If our tt>st problem did exhibit large changes in proces;;or loadin~J a,; executin;.rpro-ceedecL we would periodically repeat :->tep :2 in or-dt>r to maintain a ;.rood load balance for the processors.

PARALLEL IMPLEMENTATION
The most important aspect~ of a paralld implementation of dw \lD alw>rithm are tlw a,.;,:;i;.rtmwut of the processors to a subst>t of atonh and the computation of tlw for<:f's on tho,;e atom,;.In dw following.we discuss the linked-cell li,;t method and the domain decomposition ,;clwme that \H' used in our paralld implementntinn.

The Linked-Cell List Method
For finite range interactions tlw amount of computations i,-; reduced from O(l\~: to O::l\:.where :\.i,; the total number of atoms within tlw simulmion cell.For example.within a ,:;implt> metal.ench atom ha;; (}1'1 ()O"J t1Pighhor~ that COntrilltlte tO dw force.\\-e use the linkt>d-cell lbt method to keep track of the distribution of atoms in each cell.Each cell has a pointer to tlw firM atom it contains.Each atom ha;; a pointer to the nt>xt atom in the same linked-cell.The a<.h-antaf!e of thing the linked-cell list approach in our \ID eode i,; that storage requirement;; are minimized.The disadvantage is that the list must be retwwed at eaeh time-stt>p and memory i;.addressed in a random manner.However_ the time spent updatintr the cell list is O(l\) and i,.; found to be small relative to the force calculation.
A linked-cell li,;t is built according to the coordinates of the atoms in,;ide the ..\ID simulation box.For each atom.we eompute a cell index according to the eurrent position and add the atom• s pointer to the top of the li,.;t for that cell.Each edl atom-i due to atom~i and the force on atom-) due to atom-i.thus doul1ling the \HJrk to a,.;sure ,.,cala!Jility.At the end of the time-,.;tep.any atom that mii.!Tates from one cell to another ct'll i,.; rt'mm•ed from its cellli,.;tand is added to another ct'llli,.;t.In  The message-passing programming model may be used for performing the interdomain communication on distributed memory machinf's.like the TC2000 [ 12,13].Each processor knows the identity of its nearest neighbors.The data from the inner border subcelb of each processor are sent to its neighboring processors.Each pnH'PSsor then receives message.:'from it,:; rwi;.dd>nr;;nnd scatters the data appropriat<'ly to the outt'r border subcells.To ensure that the corn1~r data are passed correctly.data are passed to m•i~hborinp: processors in one direction fir,:;t.then in another direction.The forces are calculated indqwndently "\\ithin each processor.
In our approach.we use the scalable shared memorv facilitv of the BB:\ TC2000 to amid thP comple~ity of data packing and communication;; management that would have been required by a strict message-passing implementation.To carry out the interdomain communication.we use the shared memory as a "hub'"• for temporary storage of new atomic positions in the border subcells.First, atomic positions of interior border ;;ubcells from each processor are copied into share memory.Then, each processor updates its exterior borclPr subcells by copying those atomic positions from shared memory to local memory (Fig. 5).Fi-nallY.the linked-cell lists for each subdomain are renewed due to the migration of atoms from one subeell to another.
To make our procedure more instructive.WP li;;t four steps that are essential to this approach: 1. Cpdate all interior linked-cell lists.

PERFORMANCE RESULTS
Our parallel .\IDaltrorithm ha;; been implemented on the BB:'\ TC2000 using the PCP programming paradigm and we han• rwrformed some of tlw largest .\IDsimulations for the longe,;t periods of time to date.Here we present timin!! comparisons for three-dimensional E.-\.\1 molecular dynamics simulation;; eontamm!! -t.0:32.:t2.256.and 2.38.0-!8 atoms.All of our calculations are performed with 6-t bit floating-point precision.Table 1 is a sumrnary of the simulation paranwter,; that definf' our .\1Dsimulation.The benehmark ealeulation" are rwrformed for 20 .\IDtime-;;tt>ps.thoup:h we have perfornwd simulations for as !on!! a,; several million time-steps.The timing,.; from the lir::;t 10 time-steps are di,;carded and tlw la,;;t 10 time-steps are averaged to obtain CPC times for one .\IDtime-step.On tllP BBl\ TC2000.the performancl:' of the startup of a job is sewrely degraded by the thrashinp: of the virtual memory system as page tables are constructed for each processor.This effect ha,;; been observt>d on other parallel systems implt>menting demand pagt>d virtual memory and seems to be inherent in such system,.;.Althou!!lt the startup cost does not affect long problem, that run for many thousand" of time-step,;.tht> chaotic timing effects that arise from the virtual memory system during startup are very inconvenient when attempting to time portion,; of an application..,,Processor J FIGL'H.E .5The data communil•ationthroul!h the ,;hared memory.The data structure within each pron•~,-or is updated in a twu-step mamwr.The data from the intt>rior !.,order ,uhcdl~ are copied to the ,;bared lll<'llH>ry.The updated data fro111 the "hared nwmory lwlon.t:in:r to the exterior border ,.;ulwdls are then copied to ('ach proet>ssor.Shown in Tables 2 and 3 are timings for the code executing on the TC:2000 using the linkedcell domain decomposition ,.;cht>me dt>5crilwd in this paper.The pre5enl code executing on sinp-le processor ust>s memory local to the pro('e;;;;or mdy-there is no shared memory merhcad.Tht> share mt'mory is used for interproce,.;sorcommunication only.Tht>re are about t•if!ht aton1s pt>r subcell and t>ach proct>ssor has a prin1te copy of all dma that deline tlw c:alculation.Tht> ce ller and io mutint's are the only routines that operate din.'ctly on shared data.From Table : an average of threefold speed-up by using the domain decomposition scheme.:\lost of this performance increase b attributable to the increase in performance of the force routine.Foremost.we have decwased the computational work l.na faetor of two.Furthermore, we have minimized the interprocessor communication hy eonfining the computational task to local memory.Figure 6 is a plot of tlw parallel efficiency of various routines in our code simulating the motion of 32.2.36 atoms.The parallel efficiencY i11) is defined to be the time to execute on one processor (tt) divided by the time to execute on n processors (tn) divided b~the total number of proeessors.

That is:
The most inefficient routine in our code i" the celler rourine where border subcdl information is copied to/ from shared memory.The ~<:>cond most inefficient section of the code is the io routine, which we pedormed one<> at every time-step in the timings presented here.In a production run.io is executed every -100 time-steps and the t>ffen on tht> overall perfornHH1ee is ,;mall.The overall performance is somewhat surpri,;ing.providing: an efficienc-y that is trreater than 1 and reachin~ a plateau of 1.:2 at around (H prnce,;sors.Parallel eft-ieiPncit>s of !!!'Pater than 1 are unusuaL but han" been docunwnted lwfon" on multiprocessors employin!! cache t'quipped microprocessors as their ba,;ic computational elements.There are essentialh• two places wherP this effect can arise.tlw first is in the cache u,;ed to store virtual to physical addres,-; tran,-;lntion,; and the second is in the cache u,;ed ro swre prt"\•iousl~ referenced data.The BB~ TC2000 i::; compocied of processors posses,;; in~ individual 1 () Khyte data eadlf's.and n1emory uu-tnap-ement units capable of mapping: ,3b-tK byte pa~e,; without TLB mi~se,;.
As dw number of proce~,;ors j,; scalP<L tht> total available cache memorv scaleii as well.For a 311\.atom problem size.the h<•m•ily used portion of the data set begins to tit in tlw availa!Jlc• cacht' memory as the proce~sor count waches 6-i and no further speed increase results from dw cache efft>et.This effect is important for all currently availablt> microprocessor~.hut these micropmcessor,.;lack hardware monitors that would allow us to pin down the source of the efficiency improvement in a more quantitative manner.\\•e expect that future microproces5ors will posses,; hardwarP monitors that are capable of coumin~ cache mis5es and these will be very useful when tracki11g down anomolous perfonnance rt>;;ults.

(a: 1
Read input file (b) Initialize position;; and velocities.build the linked-cell lists 2. Impoo;e the domain decomposition scheme to divide the work for each proce;:;sor 3 ...\lain simulation loop (a:i.f<)rce-calculateintf'ratomic forces (L) up clute-obtain new positions usin~ the central difference (c) kinetic-calculate kinetic ener~ry and the temperature of the system (ell output-accumulate data and output (e) cel/er-apply boundary conditions and update linked-cell lists The initialization and output blocks represent a small amount of serial work that is performed by the master processor.The heart of the work throughout the program consists of loops of the following type: for ( i 0; i < nx; i++) for (j 0; j < ny; j ++) for (k 0; k < nz; k++) a_ptr = cell [i] [j] [k] .start_ptr;while( a_ptr != NULL) { priratc u•ork a_ptr a_ptr->ce ll_l ist_ptr; }}}.where cell [ i] [j] [k].start_ptr indicat;•s the pointer to the first atom for each suheell and cell_list_ptr points to the next atom pointer

Figure : 3 .
Figure :3.we illustrate how an atom is remmt'd from the middle of a cellli,;t and i,; then added to the beginning of another cell list.

Table 2 .
CPU Times (Seconds) to Simulate One

Table 3 .
Total CPl Times (Seconds) to Simulate One l\ID Time-Step of a Three-Dimensional l\laterial Containing 4,032, :12.256, and 2.'53.048Atoms on the BBN TC2000