On the Parallel Elliptic Single / Multigrid Solutions about Aligned and Nonaligned Bodies Using the Virtual Machine for Multiprocessors

Parallel elliptic single/multigrid solutions around an aligned and nonaligned body are presented and implemented on two multi-user and single-user shared memory multiprocessors (Sequent Symmetry and MOS) and on a distributed memory multiprocessor (a Transputer network). Our parallel implementation uses the Virtual Machine for MultiProcessors (VMMP), a software package that provides a coherent set of services for explicitly parallel application programs running on diverse multiple instruction multiple data (MIMD) multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. The performance of our algorithm is investigated in detail. It is seen to fit well the above architectures when the number of processors is less than the maximal number of grid points along the axes. In general, the efficiency in the nonaligned case is higher than in the aligned case. Alignment overhead is observed to be up to 200% in the shared-memory case and up to 65% in the message-passing case. We have demonstrated that when using VMMP, the portability of the algorithms is straightforward and efficient. © 1994 by John Wiley & Sons, Inc.


INTRODUCTION
tion of the multigrid methods on message-passing architectures.The introduction of parallel computer architectures in the last years has challenged the existing serial methods such as finite differences.finite elements, and multigrid solutions of PDEs.See, for example, [1] through [7] for the implementa-Physical problems often require the solution of the corresponding Partial Differential Equations (PDE) around bodies of various shapes.ltappears that in such cases there is no simple method to map the body surface into the grid points.There are some methods that generate grid points in such a way that the body surface is aligned with the grid, by using local grids [8,9].Another method is based on using Cartesian grids [8,9] such that the body surface is not aligned with the grid, but we have to take into consideration its existence within the grid.This method eliminates some problems that are involved in generating grid points for multigrid solution for flow problem.s over shapes with complex geometry.However, 1t requires using dummy points (points that.are on the body boundarv or in its interior) and rmses the proble~ of how t~ use the corresponding boundarv conditions.A potential solution to this prob-le~ is to use CDC CYBER 860 [8,9].
In this article, as a model problem, we investigate the parallel implementation of Poisson single/multigrid solution on two shar~d-memory a~d shared bus multiprocessors (mult1-user and smgle-user machines), as well as on a distributed memorv multiple instruction multiple data (MIMD i multiprocessor using Virtual Machine f~r MultiProcessors (VMMP) [10,11].Forthemulugrid solution [8,9,12], we apply a set of grid~ with a decreasing order of refinement, all appro,Xlmating the same domain to be conside~ed.~ e star: with the finest grid and apply the we1ghted Jacob1 relaxation algorithm.When relaxation begins to stall, we move to a coarser grid, on which relaxation becomes more efficient.After reaching a coarser grid with a satisfactory rate of convergence, we return to the finest grid and correct the solution approximation accordingly.
Cnlike previous investigations, we consider two distinct cases: ( 1) when the body boundary is aligned with the grid and (2) when the body is not aligned with the grid so that dummy points are introduced.We consider both shared-memory and message-passing architectures and compare their performance.
This article is organized in the following way: In Section 2 we present a brief background conceming the weighted Jacobi relaxation method, the multigrid method, and the VMMP software package.In Section 3 we introduce our m~del p.robler_n and its specific parallel implementatiOn.1\umencal results are presented in Section 4 and processing time analysis is presented in Section 5. Finally, Section 6 summarizes our results.

BACKGROUND
ln this section we briefly present some of the methods and software tools used in the solution of our model problem.In particular, we present the weighted Jacobi relaxation method, the multigrid method, and the VMMP package.

The Weighted Jacobi Relaxation Method
Let Au = J be a system of linear algebraic equations representing a finite difference solution that approximates an elliptic partial differential equation where A is nonsingular.Let D = diagA, B = n-1 (D -A) and c = D-1 ].Then we can replace the linear system by the equivalent system u = Bu +c.(1) In the weighted Jacobi method [13], which will be considered here, a real parameter w is introduced as follows: (2) where Bw = wB + (1 -w)/ (3) For the case in which the body surface is aligned with the grid, the algorithm consists of using in each iteration the value of real grid points that are participating in the grid relaxation.The boundary conditions exist only on the grid boundary.For the case where the body surface is not ali.gnedwith the grid, one may need to know the solution at points that are in the body interior or on its surface.These points are called dummy points.For example, consider the problem of Figure 1 in which value at the point C in a given puted from the values of A 1 , A 2, A:o, and D, when D is located inside the body.In such a case we proceed as follows: We first consider the point that is the mirror image of D along the tangent line to the body boundary, namely the point D".The value at the point D is then replaced by a weighted average of the values in the neighborhood of D".Similarly, in order to compute the value at B, the value at D will be replaced by a weighted average of the values of the neighborhood in D'.Kote that the value of the dummy point D is different in the computation of A and B because D is reflected to different points in the grid.Body boundary points are replaced by a weighted average of their neighbors values.Kote that the value of inner points is not computed.

Consider a partial differential equation and let
A huh = Jh be the corresponding linear system to be solved iteratively on some domain Gh with grid spacing h.In the multigrid method we consider a set of grids with a decreasing order of refinement, Gh being the finest grid.We start with the finest grid and apply some relaxation algorithm.When relaxation begins to stall, indicating the predominance of smooth (low frequency) error modes, we move to a coarser grid in which these smooth error modes appear more oscillatory (corresponding to high frequency modes) and relaxation becomes more efficient.After reaching a coarser grid with a satisfactory rate of convergence, we return to the finest grid and correct the solution approximation accordingly [ 13].
Let I~h be the interpolation operator from the coarser grid Gzh to the fine grid Gh and let {f,h be the one from Gh to G 2 h.In general, the order of coarse to fine interpolation should be no less than the order of the differential equation to be considered.The most popular multigrid scheme is the /)cycle multigrid scheme.It solves the system uh = M~J-h(uh, Jh, A h) recursively as follows: 1. Relax '}'1 times on Ahuh = Jh (according to some relaxation algorithm).Hence, for instance fJ-= 1 corresponds to the V -cycle algorithm, and u = 2 corresponds to the W -cycle algorithm (Fig. 2).

VMMP
VMMP [10,11] is a software package developed at the computer science department of the Tel-Aviv University.VMMP provides a coherent set of services for explicitly parallel application programs running on diverse YIIMD multiprocessors, both shared memory and message passing.It is intended to simplify parallel program writing and to promote portable and efficient programming.Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors.VMMP provides high level services, such as the creation of cooperating set of processes (called crowd computations in VYIMP), creation of a tree of lightweight processes (called tree computations), data distribution and collection, barrier synchronization, topology independent message passing, shared objects memory, etc.All of these services may be implemented efficiently on a variety of target machines.These high level services ensure that their implementation details can be hidden effectively from the user.VMMP does not provide low level services, such as semaphores, because they are difficult to implement efficiently on all target machines (e.g., on a message-passing multiprocessor).
VMYIP may be used directly by the application programmer as an abstract parallel machine to aid portability without sacrificing efficiency.VMMP may also be used as a common intermediate language between high level programming tools, such as parallelizers and specification language processors, and the target machines.In particular, VMMP is used as the intermediate language in P 3 C (Portable Parallelizing Pascal Compiler) [ 14] which translates serial programs in Pascal or Fortran to explicitly parallel code for a variety of target multiprocessors.VM.\IP usage is illustrated in Figure 3.This article includes only a short description of VMMP services and capabilities related to our multigrid problem.Full details and other types of problems can be found in Gabber [ 1 0, 11].

VMMP Services
VMYIP supports the communication and synchronization requirements of parallel algorithms, and especially the requirements of tree and crowd computations.It provides a virtual tree machine to support tree computations and crowd services to support crowd computations.Shared objects memory provides an asynchronous and flexible communication mechanism for all computations.VMMP also supports several data distribution strategies using shared objects or by special crowd operations.A crowd may create a tree computation and vice versa.Figure 4 illustrates the relationship among VYIYIP services.
Tree Computations.Tree computations are a general computation strategy, which solve a problem by breaking it into several simpler subproblems, which are solved recursively.The solution process looks like a tree of computations, in which the data flows down from the root into the leaves and solutions flow up towards the root.Some ex-
Crowd Computations.Many parallel algorithms are defined in terms of a set of cooperating processes.Each of these processes is an independent computational entity, which communicates and synchronizes with its peers.The processes communicate and synchronize mainly by message passing, which may be either blocking or nonblocking.
The communication structure of crowd computations normally follows some regular graph, such as a ring, mesh, torus, hypercube, or tree.Some computations change their communication structure during the computation and some require an unlimited communication between any pair of processes.
Message passing forces an implicit synchronization between the sending and the receiving processes, because the receiver normally waits for a message from the sender.Blocking send and receive operations imply a much stricter synchronization between the sender and the receiver.

Shared Obiects Memory. VMMP implements a
shared objects memory with the same semantics on all target multiprocessors.Shared objects may be regarded as virtually shared-memory areas, which contain any type of data, and can be accessed only by secure operations.Shared objects can be created, deleted, passed by reference, read, and written.
The shared objects are intended to distribute data, collect results, and collect global results by some associative and commutative operations.Other uses of the shared objects, such as random updates to parts of a large object, are discouraged, because the implementation of these operations increase overhead on a message-passing multiprocessor.
Concurrent writes to the same shared object are permitted.The value of an object is the last value that has arrived at the master copy of the object.This value is then broadcast to all other copies of the object (if any).A read following a write to the same object will return either the value written or a more recent value.It is possible that two distinct processors reading the same object at the same time will access different versions of the object value.This definition is necessary to guarantee consistent behavior also on message-passing multiprocessors.A safe way to update shared ob-jects is to apply an associative function, which guarantees the correct result regardless of the application order.
The user must manipulate the shared objects only through VMMP services, which guarantee the correctness of the operations.The program must call VMMP explicitly after each change to the local copy of a shared object in order to broadcast the new value to other processors.

VMMP Support for Grid Computations
VMMP provides several high level services that support grid algorithms.These services allow a crowd process to access a slice of the grid and exchange overlapping boundary elements with the neighboring processes.
VMMP supports several grid distribution strategies to ensure low communication overhead and load balancing.They are illustrated in Figure 5.
The VgricLpart service computes the boundaries of a slice of the grid that is assigned to the process.The program can use a different distribution strategy by changing a single parameter to the VgricLpart routine.Additional distribution strategies can be programmed explicitly by a combination of data access services.The following C code fragment illustrates grid update operations using VMMP services.It is a part of a program for a solution of a Laplace PDE by relaxation.The multigrid algorithm is implemented by a similar code.The VMMP services used in this code fragment will be explained in the following paragraphs.

VMMP Implementation
VMMP has been implemented on the following multiprocessors: 1. Sequent Symmetry: a 26-processor Sequent Symmetry running the DYI\IX operating system.
*define BOUNDARY The Vget_rect and Vset_rect are used to read and write a slice of a shared object containing the grid.The slice boundaries were computed previously by the VgricLpart according to the distribution strategy.5. Transputers: a distributed memory multi-processor consisting of a network of eight T800 transputers running the HELlOS distributed operating system [18].
V:\1MP simulators are available on SC.l'\3.SUN4, IBM PC/RT, and V:\1£532 serial computers.The same application program will run unchanged on all target machines after a recompilation.
VMMP has been used in the implementation of many other problems, including a parallel generalized in frequency FFT ( GFFT) [ 19], parallel solution of underwater acoustic implicit finite difference (IFD) schemes [20], and a portable and parallel version of the basic linear algebra subroutines (BLAS-3) [21].VMMP was also used as the intermediate language of the portable parallelizing pascal compiler (P 3 C) [14], which translates serial application programs in Pascal into portable and efficient explicitly parallel code.

The Model Problem and Its Solution Scheme
As our model problem we consider the Poisson equation (4) on a rectangular domain G, which without loss of generality is taken to be the unit square (0 :S x :S 1, 0 :S y :S 1) with Dirichlet boundary condition u(x, y) = g(x, y) on ac. (5) As indicated earlier, when the body is aligned with the grid, its boundary coincides with the grid boundaries.In the case of an unaligned body we consider a rectangular body as illustrated in Figure 6.
Note that although we consider a specific elliptic problem, our analysis to follow provides qualitative results for the more general cases and may be easily modified to fit other boundary-value problems and more complicated domains. 1 We use a grid Gh where h = N + 1 is the uniform grid spacing along the x and y axes, where N is the number of interior grid points.Approximating the partial derivatives by the corresponding second order central difference equation we obtain ri+Li-2V;.i+ V;-1,; h2 = fJ where and f.J = F(x;, YJ), V;.J = u(x;, YJ), X;= ih(1 :S i :S N-1), YJ = jh(1 :Sj ::s N-1), V;.J = g(x;, YJ) on aG(i = 0, N; 1 ::sj ::s N-1), (j = 0, N; 0 ::s i ::s N). (7) Truncation error is of O(h 2 ).
The weighted Jacobi method applied to Poisson equation becomes In our analysis to follow, we consider general F(x, y), g(x, y) and w.However, for the sake of simplicity of intermediate convergence tests, in our practical implementation we take F(x, y) = g(x, y) = 0.In this case, the exact solution is identically zero and the numerical results are actually the corresponding errors.The convergence parameter is taken as w = l In the case where the body is not aligned with the grid, the values V/j at dummy and boundary points are evaluated as dis- cussed in Section 2.1.Note that because we consider a rectangular body, only reflections likeD" in Figure 1 (corresponding to the point A) are to be considered.In our implementation, we s:mplifv the average to be taken such that the value of such reflection point is taken as the 5ingle value of the real grid point with which it coincides.For instance, when evaluating the value of the point A in Figure 1, the value of D is replaced bv the value of its reflection D", which in tum is replaced bv the value of A. • The corresponding interpolation and restriction operators that are implemented here are as described in detail in Hackbusch [13] (pp.60, 65).If the grid point in Gh coincides with a grid point of G2h in the interpolation procedure, then the value is unchanged.If, however, the new fine grid point of Gh is along a partition line of G 2 h then its value is taken as the symmetric average of its two neighbors along this line.Otherwise, the value is taken as the symmetric average of its four neighbors at the comers of the appropriate square (Fig. 7).
In the restriction procedure, the weights of neighbor points of Gh along a partition line of G 2 h is twice as much as that of the neighbors at the comers of the corresponding square.The weight of the point itself is four times as much (Fig. 8).
We use the multigrid V -cycle (p.= 1) as discussed in Section 2.2 with y 1 = y 2 = 100.1\'ote that usually the main advantage of the multigrid methods is the fact that the needed numbe; of D coarse grid points o fine grid points .iterations (which is determined by the problem residual) is relatively very small as compared with the single grid case.However, unlike other articles considering parallelization of multigrid solutions, the main purpose of our article here is to investigate parallelization of multigrid solutions of a model problem around aligned and nonaligned body.Therefore, as a first step, we have chosen fixed and relatively large values for y 1 and y 2 (for N > 1 ).The case where the number of iterations is determined by the residual problem in the aligned and nonaligned case involves a great deal of load balancing and is under current investigation.The fi 'd .

Domain Decomposition
The performance algorithms implemented on shared-memory and shared bus architectures has been of concern, as compared with the case of message-passing architecture, due to the need for global memory access.This access is done through a bus that is much slower than the local memory bus.Ylultigrid algorithms implement in each level a relaxation algorithm and then a fineto-coarse grid, or coarse-to-fine grid, data transfer.The locality of multigrid computations enables having it implemented efficientlv on both message-passing as on shared bus an.d sharedmemory machines, because the computation of local data enables avoiding frequent use of the global memory.The communication overhead is minimized and there is less traffic on the bus so that less synchronization is required.On messagepassing architectures, the algorithm and the problem are scalable.In other words, the performance stays the same if we increase the number of processors, which means that we have linear scalability.However, for shared-memory architecture, the availability of more processors.will enhance the level of parallelism on one hand, but will increase synchronization and communication overhead on the other hand.Therefore, on shared bus architectures, there is no linear scalabilitv.
Our parallel implementation of ~ grid relaxation is based on grid partitioning; the grid is divided into several subgrids each of which is assigned to a single processor.The partitioning in our case is done by splitting G into p horizo~tal strips of equal areas: :::::% :::: p is the number of processors and N is the number of interior grid points along each axis.This is illustrated in Figure 9.If the body shape causes a significant difference in the number of grid points between the strips.the grid will be partitioned into nonequal size strips such that each contains an equal number of nondummy grid points.l\"ote that this partition is different from that of Taylor and Markel [ 4]. who consider only the aligned case and use a partition into square elements.The relaxation step is performed on each subgrid independently.However, calculations of values at interior subgrid points cannot be done until values from neighboring subgrids are received.For this reason, and in order to avoid frequent access to the global memory, we use overlap areas that will include the needed values from neighbors' sub grids.Data from these overlap areas are transmitted and stored at the beginning of a relaxation step, simultaneously for all subgrids boundaries.and before the calculations of the subgrid interior points are performed.For example, consider a relaxation with two processors.The domain is divided into two subdomains, such that processor 1 is assigned the upper subdomain, whereas processor 0 is assigned the lower subdomain (see Fig. 10a).At the beginning of the relaxation, and before the evaluation of subgrid interior points, all values of the upper subdomain last row are transmitted simultaneouslv and stored at the The same grid partitioning method is applied at all multigrid levels, and data transfers from fine grid to coarse grid (see Fig. 11) and vice versa are performed as described earlier in Section 3. 1.
l\"ote that the number of grid points per processor in the coarser grid is decreasing and thereby causing a relatively high cost of data transmission.Therefore the number of processors to be utilized in each step of the multigrid algorithm should depend strongly on the size of the grid.Because moving one step down in the V -cycle reduces the number of grid points by a factor of 4, at some step it would obviously be more economic to utilize only some of the available processors.1\"umer-FIGURE 11 Transfer of data from fine grid level to a coarse grid level.ical tests £or our problem have shown thatf(p, m ), the optimal number of utilized processors, is given by: where p is the number of processors, m is the number of grid points along each axis, and "l q j" denotes the largest integer which is smaller than q.This way, with the overlapping rows and unless N = 1, each processor is assigned at least two rows.For example, in a multiprocessor with 12

NUMERICAL RESULTS AND PERFORMANCE ANALYSIS
The same parallel single grid relaxation and multigrid algorithms have been implemented on MOS [ 17], a single user-shared memory machine, on a Sequent Symmetry, a multi-user multiprocessor shared-memory machine, and on a distributed memory (a Transputer network) multiprocessor by using the VM~fP software package.

MOS Performance
This section describes the implementation of the algorithms on the MOS single-user shared-memory multiprocessor.This multiprocessor has eight DB532 processor cards connected to shared memory via a VME bus.Each processor card contains a 25 MHz ~S32532 CPC and FPC, 64 KB of write through cache, and 4 MB of local memory.The on-board cache minimizes global memory accesses.The multiprocessor runs the MOS [17] operating system, which is a distributed version of U\lX.
Figure 12 illustrates the YIOS multiprocessor structure.
For the reference purpose of parallel speedup measurements, a serial program was executed on a single NS32532 processor, without the overhead of system calls to VYIMP and of parallelization.This serial algorithm is the fastest we have been able to derive.The results of our implementation for up to eight processors with VMYIP calls are given for aligned and nonaligned cases in Figures 13 and 14.The speedup, defined as serial time/ parallel time is given for each parallel processing time, whereas the single processor time and the serial time are given as no.processors = 0.These measurements do not include 110 time.Obviously, the start-up cost at process creation overhead from the use of the V:\1:\1P is best demonstrated when we examine the performance of the parallel algorithm on a single processor, relative to the corresponding serial performances in these figures.For example, when the body surface is aligned with the grid and N = 16, the processing time for the serial algorithm is 700 seconds whereas the processing time for the parallel algorithm utilizing one processor is 682 seconds.The difference of 18 seconds is due to V.\fMP calls and in general for one processor the overhead is about 3% of serial time.
We observe from Figures 13 and 14 that the speedup increases with the number of points, and the performance is dramatically better for very fine grids and a relatively small number of processors.This is due to the locality of the involved calculations, so that for large grid sizes, the communication overhead becomes negligible compared with processing time of each processor.
speedup vs. number of processor graphs: lower graphs-single grid relaxation; upper graphs-grid relaxation: left-when the body surface is aligned with the grid; right-when the body surface is not aligned with the grid.
Note also that the performance in the nonaligned case for small Nand large p is relatively poor due to the enhanced overhead caused by the dummy points.Alignment overhead is shown to be up to about 200% of aligned grid processing time in both the single and multigrid cases.

Sequent Symmetry Performance
The multi-user Sequent Symmetry machine contains 26 i386 processors with 64 KB of cache for each processor.Global memory size is 32 YlB.
The Sequent runs the DYKIX operating system.which is a parallel variant of U\IX. Figure 15 illustrates the Sequent Symmetry structure.The results for the single grid and multigrid in the aligned and nonaligned cases are presented in Figures 16 and 17.The performance here is shown to be similar to that of the YIOS when the processor's load is large (i.e., for large grid points and small number of processors) and an efficiency of over 90% is achieved.However, some variations are observed for a small number of grid points.Alignment overhead is again up to about 200%.We note, however, that poor efficiency is expected for small N and relatively large p. when the processor's load is very small and communication overhead becomes significant.

Message-Passing Performance
Our algorithm has been implemented also on a distributed memory multiprocessor comprising eight TSOO Transputer nodes.Each node contains a T800 Transputer and 2 or 4 YIB of local memory.The transputer dock speed is 20 .\1Hzand memorv access time is 80 nsec.The nodes are connected by high-speed bidirectional serial links.There is no shared memory and the Transputers run the HELlOS distributed operating system Transputers. Figure 18 illustrates the Transputer system structure.Figures 19 and 20 show the speedup and processing time accordingly.When the processor's load is large enough, an efficiency of over 90% is attained.However, for the small processor's load the message-passing performance of our algorithm is degraded significantly as compared with the shared-memory results, mainly due to the increased communication overhead.Note also that the alignment in the mes- sage-passing case is reduced to about 20% in the single grid case and to about 65% in the multigrid case.

The Single Grid Solutions
For each fixed N = 16, 32, 64, 128, 256 we consider the following processing time model: Tv.p = n (t ~2 + eN) (12) where Tv.p is the processing time for single grid relaxation, n is the number of relaxation itera-  -------L--------L--------L--------~------ Using Equation 12we assume equal time delays for all processors when they try to transmit data on a shared bus while another processor is occupying the bus.Cnder this assumption, and due to the specific choice of our body shape, neither t nor c should depend on N or p.
In our computations we used n = 100.The values of c and t of Equation 12 that provide the least squares fit to the performance results for different values of N are listed in Table 1.
We can see from Equation 12that the maximal number of processors to be utilized in our algorithm is the square root of grid points (namely p = N).In this case, each processor is assigned one horizontal line (Fig. 9).::::::---  where Ap.i\' is the multigrid processing time on N X N grid points using p, processors, Ta.b is the relaxation processing time on a X a grid points using b, processors,   ----------------------------------------   In this processing time model we made the same assumptions of Equation 12 as well as equal time for grid interpolation and for grid projection.Because we are using a full weighted projection we may assume nearly equal times for grid interpolation and for grid projection.Then, neglecting the residual computation (step 2 in the /.L -cycle algorithm), F\.p is given by ;V2

The Multigrid Solutions
where z is the time needed to receive one grid point value, interpolate (or project) it on finer (coarser) grid, and transmit the point value to the new grid.N 2 is the number of interior grid points, and p is the number of processors.Using the identity: in which T\f.p.\ is defined by Equation 12, we may evaluate the least squares values of z, corresponding to the performance results.They are presented in Table 2. l\"ote that for the Transputers z = 0(10-1 ) whereas for the MOS and for the Sequent machines it varies by a factor of 10 -20.These large variations are probably due to large variation in bus contention for different runs in the ~lOS and Sequent machines and due to limited amount of cache memorv in those machines.\Ve can now use Equation 13 and predict the processing time if the number of processors is the same as the square root of the number of grid points (namely p = N).This is the maximal number of processors to be utilized in our algorithm.We then obtain the information as shown m Table 3.
\V"e can see that in this case.processing time depends on ;V in a nonlinear manner.The average increase of processing time when N is doubled is about 2.5 in both the aligned and the unaligned The nonlinearity in the speedup measurements can be explained by the change of number of processors being utilized in each V -cycle level in order to increase efficiency, as explained earlier (Table 3).

6SUMMARY
We have presented and implemented an efficient algorithm for single grid and multigrid relaxation of a boundary value model problem on both shared bus and shared-memory .\1L\1Dand message-passing machines by using the V.\-1MP soft- \V'e do emphasize, however, that our algorithm can be modified accordingly for more complicated boundary value problems.The parallel single/multigrid algorithm is noticed to fit well this architecture when, p. the number of processors is less than, N, the maximal number of grid points along the axes.
For large number of points the efficiency is over 90%.In general, efficiency is higher in the nonaligned case as compared with the aligned case.In the case of a single grid relaxation in the two-dimensional case when p = N, processing time increases nearly linearly with N.However, in the multigrid case, this dependence noticed to be nonlinear (O(N 1 3 )) mainly because of the change in the number of processors being utilized in each multigrid level (in order to increase efficiency).

2 .FIGURE 2
FIGURE 2 Schedule of grids (a) V-cycle and (b) Wcycle all on four levels.

FIGURE 5
FIGURE 5 Supported grid distributed strategies.

2 .
MMX: an experimental shared bus multiprocessor containing four processors running a simple run time executive [ 15]. 3. ACE: an experimental multiprocessor developed at IBM T. J. Watson Research Center running the MACH operating system The VgricLupdate routine is used to exchange the overlapping boundary elements of the grid between neighboring processes.The Vcombine routine is used to compute an associative and commutative operation, such as maximum, on values provided by all processes.[16].4. MOS: a shared-memory multiprocessor containing eight NS32532 processors running MOSIX, a distributed UNIX r 171.

FIGURE 6
FIGURE 6The body shape in the aligned and unaligned case.

FIGURE 7
FIGURE 7 Coarse-to-fine interpolation; numbers indicate the corresponding weights.

FIGURE 8
FIGURE 8 Fine-to-coarse grid interpolation: the corresponding weights are the shown numbers divided bv 16..

VIRTUALFIGURE 10
FIGURE 10Overlap areas for 6 X 6 grid: (a) grid partitioning, (b) overlap data stored in processor local memory; internal rectangle resembles the body shape .
Multigrid Program: I* Init *I Vprocessors() I* get number of processors *I Relaxation function: I* get local part of grid and dummy points *I Vcrowd_part() I* compute local part *I Vget_slice() I* get local local part of grid and dummy points structure *I I* Relaxation *I Repeat N times Compute relaxation result on local grid part Vsend( ) I* send local grid boundary values to the adjacent processor *I Vrecv( ) I* receive local grid boundary values from the adjacent processor *I Endrepeat Vset_slice() I* prepare result of local grid relaxation for Vget operation *I End relaxation function

FIGURE 14 :
FIGURE 14 :YIOS processing time vs. number of processors graphs (0 processors stand for serial computation).
Fa.b is the time needed to transfer grid with a • a points to (~ • ~) points (or (~ • ~) to a • a) using b processors, and f(p, l) is defined by Equation 11.

JFIGURE 20
FIGURE 20  Message-passing processing time vs. number of processor graphs (0 processors stand for sprial computation).
Domain partition for N = 18 and p (internal rectangle resembles the body shape).

of proeea .. ra tions
~ ,.N 2 is the number of interior grid points, p is the number of processors, t is the computation time of one grid point in a relaxation step, and c is the proportional factor of overhead time for data transmission among the processors (exchange of overlapped areas).

Table 2 .
z Values -FIGURE 18Transputer message-passing multiprocessor structure.Tablet.c and T Values (in Seconds)

Table 3 .
Predict Processing Time Where p = :Y a body that is embedded in and unaligned with the grid net.processing time obviously increases, as compared with the aligned case.In our case it is tripled.