Comparison of message-passing and shared memory implementations of the GMRES method on MIMD computers

Joanna Płȧ zeka, Krzysztof Banása and Jacek Kitowski b,c,∗ Section of Applied Mathematics FAPCM, Cracow University of Technology, Warszawska 24, 31-155 Cracow, Poland Institute of Computer Science, University of Mining and Metallurgy, al. Mickiewicza 30, 30-059, Krak ów, Poland Academic Computer Center CYFRONET, ul. Nawojki 11, 30-950 Kraḱ ow, Poland Tel.: +48 12 617 3964; Fax: +48 12 338 054; E-mail: kito@uci.agh.edu.pl


Introduction
Architecture details of parallel computers make it possible to define diversity of machine taxonomies (see for example [10,28,51]), however one of the most important factors is organization of the address space.Between the two extremes, i.e., the shared address space organization and the distributed memory architecture, there is a wide class of machines with virtually shared, physically distributed memory organization (often called Distributed Shared Memory, DSM, machines).DSM can be software or hardware implemented; the latter one offers better characteristics and several typical classes, like cc-NUMA (cache coherent non-uniform memory access), COMA (coherent only memory architecture) and RMS (reflective memory systems).At present cc-NUMA implementations are commercially the most popular.According to Flynn's taxonomy [17] they belong to Multiple Instruction/Multiple Data (MIMD) computers.Examples come from HP (Exemplar and SuperDome with two interconnection layers) and from SGI (Origin2000/3000 with fat hypercube topology).The similar approach is implemented in IBM RS/6000 SP computers with SMP nodes.
Parallel computing adopts two classical programming models -the message passing and the shared address space.The former (explicit) is often treated as an assembler for parallel programming, while the latter (implicit) simplifies the programming process and is applied recently for the network of workstations [29].For the experienced user it is easier to obtain better parallel efficiency with the message passing model.Since the advanced multiprocessors and clusters are constructed with SMP nodes the choice between the programming models is not obvious and some inte-gration of multiprocessing and multithreading would be profitable in the future.Due to faster communication within the SMP nodes than between the nodes, it would be probably not the best choice to implement DSM in the entire program (since they are specially designed for the implicit programming).The better way is to adopt this kind of programming within the SMP nodes, while using the explicit programming between the nodes.Such an approach is consistent with present trends in high performance computing, in which the dominant architecture will be clusters of SMPs in its many variants [46].
Most programming problems can have several parallel solutions.A popular design methodology is to consider distinct stages, like machine-independent stages including partitioning and communications and also machine-specific stages of design that deal with agglomeration and mapping.This kind of design is often called the PCAM methodology [18].
For the Finite Element (FE) Method applied to numerical approximation of nonlinear partial differential equations, both implicit and explicit approaches could be adopted, combined with iterative solvers for systems of linear equations and domain decomposition preconditioners.The solvers are usually implemented with imperative programming, which offers high performance.Newly appearing approaches make use of object-oriented paradigm implemented in the C++ language and/or network paradigm represented by Java (e.g.[22,31,42,52]).
Overlapping domain decomposition methods are iterative in nature, thus solutions on the individual subdomains can be combined together in order to formulate the overall solution.Iterative solvers can also be applied on any individual subdomain, which constitutes the room for further development of hierarchical or multilevel algorithms, some of them published to date, e.g.[32,34,47].
In this paper we compare different parallel implementations of the same algorithm for solving nonlinear simulation problems on unstructured meshes using the adaptive finite element approach.For this purpose we developed a parallel adaptive finite element algorithm in the version designed to solve the compressible Euler equations of inviscid fluid flow.The general purpose part consists of finite element data structure routines, including mesh modification procedures and solvers.For solving systems of linear equations, which in the case of fluid flow problems are nonsymmetric, the GM-RES method combined with overlapping domain decomposition preconditioner is applied [37].
To get full advantage of parallelization, efficient implementations of domain decomposition (including load balancing and minimizing the communication cost) and of the solver should be taken into account.In this paper, however, we are interested in different parallel implementations of the same algorithm rather than in sophisticated methods concerning the domain decomposition (which are kept simple for this study), thus dealing with machine-specific issues of the PCAM methodology.
We introduce two levels of parallelization to the algorithm.Using one-level parallelism only explicit or implicit programming can be applied.With two-level parallelism a hybrid model is considered to effectively use SMP nodes.The paper describes the essential extensions to the algorithm which has been partially already presented [34][35][36].
The goal of the paper is to compare performance of different parallel models, that implement the same mathematical algorithm.

Finite element algorithm
As a test bed for parallel implementations of GM-RES an h-adaptive finite element code designed to solve compressible Euler and Navier-Stokes equations has been used.The code implements standard stabilized finite element algorithms (Streamline Diffusion [1,20,21,44] with pseudo time stepping for steady state problems or modified Taylor-Galerkin [12] for transient problems).The approximation to the Euler equations is split into a sequence of solutions to implicit finite element problems with a system of non-symmetric linear equations solved at each time step.Adaptivity is based on refinement/derefinement technique [11] with the use of explicit residual error indicators for compressible flow problems [13].The details of the finite element algorithm can be found in [3,4,6,7].This rather simple algorithm, still however of practical significance, demonstrates flexibility for explicit, implicit and hybrid implementations.It contains the most important ingredients found in modern finite element solvers for CFD, like error based adaptivity, Krylov space solvers and domain decomposition preconditioning.From the point of view of parallelization, the techniques developed for this particular algorithm can be easily transferred to other solvers.
The standard finite element procedures for solving an implicit, linear problem consist in creation of element stiffness matrices and load vectors, assembling them into a global stiffness matrix and a global load vector and then solving a resulting system of linear equations.The latter task is often performed by a separate, general purpose library procedure or specialized algorithms (e.g.[40]).The parallelization of such a solver is done independently of the finite element mesh and the problem to be solved.
In the reported case the solvers are built into the algorithm and use a particular data structure related to the finite element mesh.It is assumed that the finite element module of the code provides the solvers with element stiffness matrices and load vectors, together with information on element connectivities.The connectivity information, in the form of lists of element nodes and, for each node, elements sharing a node, is sufficient for assembling the global stiffness matrix and the load vector (also in the case of irregular meshes with hanging nodes).The code assembles and stores the global stiffness matrix in a special data structure related to the decomposition of the computational domain into small subdomains.These small subdomains, called patches of elements, consist of several elements only (with one or more internal nodes).There is a patch data structure created for each patch, consisting of the corresponding blocks of the stiffness matrix and the load vector.When each node of the finite element mesh belongs to the interior of only one small subdomain (for standard meshes this corresponds to one element overlap between subdomains) than blocks of the stiffness matrix in patch structures do not overlap.In that case the storage scheme is just a version of the block compressed row storage.Otherwise, it provides a useful extension to it for the case of overlapping subdomains.
The patch data structure is directly used in the solver algorithm that is built around block iterative methods.Patch stiffness matrices form stiffness matrices for local, subdomain problems, forming building blocks of the algorithm.
The GMRES algorithm [39,41,48,50] is one of the most successful and widely used iterative methods for solving nonsymmetric systems of linear equations.It is especially suited for nonlinear or time dependent problems where the whole simulation is split into a sequence of separate step problems.The solution from the previous step problem forms then a perfect candidate for the initial guess in GMRES iterations.
The performance of the GMRES depends on the conditioning of a system of equations so usually it is used with left or right preconditioning.In the case of left preconditioning, instead of solving the original system of equations Ax = b (with A the global stiffness ma-trix, x the vector of unknown degrees of freedom and b the global load vector), the preconditioned system M −1 Ax = M −1 b is solved (see e.g.[6]).The preconditioning matrix M should be easily invertible and designed in such a way that the preconditioned system has better convergence properties than the original one (the condition number of the product M −1 A should be close to one).The preconditioned GMRES algorithm schematically is presented in Fig. 1.
Since the convergence of the GMRES was not the subject of investigations, the number of Krylov space vectors N ksv was set to 10, a small arbitrary number, often found in large scale simulations.The initial guess x 0 was formed by the solution from the previous time step.
Preconditioning of the GMRES algorithm can be efficiently achieved by using any of basic iterative methods such as the standard Jacobi, Gauss-Seidel or block Jacobi, block Gauss-Seidel methods [19].
We have implemented the preconditioning using block Jacobi and block Gauss-Seidel algorithms where all operations in the GMRES involving the global stiffness matrix A are performed by means of loops over local (block) problems.
In the default setting of the solver (used in all computational examples reported) the local problems have minimal dimension with patches having only one internal node.This option corresponds to blocks in standard iterative methods related to unknowns at each finite element node and has proven to be the most efficient for the Taylor-Galerkin method [3].

Parallelization with explicit programming model
The GMRES algorithm preconditioned by standard iterative methods can be interpreted as domain decomposition algorithm [30,47].It reflects the general framework of Schwarz methods: divide the computational domain into subdomains, solve the problem for each subdomain separately and combine the solutions.The role of subdomains is played by patches of elements and single iterations of iterative methods accelerated by the GMRES are just inexact solutions of subdomain problems.Depending on the preconditioner we have either an additive Schwarz method (block Jacobi preconditioner -contributions from different subdomains are summed up) or a multiplicative Schwarz method (block Gauss-Seidel preconditioner - contributions are taken into account while looping over patches).
Since the number of patches is usually much larger than the number of processors, the second level of domain decomposition is introduced, equivalent for finite elements with suitable mesh partitioning.The number of subdomains at this level (submeshes) is equal to the number of computing nodes of a parallel machine.The subdomains possess one element overlap so each node in the finite element mesh belongs to the interior of only one subdomain.Using the message passing programming model, data corresponding to a particular subdomain are sent and stored in the memory of one computing node.The computing node corresponding to a given subdomain (storing its data) is responsible for all computations related to the subdomain (computing element stiffness matrices, assembling them into patch matrices, solving local problems).
Having distributed the data structure, computations of the GMRES algorithm begin.Some vector operations, such as scalar product or normalization, require communication between computing nodes and are done by standard message passing procedures using a selected computing node to gather the data.Other operations, like subtraction or multiplication by a scalar, are done entirely in parallel.The GMRES minimization problem is solved on a chosen computing node.
Iterations of the block Jacobi method are done in two steps.First the loop over all internal FE nodes of corresponding subdomains is executed by each computing node separately.Patches of elements are created and for each patch (i.e.finite element node) a local problem is solved by a direct solver (e.g.Gauss elimination).
Then all subdomains exchange data on interface nodes.Due to one element overlap between subdomains, each node is updated during computations only by one computing node (corresponding to the subdomain owning the node).Hence the exchange of data after iteration loops is simplified -the computing node owning a given interface node just broadcasts its data to neighboring subdomains.

Mesh partition algorithm
The optimal partition of the mesh should keep minimal the execution time of the whole problem solved by the finite element method.Equal load of the processors is usually imposed on the domain decomposition algorithm.The partition determines two important factors: the convergence properties of an iterative method used to solve the system of linear equations [38,47] and the amount of data exchanged between computing nodes during the solution procedure.The former factor favorites subdomains with regular shape and bigger overlap.The influence of the latter depends additionally on the architecture of the parallel system, especially the bandwidth of data transfer between processing nodes.
As it has been noticed in the introduction, in this paper we are interested in parallel implementation of the solver rather than in the domain decomposition methods.For this purpose we introduce a simple heuristic partitioning algorithm that, nevertheless, has several advantages.It enables creation of subdomains with arbitrary number of nodes and handles efficiently hadaptive meshes with hanging nodes.In the code it is used for creating both types of subdomains: small subdomains for block iterative preconditioning and large subdomains for parallel execution.The slightly simplified version of the algorithm, with all technical complications caused by the presence of constrained (hanging) nodes omitted, is presented in Fig. 2. The standard for greedy type algorithms principle of adding the nearest neighbor to the created subdomain is combined with an intermediate step of creating a front (group) of nodes being the candidates for adding to the subdomain.The presence of heuristic weights at nodal points, allows for using different mesh partition strategies.These strategies are based on some geometrical principles.The first node in a given subdomain becomes a reference point and then we create subdomains trying to minimize the distance of the subdomain boundary from this point.Choosing different definitions for the distance in 2D space we get different strategies for mesh partition by obtaining different shapes of subdomains and as a consequence different sizes of inter-subdomain boundaries.The choice of the strategy, which results in minimal number of elements shared by the processors, follows from the qualitative knowledge concerning the solution and local density of mesh points.In Fig. 3(b)-(g) several partitions of the mesh from the Fig. 3(a) are presented with corresponding distance definitions.The expanse of white between the partitions represents the shared elements.We implemented also the greedy Farhat criterion [14,15] of adding the node which belongs to the minimum number of elements.In all examples the total number of 8441 nodes is almost uniformly distributed among subdomains, each of which possesses more than one thousand internal nodes.The initial front for all cases is the same and consists of the lower left corner node.
We developed a simple strategy for load balancing [7] which is used together with the heuristic strategy for definition of the distance in 2D space.The load balancing strategy exploits the fact that in the mesh partition algorithm we can specify the number of nodes for each subdomain.In the presented GMRES implementation, the workload for a given computing node is proportional to the number of FE nodes.Then the load balance in the corresponding subdomain is achieved by ascribing to each computing node a subdomain with the number of FE nodes proportional to power of the computing node.As an estimate for computing node performance the inverse of the average time for computing local stiffness matrices is taken.The information on the performance is sent as an input to the procedure which specifies the numbers of nodes for particular subdomains.These numbers form the input for the mesh partition algorithm.Combining the strategy for load balancing with the strategy of distance definition one is able to get a decomposition which is useful for solver testing.
To retain the efficiency of the algorithm for adaptive meshes, for which the number of nodes and elements grows considerably, the loops in Fig. 2 can be performed for initial mesh elements and the corresponding nodes only.At that moment adding an element to a subdomain consist of adding the element and all its ancestors with all their nodes.Additional techniques are introduced to maintain the one element overlap between subdomains.As a result, the speed of the algorithm depends only on the parameters of the initial mesh.

Implementation
The code is written in the C language and uses typical C features like structures and dynamic memory allocation.The structures are used for nodes, elements and patches data representation.Access to data is realized by an array of pointers to the structures.Characteristic feature of the present algorithm is the existence of patch structures, each storing all data (including lists of nodes and elements, assembled stiffness matrices and load vectors) for a corresponding patch.
The exchange of data between subdomains is based on arrays storing, for each subdomains, lists of nodes for which data is sent to or received from the neighboring subdomains.
The practical realization of the whole computations in the explicit model is achieved using master -slave paradigm.There exists one master process that controls the solution procedure and slave processes performing in parallel the most of calculations.The flow diagram of the whole simulation is presented in Fig. 4. Since the implementation of the explicit programming model makes use of the PVM system for message passing between different processing nodes, it is denoted by PVM in the following sections.
Adaptation of the mesh can be performed between any two time steps of the algorithm.Usually for simulations of transient problems, adaptations of the mesh are frequent while for steady state problems only several adaptations are required in the whole solution procedure.Partitioning of the mesh is performed after mesh adaptation.This may mean that different mesh partition strategies are optimal for these two types of problems.
In the current setting of mesh adaptation all the domain information must fit the master processor.To overcome this effect fast refinement-derefinement technique is used for mesh modification, rendering the time for adaptation to the range of several percents of the to- tal computing time.Nevertheless, despite the fact that expensive error estimation is done perfectly in parallel, the algorithm is expected to scale only up to the range of several dozens of processors.Then special parallel adaptation strategies has to be designed.

Parallelization with implicit programming model
Compilers for parallel computers parallelize codes automatically, verifying loop dependencies.Such an approach results in sufficient efficiency for simple loops only.For example, a subroutine call inside a loop prevents it from possibility of automatic parallelization.To overcome those problems compiler directives/pragmas are additionally used to increase degree of parallelization and to enable manual control of many aspects of execution.
In the implicit programming model, the GMRES algorithm sketched in Fig. 1 (written in C language in a similar way to the explicit one) is parallelized using compiler directives whenever a loop over patches, nodes, or individual degrees of freedom is encountered.In particular the parallelization is applied for: -loops over blocks of unknowns at the step of blocks construction, -computation of element stiffness matrices and assembly into patch stiffness matrices, -iterations of the block method.
Since the implicit programming model makes use of distributed shared memory of the multiprocessors, its practical implementation is denoted by DSM in the following sections.

Parallelization with hybrid programming model
A hybrid programming model is a combination of explicit and implicit models.In this model, suitable for a machine with SMP nodes (or for a cluster of multiprocessor workstations), we introduce two levels of parallelization: -first-level parallelization with pure explicit programming model making use of geometrical data decomposition into subdomains; a number of subdomains is equal to the number of SMP nodes coordinated by a message passing library.-second-level parallelization with the implicit programming model for each SMP node, that makes use of the shared memory for performing iterations for the block method.

Results
We tested our parallel versions of GMRES (preconditioned with the block Jacobi method) with an example of flow simulations -a well known transient benchmark problem -the ramp problem [54].A Mach 10 shock traveling along a channel and perpendicular to its walls meets at time t = 0 s a ramp, having an angle of 60 degrees with the channel walls.A complicated flow pattern develops with double Mach reflection and a jet of denser fluid along the ramp behind the shock.The mesh after initial adaptation is shown in Fig. 3(a), that presents qualitatively the solution.
The problem is solved with the Taylor-Galerkin method.Adaptations of the mesh are done every five steps.Since the flow is inviscid, a simplified version of the element error indicator is used: where h is a linear element size, U is the vector of conservation variables at time t n and t n+1 (∆t = t n+1 − t n ) and f E k are the vectors of Eulerian fluxes.The Euclidean vector norm is weighted by the Hessian of the entropy function H with respect to U [4].
The time step length is controlled by the constant CFL number equal to 2 [5].
In the explicit parallel programming model (denoted by PVM) we used a version of domain decomposition (mesh partition) algorithm that ensures vertical alignment of subdomain interfaces.
The results refer to the wall-clock execution time, T , for one time step chosen as a representation for the whole simulation, with different meshes (having N = 4474, 16858, 18241 and 73589 nodes and denoted in the following figures by s4t, s16t, s18t and s73t respectively).To verify performance of the algorithm for different meshes and different programming models we fixed the number of restarted GMRES iterations, l, that keeps the convergence index, , at the acceptable level.We used = r l , where r l = x l−1 − x l and x l−1 , x l are solution approximations in l − 1 and l iterations respectively.
In Fig. 5 dependence of convergence index on the number of restarted GMRES iterations, l, for three meshes is presented.Since 10 −10 for l 5, l = 5 was adopted.
To get statistically more reliable results the measurements have been collected three times from 5 subsequent time steps, since fluctuations in T of several percents were observed.

Performance of the mesh partition algorithm
The final result of partition produced by the algorithm from Fig. 2 depends, for a given computational domain, on the initial front and the way of selecting nodes from the front.On the basis of knowledge concerning the solution and local density of mesh nodes, from many possibilities presented in Section 3 we implemented three for explicit programming to compare their efficiency.
The initial front for all cases is the same and consists of the lower left corner node.Table 1 shows the number of the interface nodes (representing the communication cost) for strategies presented in Figs 3(b) and (d) supplemented with the greedy Farhat's strategy [14,15].These values were obtained for the ramp problem with the number of nodes equal to 18241.
It can be noticed, that for the particular domain considered, the heuristic strategy presented in Fig. 3(d) results in the smallest number of the interface nodes minimizing the communication requirements.This result remains valid also if the mesh is refined, because remeshing is done mainly in the region of high mesh density which does not influence much mesh density profiles.

Performance of the parallel GMRES
Three parallel machines have been used: HP SPP1600 (32 PA 7200/120MHz processors organized in 4 SMP hypernodes, with SPP-UX 4.2, Convex C 6.5 and Convex PVM 3.3.10),HP S2000 (16 PA8000/180MHz processors in one hypernode using SPP-UX 5.2, HP C 2.1 and PVM 3.3.11)and SGI Ori-gin2000 (32 R10000/250MHz processors, IRIX 6.5, SGI C 7.3 and SGI PVM 3.1.1).We run the code on a SPP1600 isolated subcomplex consisting of 16 processors from four SMP nodes.During the measurements on S2000 and on Origin2000 (subsequently called S2K and O2K respectively) no other users were allowed to use the machines.
In Figs 6(a) and (b) CXpa profiles for the wall-clock execution time, T , for one simulation step and for the implicit parallel programming (denoted by DSM) are reported.CXpa is a performance analysis tool for CONVEX and HP parallel computers [55] for monitoring program performance at user-selectable source code regions, such as routines, loops, and compilergenerated parallel loops.Collected data include wallclock/cpu time, execution counts, cache miss counts and latency for memory access.In our case the profiles represent the time spent by the program in the most time consuming program modules.Seven execution threads were declared of a rather small problem with 4474 FE nodes.For execution with one thread, T = 48 s (see Fig. 6(a)), compared to T = 19 s obtained for seven threads (cf.Fig. 6(b)).Since the results concern program threads with procedures including children (e.g. the main() procedure includes time spent in all other routines), it can be noticed that sufficient parallel performance is obtained.The most involved modules are a main() procedure, dumpin() for reading input data, adapt() for performing mesh adaptation and gmressol() for solving a finite element problem using preconditioned GM-RES method.The last mentioned procedure calls other modules, like assebj() to compute element stiffness matrices and assembly them into patch stiffness ma-  trices using celm(), elem(), mcoeff(), kijnat(), mat4prodffc() modules and bijacit() to perform one block Jacobi iteration.Due to rather high CXpa overhead the results can be interpreted qualitatively only.
In Fig. 7(a) the wall-clock execution time, T , for different number of processors, K, for different machines and programming models is presented for number of characteristic saturation is observed for the PVM implementation due to higher computation to communication ratio, resulting from higher data locality.In all cases the DSM implementation shows worse performance in comparison with PVM.Although the machines are the shared memory computers, DSM hardware represents different access times between differing levels of memory (i.e.non-uniform memory access).This fact introduces difference in the latency and in the memory bandwidth, that reduce the performance.
The PVM implementation takes advantage of the PVM system adopted by the vendors.No additional knowledge of the architecture has been included.The architecture details influence the hybrid implementation (PVM+DSM), which is more flexible in choice of communication and computation granularities, according to the surface-to-volume ratio [18].
Relative speedup values, S, are depicted in Fig. 7(c).Good scalability is obtained for the message-passing model, with better characteristics for Origin2000.Despite of the interval 16 < K < 28 (with the biggest mesh, N = 73589 influenced again by different levels of memory), the DSM implementation demonstrates higher speedup on Origin2000 than on S2000.In Fig. 7(d) we present the wall-clock execution time normalized to number of mesh nodes, T /N .Since the characteristics are very close each other, this confirms experimentally linear computational complexity o(N ) of the algorithm.
In Fig. 8(a) timings for Origin2000 are collected.For a small mesh (N = 4474), a minimum is observed due to small computation to communication ratio.The scalability is getting better as the ratio increases.Wall clock time distribution for each of the first 5 time steps and the PVM implementation is presented in Fig. 8(b).It reflects uncertainty in wall clock time measurements (due to operating system overhead), that confirms reasons for multiple measurements.

Conclusions
In the paper we have focused on parallel implementations of the algorithm for GMRES method.The DSM implementation demonstrates useful and scalable parallelization.No PCAM methodology is applied, making the development easier, for the price of suitable choice of data structures and program control flow.In particular, this implementation is adequate for rather small number of processing nodes, although this limitation depends on the architecture of the computer system and the problem size.In our case, a practical limit for processors is K = 4 (for SPP1600 and N = 16858) and K = 16 (for Origin2000 and N = 73589).No performance degradation is observed on S2000, due to only one (SMP-) hypernode available.
The PVM implementation still demonstrates higher performance and speedup characteristics closer to the ideal case for the price of more complicated code struc-ture.In the development stage application of the PCAM methodology is useful to obtain efficient implementation.In our case, in spite of use of the simple partitioning method, no essential limitation of this implementation is encountered.The biggest production run with N ≈ 375000, completed for the PVM implementation in the 32-processor Origin2000 multiuser environment, proved sufficient practical performance.
The PVM implementation is less sensitive to the communication bandwidth than DSM.Changing from one SMP node execution to multi-SMP node execution, the former is only weakly affected, while the latter suffers the performance degradation.The PVM implementation is suitable also for the clusters of computers.In this implementation the PCAM methodology should be carefully applied, considering partitioning and communication models as well as implementation issues of atomic tasks agglomeration and mapping on the architecture [18].On machines with vast number of processors implemented for big problems solving, difficulties with load-balancing and with computation to communication ratio kept high could be encountered.On the other hand, the DSM implementation, although easy to develop and efficient for small number of processors, would suffer of low performance and excessive resources consuming while the problems got bigger.Thus neither of the specific implementations is universal.
Since most of the modern machines with vast number of processors implement different layers of memory and also because the dominant architecture for the future is expected to be clusters of SMPs [46], the hybrid model, that makes use of the both implementations, would be a choice for those kinds of machines.If necessary, in this model both domain and functional partitioning methods can be implemented simultaneously.The present study shows the possibility and feasibility of using the hybrid approach for parallelization of finite element, and hopefully, other scientific codes.To fully assess advantages of this approach against the explicit model more numerical studies for systems consisting of a large number of SMP nodes should be performed.
Future work could consider also several topics, like tuning the solvers for each type of architecture, coding in parameters, which estimate processor performance, memory access speed and the interconnection bandwidth, to be used later on for further solver optimization.Other possibilities include implementation of more sophisticated partitioning methods, loadbalancing procedures and parallel adaptation strategies.
In practice, to get advantage of parallel implementations on tightly coupled architectures, more processors have to be available.Otherwise, in multiuser environments with vast number of users, performance loss is observed.

Fig. 2 .
Fig. 2. The mesh partition algorithm (N S n -the curent number of nodes in a subdoamin, N S max -the maximal allowable number of nodes in a subdomain).

Fig. 4 .
Fig. 4. Diagram of the whole simulation in the explicit programming model (indicating master and slave processes).

Fig. 5 .
Fig. 5. Dependence of convergence index on number of restarted GMRES iterations.

Fig. 6 .
Fig. 6.CXpa profiles (SPP 1600) for wall-clock execution time for one simulation step and for the implicit parallel programming model for: (a) one and (b) seven threads with children, N = 4474.

Fig. 7 .Fig. 8 .
Fig. 7. Wall-clock execution time for one simulation step, different programming models and machines: (a) the mesh with N = 16858 nodes, (b) with N = 73589 nodes.(c) Speedup for different meshes (N = 16858 and 73589 nodes).(d) Normalized T for N = 16858, 18241 and 73589 nodes.nodesN = 16858.The architecture of SPP1600 has influenced results of the DSM implementation, while results from the explicit model remained undistored by the lower bandwidth between the hypernodes.Better scalability is observed for the explicit model in comparison with DSM, although the latter results are obtained with relatively small programmer's effort.For 30 Origin2000 processors and the PVM implementation characteristic saturation becomes evident, while kept monotonic for DSM.Results for the hybrid implementation (denoted by PVM+DSM) are in between those obtained for PVM and DSM.Difference in performance between explicit and im-

Table 1
Number of interface nodes for different kinds of the mesh partition strategies