A Parallel Adaptive Newton-Krylov-SchwarzMethod for 3 D Compressible Inviscid Flow Simulations

A parallel adaptive pseudo transient Newton-Krylov-Schwarz (αααNKS)method for the solution of compressible �ows is presented. Multidimensional upwind residual distribution schemes are used for space discretisation, while an implicit time-marching scheme is employed for the discretisation of the (pseudo)time derivative.e linear system arising from the Newton method applied to the resulting nonlinear system is solved by the means of Krylov iterations with Schwarz-type preconditioners. A scalable and efficient data structure for the αααNKS procedure is presented. e main computational kernels are considered, and an extensive analysis is reported to compare the Krylov accelerators, the preconditioning techniques. Results, obtained on a distributedmemory computer, are presented for 2D and 3D problems of aeronautical interest on unstructured grids.


Introduction
e aim of this paper is to provide an overview of the methods required for an efficient parallel solution of the compressible Euler equations (CEE) on unstructured 2D and 3D grids.e ingredients include space and time discretisation schemes for the underlying partial differential equations (PDEs), a nonlinear solver based on the Newton's method, and a parallel Krylov accelerator with domain decomposition preconditioners.
Nowadays, unstructured grids are of particular interest for industrial applications.With respect to structured grids, it is oen easier to produce unstructured grids of good quality in domains of complex shape, especially if the unstructured grid generator can be coupled to a CAD system.is can provide-at least in principle-a fast process to solve the problem at hand, with minimal intervention of the user, once the geometry of the domain and the boundary conditions have been speci�ed.However, such techniques imply an almost perfect geometric representation (CAD level) and automatic grid generation remains a challenge in computational engineering, especially for moving boundaries.
e framework presented here can be successfully applied to the solution of sets of PDEs problems discretised on unstructured grids.e focus in this paper is restricted to the solution of the CEE.In particular, the space discretisation technique considered is based on the so-called multidimensional upwind residual distribution (MURD) schemes; see for instance [1][2][3][4][5].ese schemes are supposed to render higher accuracy and less spurious numerical anomalies.is is due to the fact that they take into account the multidirectionality of the wave structure, rather than being based upon a summation of one-dimensional Riemann problems, [6,7].Firstly developed for a scalar advection equation, MURD schemes have been extended to homogeneous �ux hyperbolic systems such as the Euler equations.ey can be interpreted as Petrov-�alerkin �nite element methods, with compactstencil basis functions.is turns out to be of particular advantage for their parallel implementation, since both �rstorder and second-order in space schemes share the same communication pattern.
For compressible �ow simulations, we will consider here the Euler equations.e steady state solutions are found as the steady-state of their time-dependent version.e time derivative is discretised using the backward Euler method.
is leads to a nonlinear system to be solved at each time step.Hence, for these cases, time accuracy is not an issue.In fact, the pseudo-time derivative is added as a mean to obtain a "good" initial iterate for the Newton's method (which can even diverge for initial iterates too far-away from the solution).Moreover, only few steps of the Newton's method are considered, and the linear system with the Jacobian matrix is solved only approximatively.e Newton's method for the solution of this nonlinear system results in a large, ill-conditioned linear system with the system's Jacobian matrix.Direct methods appear to be too computationally expensive, especially in 3D.For our kind of applications, preconditioned iterative solvers of Krylov type are a prime choice and can also be associated with multigrid techniques, another challenging method, especially for compressible �ow simulations.A possible way to derive the preconditioner is to adopt a domain decomposition approach [8,9], and more in particular to use a Schwarz preconditioner.is means that the computational domain Ω is decomposed into  overlapping subdomains Ω  ,   1, … , , such that ⋃  1 Ω   Ω. en, a Dirichlet problem is solved in each subdomain.Since typically each subdomain is assigned to a different processor, this leads to a parallel preconditioner.Further, a coarse level is added, resulting in an overall-although cheap-communication among the subdomains.Here, to construct the coarse level, we consider algebraic procedures based on the concept of aggregation.
e resulting framework is usually addressed to as the Newton-Krylov-Schwarz framework, a quite generic procedure that can be applied to nonlinear systems of PDEs [10][11][12][13][14][15][16].e idea is to obtain (asymptotically) a secondorder convergence in time through the Newton iterations and a reasonable parallelism thanks to the use of a Krylov solver complemented with a Schwarz-type preconditioner.e resulting algorithm will be addressed to as NKS.
For an efficient and scalable implementation of NKS methods, parallel codes have to be developed, extending their sequential counterparts.Data, like grid structures, the Jacobian matrix, state variable, and work vectors, must be partitioned among the processors in order to minimise the amount of intraprocessor communications.e de�nition of a distributed data structure, wellsuited for all the computational phases, is the �rst problem to be addressed.�o that aim, all the main computational kernels of the NKS algorithm are analysed, and suitable algorithms are presented and implemented in order to make use of as much local data as possible, thus minimising communication.Note that robust and accurate methods are necessary to treat the large variety of regimes which may typically arise in aeronautics which rely on solving the Euler equations.e interest of these techniques is that many soware packages, such as PetSc and ISL++, offer these functionalities in a user-friendly way.
e paper is organised as follows.Section 2 describes the numerical solution of the compressible Euler equations and the NKS scheme.In Section 3 the data structure used for parallelisation is introduced.Section 4 outlines the main computational kernels of the NKS scheme.Section 5 gives some numerical results obtained on a SGI Origin 3800 for academic test cases as well as typical full aircra problems.Although this machine is nowadays replaced by Linux clusters, the scaling of the results for multiple processor clusters is signi�cant.Indeed, all techniques and results described here scale in exactly the same way on Linux clusters under MPI.Finally, Section 6 outlines the conclusions.

Numerical Solution of the Compressible Euler Equations
e compressible Euler equations describing nonviscous and nonheat conducting compressible �uid �ows in a domain Ω ⊂ ℝ 3 can be written as where   (,  1 ,  2 ,  3 , ) is the vector of conservative variables.For the applications of interest, (1) are posed on a bounded domain and completed by appropriate boundary conditions and initial condition (  )    (see for instance [17]).
In (1),  is the density,  1 ,  2 , and  3 are the  1 -,  2 -, and  3 -components of the velocity vector , and  is the speci�c total energy.e quantities    (  ,  +  , ,   ℎ),   {1, 2, 3} are homogeneous functions of , called �uxes.Here  , is the Kronecker symbol,  is the (static) pressure, and ℎ is the speci�c enthalpy.e gas is considered to be a calorically perfect gas [18,Chapter 2], therefore with C  and C  being the speci�c heat coefficient at constant volume and constant pressure, respectively.e space discretisation of (1) is conducted using MURD schemes, which are based on the quasilinear form as with e basis idea of MURD schemes is to compute the residual at element level, de�ned as en, the main idea is to redistribute this quantity to the nodes of the element, according to the choice of the scheme.We will not describe MURD schemes, referring the reader to, for example, [19][20][21][22][23][24] and the references therein.Here, we just give a very broad overview of the schemes.
At their beginning, MURD schemes have been developed for scalar equations.e extension to system schemes is straightforward if the system is diagonalisable.In this case, scalar schemes can be used for each scalar equation of the diagonalised system.For nondiagonalisable systems, like the CEE, three main orientations are identi�ed.One consists in the formal extension of the scalar scheme.is leads, for instance, to the system N-scheme [4].A second orientation was introduced by Roe in 1986 [25] and consists in decomposing the initial residual as a sum of simple wave solutions.e last one, initially proposed by Deconinck and coworkers, see [26], is an elliptic/hyperbolic splitting, which decompose the CEE into an acoustic subsystem of three equations (for 3D problems), plus two advection equations, one of them being the entropy advection.e last approach is followed here.It was shown in [27] that this decomposition can lead to numerical methods with minimal spurious entropy production.
MURD schemes are targeted to the solution of advectiondominated systems on unstructured grids because they are not constructed by concentrating on any particular direction of the grid.An advantage is that, at least for scalar equations, one can construct a fully second-order accurate scheme on triangular grids with a very compact stencil: only the nodes in the triangle are used in the evaluation of the �uctuation.Note that these schemes can be interpreted as Petrov-Galerkin �nite-element schemes, see [28,Section 4.1].
e spatial discretisation of (1) results in a system of nonlinear equations of type with a convenient initial condition ( = ) =   .In (6),  is a nonsingular (lumped) mass matrix.In particular, we seek for the the steady-state solution of system (6), which will be one of the roots of Note that the Newton method applied directly to system (7) does not suffice, since an initial iterate sufficiently near to the root is generally not available.is is particularly true for complex �ows containing shock waves and contact discontinuities, like the ones considered in this paper.Framing the steady-state equation into a time-dependent setting constitutes a sort of continuation process, which can be considered as an attempt to widen the domain of convergence of Newton method, or as a procedure to obtain sufficiently close starting points.System (6) is discretised using a backward Euler scheme.Starting from a given   , the solution at the pseudotime step  =    is found by solving the nonlinear system of equations or, equivalently, where   is the time increment at step .
A strategy for the de�nition of   is needed.As a general rule, one should keep the time step small until all the main �ow features are well resolved, then large time steps may be taken near to obtain superlinear or quadratic convergence of Newton's method.In this paper we have adopted the so-called exponential rule [29], that is,   = min   × ()   CFL max   (10) where   is the initial time step and    is a prescribed growing factor.Other strategies have been proposed in the literature, like the so-called switched evolution relation rule (see [29]) or the expert rule (see [30]).
A Newton iteration at time level  for the solution of (9) would then read with  =    being the index associated to the Newton procedure, and   =  − .e Newton iteration should stop when either a certain tolerance level is reached, or aer a �xed number of iterations.Although it is possible in principle to take more Newton correction iterates, convergence results from [31] show that, for steady-state computations, quadratic convergence can be eventually achieved using only one step of process (11), provided that a suitable time-step evolution strategy for   is found.erefore, in the following we are supposed to take just one Newton iteration.is leads to the following linear problem: that can be written, dropping the index , as where    × and      .More precisely, A preliminary analysis of the memory requirements and the computational time required by for direct  techniques suggests the use of an iterative solver of Krylov type for the solution of (13).
e problem with iterative methods is that their convergence rate depends on the spectral properties of the coefficient matrix, in particular the condition number () = ‖‖  ‖ − ‖, where ‖  ‖ is a matrix norm.To obtain a more favourable condition number, one can solve the modi�ed system  −  =  − , where  is a nonsingular matrix. should approximate  − as closely as possible, so that  −1   , while still being reasonably cheap to compute both in memory and CPU time requirements.Also, it should be optimal, that is,  −1  does not depend on the problem dimension, and scalable, that is,  −1  does not depend on the number of processor used.Indeed, one of the main aims of a preconditioner is to transform the system into a system such that the matrix  −1  is close to an identity matrix.
Many parallel preconditioners have been proposed in the literature.Here, we have resorted to preconditioners based on the domain decomposition (DD) approach.e basic idea of DD methods is to decompose the computational domain Ω into  smaller parts Ω  ,   1   , called subdomains, such that ⋃  1 Ω   Ω. Next, the original problem can be reformulated within each subdomain Ω  , of smaller size.is family of subproblems is coupled one to the other through the values of the unknown solution at subdomain interface.is coupling is then removed at the expense of introducing an iterative process which involves, at each step, solutions on the Ω  with additional interface conditions on Ω  ⧵ Ω.
We will focus on overlapping domain decomposition methods, also known as Schwarz methods.In these methods, the computational domain is subdivided into overlapping subdomains, and local Dirichlet-type problems are then solved on each subdomain.e communication between the solutions on the different subdomains is here guaranteed by the overlapping region.
ese DD methods are usually rather inefficient when used as solvers of the linear problem; however, they can be reformulated as efficient parallel preconditioners.To express the preconditioner, we need to �x some notation.�et   ∶ Ω → Ω  be a rectangular matrix which returns a vector de�ned on the nodes internal to Ω  from a global vector de�ned on Ω,         the local matrix, and �nally        −1    the correction on subdomain Ω  .Using this notation, the one-level (additive) Schwarz method can be written as is preconditioner is particularly well suited for the class of problems considered here.It is quite easy to implement and has reduced memory requirements.However, since   acts only locally, its scalability is hindered by the weak coupling between far away subdomains.Its performances degrade rapidly as the number of subdomains increases.
is behaviour is typical of all one-level preconditioners, which, by construction, are composed only by local corrections on the subdomains.In fact, all the state-ofart DD preconditioners consist of local correction and a global components.e local part, acting at the subdomain level, captures the strong couplings that appear between neighbouring subdomains, while the global part provide an overall-although inexpensive-communication among the subdomains.e global component is usually referred to as a "coarse space correction, " since it is usually de�ned on a space that is coarse with respect to the �ne space containing the solution.e complexity of this auxiliary problem is much lower than that of the original problem, and its role is to diffuse information among the subdomains.In an analogous manner to multigrid methods, this coarse space is used to correct the "smooth" part of the error, whereas the local preconditioner is used to dump the "high-frequency" part of the error.
In all generality, the global correction term has the form  0    0  −1 0  0 , where  0 corresponds to the discretisation of the original PDE problem on a coarse space  0 , and  0 is the restriction operator from the �ne space to the coarse space.e resulting preconditioner reads as is preconditioner is fully additive since all the corrections on the subdomains and on the coarse space are added together.Alternatively, an hybrid preconditioner can be used as Roughly speaking, preconditioner ( 17) is called hybrid because the corrections on the subdomains are treated in an additive way (the term  −1  +  0 ), as well as in a multiplicative way (the term  −1   0 ).For more details, the reader is referred to [9].
e key element to obtain a scalable and efficient preconditioner is the proper de�nition of the coarse space.e general approach is to discretise the original problem on a coarse grid.However, the construction of the coarse grid and of the corresponding restriction operator  0 can be difficult or computationally expensive for problems de�ned on unstructured grids in domain of complex shape, as typical in aeronautical applications.For this reason, here we consider agglomeration procedures to construct the coarse matrix for a two-level Schwarz preconditioner.e procedure is completely algebraic and very wellsuited for unstructured grids.
We now detail the procedure used to de�ne the coarse space  0 , referring the reader for more details to [28,32].Our approach is based on the concept of aggregation [33][34][35].On each subdomain Ω  we build  0 aggregates   ,   1   ,   1    0 .An aggregate is a set of contiguous vertices, see a 2D examples in Figure 1.e value  0  ∑  1  0 represents the dimension of the coarse space.is means that we split the degrees of freedom  0 among the  subdomains, for example, simply setting  0   0 /.Finally, inside Ω  , each aggregate is associated to a vector   ,   1    0 , whose elements are built following the rule e algorithm uses a simple procedure to build the operators  0 and   0 and to form the coarse matrix as F 1: Examples of aggregates for a 2D domain.e nodes marked with "■, " "○", and "•" belong to three different aggregates.
For a comparison between the de�nition of the coarse space using a coarse grid and using aggregation, the reader is referred to [36], or [28,Chapter 3], where the authors compare the two approaches for a model problem.
e development of a scalable parallel NKS code requires the de�nition of an efficient distributed data structure.is data structure must efficiently handle the preconditioning phase, which is probably the most important aspect of the NKS solver, as well as all the other computational kernels of the scheme.roughout our description, we make the following assumptions.
(i) e parallel computer is of multiple instruction multiple data (MIMD) architecture.
(ii) e parallel computer is assumed to use distributed memory architecture.In fact, this is not a restriction at all since it can be emulated on almost every type of parallel computers including shared memory computers.
(iii) Communications are achieved using a message passing interface, like MPI [37].is is not a limitation since message passing libraries are currently available on almost every type of architecture.
We also suppose that the starting grid, of moderate size, say, up to a million of elements for an inviscid solution, has been generated on a sequential computer.More re�ned grids will be obtained by the means of parallel adaptation techniques, as described in paper [38].e �rst step is the partition of the grid among the  processors.To that aim, we have used the -way graph partitioning algorithms [39].Given a graph     with ||  , we partition  into  subsets  1   2  …    such that   ∩    ∅, |  |   on average, and ⋃     , and the number of edges of  whose incident vertexes belong to different subsets is minimised.A k-way partition is commonly represented by a partition vector  of length , such that for every vertex   ,  is an integer between 1 and , indicating the partition to which vertex  belongs to.Given a partition , the number of edges whose incident vertexes belong to different subsets  is called the edge-cut of the partition.
e -way algorithm can be used to decompose the direct graph or the dual graph of the grid, leading to vertexoriented (VO) or element-oriented (EO) decompositions.In the former, each vertex of the grid is assigned to a different processor, while in the latter the decomposition is at element wise.A numerical comparison between the two approaches for the compressible Euler equations using explicit time-marching scheme can be found in [40], while results concerning implicit time-marching schemes can be found in [41].
We have followed a VO decomposition.As a result of the VO decomposition, each vertex of the computational grid is assigned to a unique processor, whereas there exist elements shared by more processors.ese elements are called cut elements-since their edges have been intersected-and they form a layer of elements that actually belong to more than one processor.e cut elements are stored on all processors which have at least one node of the element in their update set.For example, in Figure 2 the shaded elements are stored on both processor  1 and processor  2 .e nodes assigned to processor  form its update set, whose size is   .Clearly, ∑  1    ,  being the global dimension of the problem.e nodes belonging to the subdomain, but not to the update set, are called external nodes.e size of this set is   .
From the point of view of the parallel application, grid generation and graph partitioning result in  input �les [40]; each input �le contains a list of elements (grid connectivity and spatial coordinates of each vertex) and a list of the (physical) boundary nodes, grouped in the so-called patches.Each patch corresponds to a part of the boundary, like, for instance, the nacelle, the wing, or the symmetry plane, on which the same boundary conditions are imposed.e grid is represented in terms of element-based data structure.Also, a local-to-global mapping must be provided, to enable communication among the processors.
Aer convergence of the NKS solver, each processor writes the solution corresponding to its update set as a part of the global data.Oen, these partitioned data are F 2: Vertex-oriented decomposition.Example of the overlapping stripe and respective mapping onto processors p and q, update and external nodes.For processor  1 , "○" represents the internal nodes, "□" the border nodes, and "■" the external nodes.For processor  2 , "•" represents the internal nodes, "■" the border nodes, and "□" the external nodes.
merged into a single �le for global solution analysis and visualisation.However, certain packages also work directly on the partitioned solution and can also monitor the solution.We will not consider automatic visualisation issues in the following, since we have used stand-alone soware tools.

𝛼𝛼𝛼NKS Solver Main Operations Analysis
e main computational kernels of our solver are as follows: (1) construction of the distributed matrix; (2) vector updates and AXPY operations; (3) vector-vector products; (4) matrix-vector products; (5) application of the preconditioner.
In this section we analyse these operations.Let us focus on the construction of the distributed matrix.As typical, the �nite-element matrix is constructed by looping over all the elements of the local grid, to compute the elemental matrix  ( , corresponding to each element .On processor , the contribution of  ( will be considered only if the corresponding row is contained in the update set of .is means that  is constructed in a distributed way, that is, each row of  is stored on a unique processor.In an analogous way, every vector, say , is distributed: each processor will store only   , the vector composed by the components of  which correspond to its update set, and of size   . An AXPY operations (i.e.,      for  and  two given vectors and  a given real number) are done as follows.First, one has to verify that all processors have the same value of ; then, for all the elements of the local vector   , one do         .Apart from the synchronisation of the value of  (step which is oen avoided), no communication occurs in vector updates and AXPY operations.
e vector-vector product     on processor  reads as follows: (1) compute the local value         ; (2) sum up the local contributions of all the processors:     1   .
Note that communication is required only in the second step.�efore computing the �nal value of , all processors must have �nished the computations in the �rst step.For this reason, vector-vector product always act, in a parallel environment as synchronisation points and require, global communications.
In the matrix-vector product   , a processor  computes only the elements of  in its update set, that is, in   .Hence, the local parallel matrix-vector product reads as              (20) where   and   are the values of  corresponding to its update and external set, respectively,   is the submatrix representing the in�uence of update nodes on themselves, while   , the action of external nodes on update ones.From (20), the steps required to perform the parallel matrix product are (1) the exchange of boundary data, that is, the update the value of the external nodes   ; (2) the computation of the the local matrix-vector product        ; (3) the computation of the external matrix-vector product           .
Steps ( 1) and ( 2) can be overlapped.Note that an efficient implementation of step ( 1) is particularly important to obtain a scalable matrix-vector product.As a general rule, data to be sent to a processor should be packed and sent using only one MPI call to avoid latency problems occurring within the network.However, since the grid discretisation is unstructured, it is not trivial to determine the owner of an external set and, moreover, all the nodes belonging to a certain processor are not necessarily stored contiguously in the vector   .is operation of packing data into a buffer creates contiguous messages from irregularly spaced data structure and requires a preprocessing of the matrix, plus the de�nition of a certain number of data structures.
We want to underline the fact that for both �rst-order and second-order MURD schemes, the computational stencil of a generic vertex  is compact, that is, it only depends on information from vertexes that are one edge-distance away in the grid structure.erefore, the Jacobian matrix will have nonzero elements only for the contributions corresponding to the �rst-order neighbouring nodes even for second-order schemes.is simpli�es considerably the communication and allows the use of the same data structure for �rst-order and second space discretisation schemes.
Finally, let us consider the application of the Schwarz preconditioner.Using the VO decomposition outlined in Section 3, it is particularly easy to implement a one-level Schwarz preconditioner   with minimal overlap, that is, with an overlap among the subdomains of one element only.Referring again to Figure 2, the overlapping region among the subdomains is represented by the elements in shaded colour.
On processor , the computation of the preconditioned residual    −1    simply reads as Matrix   is already stored and formed in the local memory of processor .Note that Schwarz preconditioners with wider overlaps will require communication.In general, the optimal amount of overlap is clearly a compromise between the effectiveness of the preconditioner and its computational complexity.For this class of problems, we have found it convenient to use the minimal overlap version.is is equivalent to a block-Jacobi procedure, where the blocks have size   .An ILU(0) or BILU(0) factorisation (using a reverse Cuthill-McKee reorder) of the block matrices gives satisfactory results (in terms of CPU times) with respect to other "richer" incomplete factorisations [41].In fact, compared with other variants, ILU(0) is computationally fast and memory efficient.Now, let us focus on the two-level Schwarz preconditioner described in Section 2, with a coarse space based on aggregation.We show that this preconditioner makes use of information at subdomain level.To obtain a representation of matrix  0 , we order the set of nodes putting before the nodes in Ω 1 , and then the nodes of the domain subdomains in the other subdomains.Analogously, we order the degree of freedom of  0 .As a result,   0 ∈ ℝ  0 × has the following block structure: Note that each vector  , can be constructed at the subdomain level, using only the local information about the grid of Ω  (hence no communications).Very oen, the graph of the local matrix   can furnish the information required to produce the aggregates, and henceforth grid data are not required by the algorithm.is is notably the case of �niteelement codes with ℙ 1 basis function, as used in this paper.Note moreover that the vectors  , are not explicitly required by the algorithm.We use one integer vector , so that   represents the (local) degree of freedom associated to the aggregate  , .e vector  will be of size   +  , .First, communications occur to exchange the actual values of  , among the processors.en, in the aggregation procedure to compute the coarse matrix as in (19), no communication will be required to compute, on each subdomain, the rows of  0 corresponding to the degree of freedom of  0 contained in that subdomain.

Numerical Results
In this section the NKS solver presented previously is validated.e underlying Euler code we have used is called THOR, developed within the BRITE-EURAM project IDeMAS.e basic code was provided by the project coordinator, the von Karman Institute VKI (Belgium), and includes contributions from the partners, DASSAULT AVI-ATION (France), ALENIA (Italy), DASA/EADS (Germany), INRIA (France), and EPFL (Switzerland).THOR involves several high-level numerical schemes aiming to produce highly accurate multidimensional upwinding techniques, see [5].e graph partitioning is based on the public domain soware METIS [42] and its parallel counterpart ParMETIS [43].e linear systems involved are solved using the library AZTEC [44], developed at the Sandia National Laboratories (USA).e Schwarz algorithms and the Krylov solvers have been implemented by the authors.e numerical results have been obtained on a SGI-Origin 3800, equipped with 128 MIPS R14000 500 Mhz processors.Each processor has 512 Mbytes of RAM, with 32 Kbytes and 8 Kbytes of �rst level and second level cache memory, respectively.is type of machine can be replaced by Linux cluster-type machines, the results being signi�cant for both types of architectures.e parallel communication paradigm used is MPI [37].
ree test cases taken of aeronautical interest have been selected to illustrate the variation of the performance with the various parameters within the algorithm and validate the full NKS cycle.
In all the numerical tests,   indicates the one-level Schwarz preconditioner, with ILU(0) factorisations and minimal overlap among the subdomains. ,hybrid denotes the two-level Schwarz preconditioner, as detailed in Section 2. e local dimension of the coarse space (i.e., the number of aggregates assigned to each processor) is  0 /.e coarse problem is solved redundantly on each processor, using a direct LU decomposition.

Falcon
Aircra at  ∞  0.45 and   1 ∘ .We have studied the in�uence of the Krylov accelerator on a small 3D test case, namely, a Falcon aircra with a computational  grid composed of 45387 nodes and 255944 elements, with �ight conditions of  ∞ = 0.45 and  =  ∘ .is test case is representative of a complete aeronautical con�guration of an industrial airplane, with body, wing, and nacelles.is is the only subsonic test case.e mesh is quite coarse, and we have used this test case to validate the Krylov accelerators and the basic properties of the preconditioners (see also [23,45] for a more extensive analysis on this subject).Slip boundary conditions are imposed on the airplane.Since the problem is symmetric, as well as the boundary conditions, the mesh is represented by half a sphere, where we have imposed far-�eld and symmetry-plane boundary conditions.Referring to equation (10), we have used CFL 0 = .0, = , and CFL max = 0 6 .For this test case, only the �rst-order N-scheme for systems was used, since the aim of this test case was to validate the choice of the Krylov accelerator.
Figure 3 reports the norm of the residual (scaled by the norm of the initial residual) versus the linear system solver iterations, for the 10th time step.e accelerators considered are, Bi-CGSTAB, CGS, TFQMR, GMRES (25), and GMRES(60).e preconditioner is   .Figure 4 reports the iterations to converge to a tolerance on the relative residual of 10 −5 at each time iteration, using   ,  add , and  hybrid .From the �gure, it is evident that the CFL number has a strong in�uence on the iterations to converge.In fact, as the CFL number increases, the linear system becomes more skew-symmetric and less diagonaldominant, and hence more iterations are required to converge to the desired accuracy.Note that, for low values of the CFL number, the number of iterations to converge is almost the same with one-level and two-level methods.is may suggest to adapt the preconditioner for the problem to be solved, introducing the coarse problem only for certain values of the CFL number.

Modelling and Simulation in Engineering
For any combination of solver and preconditioner, convergence of the linear system solver is reached at any time step.GMRES is slightly faster than the other methods; it takes more iterations to converge, but the single iteration is faster, since it requires only one matrix-vector product, instead of two as in the other methods.However, it is quite sensitive to the dimension of the Krylov space, especially as the number of processor used in the computation increases.e convergence history at the 10th time step (when the CFL number is 10240) reveals that TFQMR has a quite erratic convergence, while the one provided by CGS is (unexpectedly) quite regular.Bi-CGSTAB performs quite well but it is still erratic.e differences among the Krylov accelerators are particularly evident as the number of processors grows.is may re�ect the fact that the Schwarz preconditioners with no coarse correction worsen their performances with many subdomains.
As a result of this analysis, GMRES appear to be the most interesting among the iterative schemes for our class of problems.is con�rms what was already pointed out in [23].e only drawback of GMRES is represented by the memory requirements.erefore, if it is the case to use Krylov subspaces of "low" dimension, one may consider other iterative schemes, like TFQMR or Bi-CGSTAB.
e results also show a reasonable scalability as illustrated in Figure 5. e whole convergence process is autoregulating; however, in general, 20-200 linear system iterations are performed 4 to 16 time levels as illustrated in Figure 6.
e initial grid is composed by 136767 nodes and 726713 elements.e �rst-order N-scheme for systems was used on the nonadapted grid, while a second-order blended scheme was used on the adapted grids.In all the computations, the �acobian matrix is formed starting from �rst-order space approximation to save computational time, because the evaluation of second order distribution residual is more time consuming with respect to �rst-order method.
In Table 1, the cost of the following adaptation sequence is reported.e complete solution is required at least 8 hours on 16 processors, and less than one hour on 64 processors.is result is mainly due to the speci�c architecture of the SGI-Origin, which allows a processor to use (when required) the memory of another processor.is is notably the case of the X29 test case when run with only 16 processors, due to the size of the problem; while by using 64 processors, then only the local processor memory was used.e NKS cycle is as follows.Starting from an initial grid  (0) , 2 steps of re�nement and dere�nement and structural optimisation gave the grid  (1) .en smoothing optimisation is performed on  (1) to obtain  (2) .en on  (2) large scale, re�nement and optimisation gave a �nal grid of 1.47 million nodes and 8.7 million elements,  (3) , which is a very interesting performance test case.
e effect of the smoothing optimisation can be seen in Figure 7. Figure 8 details the solutions on these �rst re�nements.e mesh re�nement criteria taken here are segment length and Mach gradients and are detailed in [38].
A comparison of iterations to converge and CPU times is reported in Table 2 for the �rst 14 iterations of the NKS procedure.In the table, iters indicates the sum of Krylov iterations at each backward Euler time step, and CPU time the sum of the (elapsed) CPU times (in seconds) to solve the Jacobian linear system.More detailed results about one-level and two-level Schwarz preconditioners are reported in Table 3.As previously noted, the CFL number has a strong in�uence on the convergence.One may note that  hybrid performs better than  add .For small problems, large dimension of  0 have to be preferred, while bigger problems re�uired a �ne tuning of  0 / to obtain the best numerical results.It is worth to note that the efficiency of the  hybrid preconditioner clearly outperforms   in terms of iterations, while as concerns CPU-times there is an improvement of about 10%.is is due to the following facts: (i)  hybrid requires a matrix-vector product to be performed, and this is computationally expensive; (ii) the solution of the coarse problem requires intensive communications among theprocessors.e �rst-order N-scheme for systems is used to obtain the solution on the �rst (nonadapted) grid, while the secondorder blended scheme is used for the adapted grids.As for the X29 test case, the �acobian matrix is formed using the �rstorder scheme.
e far �eld is represented by a half sphere with a radius of 12.5 root-chord length.e NKS procedure has been F 9: ONERA M6 wing.Pressure coefficient at 20% of wing span, using �rst-order and second-order schemes on a nonadapted grid (a), and second-order schemes on nonadapted and adapted grid (b).
performed as follows.e starting nonoptimised grid is composed by 94493 nodes and 666569 elements, and an initial constant solution is used.en, two steps of re�ning are performed.Each adaptation phase corresponds to at least two passes of adaptation (re�nement coarsening and optimisation).e �nal grid (aer two adaptation cycles) is composed by 585725 nodes and 3477090 elements.e pressure coefficient at 20% of the wing span is reported in Figure 9. e picture on the le shows the improvement from �rst-order to second-order schemes.
Second-order scheme captures the �rst shoc� wave, near the leading edge, with much greater precision than �rst-order schemes.e over-and under-shoots are sensibly reduced by using adapted grids, as one can notice from the picture on the right.
An analysis of the �ow �eld on the upper part of the wing is reported in Figure 10, while Figure 11 reports the nonadapted grid and the adapted grid.
As regards the convergence of the NKS scheme, as one can seen from Figure 12 that in the case of 16 processors,  the �rst-order N-scheme rapidly converges to steady-state solution very rapidly.In this phase, CFL 0 = 10,  = 0, CFL max = 10 6 , and analytical Jacobian are used.ese initial iterations are needed to place the shocks in a more or less correct position, so that they do not have to move consistently when second-order schemes are employed.For second-order schemes, we have used CFL 0 = 0,  = 0, and CFL max = 1000.
As already reported in [23,46], the iterative convergence of second-order schemes is poor; basically, for nonadapted meshes, the   density residuals stagnate around  to 3 orders of magnitude, where the residual is compared to the �rst iteration.e situation is slightly improved on adapted meshes, even if it was not possible to reduce the residual more than a certain factor, here about 10 −5 with respect to the �rst iteration on the nonadapted grid.

Conclusions
Multidimensional upwind schemes have reached a certain degree of maturity for the solution of the steady-state compressible Euler equations on unstructured grids made up of triangles (in 2D) and tetrahedrons (in 3D).ey can be used for complex aerodynamic �ows, in the subsonic, transonic, and supersonic regime.Nevertheless, it is clear that the the full potential of the approach may be revealed only if these methods could effectively lead to scalable parallel codes.In particular, this requires, among others, the de�nition and the implementation of a scalable preconditioner and of a parallel grid adaption technique to improve the solution accuracy whilst optimising the computational resources.To that aim, we have here presented the adaptive pseudotransient Newton-Krylov-Schwarz framework.
Most parallel implementation of MURD schemes are based on one-level Schwarz preconditioner and do not include automatically grid adaptation procedure.is paper aims to present a complete framework which couples these two aspects with the characteristic aspects of MURD schemes.
e main kernels of this algorithm have been analysed.A data structure, based on a vertex-oriented decomposition of the grid, has been described.is data structure allows an efficient implementation of all the main computational kernels.Of particular importance is the de�nition of the preconditioner.One-level and two-level Schwarz preconditioners have been presented.e coarse matrix is constructed automatically and for any computational grid with no input from the user (except for the matrix  and the dimension of the coarse space).is simplicity has its roots mainly in the way the restriction and interpolation operators are de�ned.e results show an improvement of performance with respect to the one-level Schwarz method, which is the preconditioner usually adopted in the literature for multidimensional upwind residual distribution schemes.
Numerical results, obtained on distributed memory computers, are presented for large scale computations in the sense of complexity and mesh adaptation on parallel machines.
Results show that the conjunction of the Krylov-Schwarz approach with a grid adaptation procedure leads to an effective solution of the compressible Euler equations on unstructured grids.e same techniques apply to the Navier-Stokes system of equations solved by similar equivalent �nitevolume techniques on unstructured grids [18,47].

8 F 3 :
FALCON aircra.Convergence history at the 10th time level using 32 processors and various Krylov accelerators.

F 4 :
FALCON aircra.Iterations to converge at each time iterations, using   ,  add and  hybrid , and 16 processors. 0 denotes the dimension of the coarse space for the aggregates.

F 5 :
FALCON aircra.Relative speedup curves taken at the 10th time level.

F 6 :
FALCON aircra.Comparison of number of iterations per time level for 32 processors.

F 7 :
X29 aircra.Before and aer optimisation and smoothing.In particular, the symmetry plane grid on the le picture shows an excessive neighbour number elements.is is no longer present in the right picture, aer optimisation and smoothing.

F 8 :
X29 aircra.�oom over part of the re�nement and Mach number distribution on the body for the grid ℳ 1 .T 1: X29 aircra.

F 10 :
ONERA M6 wing.From le� to right, Mach number contours for: a �rst-order solution and a second-order solution on the nonadapted grid, and a second-order solution on the adapted grid.

F 11 :F 12 :
ONERA M6 wing.Nonadapted grid (a) and adapted grid (b).ONERA M6 wing.Convergence history of the NKS scheme.On the -axes, CPU time, in seconds.On the -axes, the logarithm of the (Euclidean) norm of the density residual.
Time for successive adaptations using 8 and 16 processors.X29 aircra.Comparison of   and  hybrid using different number of processors for the non-adapted grid, is made up of 136767 nodes and 726713 elements.Both CPU-time and iterations refer to the �rst 14 time iterations of the backward Euler method.For the coarse correction  0 /  .
ONERA M6 Wing at  ∞  0. and   3.0 ∘ .e last test case is an ONERA M6 wing at  ∞  0. and   3.0 ∘ .It represents a ONERA M6 wing, a widely used test case for 3� �ows from subsonic to transonic regimes.e selected transonic case is  ∞  0. and   3.0, which results in several shock waves over the wing.In the �ow-�eld, a shock originates at approximatively 0% of the wing span.A second leading-edge shock is caused by the 30% sweep angle of the M6 wing.Near the tip of the wing, a lambda-shock structure is observed.
T 3: X29 aircra.Comparison among different preconditioners for different values of the CFL number.16, 32 and 64 processors have been used in the computations.