An MPI-OpenMP Hybrid Parallel H -LU Direct Solver for Electromagnetic Integral Equations

the


Introduction
Computational electromagnetics (CEM), which is driven by the explosive development of computing technology and vast novel fast algorithms in recent years, has become important to the modeling, design, and optimization of antenna, radar, high-frequency electronic device, and electromagnetic metamaterial.Among the three major approaches for CEM, finitedifference time-domain (FDTD) [1], finite element method (FEM) [2], and method of moments (MoM) [3], MoM has gained widely spread reputation for its good accuracy and built-in boundary conditions.Generally, MoM discretizes the electromagnetic integral equations (IEs) into linear systems and solves them via numerical algebraic methods.Although the original MoM procedure forms a dense system matrix which is computationally intensive for solving large-scale problems, a variety of fast algorithms have been introduced to accelerate MoM via reducing both of its CPU time and memory cost.Well-known fast algorithms for frequency domain IEs, such as multilevel fast multipole method (MLFMM) [4][5][6], multilevel matrix decomposition algorithm (MLMDA) [7][8][9], adaptive integral method (AIM) [10], hierarchical matrices (H-matrices, H 2 -matrices) method [11][12][13][14][15], multiscale compressed block decomposition (MLCBD) [16,17], and adaptive cross-approximation (ACA) [18,19], have remarkably increased the capacity and ability of MoM to analyze radiation/scattering phenomena for electric-large objects.
Traditionally, iterative solvers are employed and combined with fast algorithms to solve the MoM.Despite of the availability of many efficient fast algorithms, there are still some challenges in the iterative solving process for discretized IEs.One major problem is the slow-convergence issue.Due to various reasons, such as complex geometrical shapes of the targets, fine attachments, dense discretization, and/or nonuniform meshing, the spectrum condition of the impedance matrix of discretized IEs can be severely deteriorated.Therefore, the convergence speed of iteration will be slowed down significantly so that we are not able to obtain an accurate solution within a reasonable period of time.In order to overcome this difficulty, preconditioning techniques are usually employed to accelerate the convergence.There are some popular preconditioners, including diagonal blocks inverse [20], incomplete Cholesky factorization [21], incomplete LU factorization (ILU) [21,22], and sparse approximate inverse (SAI) [23], used widely.However, the effectiveness of preconditioning techniques is problem dependent and the convergence still cannot be guaranteed.For some extreme cases preconditioning shows little alleviation to the original problem.
2 International Journal of Antennas and Propagation By contrast, direct solving like Gaussian elimination or LU factorization [24] spends fixed number of operations, whose impedance/system is only related to the size of impedance/ system matrix, namely, the number of unknowns , to obtain the accurate solution of MoM.Although these standard direct solvers are not favored over iterative solvers because of their costly computational complexity ( 3 ), a number of fast direct solvers (FDS) are introduced recently [14][15][16][17][18][19]25], which inherit the merit of direct solving for avoiding slow-convergence issue, and have significantly reduced the required CPU time and memory.
So far, there are some existing literatures discussing the parallelization of direct solvers for electromagnetic integral equations [26,27].It is notable that the implementation of parallelizing FDS is not trivial.Since the algorithms of major FDS share a similar recursive factorization procedure in a multilevel fashion, their implementations are intrinsically sequential in nature.However, they can be parallelized with good scalability by elaborate dissection and distribution.In this paper, we proposed an MPI-OpenMP hybrid parallel strategy to implement the H-matrices based direct solver with hierarchical LU (H-LU) factorization [15,28].This parallel direct solver is designed for running on a computer cluster with multiple computing nodes with multicore processors for each node.OpenMP shared memory programming [29] is utilized for the parallelization on multicore processors for each node, while MPI message passing [30] is employed for the distribution computing among all the nodes.The proposed parallel FDS shows good scalability and parallelization efficiency, and numerical experiments demonstrate that it can solve electrical-large IE problems involving nearly 4 million unknowns within dozens of hours.
The rest of this paper is organized as follows.Section 2 reviews the discretized IE we aim to solve and basic MoM formulas.Section 3 outlines the construction of H-matrices based IE system, including spatial grouping of basis functions and the partition of the impedance matrix, as well as the direct solving procedure of H-LU factorization.Section 4 elaborates the parallelizing strategy for the construction and LU factorization of H-matrices based IE system, with the discussion of its theoretical parallelization efficiency.Section 5 gives some results of numerical tests to show the efficiency, capacity, and accuracy of our solver.Finally, the summary and conclusions are given in Section 6.

Electromagnetic Integral Equation and Method of Moments
In this section, we give a brief review of surface IEs and their discretized forms [3] that our parallel FDS deals with.
In the case of 3D electromagnetic field problem, electric field integral equation (EFIE) is written as in which  is the wavenumber,  is the wave impedance, t is the unit tangential component on the surface of the object, and (r, r  ) is the free space Green function.E inc (r) is the incident electric field and J(r  ) is the excited surface current to be determined.Correspondingly, the 3D magnetic field integral equation (MFIE) is written as in which n is the normal component on the surface of the object, H inc (r) is the incident magnetic field, and P.V. stands for the Cauchy principal value integration.For time harmonic electromagnetic plane wave, E inc and H inc have the relationship where k is the unit vector of wave propagating direction.In order to cancel the internal resonance that occurred in the solution of individual EFIE or MFIE, the two equations are usually linearly combined.Specifically, we have the combined field integral equation (CFIE) written as To solve this IE numerically, first we need to mesh the surfaces of the object with basic patches.For electromagnetic IEs, triangular and quadrilateral patches are two types of frequently used patches, which are associated with Rao-Wilton-Gliso (RWG) basis function [31] and roof-top basis function [32], respectively.By using these basis functions, the undetermined J(r) in ( 1) and ( 2) can be discretized by the combination of basis functions f  (r), ( = 1, 2, . . ., ) as in which   is the unknown coefficients and  is the total number of unknowns.By using Galerkin test, we can form the  ×  linear systems for (1) and (2) as in which the system matrices Z EFIE and Z MFIE are also called impedance matrices in IEs context.The entries of the system matrices and right-hand side vectors can be calculated numerically through the following formulas: Consequently, the discretized CFIE system matrix and excitation vector are By solving the linear system we can get the excited surface current J and corresponding radiation/scattering field.

Fast Direct Solver and Hierarchical LU Factorization
Generally, we can solve the linear system (9) iteratively or directly.For iterative solvers, the Krylov subspace methods are frequently used, such as CG, BiCG, GMRES, and QMR [33].Because iterative methods often suffer from slow-convergence issues for complex applications, direct methods have become popular recently.Several researchers proposed a variety of FDS [14][15][16][17][18][19]25], which aim at avoiding convergence problems for iterative solvers and utilize similar hierarchical inverse/factorization procedure and low rank compression techniques to reduce the computational and storage complexity.
H-matrices method [11,12,15] is a widely known mathematical framework for accelerating both finite element method (FEM) and IE method.In the scenario of electromagnetic IE, H-matrices method is regarded as the algebraic counterpart of MLFMM [4][5][6].The most advantage of Hmatrices technique over MLFMM is that it can directly solve the IE system by utilizing its pure algebraic property.In this section, we briefly review the primary procedures of IE-based H-matrices method and the algorithm of hierarchical LU factorization which will be parallelized.
In order to implement H-matrices method, firstly the impedance matrix Z CFIE needs to be formed block-wisely and compressed by low rank decomposition scheme.We subdivide the surface of the object hierarchically and, according to their interaction types, form the impedance/forward system matrix in a multilevel pattern.This procedure is very similar to the setting-up step in MLFMM.By contrast to the octree structure of MLFMM, the structure in H-matrices is a binary tree.The subdivision is stopped when the dimension of smallest subregions covers a few wavelengths.The subblocks in the impedance matrix representing the far-field/weak interactions are compressed by low rank (LR) decomposition schemes.Here we use the adaptive cross-approximation (ACA) algorithm [18,19] as our LR decomposition scheme, which only requires partial information of the block to be compressed and is thus very efficient.
After the H-matrices based impedance matrix is generated, the hierarchical LU (H-LU) factorization [12,26,28] is employed on it and finally we get an LU decomposed system matrix LU (Z CFIE ).During and after the factorization process, subblocks in LR format are kept all the time.Having the LU (Z CFIE ), we can easily solve the IE with given excitations by a recursive backward substitution procedure.Figure 1 illustrates the overall procedure of the H-matrices based FDS introduced above.
From Z CFIE to LU (Z CFIE ), the hierarchical LU factorization is employed.The factorization algorithm is elaborated below.Consider the block LU factorization of the level-1 partitioned Z CFIE matrix; namely, We carry out the following procedures in a recursive way: ( Recursive implementations are participated among all procedures (i)-(v).Procedures (ii) and (iii) (and similar operations under the hierarchical execution of (i) and (v)) are performed via partitioned forward/backward substitution for hierarchical lower/upper triangular matrices [12,28].Procedure (iv) is a vital operation that contributes most of the computing time cost.This subblocks addition-multiplication procedure occurs not only in one or several certain levels, but pervades in all of the recursive executions.Figure 1 illustrates the overall procedure of H-matrices based FDS.Implementation details about Hmatrices compression and recursive LU factorization can be found in [12,14,15].In addition, a similar recursive direct solving process for noncompression IE system matrix is discussed in [26].
Based on the result of H-LU factorization, we can easily get the solution for any given excitation, namely, right-hand side (RHS) vector by hierarchical backward substitution.The CPU time cost for backward substitution is trivial compared to that of H-LU factorization.

Parallelization of H-Matrices Based IE Fast Direct Solver
The parallelization of H-matrices based IE FDS contains two parts: (a) the parallelization (of the process) of generating the forward/impedance matrix with H-matrices form; (b) the parallelization of H-LU factorization and backward substitution.In this section, we will elaborate the approaches for implementing these two parts separately.

Parallelization of Generating Forward/Impedance Matrix.
Since our FDS is executed on a computer cluster, the computing data must be stored distributively in an appropriate way, especially for massive computing tasks.The main data of H-matrices based IE FDS are the LR compressed forward/impedance and LU factorized matrices.In principle, for forward/impedance matrix, we divide it into multiple rows, and each computing node stores one of these row-blocks.This memory distribution strategy is shown in Figure 2. The width of each row is not arbitrary but in correspondence to the hierarchical partition of the matrix in certain level .Therefore, no LR formatted block will be split by the division lines that cross the entire row.In practice, the width of each row is approximately equal and the total number of rows is the power of 2, namely, 2  . is an integer which is less than the depth of the binary tree structure of H-matrices, which is denoted by .Each row-block is computed and filled locally by each MPI node it is assigned to.Since every subblock is an independent unit, the filling process for these row-blocks can be parallelized for all MPI nodes without any communication or memory overlap.Furthermore, for each MPI node with multicore processor, the filling process of its corresponding blocks can be accelerated by utilizing the OpenMP memory share parallel computing.In the case of MoM, OpenMP is very efficient to parallelize the calculation of impedance elements, since this kind of calculation only depends on the invariant geometrical information and electromagnetic parameters, which are universally shared among all nodes.

Parallelization of H-LU Factorization.
The H-LU factorization is executed immediately after the forward/impedance matrix is generated.Compared with the forward/impedance matrix filling process, the parallelization of H-LU factorization is not straightforward because its recursive procedure is sequential in nature.In order to put H-LU factorization into a distributed parallel framework, here we use an incremental strategy to parallelize this process gradually.
During the H-LU factorization, the algorithm is always working/computing on a certain block which is in LR form or full form or the aggregation of them, so that we can set a criterion based on the size of the block for determining whether current procedure should be parallelized.Specifically, assuming that the (total) number of unknowns is , we have about 2  nodes for computing; then if the maximal dimension of ongoing processing block is larger than /2  , MPI parallel computing with multiple nodes is employed; otherwise the process will be dealt with by single node.It should be noted that OpenMP memory share parallel is always employed for every node with multicore processor participating in computing.In the H-LU factorization process, OpenMP is in charge of intranode parallelization of standard LU factorization, triangular system solving for full subblocks, and SVD or QR decomposition for LR subblocks addition and multiplication, and so forth, which can be implemented through highly optimized computing package like LAPACK [34].
According to the recursive procedure of H-LU factorization, at the beginning, the algorithm will penetrate into the finest level of H-matrices structure, namely, level , to do the LU factorization for the top-left most block with the dimension of approximately (/2  ) × (/2  ).Since  ≥ , this process is handled by one node.After that the algorithm will recur to level  − 1 to do the LU factorization for the extended top-left block of approximately (/2 −1 )×(/2 −1 ) upon the previous factorization.If  − 1 ≥  remains true, this process is still handled by one node.This recursion is kept on when it returns to level  that  < , for which MPI parallel with multiple nodes will be employed.
For the sake of convenience, we use H-LU() to denote the factorization on certain block in the level  of Hmatrices structure.Recall the H-LU factorization procedure in Section 3. Procedure (i) is H-LU itself, so its parallelization is realized by other procedures on the finer levels.For the procedure (ii) and procedure (iii) we solve the triangular linear system as in which L, U, A, and B are the known matrices and L, U are lower and upper triangular matrices, respectively.X and Y need to be determined.There two systems can actually be solved recursively through finer triangular systems.For instance, by rewriting (11) as we carry out the following procedures: (i) get After solving processes of triangular systems is done, the updating for Z 22 , namely, the procedure (iv) of H-LU, is taking place.We here use UPDATE() to denote any of these procedures on level .Rewriting it as apparently, this is a typical matrix addition and multiplication procedure, which can be parallelized through the addition and multiplication of subblocks when it is necessary: In the final form of ( 16), the matrix computation for each of the four subblocks is an independent process and can be dealt with by different nodes simultaneously.When Z 22 is updated, the LU factorization is applied to Z 22 for getting L 22 and U 22 .This procedure is identical to the H-LU factorization for Z 11 , so any parallel scheme employed for factorizing Z 11 is also applicable for factorizing Z 22 .
The overall strategy for parallelizing H-LU factorization is illustrated in Figure 3 and ℋ-LU: recurring to upper levels  During and after H-LU factorization, the system matrix is kept in the form of distributive format as the same as that for forward/impedance matrix shown in Figure 2.Under the perspective of data storage, those row-blocks are updated by H-LU factorization gradually.It should be noted that, for the nodes participating in parallel computing, the blocks they are dealing with may not relate to the row-blocks they stored.Therefore, necessary communication for blocks transfer occurs among multiple nodes during the factorization.

Theoretical Efficiency of MPI Parallelization.
To analyze the parallelization efficiency under an ideal circumstance, we assume that the load balance is perfectly tuned and the computer cluster is equipped with high-speed network; thus there is no latency for remote response and the internode communication time is negligible.Suppose there are  nodes and each node has a constant number of OpenMP threads.The parallelization efficiency is simply defined as in which   is the processing time with  nodes.Apparently, for the process of generating forward/impedance matrix,   could be 100% if the number of operations for filling each row-block under -nodes parallelization O rowfill  is 1/ of the number of operations for filling the whole matrix O fill 1 by single node, since  1 ∝ O fill 1 and   ∝ O rowfill  = O fill / are true if the CPU time for taking single operation is invariant.This condition can be easily satisfied to a high degree when  is not too large.When  is large enough, to divide the system matrix into  row-blocks may force the depth of H-matrices structure inadequately that consequently increases the overall complexity of H-matrices method.In other words, Under this situation, the parallelization becomes inefficient.
For the process of H-LU factorization, by assuming that  = 2  , the total number of operations under -nodes parallelization O LU  is where H-LU(), H-TriSolve(), and H-AddProduct() denote the processes for LU factorization, triangular system solving, and matrices addition and multiplication in Hformat in the th level of H-matrices structure, respectively.Their computational complexity is (  ) (2 ≤  < 3), in which  = /.However, because of parallelization, some operations are done simultaneously.Hence, the nonoverlap number of operations O   that represents the actual wall time of processing is Additionally, the number of operations without parallelization for H-LU factorization O LU 1 is definitely no larger than O LU  for the potential increased overall complexity by parallelization.Therefore, the parallelization efficiency is When  goes larger, the result is approximately According to our analysis, we should say that, for a given size of IE problems, there is an optimal  for best implementation.If  is too small, the solving process cannot be fully parallelized.On the other hand, larger  will make the width of the row-blocks narrower that eliminate big LR subblocks and cause H-matrices structure to be flatter, which will consequently impair the low complexity feature of Hmatrices method.
Similarly, the number of OpenMP threads  has an optimal value too.Too small  will restrict OpenMP parallelization, and too large  is also a waste because the intranode parallel computing only deals with algebraic arithmetic for subblocks of relatively small size; excessive threads will not improve its acceleration and lower the efficiency.

Numerical Results
The cluster we test our code on has 32 physical nodes in total; each node has 64 GBytes of memory and four quad-core Intel Xeon E5-2670 processors with 2.60 GHz clock rate.For all tested IE cases, the discrete mesh size is 0.1, in which  is the wavelength.

Parallelization Efficiency.
First, we test the parallelization scalability for solving scattering problem of PEC spheres with different electric sizes.The radii of these spheres are  = 2.5,  = 5.0, and  = 10.0,respectively.The total numbers of unknowns for these three cases are 27,234, 108,726, and 435,336, respectively.Figure 4 shows the efficiency for the total time solved by 1, 2, 4, 8, 16, 32, 64, and 128 MPI nodes, with 4 OpenMP threads for each node.The definition of parallelization efficiency is the same as formula (17); thus in this test we only consider the efficiency of pure MPI parallelization.Unsurprisingly, for parallelization with large number of nodes, bigger cases have higher efficiencies because of better load balance and less distortion to the Hmatrices structure.Besides, since the H-LU factorization dominates the solving time consumption, these tested efficiencies of bigger cases are close to the maximum efficiency of theoretical one  LU  = 2/3, which demonstrates the good parallelizing quality of our code.
Next, we investigate the hybrid parallelization efficiency for different MPI nodes ()/OpenMP threads () proportion.We employ all 512 cores in cluster to solve the 3 PEC sphere cases above in four scenarios, from pure MPI parallelization (1 OpenMP thread for each MPI node, large /) to heavy OpenMP parallelization embedded in MPI ( 16OpenMP threads for each MPI node, small /).Here we slightly change the definition of parallelization efficiency as  nodes, 4 threads per node, the parallelization efficiency in this test is 0.737, which is higher than that of 0.562 in previous test.This is because, by our new definition of efficiency, the OpenMP parallelization is taken into account.These results also indicate the superiority of MPI-OpenMP hybrid parallelization over pure MPI parallelization.Figures 6 and 7 show the memory cost and solving time for different problem sizes under different MPI nodes/ OpenMP threads proportion.The memory cost complexity generally fits the ( 1.5 ) trend, which has also been demonstrated in sequential FDS [15,17,19].Since the parallelization efficiency gradually rises along with the increase of problem size as shown above, the solving time complexity is about ( 1.84 ) instead of ( 2 ) exhibited in sequential scenario [15,17,19].

PEC Sphere.
In this part, we test the accuracy and efficiency of our proposed parallel FDS and compare its results to analysis solutions.A scattering problem of one PEC sphere with radius  = 30.0 is solved.After discretization, the total number of unknowns of this case is 3,918,612.For hybrid parallelizing, 128 MPI nodes are employed, with 4 OpenMP threads for each node.The building time for forward system matrix is 4,318.7 seconds, H-LU factorization time is 157,230.2seconds, and solving time for each excitation (RHS) is only 36.6 seconds.The total memory cost is 987.1 GBytes.The numerical bistatic RCS results agree with the analytical Mie series very well, as shown in Figure 8.

Airplane Model.
Finally, we test our solver with a more realistic case.The monostatic RCS of an airplane model with the dimension of 60.84 × 35.00 × 16.25 is calculated.After discretization, the total number of unknowns is 3,654,894.The building time for forward system matrix is 3548.5 N 2 14  2   seconds; H-LU factorization time is 110,453.4seconds; and the peak memory cost is 724.3GBytes.With the LU form of the system matrix, we directly calculate the monostatic RCS for 10,000 total vertical incident angles (RHS) by backward substitution.The average time cost for calculating each incident angle is 31.2seconds.The RCS result is compared with that obtained by FMM iterative solver of 720 incident angles.Although FMM iterative solver only costs 112.4 GBytes for memory, the average iteration time of solving back scattering for each angle is 938.7 seconds with SAI preconditioner [23], which is not practical for solving thousands of RHS.From Figure 9 we can see that these two results agree with each other well.

Conclusion
In this paper, we present an H-matrices based fast direct solver accelerated by both MPI multinodes and OpenMP multithreads parallel techniques.The macrostructural implementation of H-matrices method and H-LU factorization is parallelized by MPI, while the microalgebraic computation for matrix blocks and vector segments is parallelized by OpenMP.Despite the sequential nature of H-matrices direct solving procedure, this proposed hybrid parallel strategy shows good parallelization efficiency.Numerical results also demonstrate excellent accuracy and superiority in solving massive excitations (RHS) problem for this parallel direct solver.

Figure 1 :
Figure 1: Overall procedure of H-matrices based FDS.

Figure 2 :
Figure 2: Data distribution strategy for forward/impedance matrix and also for LU factorized matrix.
. Before the LU factorization on level -H-LU() is finished, only one node  1 takes the computing job.Afterwards, two nodes  1 and  2 are used to get U 12 and L 21 simultaneously for H-LU( − 1).Then Z 22 is updated and factorized for getting U 22 and L 22 by the process H-LU().For triangular solving process on the  − 2 level, International Journal of Antennas and Propagation Update (K − 1) and Update (K − 2)

Figure 6 :
Figure 6: Memory cost for different MPI nodes/OpenMP threads proportion.

Figure 7 :
Figure 7: Solving time for different MPI nodes/OpenMP threads proportion.

Figure 8 :
Figure 8: Bistatic RCS of PEC sphere with  = 30.0solved by proposed hybrid parallel FDS.

Figure 9 :
Figure 9: Monostatic RCS of airplane model solved by proposed hybrid parallel FDS and FMM iterative solver.
11 and X 12 by solving lower triangular systems A 11 = L 11 X 11 and A 12 = L 11 X 12 , respectively; (ii) get X 21 and X 22 by solving lower triangular system A 21 − L 21 X 11 = L 22 X 21 and A 22 − L 21 X 12 = L 22 X 22 , respectively.From the above procedures we can clearly see that getting columns [X 11 , X 21 ]  and [X 12 , X 22 ]  represents two independent processes, which can be parallelized.Similarly, by refining Y as getting rows [Y 11 , Y 12 ] and [Y 21 , Y 22 ] represents two independent processes.Therefore, if the size of X or Y, namely, U 12 or L 21 in the H-LU context is larger than /2  , their solving processes will be dealt with by multiple nodes.
in which  (,) denotes the total solving time parallelized by  MPI nodes with  OpenMP threads for each node.The results are presented in Figure5.We can clearly see that both pure MPI parallelization and heavy OpenMP hybrid parallelization have poorer efficiency compared to that of moderate OpenMP hybrid parallelization.Ideally, larger / should give better efficiency for bigger cases.But, in reality, since larger / can result in more frequent MPI communication, more time is required for communication and synchronization, and so forth.Besides, large  also impairs the low complexity feature of H-matrices For all cases of this test, the size of smallest blocks is between 50 and 200, so their optimal number of OpenMP threads per node turns out to be 4.However, we can expect the optimal number of OpenMP threads to be larger (smaller) if the size of the smallest blocks is larger (smaller).For the case of solving PEC sphere of  = 10.0 with 128 thread per node 128 nodes, 4 threads per node 64 nodes, 8 threads per node 32 nodes, 16 threads per node O (N 2 ) O (N 1.84 )