An Efficient GPU-Based Out-of-Core LU Solver of Parallel Higher-Order Method of Moments for Solving Airborne Array Problems

The parallel higher-order method of moments (HoMoM) with a GPU accelerated out-of-core LU solver is presented for analysis of radiation characteristics of a 1000-element antenna array over a full-size airplane. A parallel framework involving MPI and CUDA is adopted to ensure that the procedures run on a hybrid CPU/GPU cluster. An efficient two-level out-of-core scheme is designed to break the bottleneck of bothGPUmemory and physical memorywhen solving electrically large and complex problems. To hide communication time between CPU and GPU, asynchronous communications are chosen to enable overlapping between communication and computation. For large problems that cannot fit in GPU memory or physical memory, the two-level out-ofcore LU solver is able to achieve a speedup of about 1.6x over the traditional out-of-core LU solver based on a highly optimized math library.


Introduction
The method of moments (MoM) can provide highly accurate results for a wide variety of complex electromagnetic (EM) problems [1,2].However, it requires a lot of memory and calculation time when solving large dense matrix equations using the lower/upper (LU) decomposition based direct solver.Because of these restrictions, the application of MoM for simulation of extremely large problems, such as an airborne array, is seriously limited.To break these difficulties, researchers have been employing several approaches.One is the fast algorithms based on MoM, for example, the fast multipole method (FMM) [3,4], multilevel fast multipole algorithm (MLFMA) [5,6], and adaptive integral method (AIM) [7].Generally speaking, these approaches reduce the memory requirement and the computation complexity compared with MoM and get more accurate results compared with the hybridization of MoM with high frequency methods [8,9].However, the fast algorithms may not be suitable for the ill-conditioned equations because of the use of iterative solvers, which may converge very slowly for models including complex structures and various materials.An alternative approach is to employ the higher-order basis based integral equation solver, which can greatly reduce the number of unknowns compared to use of the traditional piecewise basis functions, such as the Rao-Wilton-Glisson basis functions (RWGs) [2,10].Meanwhile, the out-of-core technique of the higher-order MoM (HoMoM) has been adopted to improve the capability of MoM to deal with larger complex EM problems, which is limited only by the capacity of hard-disk storage [2].
Along with the latest developments on computer technology, the use of high performance computing and graphics processing units (GPUs) acceleration techniques can significantly improve the computing speed.Early GPUs were mainly for image processing, until the compute unified device architecture (CUDA) programming was proposed in 2006 [11].Then the GPUs were usually utilized to improve the performance of the EM methods, such as MoM and its fast algorithms [12][13][14][15] and the Finite Difference Time Domain (FDTD) [16,17].However, there is a serious imbalance between GPU memory and the storage required by MoM for solving electrically large problems.(k) E (k)  inc , H (k)   inc inc , H (i) (j) In [18,19], an out-of-core scheme between GPU memory and CPU memory (RAM) was presented to break the limitation of the GPU memory and to improve the capability of the parallel LU solver.Compared with a single core, a speedup of 15 can be obtained.In [20], an efficient out-of-GPU memory scheme for solving matrix equations generated by MoM was presented, and it also obtains a good speedup on a single CPU/GPU platform.In [21,22], an out-of-core scheme between RAM and hard-disk drives (HDD) using RWGs and higher-order basis functions (HOBs) was adopted to break the limitation of RAM and improve the capability of parallel MoM to solve larger EM problems.In [23], the parallel MoM was utilized to solve a full-size airplane with nearly one million unknowns using the out-of-core LU solver on a high performance cluster with 1 TB RAM and 16 TB hard-disk space.In [24], a GPU-based parallel out-of-core LU solver of HoMoM is presented, but the procedure was developed on a single CPU/GPU platform.
High performance computing and GPU acceleration are an effective way to improve the computational capability of the MoM.They were analyzed for the parallel HoMoM in our previous studies [25][26][27][28][29].In this paper, an efficient GPU-based out-of-core parallel LU solver of the HoMoM using message passing interface (MPI), which runs on a high performance cluster with multiple CPU/GPU computing nodes, is presented to solve the complex EM problems.The radiation patterns of a wire antenna array with about 1000 elements over an airplane are simulated to demonstrate the acceleration performance of the GPU-based solver.The proposed solver has two distinguishing characteristics as follows: (1) an efficient two-level out-of-core scheme is presented to overcome the restriction of the RAM and GPU memory; (2) an overlapping scheme based on the asynchronous data transfer technique and CUDA streams is adopted to hide the communication time between CPUs and GPU cards.

Higher-Order Method of Moments
The Poggio-Miller-Chang-Harrington-Wu (PMCHW) integral equations [2,30,31] are employed for modeling metallic and dielectric structures, and higher-order polynomials are used as the basis functions, which will be briefly reviewed.

Integral Equations.
Consider that a model consists of +1 types of materials, each of which is assigned to a region.In the th region ( = 1, . . ., ), the permittivity and the permeability are  () and  () , and the incident electric and magnetic fields are E ()  inc and H () inc , respectively, as shown in Figure 1(a).Note that region 0 is reserved for perfect electric conductors (PECs), in which the electromagnetic field is zero.According to the equivalence theorem, the PMCHW formulation [2,26] at the boundary surface between regions  and  is expressed as where, as shown in Figure 1(b), n is the unit normal vector; E ()  scat and H ()  scat are the scattered electric and magnetic fields in region , which can be computed by using the  and  integral operators [2]; J sik and M sik are the electric and magnetic currents on the boundary surface between regions  and .In region , the fields and currents are similarly represented.On boundary surfaces of PECs, the magnetic currents vanish, and (1) degenerates into the electric field integral equation (EFIE) [32].the higher-order basis functions (HOBs) are defined in the polynomial form [2],

Higher-Order Basis
where p and  are local coordinates of the patch,  and  are orders of polynomials, and r(, ) is the parametric equation of the bilinear patch, which is illustrated in Figure 2.
The polynomial orders can be adjusted according to the electrical size of the geometric element.The orders become higher as the element increases larger.The electrical size of the geometric element can be as large as two wavelengths.Compared with the widely used RWG basis functions, the use of HOBs can reduce the number of unknowns by a factor of 5-10 and thus requires much less computational amount and memory.This kind of basis functions is also applicable for wire structures modeled by truncated cones.

GPU-Based Out-of-Core LU Solver
The parallel LU solver and the out-of-core scheme between CPU memory and HDD are briefly introduced firstly.Then the GPU-based two-level out-of-core scheme is discussed in detail.Finally, an overlapping scheme is designed to save the data transfer time between GPU memory and CPU memory.

Parallel LU Solver.
The LU solver is a numerical accurate method in solving matrix equations.There are two decomposition algorithms: left-looking and right-looking algorithms, where the left-looking algorithm requires less I/O operation between RAM and HDD [2].For the parallel LU solver, the data should be distributed to multiple processes in a certain way and a load balanced approach called two-dimensional (2D) block-cyclic matrix distribution, which is similar to the ScaLAPACK, is adopted.
As shown in Figure 3, consider a matrix A with 9 × 9 elements, which is distributed using a 2 × 2-size blocks to 6 processes in a 2 × 3 process grid using 2D cyclic boundary condition.The outermost numbers denote the row and column indices of the process coordinates.The left-looking algorithm is mainly considered in this paper, and there are two primary phases in the left-looking LU solver with pivoting neglected: the active block column updates and its factorization.When it is factorized, next block column will become the active column.And this process is repeated until the LU decomposition is completed.Note that the block columns (slabs) to the left of the active column are required in the update process, as denoted by gray color in Figure 4(a).
In Figure 4(a), the green part is called an active block column and the gray part has been factorized already.The data dependence of the active block column on its left block columns is illustrated by red arrows in Figure 4.When the RAM cannot store the whole matrix, an efficient solution is to use the storage of the HDD, as shown in Figure 4(b).

Out-of-Core Scheme between CPU Memory and HDD.
In the case where the RAM cannot meet the storage requirement, the whole matrix should be decomposed into a set of smaller matrices or slabs, and the size of which is determined by the RAM.In the process of the out-of-core LU decomposition, each slab is read into the RAM from the HDD in order, and then the LU decomposition is started.The decomposed results of each slab are written back to the HDD when its factorization is done.The solver then proceeds with the next slab until the whole matrix is decomposed, as shown in Figure 5.
From Figure 5 we can see that CPUs start computing after the data are transferred from the HDD to RAM.In this phase, L1 and L2 cache are used frequently to help the CPUs exchange data with RAM, and there are no I/O operations to the HDD.The right-looking algorithm [2] is adopted in this phase, because it has the larger computation destiny than the left-looking algorithm.In contrast to the leftlooking algorithm, the active block column depends on its right block columns in the right-looking algorithm, as shown in Figure 6.
There is no doubt that the performance of the CPU-HDD out-of-core procedures is reduced by the data-movement between the HDD and CPU RAM.To accelerate the parallel out-of-core MoM, GPUs are used to implement the LU decomposition procedures.

GPU-Based Two-Level Out-of-Core Scheme.
To implement the GPU-based two-level out-of-core scheme, the multilevel storage architecture should be defined, as shown in Figure 7, where the dashed line frame represents the CPU and the GPU card, respectively.
The data should be transferred to GPU to implement computing tasks.When the GPU memory cannot store the data, an efficient out-of-core scheme between RAM and GPU  (1, 0) (0, 0) (0, 1) (0, 2) The data transfer between RAM and HDD

CPU calculation
The control of CPU over RAM to read data from HDD Calculation Calculation L1\L2 cache L1\L2 cache memory must be utilized, which is similar to the out-ofcore scheme between RAM and HDD.As shown in Figure 8, the block column in RAM (active block column in Figure 4) should be divided into finer block columns.
Combine the RAM to HDD and GPU memory to CPU out-of-core schemes in a certain way as shown in Figure 9, which is the software and hardware architecture of parallel out-of-core left-looking algorithm.To make one GPU card used by several CPU cores at the same time, the GPU context technique [33] is used.In Figure 9, each node has two CPU cores and one GPU card.The context technique is used to partition one GPU card into two virtual GPU cards.And one MPI process is assigned to one CPU core and one virtual GPU card.Each MPI process can use its own GPU resources to accelerate the computing tasks.
In Figure 9, the CPUs first read data from the HDD to the RAM and then upload them into GPU, then GPU and CPUs start computation, and finally all the data will be written to the HDD.
Figure 10 is the implementation scheme of the GPUbased out-of-core LU solver of HoMoM.The left column of Figure 10(a) is the out-of-core left-looking algorithm for the LU solver, where the load represents the data transferred from RAM to HDD.
The most time-consuming functions of the algorithm are FSub  and LU  blk , which are given in the right column of Figure 10(a) and the left column of Figure 10(b), respectively.FSub  is the hot spot of FSub  , and GPUFSub GEMM is the hot spot of LU  blk .FSub  is given in the right column of Figure 10(b), where GPUFSub GEMM is the hot spot of FSub  .Obviously, GPUFSub GEMM is the hot spot of the whole process of the LU solver, which is the matrix-matrix multiplication function in the GPU-based out-of-core LU solver.

Overlapping of Communication and Computation.
In the GPU-based out-of-core scheme, communication can be hidden by computation by using the CUDA asynchronous technology and CUDA stream technology [34,35].As shown in Figure 9, each MPI process opens several CUDA streams on GPU context.The CUDA stream is similar to a CPU pipeline operation queue.The overlapping scheme consists of three kinds of operations: data transfer from RAM to GPU memory, GPUs calculation, and data transfer from GPU memory to RAM, which are performed by different hardware units.The operations in different CUDA streams can be executed asynchronously, and the data transfer and calculation can avoid resource confliction.
The working order of the three operations in different CUDA streams is shown in Figure 11.From the figure, one can see that the communication is hidden by computation.

Numerical Results and Discussion
The computational platform used in the following examples is a hybrid CPU/GPU cluster, which is equipped with 2 compute nodes connected by 1000 Mbps network cards.Each node has two six-core Intel Xeon E5-2620 2.0 GHz CPUs, 64 GB RAM, and one Tesla K20c GPU card with 4.6 GB memory available.The parallel hybrid CPU/GPU codes are programmed by hybrid languages based on MPI and OpenMP techniques.In this paper, the performance of the GPU-based parallel outof-core LU solver is compared with the CPU version.

Performance Tuning. A new parameter cuRatio is introduced to demonstrate the task ratio of GPUs calculation.
To obtain high acceleration performance, we firstly tune the cuRatio value.Note that the speed of the CPU and GPU card is different, so the optimal value of cuRatio should be testing to ensure the optimal performance of GPU-based out-of-core parallel LU solver.
The test is carried out on a single CPU/GPU node.In the calculation, one GPU card should be partitioned into twelve virtual GPU cards, and each virtual GPU card contains 1/12 resources of the whole GPU card.
Because the matrix equation solving contributes more than 80% of the computation time for a large scale dense complex matrix, the matrix equation solving time is mainly considered in this paper.A problem involving 67,552 unknowns is simulated.The cuRatio values are chosen from 0.0 to 1.0, and the step size is 0.05.The testing results are shown in Figure 12.
From the results, one can see that the optimal value of cuRatio is 0.75.[A], [p], [B])

Correctness of the
where ] ] where where ] where Continue with endwhile ] Continue with endwhile endwhile Algorithm: where ] Continue with ] where element are 30 mm × 35.6 mm, and the gaps between any two neighboring elements are 14.0 mm along both directions.The entire array is discretized into 121 wires for the feeds and 6,490 bilinear patches for the microstrip surfaces, corresponding to a total of 14,956 unknowns.Figure 14 shows the gain patterns of microstrip patch phased array, and the results from the proposed method and the RWG MoM in FEKO agree with each other very well.15.The dimensions of the wire array are 10.8 m × 2.9 m and the distances between any two neighboring elements are 150 mm and 50 mm along the length and width directions, respectively.The amplitude at the feed of the array is designed by a −35 dB Taylor distribution in xoy plane.The operation frequency is 1.0 GHz and the number of unknowns is 12,166.The three-dimensional (3D) gain pattern is given in Figure 16 and the 2D gain patterns are given in Figure 17.The 2D gain patterns computed by CPU alone are also given for comparison.The side-lobe level in xoy plane is below −35 dB, which meets the design target.From the comparisons one can see that the results using the parallel out-of-core HoMoM and the GPU-based out-ofcore HoMoM agree with each other very well.The calculation parameters are given in Table 1, and the time in Table 1 is matrix equation time.The memory column shows the memory requirements of double complex dense matrices.In this simulation, there is no effect of the out-of-core scheme, because the GPU memory is enough to deal with this problem.According to Table 1, take the 24 CPU cores as a reference, the speedup of the hybrid CPU/GPU version is 2.309.But compared with the sequential HoMoM, a speedup of over 55 times can be obtained using the GPU-based version.
Install the 72 × 14 wire antenna array over an airplane, as shown in Figure 18.The dimensions of the airplane are 36 m × 40 m × 10.5 m.The simulation frequency is also 1.0 GHz and      the number of unknowns is 259,128.Here this airborne problem is simulated using two approaches, the CPU version outof-core HoMoM and the GPU-based out-of-core HoMoM, to demonstrate the acceleration performance.The 3D and 2D gain patterns obtained by the proposed GPU-based out-ofcore algorithm are shown in Figures 19 and 20, in which the corresponding results of the array alone are also given.
From the comparison one can see that the results using parallel out-of-core HoMoM and GPU-based out-of-core HoMoM agree with each other very well.Meanwhile, we can see that the mainlobe region of gain patterns (i.e., 178 ∘ ∼182 ∘ in the xoy plane) of the airborne wire antenna array shows slight changes, while the side-lobe region of gain patterns (i.e., 0 ∘ ∼45 ∘ , 135 ∘ ∼225 ∘ , and 315 ∘ ∼360 ∘ in the xoy plane) seriously deteriorates compared with the result of the array alone.When the main beam is towards the tail fin of the airplane, its gain patterns get much worse mainly due to the reflection effect of the airplane tail fin.
The computing resources and solving time are given in Table 2.According to Table 2, the memory requirement is about 1000 GB, but the available RAM and GPU memory are

Conclusion
Using GPU-based two-level out-of-core scheme can solve larger EM problems at faster speed.In this paper, an airborne array problem with 259,128 unknowns is solved on a hybrid CPU/GPU cluster with about 10 GB GPU memory, 128 GB RAM, and 1 TB storages of HDD, and a speedup of 1.6 can be obtained compared with the parallel CPU version.The technique presented in this paper can also be used to accelerate the solution of radiation patterns from onboard antenna systems including complex antennas and large platforms.

2 International
Journal of Antennas and Propagation

Figure 1 :
Figure 1: Surface equivalence theorem: (a) the original problem and (b) the equivalent problem for region .
Functions.The boundary surfaces are discretized into bilinear quadrilateral patches, over which

Figure 2 :
Figure 2: Bilinear quadrilateral patch with its local coordinates  and .

Figure 3 :Figure 4 :
Figure 3: 2D block-cyclic distribution scheme: (a) matrix A divided into 2 × 2-sized blocks, (b) the matrix A is consisting of 5 × 5 blocks, and (c) the rearrangement of   in (b) for each process.

Figure 5 :
Figure 5: Operation sequence of the out-of-core scheme between RAM and HDD.

Figure 6 :
Figure 6: The data dependence in right-looking algorithm.
Implementation.The accuracy of the proposed GPU-based out-of-core HoMoM is validated in this section.As shown in Figure13, an 11 × 11 microstrip patch phased array, which is housed in a 520 mm × 580 mm × 7 mm cavity in a ground plane, is considered.The parameters of substrate are   = 2.67 and   = 1.0.The dimensions of the

Figure 9 :
Figure 9: Software and hardware architecture of parallel out-of-core left-looking scheme.

Figure 10 :
Figure 10: The scheme of the GPU-based out-of-core LU solver.

Figure 11 :Figure 12 :
Figure 11: Overlapping between communication and computation on GPU context.

Figure 13 :Figure 14 :
Figure 13: Model of the microstrip patch phased array.

Figure 15 :
Figure 15: Model of the wire antenna array.

Figure 16 :
Figure 16: 3D gain patterns of the wire array.

Figure 17 :Figure 18 :
Figure 17: 2D results of the wire array in xoy plane.

Table 1 :
Computational parameters for the array.

Table 2 :
Computational parameters for the array installed on an airplane.It represents how the proposed GPU-based solver breaks the restriction of the RAM and GPU memory.Meanwhile, taking the 24 CPU cores as a reference, a speedup of 1.611 can be obtained using the GPU-base version.And compared with the sequential HoMoM, the speedup of the GPU-based version is over 38 times.