High Performance Computing of Complex Electromagnetic Algorithms Based on GPU / CPU Heterogeneous Platform and Its Applications to EM Scattering and Multilayered Medium Structure

The fast and accurate numerical analysis for large-scale objects and complex structures is essential to electromagnetic simulation and design. Comparing to the exploration in EM algorithms frommathematical point of view, the computer programming realization is coordinately significant while keeping up with the development of hardware architectures. Unlike the previous parallel algorithms or those implemented by means of parallel programming on multicore CPU with OpenMP or on a cluster of computers with MPI, the new type of large-scale parallel processor based on graphics processing unit (GPU) has shown impressive ability in various scenarios of supercomputing, while its application in computational electromagnetics is especially expected.This paper introduces our recent work on high performance computing based on GPU/CPU heterogeneous platform and its application to EM scattering problems and planarmultilayeredmedium structure, including a novel realization of OpenMP-CUDA-MLFMM, a developed ACA method and a deeply optimized CG-FFTmethod.With fruitful numerical examples and their obvious enhancement in efficiencies, it is convincing to keep on deeply investigating and understanding the computer hardware and their operating mechanism in the future.


Introduction
Demand boosting in high performance computing algorithms has been one of the most significant topics in computational electromagnetics (CEM).With the well-known merit of much fewer unknowns than finite-difference timedomain (FDTD) and finite element method (FEM), the method of moments (MoM) has been widely used in EM scattering problems of large-scale complex objects and fullwave analysis in multilayered structures during the past decades [1].With the development in computer science and computational mathematics, many innovative algorithms have been established to accelerate the solving procedure of matrix equation.Several methods have been widely used in CEM, for example, multilevel fast multipole method (MLFMM) [2], adaptive integral method (AIM) [3,4], fast Fourier transform based method (p-FFT, IE-FFT) [5][6][7], adaptive cross approximation method (ACA) [8][9][10], and conjugate gradient fast Fourier transform method (CG-FFT) [11].Some of them have been successfully implemented in commercial software.Naturally, parallelization is the obvious way to make a further enhancement to efficiency [12].In early stage, the implementation of parallel computing was mainly based on CPU platforms, such as PC-cluster with MPI and multicore CPU workstation with OpenMP.The methodology of the parallelization based on a single PC platform is the massive use of threads while sharing the same memory.The bottleneck appears immediately as the explosive growth of communication load among threads.Therefore, the theoretical speedup when using multiple processors can be predicted by the famous Amdahl's Law [13].Thanks to the upgrading in hardware architecture of GPU, it is possible to allocate much more transistors devoted to data processing, that is, arithmetic logic unit (ALU), rather than data caching or flow control [14].This makes the GPU most involved in highly parallel, multiprocess, and many-core computing with very high memory bandwidth.The combination of GPU and CPU can be realized by the concept of heterogeneous computing, which includes other popular hardware modules, that is, field-programmable gate array (FPGA) and digital signal processing (DSP).The programming combination between GPU and CPU can fully depend on the easy-to-use language CUDA-C/PTX.To the authors' knowledge, the GPU/CPU heterogeneous platform is the most popular choice especially in CEM, such as the GPU-based FDTD [15], MLFMM [16][17][18][19], AIM [20], P-FFT [21], MoM [22,23], and higher-order MoM [24].In 2013, an impressive implementation of MLFMM by OpenMP-CUDA was realized on Fermi architecture (NVIDIA Tesla C2050), which achieved much higher performance than those before [16].
In this paper, our recent works on high performance computing based on GPU/CPU heterogeneous platform are introduced with the following: (1) A novel realization of OpenMP-CUDA-MLFMM method for EM scattering problem, in which the nearfield matrix filling and the sparse matrix-vector production (MVP) are optimized, together with a warplevel parallel scheme for aggregation/disaggregation and the use of texture memory for 2D local interpolation/anterpolation (2) A developed ACA method for EM scattering problem, in which the near-field matrix filling realizes a 100% efficiency enhancement and a thread-block-level parallel and a register-reusable schemes are applied to matrix compression and far-field MVP, respectively; also, a double-buffer technique for double precision is proposed to further enhance the efficiency of MVP computation [25] (3) A deeply optimized CG-FFT method for planar multilayered structure, in which the hardware instruction "__shufl_down" instead of shared memory was firstly used in the summation of MVP [26] and the current distribution of a large-scale flat lens was obtained efficiently.
It is worth emphasizing that the domain decomposition method (DDM) is not considered in this paper, by which means all the optimization comes from arithmetic programming.Corresponding to numerical examples, the detailed parameters of GPU platform can be found in Table 1, from which one can also discover the changes in computing ability brought by hardware's upgrading.The paper is organized as follows.With a brief introduction in numerical algorithms for MLFMM, ACA, and CG-FFT, Section 2 introduces the proposed and optimized methods from the programming point of view.Section 3 deals with the analysis of fruitful numerical examples.Section 4 presents the conclusions.Some technical terms used in this paper relating to NVIDIA GPU architectures and CUDA-C language, such as thread, threadblock, grid, warp, kernel, instruction, streaming multiprocessor (SMX), and stream processor (SP), are essential for understanding and can be found in [14].

Numerical Analysis of MLFMM,
ACA, and CG-FFT ( By using the addition theorem of Green's function, formula (1) can be rewritten in the form named as "aggregation-translation-disaggregation" for EFIE and MFIE: where  =  or , ℎ (2)   (⋅), and   (⋅) stand for the l-order Hankel function of the second kind and the l-order Legendre function, respectively.  is the truncation length and related to the diameter of the cubic in th layer and the numerical accuracy.
Since the generation of near-field matrix is a highly intensive work in computation of MLFMM, the implementation of GPU or multi-GPU can significantly speedup the procedure even with RWG-to-RWG rather than triangle-totriangle scheme.All the RWG functions are stored as two copies, namely, testing chain and basis chain, and each thread on GPU deals with one or more nonzero matrix elements by accessing corresponding data from the chains.This scheme can easily extend to multi-GPU case by splitting the whole task into subtasks.For example, if there are 2 GPUs, namely, GPU-0 and GPU-1, the chain can be split as {   } /2−1 =0 and {   }  =/2 , which makes each GPU has a /2 ×  sparse matrix to deal with.Considering the asynchronous launching scheme of kernel functions in CUDA, the control returns to CPU after launch immediately, which means a duty waste The near matrix for CPU when a thread is working on GPU.Therefore, it is necessary to establish a heterogeneous scheme for coordination between CPU and GPU.For this reason, each near matrix is further divided into two parts with N G and N C rows, respectively, as shown in Figure 1.All the tasks can be executed concurrently while satisfying the asynchronous mechanism of GPU/CPU heterogeneous platform.This kind of subdivisions can accelerate the sparse matrix-vector productions (SpMVP).The determination of N G and N C cannot be easily derived from mathematics; however, the empirical value can be obtained from numerical simulation and curve fitting, in which the optimum solution can be refined by parameter scanning, as shown in Figure 2. Considering the coexistence of multicore CPU and multi-GPU, n C and n G stand for the number of cores in CPU and number of GPU, respectively.The sampling points were obtained from sphere scatterings with radius changing from one to six wavelengths.As a result, for the case of n C = 4 and n G = 2, the optimum solution of N C /(N/2) equals 0.016 and 0.083 for single and double precision, respectively.
The implementation of aggregation and disaggregation at finest level on GPU was proposed by means of allocating a thread to each spectrum point [16].To increase the utilization of GPU on Kepler architecture (GK110) further, which has a maximum value of 32 threads in one warp, a smart scheme is designed with two steps.Firstly, all the cubes are grouped into each GPU evenly.Secondly, one warp is assigned to a certain cube, in which one thread is assigned to one or more RWG functions immediately.Taking the advantage of the hardware architecture in Kepler GK110, the shuffle instruction "__shulf()" can efficiently calculate the reductive summation in register rather than shared memory [14].Meanwhile, all the 32 threads in a warp can read the data from constant memory according to a certain spectrum point through the read-only cache by using the instruction "__ldg()" or "__restrict__".Unlike the strategy of thread-based task assignment proposed for aggregation and disaggregation at coarser level [16], we make a further step in data storage by using the four times larger texture memory in Kepler than Fermi.Since the local inter-/anterpolation accesses neighboring data frequently, it had better store data in a certain pattern form looking like the geometric topology in texture memory.As is pointed out in [27], the texture cache is optimized for 2D spatial locality.Therefore, the threads in one warp reading close-set texture addresses can acquire hardware acceleration and achieve the best performance.With texture memory, the process from coarser level towards finest level can also be executed efficiently.In detail, after all the cubes at child level are grouped evenly in every device (i.e., GPU), a thread will be assigned to give a cube for a spectrum point at this level.By fetching the data from global memory, the information can be organized into a 3D CUDA array that is bound to the layer of texture memory.Then, the cube receives data disaggregated from its parent level by applying local anterpolation within texture memory.All this kind of tasks for cubes at child level can be executed concurrently and all the layers of texture memory assigned can be flushed for the subsequent requests.

Principle of ACA and Its Optimization on GPU.
The ACA is a purely algebraic algorithm [8,9], aiming at independence from the integral kernel, that is, Green's function, which makes it different from MLFMM.The main idea of ACA is based on the compressing of a rank deficient system, which corresponds to the submatrix representing interactions between two well-separated groups of RWGs.Therefore, the MVP of this kind of submatrices can be efficiently calculated by using the factorial forms [9].In ACA, the MVP is divided into near-and far-field interactions, which leads to the same data structure as that in MLFMM, that is, the Octree structure, and forms a hierarchical representation of far-field submatrices.Hereby we use   , to denote the  ×  submatrix for interaction between group m and group n at the th-level with indicating a well-separated situation.Then, it can be approximated as where ‖ ⋅ ‖  is the Frobenius norm of a matrix and  ACA is a given threshold.The workflow of ACA can be depicted as in the following steps with initial value of k = 1 and i k = 1.
Step 1. Generate V () as the th row of   , and update it.
Step 5.The feedback system will keep updating by  ⇐  + 1 and   ⇐   + 1 until the condition in ( 5) is satisfied.
Obviously, it is convenient to employ both threadgrid and threadblock to take charge of one dimension.The total number of threads in each block is selected as 256, by means of 8 warps.The RWG functions, no matter what it is seen as, basis or test function, can be stored in shared memory during the generation of factorial form of a submatrix.Meanwhile, the submatrices U and V are stored in columnand row-major order, respectively.For the task assignment, 2 l warps perform the ACA calculation in one submatrix at the ( − )th level.However, this arrangement has no worry about insufficient blocks or logic error for l = 0, 1, 2, 3 until l = 4, because there is no synchronization mechanism between different threadblocks.In this paper, we realize one threadblock to one submatrix even at (l ≥ 4) by launching a child CUDA kernel function during the generation of new row or column vector [25].The number of threads enabled in this so-called child kernel is equal to the length of the row or column vector and they calculate only one entry of the vector.It is worth emphasizing that an explicit synchronization between parent and child kernels should be adjusted in advance, which ensures that all the tasks assigned to child kernel are ultimately completed before logic and data operates on parent kernel.This treatment can be realized by the specific "dynamic parallelism" [28] in Kepler GK110 (Table 1) and a vivid flowchart is shown in Figure 3.
After the ACA operation, the MoM matrix is converted into an H-matrix that greatly reduces the storage and speeds up the MVPs.The upcoming procedure is to calculate massive MVPs, which is still time consuming even though the Hmatrix is used.In this paper, the batched matrix-vector productions (bMVP) are considered to deal with the massive computation [25].The main idea of bMVP is to apply the 2D form of threadblock with dim3(pth, gridDimy) to calculate all the MVPs.The column in U and the row in V are performed concurrently, such that all segments (column or row) are with the same size except the last part, as shown in Figure 4.The result of  =   , () is stored in the shared memory as an intermediate buffer and then the result of  =   , () is accumulated into the result by reductive summation in shared memory.If we let s be equal to 256, there will be 64 threads in a group with the same ID in y-direction to do multiplication and the vector  can be reused in registers.Therefore, each threadblock can deal with 8 rows of V matrix and at most 32 threadblocks (gridDimy = 32) are needed to calculate the MVP.

Principle of CG-FFT, Multilayered Green's Functions, and
Its Optimization on GPU.The CG-FFT [11] is one of the most classical methods with iterative procedure and many applications have been found in different areas such as algebra, signal processing, computational geometry, and CEM, especially in International Journal of Antennas and Propagation  MoM based algorithms [29].The pseudocode of traditional CG-FFT can be depicted as where   is the conjugate transpose of matrix A and k stands for the iteration round.Obviously, the MVP operation dominates the most computation source and needs to be optimized.As we mentioned before, after a multiprocessor is assigned one or more threadblocks to execute, it will divide them into warps and each warp will be scheduled by a warp scheduler.The full efficiency can be realized when all 32 threads of a warp agree on their execution path.Considering the Kepler GK110 platform, the maximum threadblocks in one stream multiprocessor are 16 and each SMX contains 192 ALUs.An optimized parallelization can be realized by the socalled "single instruction multiple thread (SIMT)" [14].The flowchart of this optimized CG-FFT on GPU is depicted in Figure 5, in which the "RHT" stands for right-hand term in MoM matrix equation.
As the matrix is partitioned into N submatrices with at least ( − 1) of them having the same size, each submatrix can also be partitioned into pieces and each piece occupies a certain thread.Just like the realization in OpenMP-CUDA-MLFMM, the hardware instruction "__shufl()" is used instead of the shared memory to do reductive summation.This makes the performance even more efficient than the famous CUBLAS library.That is because the shuffle instruction is more efficient and reduces the usage of shared memory with increasing the occupancy.
A brief review of dyadic Green's functions in planar multilayered medium structure is also necessary in this part.The most important one is the mixed potential integral equation (MPIE), based on which the MoM can be established [30][31][32][33][34][35].
where the dyadic Green's functions can be presented as One of the most efficient ways to evaluate the Green's functions through Sommerfeld Integral (SI) from the spectral to spatial domain is the discrete complex image method (DCIM).According to the analysis in [30], the DCIM is suitable for near-field region ( ≤ 0.05) by calculating three terms, namely, the quasi-static term ( Q-S ), the surface wave term ( SWP ), and the complex image term ( CIM ), as shown below.
where  (2)  0 stand for the zero-order Hankel function of the second kind and Res i stands for the residues of surface wave poles on the complex spectrum plane.For far-field region ( ≥ 0.05), both the surface and leaky wave should be considered, together with a branch cut integral contribution, as shown below.
where SIP stands for the Sommerfeld integral path,  SWP and  LWP stand for the number of extracted surface and leaky wave poles, respectively, D stands for the integral path along the branch cut, and G+ and G− are the spectral Green's functions along the top and bottom Riemann sheets of the branch cut, respectively [30].With accurate evaluation of all the components in dyadic Green's functions, the MPIE can be solved by MoM with RWG functions, in which the matrix element is calculated as where  ⇀   and  ⇀   stand for the basis and testing RWG functions, respectively.The right-hand term of the MoM matrix equation will be different from radiating, transmitting to scattering cases.

Numerical Examples
To verify the efficiency enhancement of the proposed optimization based on GPU/CPU heterogeneous platform, this section will demonstrate fruitful numerical examples, respectively, corresponding to the OpenMP-CUDA-MLFMM, the GPU-based ACA, and the deeply optimized CG-FFT.All the programming codes are compiled in CUDA-C environment, and the hardware configuration can be found in Table 2. Similar to Section 2, the numerical results are also separated into 3 groups.The correctness of proposed methods with MLFMM and ACA have been verified by the very good agreement between bistatic RCS and standard Mie solutions of PEC spheres, which will not be demonstrated in this section.

Numerical Results of OpenMP-CUDA-MLFMM.
As a benchmark result to verify the robustness of algorithms, the scattering from the PEC NASA Almond at 7 GHz for VV polarization and that from the PEC NASA Ogive at 9 GHz for HH polarization are considered.These two PEC surfaces are discretized into refined triangles with edge length of about 0.1, which produce 8416 and 15516 RWG unknowns, respectively.The monostatic RCS curves with the incident angle as argument are obtained by the proposed method, as shown in Figure 6.From the comparison with measurement results, very good agreement can be found.Note that the interpretations labeled as "mid VV," "high VV," and "VV FERM" refer to Figure 5 in [36], and those labeled as "HH," "VV," "HH Cicero," and "VV Cicero" refer to Figure 9 in [36].
To demonstrate the efficiency performance of the proposed OpenMP-CUDA-MLFMM, we need to define some variables to record time consumption in different procedures, as shown in Table 3.By sampling on different number of unknowns, an approximate curve of (  ;   ) and ( AF +  DF ;  AS +  DS ) can be fitted in a broad range, as shown in Figure 7.In can be clearly seen that the proposed method has higher efficiency than that reported in [16] based on the same Kepler GK110 architecture.It is worth pointing out that the translation is not included because both the proposed method and that in [16] have almost the same performance in this procedure.

Numerical Results of GPU-Based ACA.
To verify the robustness of the proposed GPU-based ACA, the bistatic RCS from a rotor with four blades and a depressed cylindrical object are considered, as shown in Figure 8. Very good agreement can be seen between the proposed method and the traditional MoM on CPU.These two PEC surfaces are discretized into refined triangles with edge length of about 0.1, which produce 6636 and 4817 RWG unknowns, respectively.
To demonstrate the efficiency performance of the proposed GPU-based ACA, the scattering from a PEC sphere with radius of n (n = 1, 2, 3, and 4) is considered.The number of generated RWG unknowns is 4774, 18468, 40950, and 72120, respectively.In this paper, the threshold in formula ( 6) is selected as  ACA = 10 −3 .The efficiency comparison between GPU and CPU platform in both the bMVP operations and the whole ACA scheme with single and double precision are shown in Figure 9.It can be seen that the speedup ratio of matrix compression by using ACA can achieve about 50∼100 and 25∼50 for single and double precision, respectively.Meanwhile, the speedup ratio of bMVP can achieve about 10∼30 and 6∼17 for single and double precision, respectively.

Numerical Results of Deeply Optimized CG-FFT.
To validate the accuracy and efficiency of the proposed method, the quasiperiodical structures of flat lens on substrate ( r = 2.2,   = 1.0) with a copper ground ( = 5.98 × 10 7 S/m) are considered, which consist of 15 × 15 and 33 × 33 Jerusalem crosses and generate 19561 and 85849 RWG unknowns, respectively, as shown in Figure 10.The flat lenses are illuminated by an orthogonal incident plane wave with polarization in the  direction, where the incident information are  = ,  = 0, E 0x = 1, and E 0y = E 0z = 0.
Since the MVP operation is the most time consuming part in CG-FFT, direct optimization from arithmetic programming point of view is a challenging topic and should be tested before applying to complicated structure.Therefore, both square and nonsquare matrices MVP are tested, from which a significant speedup ratio is observed comparing to the famous CUBLAS, as shown in Tables 4 and 5.For this experiment on arithmetic realization, the GPU occupancy of CUBLAS and that of the proposed method is 25% and 75%, respectively, while the usage of shared memory is 288 B and 10240 B, respectively.This can be seen as a great enhancement in hardware performance, which makes a speedup ratio of 1.4 in average on single GPU.
As is mentioned in Section 2, the right-hand term in MoM of multilayered structure is different from radiating, transmitting, and scattering cases.With an orthogonal incident wave, the right-hand term can be calculated as below.
where the calculation of reflection coefficient refers to [37] and d in the exponential term is the thickness of substrate.The relative error of iteration in CG-FFT for the case with 15 × 15 elements is shown in Figure 11  Number unknowns of N t M , using the scheme in [16] t M , using the proposed scheme t S , using the proposed scheme t S , using the scheme in [16]  t M , using the scheme in [16] t M , using the proposed scheme t S , using the proposed scheme t S , using the scheme in [16]  t + t , using the scheme in [16] t + t , using the proposed scheme t + t , using the scheme in [16] t + t , using the proposed scheme Number unknowns of N (c) 10000 100000 200000 t + t , using the scheme in [16] t + t , using the proposed scheme t + t , using the scheme in [16] t + t , using the proposed scheme on GPU/CPU heterogeneous platform [38], which will be implemented to the MPIE-MoM in future work.

Conclusions
In this paper, our recent works on high performance computing based on GPU/CPU heterogeneous platform and its application to EM scattering problems and planar multilayered medium structures are introduced.There are three typical applications analyzed, including a novel implementation of OpenMP-CUDA-MLFMM, a developed GPU-based ACA method, and a deeply optimized CG-FFT method.
Comparing to the MLFMM, the ACA method starts to accelerate computing with Green's functions; however, it is still worth mentioning that the data storage strategy of ACA needs to be further developed.This paper is to explain the coordination between GPU and CPU on a heterogeneous platform.In addition, the methodology of optimization through hardware instructions is also adopted.Following the trends of both parallel and distributed computing, the OpenMP-CUDA-MLFMM and ACA are realized on a 2-GPU platform.With fruitful numerical examples for the three kinds of algorithms, we can find obvious enhancement in efficiency.As the development in hardware world, it International Journal of Antennas and Propagation     is worthy keeping an eye on different kinds of modules, including GPU, FPGA, and DSP.Meanwhile, it is also worthy keeping one on investigating and understanding the operating mechanism of heterogeneous computing.

2 International
Journal of Antennas and Propagation

Figure 1 :
Figure 1: The subdivisions of the near matrix and the task assignments.

Figure 2 :
Figure 2:   as a function of N/  with   = 2.

Figure 5 :
Figure 5: Flowchart of matrix partition for threads assignment.

Figure 6 :
Figure 6: (a) The monostatic RCS at 7 GHz for VV polarization.(b) The monostatic RCS at 9 GHz for HH polarization.
and the current distribution in centre of the case with 33 × 33 elements is shown in Figure12.To enhance the efficiency further, we have realized the fast calculation of multilayered Green'

Figure 7 :
Figure 7:   and   as function of N with (a) single and (b) double precision;  AF +  DF and  AS +  DS as function of N with (c) single and (d) double precision.

Figure 10 :
Figure 10: Geometry of flat lens on substrate and its meshing scheme.

Figure 11 :
Figure 11: Relative error of iterations of the flat lens with 15 × 15 elements.

Figure 12 :
Figure 12: Current distribution in centre of the flat lens with 33 × 33 elements.

Table 1 :
Detailed parameters of GPU platform.

Table 2 :
Configurations of GPU/CPU platform of numerical examples.

Table 3 :
Variables definition of time consumption.

Table 4 :
Comparison in MVP for square matrix.

Table 5 :
Comparison in MVP for nonsquare matrix.