Implementation and Optimization of a CFD Solver Using Overlapped Meshes on Multiple MIC Coprocessors

In this paper, we develop and parallelize a CFD solver that supports overlapped meshes on multiple MIC architectures by using multithreaded technique. We optimize the solver through several considerations including vectorization, memory arrangement, and an asynchronous strategy for data exchange on multiple devices. Comparisons of different vectorization strategies are made, and the performances of core functions of the solver are reported. Experiments show that about 3.16x speedup can be achieved for the six core functions on a single Intel Xeon Phi 5110P MIC card, and 5.9x speedup can be achieved using two cards compared to an Intel E5-2680 processor for two ONERA M6 wings case.


Introduction
Computing with accelerators such as graphics processing unit (GPU) [1] and Intel many integrated core (MIC) architecture [2] has been attractive in computational fluid dynamics (CFD) areas recent years because it provides researchers with the possibility of accelerating or scaling their numerical codes by various parallel techniques.Meanwhile, the fast development of computer hardware and the emerging techniques require researches to explore suitable parallel methods for applications in engineering.Intel MIC architecture consists of processors that inherit many key features of Intel CPU cores, which makes the code migrating less expensive and become popular in the development of parallel algorithms.
Many CFD-based codes or solvers have been studied on Intel MIC architecture.Gorobets et al. [3] used various accelerators including AMD GPUs, NVIDIA GPUs, and Intel Xeon Phi coprocessors to conduct direct numerical simulation for turbulent flows and compared the results from these accelerators.Farhan et al. [4] utilized native and offload mode of MIC programming model to parallelize the flux kernel of PETSc-FUN3D, and they obtained about 3.8x speedup with offload mode and 5x speedup with native mode by exploring a series of shared memory optimization techniques.Graf et al. [5] ran their PDE codes on single Intel Xeon Phi Knights Landing (KNL) node and multiple KNL nodes respectively by using MPI + OpenMP programming model with different thread affinity types.Cai et al. [6] calculated the nonlinear dynamic problems on Intel MIC by employing offload mode and overlapped data transfer strategy, and they obtained 17x on MIC over sequential version on the host for the simulation of a bus model.Saini et al. [7] investigated various numerical codes on MIC to seek performance improvement with different host-coprocessor computing modes comparisons, and they also presented loadbalancing approach for symmetric use of MIC coprocessors.Wang et al. [8] reported the large-scale computation of a highorder CFD code on Tianhe-2 supercomputer that consists of both CPU and MIC coprocessors.And other CFD-related works on Intel MIC architecture can be found in references [9][10][11][12].Working as coprocessors, GPUs also have been popular in CFD.Many researchers [13][14][15][16][17] have studied GPU computing on structured meshes, which involved coalesced computation technique [13], heterogeneous algorithm [15,17], numerical methods [16], etc. Corrigan et al. [18] investigated an Euler solver on GPU by employing unstructured grid and gained important factor of speedup over CPUs.
en, a lot of results included data structure optimization [19,20], numerical techniques [21], and applications [22] based on unstructured meshes on the GPU platform.For GPU simulations on overlapped (overset) meshes, Soni et al. [23] developed a steady CFD solver on unstructured overset meshes by using GPU programming model, which accelerated both the procedure of grid (mesh) assembly and the procedure of numerical calculations.en, they extended their solver to unsteady ones [24,25] to make their GPU implementation capable of handling dynamic overset meshes.More CFD-related computing on GPUs can be found in an overview reference [26].However, most of the existing works either used consistent structured or unstructured meshes without mesh overlapping over the computational domain on MIC architecture or studied overlapped meshes on GPUs.A majority of CFD simulations involving overlapped meshes were implemented or developed on distributed system through message passing interface (MPI) [27] without using coprocessors in past several decades.Specifically, Djomehri and Jin [28] reported the parallel performance of an overset solver using a hybrid programming model with both MPI [27] and OpenMP [29].Prewitt et al. [30] conducted a review of parallel implementations using overlapped mesh methods.Roget and Sitaraman [31] developed an effective overlapped mesh assembler and investigated unsteady simulations on nearly 10000 CPU cores.Zagaris et al. [32] discussed a range of problems regarding parallel computing for moving body problems using overlapped meshes and presented the preliminary performance on parallel overlapped grid assembly.Other overlapped mesh-related works can be found in [33][34][35][36].Although the work in [7] conducted tests by using a solver that is compatible with overlapped meshes on MIC coprocessors, the way it accessed multiple MIC coprocessors is through the native mode or symmetric mode of MIC.In this paper, we focus on the use of offload mode of MIC and investigate the parallelization of a solver with overset meshes on a single host node with multiple MIC coprocessors.e contributions of this work are as follows: (i) We parallelize an Euler solver using overlapped meshes and propose an asynchronous strategy for calculations on multiple MIC coprocessors within a single host node (ii) We investigate the performances of core functions of the solver by employing offload mode with different thread affinity types on the MIC architecture, and we make detailed comparisons between the results obtained by Intel MIC vectorization and that obtained by Intel SSE vectorization (iii) A speedup of 5.9x can be obtained on two MIC coprocessors over a single Intel E5-2680 processor for two M6 wings case e remainder of the paper is as follows.We first introduce the MIC architecture and programming model.And this is followed by equations and numerical algorithms that have been implemented in the solver.In Section 4, we discuss implementation and optimization aspects including data transfer, vectorization, and asynchronous data exchange strategy on multiple MIC coprocessors.e performances of core functions by using different thread affinity types are reported, and comparisons are made in Section 5.
e last section summarizes our work.Anyone who is familiar with C, C++, or Fortran programming language can develop codes on MIC coprocessors without major revision of their source codes.MIC provides very flexible programming models, including native host mode, native MIC mode, offload mode, and symmetric mode [38].Coprocessors are not used in native host mode, and programmers can run their codes on CPUs just like they do before the MIC architecture was introduced.By contrast, codes can be conducted only on coprocessors in native MIC mode when they are compiled with "-mmic" option.Symmetric mode allows programmers to run codes on both CPU cores and coprocessors.And offload mode is most commonly used on a single coprocessor or multiple coprocessors within a single host node.e basic use of offload mode for programmers is to write offload directives to make the code segment run on MIC coprocessors.To take full advantage of computational resources on MIC, the code segment can be conducted in parallel by employing multithreading techniques, such as OpenMP [29].

Compressible Euler Equations.
e three-dimensional time-dependent compressible Euler equations over a control volume Ω can be expressed in integral form as where W � [ρ, ρu, ρv, ρw, ρE] T represents the vector of conservative variables and F c � [ρV, ρuV + n x p, ρvV + n y p, ρwV + n z p, ρHV] T denotes the vector of convective fluxes with V � n x u + n y v + n z w.

Numerical Algorithms.
e flux-difference splitting (FDS) [39] technique is employed to calculate the spatial derivative of convective fluxes.In this method, the flux at the interface (for example, i direction), expressed by F i±(1/2),j,k , 2 Scientific Programming can be computed by solving an approximate Riemann problem as where the left and right state of q, q L , and q R are constructed by Monotonic Upstream-Centered Scheme for Conservation Laws (MUSCL) [40] and min-mod limiter.Equation ( 1) is solved in time in this work by employing an implicit approximate-factorization method [41], which achieves first-order accuracy in steady-state simulations.
Euler wall boundary conditions (also called inviscid surface conditions), inflow/outflow boundary conditions, and symmetry plane boundary conditions are applied to equation (1) by using the ghost cell method [42].

Mesh Technique.
In this work, we aim to solve 3D Euler equations on multiple MIC devices by using overlapped mesh technique.In this technique [33,42], each entity or component is configured by a multiblock mesh system, and different multiblock mesh systems are allowed to overlap with each other.During the numerical calculations, overlapped regions need to receive interpolated information from each other by using interpolation methods [43,44].e process of identifying interpolation information among overlapped regions, termed as mesh (grid) assembly [31][32][33][34]45], has to be employed before starting numerical calculations.
is technique has reduced the difficulty of generating meshes for complex geometries in engineering areas because an independent mesh system with high mesh quality can be designed for a component without considering other components.However, it adds the complexity of conducting parallel calculations (via MPI [27], for example) on this kind of mesh system.As mesh blocks are distributed on separated processors where data can not be shared directly, more detailed work should be done on data exchange among mesh blocks.ere are two types of data communication for overlapped mesh system when calculations are conducted on distributed computers.One is the data exchange on the shared interfaces where one block connects to another, and the other is the interpolation data exchange where one block overlaps with other blocks.And the discussion of both types of communication is going to be covered in this work.

Calculation Procedure.
e procedure of solving equation (1) depends mainly on the mesh technique employed.In the overlapped mesh system, the steps of calculations are organized in Listing 1.As described in Section 3.3, performing mesh assembly is a necessary step (Listing 1, line 1) before conducting numerical computing on an overlapped mesh system.Mesh assembly identifies cells which need to receive interpolation data (CRI) as well as cells which provide interpolation data (CPI) for CRIs in overlapped regions and creates a map to record where the data associating with each block should be sent to or received from. is data map stays unchanged during the steady-state procedure and is used for data exchange (Listing 1, line 8) within each iteration.en, the mesh blocks that connect with other blocks need to share the solutions at their common interfaces.When the data from blocks that provide interpolation and from neighbouring blocks are ready, a loop (Listing 1, lines 10-13) is launched to compute fluxes and update solutions on each block one by one.
Figure 1 illustrates how we exploit multithreading technique to perform all operations in Listing 1 on a single computing node with multiple MIC coprocessors.Even though more than one MIC coprocessor can be assembled on the same node, they are unable to communicate with each other directly.ey must communicate through the host node, and it is expensive for data to be moved between MIC devices and the host node because of the limited PCI-E bandwidth.
at requires us to work out data transfer strategy in order to achieve better performance on MIC coprocessors as two types of communication occur in every single iteration.
Our first consideration is to locate each mesh block cluster (MBC) that is associated with a component or entity on a specific MIC coprocessor.is benefits from the nature feature of the overlapped mesh system (Section 3.3) employed in this work.Since a MBC consists of one-to-one matched blocks only, it does not involve operations of interpolation over all mesh blocks it contains.So, this way of distributing computational workload to MIC coprocessors avoids one-to-one block data exchange across MIC coprocessors, which reduces the frequency of host-device communication.However, the data transfer from one block to all the blocks it overlaps across MIC devices is inevitable.For data transfer over overlapped regions across different MIC devices, we proposed an algorithm of communication optimization which will be introduced and discussed in Section 4.4.
As shown in Figure 1, a bunch of OpenMP threads are created on the host node, and each thread is associated with a MIC coprocessor and responsible for the data movement and computation of at least one MBC.e operations of physical boundary conditions (lines 5-7), the data exchange on one-to-one block interfaces (line 9) over a MBC, and the most time consuming part (lines 10-13) in Listing 1 are conducted on the corresponding MIC device.For steady calculations in this work, the host calls the process of mesh assembly [46,47] only once to identify the CRIs and CPIs for each mesh block and prepare the data structure in overlapped regions.At the end of each steady cycle, the updated solutions of CPIs are used to calculate, update the interpolation data, and then copied to the host memory.e host collects the data, redistributes it for CRIs, and copies the CRIs with new interpolated values back into the MIC memory.When the interpolated data are ready, a cycle of spatial and temporal calculations can be performed on mesh blocks one by one without any data dependence.

Memory Arrangement.
A MIC coprocessor has its own memory space, and independent memory allocation and arrangement have to be performed for the computations that are offloaded to a MIC device.More importantly, manipulating memory frequently and dynamically might have a negative effect on the overall efficiency, and it is the programmer's responsibility to control and optimize the memory usage.
In the computational procedure of solving equation ( 1), the core calculations including boundary conditions, spatial discretizations, and advancements in time just involve updating the values in arrays that can be used in different subroutines during the repeated cycles.erefore, the whole procedure conducts calculations on allocated variables during the cycle and then outputs them at the end of the process.is fact inspires us to use !dec$ offload begin target(mic:i) and !dec$ end offload to include all the variables that need to be allocated on MIC or copied from host before the cycle starts.
e clause of in(varname:length(len)) is used with alloc_if(.true.)free_if(.false.)options to allocate memory for a variable and initialize it with the values on the host by memory copy.However, there are a lot of variables that are used as temporary spaces and do not have to be initialized.In this case, we use the keyword of nocopy(nocopy(tmp:length(len_tmp) alloc_if(.true.)free_if(.false.))) to avoid extra memory copies between the host and a MIC coprocessor.When all the calculations are completed, out clause is declared with allo-c_if(.false.)free_if(.true.)options and placed between offload begin and end offload regions to copy the newest conservative flow variables to the host memory and free the variables in the device memory.e memory arrangement described above is illustrated in Listing 2. (1) Conduct mesh assembly to obtain interpolation information over the mesh system (MS) (2) Read flow configuration file and overlapped mesh data (3) Initialization ( 4) repeat (5) for each iblock in MS do (6) Apply boundary conditions to iblock (7) end for (8) W s (CRI) � f(W t (CPI)): exchange interpolated data between overlapped block pair (s, t), where CRI: cells receiving interpolation data and CPI: cells providing interpolation data (9) W i (ghostcells) ⟵ W j and W j (ghostcells) ⟵ W i : exchange each one-to-one block data at block interface connecting block i and block j (10) for each iblock in MS do (11) Calculate F i c , F j c , and F k c fluxes on iblock (12) AF time advancement and update solutions on iblock (13) end for (14) Until convergence (15) Output flow data LISTING 1: Procedure of solving equation (1) on overlapped mesh system.4 Scientific Programming Inside the cycle, many MIC kernels are launched by the host, and the nocopy keyword is used again to keep the data declared out of the cycle persistent across offloading procedure.For example, as shown in Listing 2, Fluxj_mic is declared as a function that can be called on MIC directly, and it reuses the array mbc_n which is declared and allocated outside the cycle without any extra memory operations during the cycle.And mbc_n can either be copied out to host memory or just deallocated on the coprocessor at the end point of the cycle.is concept of memory arrangement, which has been applied in other applications [6,7], prevents frequent data copies inside the cycle from affecting the overall efficiency.

Vectorization.
Vectorization is an effective way to improve computing speed on CPU by using Streaming SIMD (single instruction, multiple data) Extension (SSE).Similarly, MIC coprocessors support 512 bit wide Knights Corner instructions which allow 16 single or 8 double floating point operations at the same time.In this section, we explore vector optimization to make full use of MIC computational resources.
Generally, there are two different ways to implement the flux calculations.e first method is to perform a nested loop over all mesh cells (or mesh nodes).It calculates all the fluxes that go in or out of a cell's interfaces and accumulates them for the cell.And the vector operations can be applied to the innermost loop by using Intel autovectorization or manual SIMD directives.However, this algorithm involves redundant computations, in which the flux of an interface which is shared by two cells is computed twice.Literatures [18] have shown that redundant computing was not harmful to GPU implementation because it can hide the latency of global memory accesses on GPUs.An alternative way of flux computation, which is more popular for CPU computing and employed in this work, is to evaluate the flux of every edge only once by performing three nested loops along three directions.And then, the fluxes of edges are accumulated into the cells that contain them.However, there are still different coding strategies for the later technique.We investigate two of them and compare the effects on vectorizations.
One of the coding strategies (CS-1) is to conduct flux computing along different directions.For example, each thread is responsible for visiting a piece of the array of solution along i direction, denoted as w(j, k, 1 : id, n), when F i is considered.As discontinuous memory accesses occur when higher dimensions of w are accessed, loading the piece of w into a one-dimensional temporary array at the beginning is an effective way to reduce cache misses in the following process.And the remaining flux calculations can be concentrated on the temporary array completely.en, two directives, dir$ ivdep or dir$ simd, can be declared to vectorize all the loops regarding the temporary array.Figure 2(a) shows the pseudocode of this strategy.e advantage of this consideration is to keep the code of flux computation along different directions almost same and reduce the workload of redevelopment.Another method (CS-2) refers to performing F j , F k , and F i fluxes along the lowest dimension of w.Fluxes along different directions are computed by the nested loops with the same innermost loop to guarantee that the accesses of continuous memory hold, no matter which direction is under consideration.However, the body codes inside the nested loops vary from one direction to another.In this circumstance, vectorization directives are applied to the innermost (the lowest dimension) loop and the two nested loops outside are unpacked manually to increase the parallelism [7], or applied to the two inner loops by merging two loops to make full use of the VPU on MIC. e method merging the two inner loops is employed in the present paper, and the pseudocode is shown in Figure 2(b).It should be noted that the Knights Corner instructions [38], which can be applied to perform vectorization more elaborately, are not involved in this work, because this assembly-level optimization may result in poor code readability and make the overhead of code maintenance increase.

Host-Device Communication Optimization.
As described in Section 4.1, our task allocation strategy is to distribute workload by MBCs, which avoids the data exchange on one-to-one interfaces across MIC coprocessors but inevitably leads to communication across MIC coprocessors because of interpolation among MBCs.
Although workloads which are located on different MIC devices are performed in parallel, they still are dispatched one by one when they are on the same coprocessor.As a MBC corresponds to a specific MIC device, mesh blocks in the MBC are calculated and updated one by one.erefore, a mesh block that contains CPIs does not have to wait until the computation of other mesh blocks has completed before it needs to copy data to the host.More specifically, when the solution of a mesh block has been updated successfully during a block loop (Listing 1, line 10), it can start a communication request to copy interpolation data provided by CPIs to the host immediately.And the overhead of copying data from a mesh block to the host can be hidden by the flow computing of other mesh blocks in the same device.To implement this idea, we take advantage of the asynchronous transfer of the offload mode.Because overlapped mesh that consists of 2 MBCs is the most common case, we address the pseudocode of our algorithm about communication optimization in Algorithm 1 by using 2 MBCs and 2 MICs.
Outside the main procedure of numerical calculations, a loop over all MIC devices is performed.In order to make each device responsible for a part of the whole computation, we use OpenMP directive omp parallel do before the loop to start a number of threads, each of which maps to a device.According to the memory arrangement discussed in Section 4.2, memory allocation and initialization have been done on devices before the computation starts.erefore, by using this thread-device mapping, each device can access its own device memory, start computing, and use shared host memory for necessary data exchange.At the end point of time advancement, the clause offload_transfer is used with the signal option to put a request of data transfer from the device to the host, and it returns immediately and allows the device to keep running the next MIC kernels.So, the data transfer and the block computing can be conducted at the same time.Once all the MIC kernels in a steady step have finished, the CPU is required to check the signals by using offload_wait to make sure all the transfers have been done before executing the process of exchanging interpolation data.en, the CPU uses the master thread to distribute the interpolation data from CPIs to the corresponding positions of CRIs.Vectorization can be employed in this part because the operations involve assignment operations only.After that, the CPU starts data transfer requests to copy data back to each device respectively.At the same time, the next cycle begins, and the MIC kernel of set_boundary_condition is expected to run on MIC overlapping with the overhead of transferring the interpolation data back to MIC devices.In some cases, the last mesh block residing on a MIC device may have the largest number of CPIs, and it is difficult to overlap the data transfer from device to the host with computation because there are no more blocks that need to be computed.As the overlapped region involves only a part of mesh blocks, we sort the mesh blocks in the descending order in accordance with the number of CPIs to avoid transferring data with insufficient overlapped computation.

Testing Environment.
Our experiments were conducted on the YUAN cluster at the Computer Network Information Center at the Chinese Academy of Sciences.e cluster is of hybrid architecture that consists of both MIC and GPU nodes.e configuration of MIC nodes is that each node has two Intel E5-2680 V2 (Ivy Bridge, 2.8 GHz, 10 cores) CPUs and two Intel Xeon Phi 5110P MIC coprocessors.e memory capacity for the host and coprocessors is 64 GB and 8 GB, respectively.Two environmental variables, MIC_ OMP_NUM_THREADS and MIC_KMP_AFFINITY, were declared to investigate the performances under different number of OpenMP threads and affinity types.Intel Fortran compiler (version 2013_sp1.0.080) was used with the highest optimization -O3 and -openmp options to compile the solver.

Results
. We used two ONERA M6 wings (shown in Figure 3), each of which was configured with four 129 × 113 × 105 subblocks.e lower wing and its mesh system were formed by making a translation of the upper wing down along Y-axis by the length of the wing, and then, the two mesh systems overlapped with each other.Figure 4 shows a closer view of the overlapped mesh system at the symmetry plane before the mesh assembly.Figure 5 shows a closer view of the mesh system after the mesh assembly was performed.
e mesh assembly procedure automatically formed an optimized region where the interpolation relationship between the two mesh systems can be identified, which is shown in the middle of the upper and lower wings.
e initial condition for this case is as follows: Mach number M � 0.84, angle of attack α � 3.06 °, and slip angle β � 0 °.As the host node has two Intel MIC coprocessors, we made each

6
Scientific Programming MIC coprocessor responsible for the workload of one mesh system.e results presented below start with the performances of six kernels (three in spatial and three in temporal) running on MIC.Tables 1-3 show both the wall clock times for one cycle obtained by different combinations of number of threads and thread affinity types and that obtained by CPUs.e wall clock times for one cycle were evaluated by averaging the times of first 100 cycles.And the second coding strategy for vectorization of flux computing was employed in this test.In each table, the first column lists the number of threads from 4 to 118, which is horizontally followed by two three-column blocks.Each block corresponds to a specified MIC kernel and lists wall clock times in different thread affinity types.A column in such a block represents the wall clock times on MIC varying from the number of threads under a declared thread affinity type.In order to observe the scalability of MIC kernels as the number of threads increases, we demonstrate the relative speedup as t 4 /t p , where t 4 is the wall clock time estimated by using 4 threads and t p is that estimated by using p threads. e relative speedups are also filled in each column on the right of wall clock times.Furthermore, to compare the performance on the MIC architecture with that on CPUs, we also show the corresponding full vectorization optimized CPU time for each kernel at the last row of each table and calculate the absolute speedup through dividing the CPU time by the minimum value in all three values obtained by 118 threads under different affinity types.e results in all three tables show that the wall clock times obtained by scatter and balanced modes have an advantage over that obtained by compact mode. is can be explained by the fact that the compact mode distributes adjacent threads to the same physical processor as much as possible in order to make maximum utilization of L2 cache when adjacent threads need to share data with each other but this can result in load imbalance problems among physical cores.Our implementations have made each thread load independent data into a temporary array and operate the array locally, which is more suitable for both scatter and balanced modes because the local operations in each thread make better use of L1 cache on a processor without any intervention from other threads.
As the balanced mode distributes threads to processors in the same way as the scatter mode does when NT ≤ 59, the results obtained from these two modes are supposed to be same.However, there exist slight differences between them through Tables 1-3 because of time collecting errors coming from different runs.When NT � 118, each coprocessor took responsibility of two threads in both the balanced and scatter modes, but they differed because the balanced mode kept two adjacent threads on the same coprocessor whereas the scatter mode kept them on different coprocessor.However, the results obtained by these two modes did not show obvious gap in this case.is might be caused by the cache race among threads on a single MIC processor when it undertook more than one thread.
It is noticed that except F j the wall clock times of other five MIC kernels no longer decrease effectively when more (1) !$omp parallel do private(idev, ib, . . . . ..) (2) do idev � 0, 1 (3) repeat (4) offload target(mic:idev): set_boundary_condition for each block (5) if(icycle > 1) offload target(mic:idev) wait(sgr(idev)): set_CRI_to_domain (6) offload target(mic:idev) exchange_interface_data ( 7) do ib � 1, nb(idev) (8) offload target(mic:idev): spatial_step (9) offload target(mic:idev): temporal_step (10) offload target(mic:idev): compute_CPI (11) offload_transfer target(mic:idev) out(CPI ib ) signal(sgp idev (ib)) ( 12) end do (13) master thread: offload_wait all sgp idev related to each device (14)  than 59 threads were used in both balanced and scatter modes. is can be explained by the fact that the OpenMP directives in these kernels worked at the outmost loop and the maximum number of workload pieces equals to 128 (or 112, 104); so, it was difficult for each thread to balance the workload on it when 118 threads were launched.And for F j , we vectorized it along the innermost direction and manually combined the two nested looped into one loop to increase the parallelism by slightly modifying the codes.erefore, it has been showed that F j scaled better than F k and F i did as the number of threads increased, and F k and F i have an advantage in wall clock time due to maximizing the use of VPU on MIC but at the expense of scalability.
A speedup of 2.5x was obtained on the MIC coprocessor for TA j , which is the lowest speedup among the three kernels.at was mainly caused by the cache misses and poor vectorization when each thread tried to load the j − k plane of the solution array (w(j, k, i, n)) into the temporary   space and reshaped it along k direction.e same reason can explain the results shown in Figure 6, where the comparisons between two code strategies (CS) for flux computation of F k and F i by using balanced mode have been made.e wall clock times for the four kernels were evaluated by performing the first 100 steps of the solver.CS-1 conducted 10 more discontinuous array loadings than CS-2 did, which made vectorizations far from efficient on MIC coprocessors.Although the discontinuous memory accesses were vectorized by simd clause compulsively, they turned out less effective on MIC coprocessors.Also, we can observe clear efficiency drops when more than 16 threads were used in CS-1, because we have done 2D vectorization in CS-1 which made it hard for MIC to keep balanced workload among threads.en, we split each of the original block into 4 subblocks along both i and j directions and form a mesh system of 16 blocks for each wing and report the wall clock times for 500 steps by running the case on CPU and two MIC devices, respectively.Figure 7 shows wall clock time comparisons in Wall clock times (seconds)   different block sizes.Solving equation (1) on 8 overlapped mesh blocks using two Intel Xeon Phi 5110P MIC cards can achieve 5.9x speedup compared to sequential calculations with full optimizations, whereas solving the same problem on 32 overlapped mesh blocks using two MIC cards only achieves 3.6x.is is to be expected, because threads are unlikely to keep busy and balanced workload when the dimensions of blocks are relatively small as stated above, and the six core functions can achieve only about 2x speedup on a single MIC device.Furthermore, it is observed that sequential calculations using 32 mesh blocks spent about 10% more time than that using 8 mesh blocks.is can be explained by the fact that larger mesh block makes better use of Intel SSE vectorization.To show the correctness and accuracy of the parallel solver on the MIC architecture, we plot the pressure contours at the symmetry plane in Figure 8 and compare the corresponding pressure coefficients on the airfoils calculated by MICs with that calculated by CPUs in Figure 9.We can see clearly from Figure 9 that our solver running on MICs can capture each shockwave which is expected on each airfoil for this transonic problem and produce pressure coefficients in agreement with that calculated by CPUs.

Summary
We have developed and optimized an overlapped meshsupported CFD solver on multiple MIC coprocessors.We demonstrated and compared different code strategies for vectorizations by using a two M6 wings case and analysed the results in detail.An asynchronous method has been employed in order to keep the data exchange from interfering the overall efficiency.Calculations on the case achieved 5.9x speedup using two MIC devices compared to the case using an Intel E5-2680 processor.Our future work includes extending this solver to cover unsteady flow    Scientific Programming calculations which involve relative motion among overlapped mesh blocks on multiple MIC devices.

Figure 1 :
Figure 1: Flow chart of calculations on multiple MIC coprocessors.

Figure 5 :
Figure 5: Closer view of two overlapped mesh systems after mesh assembly.

Figure 4 :
Figure 4: Closer view of two overlapped mesh systems before mesh assembly.

Figure 7 :
Figure 7: Wall clock times for 500 steps on CPU and MIC.

Table 1 :
Wall clock times for F j and F k (in seconds).
NT : number of threads.

Table 2 :
Wall clock times for F i and TA j (in seconds).
NT : number of threads.

Table 3 :
Wall clock times for TA k and TA i (in seconds).
NT : number of threads.