The Immersed Boundary-Lattice Boltzmann Method Parallel Model for Fluid-Structure Interaction on Heterogeneous Platforms

Immersed boundary-lattice Boltzmann method (IB-LBM) has become a popular method for studying fluid-structure interaction (FSI) problems. However, the performance issues of the IB-LBM have to be considered when simulating the practical problems. .e Graphics Processing Units (GPUs) from NVIDIA offer a possible solution for the parallel computing, while the CPU is a multicore processor that can also improve the parallel performance. .is paper proposes a parallel algorithm for IB-LBM on a CPU-GPU heterogeneous platform, in which the CPU not only controls the launch of the kernel function but also performs calculations. According to the relatively local calculation characteristics of IB-LBM and the features of the heterogeneous platform, the flow field is divided into two parts: GPU computing domain and CPU computing domain. CUDA and OpenMP are used for parallel computing on the two computing domains, respectively. Since the calculation time is less than the data transmission time, a buffer is set at the junction of two computational domains. .e size of the buffer determines the number of the evolution of the flow field before the data exchange. .erefore, the number of communications can be reduced by increasing buffer size. .e performance of the method was investigated and analyzed using the traditional metric MFLUPS. .e new algorithm is applied to the computational simulation of red blood cells (RBCs) in Poiseuille flow and through a microchannel.


Introduction
In the immersed boundary-lattice Boltzmann method (IB-LBM), the flow field is solved by the lattice Boltzmann method (LBM), and the interaction between the fluid and the immersed object is solved by the immersed boundary method (IB) [1]. Since Feng and Michaelides [2] first successfully applied the IB-LBM to simulate the motion of rigid particles, IB-LBM has become a popular method for studying the fluid-structure interaction (FSI) problems. Eshghinejadfard et al. [3] used IB-LBM to research turbulent channel flow in the presence of spherical particles. Cheng et al. [4] developed an IB-LBM for simulating a multiphase flow with solid particles and liquid drops. e IB-LBM is also widely used to studying red blood cells (RBCs) flow in the blood, such as the interaction between plasma flow and RBCs movement [5] and the aggregation and deformation of red blood cells in an ultrasound field [6]. Simulations of practical problems of FSI, like blood flow, demand considerable computational resources [7], which will greatly reduce the simulation time.
e architecture of the Graphics Processing Units (GPUs) offers high floating performance and memory to processor chip bandwidth [8]. e LBM simulation has the characteristics of regular, static, and local calculations [9], which can be greatly accelerated by means of GPU [10,11]. Shang et al. [12] used MPI to show that LBM has good parallel acceleration performance. By proposing two new implementation methods, LBM-Ghost and LBM-Swap, Valero-Lara [13] reduced the huge memory requirements when simulating LBM on the GPU and proposed [14] three different parallel approaches for grid refinement based on a multidomain decomposition. e homogeneous GPU and heterogeneous CPU + GPU approaches [15] for mesh refinement based on Multidomain and irregular meshing could achieve better performance. In terms of IB-LBM's parallel computing, Valero-Lara et al. [16] proposed a numerical approach parallelizing the IB-LBM on both GPUs and a heterogeneous GPU-Multicore platform. Boroni et al. [17] presented a fully parallel GPU implementation of IB-LBM. e GPU kernel performs fluid boundary interactions and accelerated each code execution. Recently, Beny and Latt [9] proposed a new method for the implementation of IB-LBM on GPUs, which allows handling a substantially larger immersed surfaces. Other parallel methods, compiler directive open multiprocessing (OpenMP) [18] and distributed memory clusters [19], can achieve multicore parallel processing with shared memory on the CPU, for instance, a nonuniform staggered Cartesian grid approach [20] reducing considerably the complexity of the communication and memory management between different refined levels for mesh refinement.
In this paper, a new parallel algorithm for IB-LBM is proposed on CPU-GPU heterogeneous platform. When the GPU is performing simulation, the CPU is in a waiting state, which causes a waste of computing resources. e new algorithm avoids this problem. On the one hand, the flow field is divided into CPU domains and GPU domains, which are calculated by the Compute Unified Device Architecture (CUDA) and OpenMP, respectively. According to the calculation process of IB-LBM, after the flow field has evolved once, data is exchanged at the junction of the two calculation domains. On the other hand, the calculation time is shorter than the communication time, so buffers were introduced to reduce the number of data communications. e size of the buffer determines the number of times the flow field evolves before data is exchanged. Setting a buffer of proper size can reduce data communication and improve computing performance. As expected, this strategy greatly improved calculation efficiency without affecting the calculation accuracy of the IB-LBM. e remainder of this paper is organized as follows. Section 2 introduces IB-LBM. Section 3 illustrates the parallel algorithm on CPU-GPU heterogeneous platform that makes full use of computing resources. Section 4 shows the benchmark tests and analyzes the parallel performance. RBCs in Poiseuille flow and through a microchannel are simulated in Section 5. Finally, Section 6 concludes this study.

Immersed Boundary-Lattice Boltzmann Method
IB-LBM is a flexible and efficient calculation in the case of simulated fluid flow through moving or complex boundaries. e computational area of the IB-LBM consists of two types of lattice points: the Lagrangian points and the Eulerian points (see Figure 1). e Euler point represents the entire flow field, and the Lagrangian point illustrates the boundary of the immersed object.
e body force of the object is calculated by the deformation of the boundary, and then, it is dispersed to the Euler point of the flow field using the Dirac delta function. Finally, the macroscopic density and velocity of the flow field are solved.
e DnQm LBGK is the most widely used model for solving the flow field in the IB-LBM. n is the spatial dimension, and m represents the number of the discrete velocity [21,22]. In this paper, the most commonly used model, D2Q9 (see Figure 2), is employed, and the corresponding discrete velocity e i is where c � Δx/Δt is the lattice velocity and Δx is the lattice space. e LBGK model can be expressed as where f i (x, t) is the particle distribution function, x is current lattice location and t is current lattice time, Δt is the lattice time step, and τ is the relaxation time which is determined by the fluid viscosity coefficient υ of the fluid: where c s is the speed of sound. e corresponding f eq i (x, t) is the equilibrium distribution function. It can be written as  Mathematical Problems in Engineering where u is the macroscopic velocity at the current lattice point and ω i is the weight factor of the discrete velocity. In the D2Q9 model, the value of ω i can be written as For boundary conditions, this paper adopts the nonequilibrium extrapolation method [23], which can be expressed as where ρ(x f , t) represents the density of the adjacent lattice point x f of the boundary lattice point, x b , and u b is the velocity of the boundary lattice point x b . In the IB-LBM, the key problem is the calculation and dispersion of forces at the Lagrangian point. e penalty force method [2] is a way to calculate resilience. e basic idea of the penal force method is to apply Hook's law of elastic deformation. Regarding the boundary of the immersed object as an elastic boundary composed of Lagrangian points, the object is deformed by force, and the position of the Lagrangian point is moved to the position of the reference point (see Figure 3).
Restoring force due to the deformation is calculated by Hooke's law: where x l i is the Lagrangian point, x r i is the reference point after the corresponding Lagrangian point is moved, and ε is the stiffness coefficient; the value is 0.03.
In the calculation process, the position of the Lagrangian point is constant. After each lattice time step, the position of the reference points moved. e body force F(x, t) in the flow field is obtained by interpolation of the restoring force (a) f 0 Mathematical Problems in Engineering 3 where X is the Euler point in the flow field, x i is the i-th Lagrangian point on the boundary, F b is the corresponding restoring force, n is the total number of Lagrangian points on the boundary, and ds i is the length of boundary element.
is a Dirac function that can be used to interpolate the restoring force at the boundary point to the Euler point of the flow field. Dirac functions have a variety of expressions. e interpolation equation used in this paper is as follows: After interpolating the restoring force at the boundary point of the object to the Euler point of the flow field, the body force on the Euler point needs to be discretized according to the DnQm model [24]. It can be expressed by According to the calculation of the above equation, the macroscopic density and velocity of each lattice point can be written as After getting the macro speed of each lattice point, the speed of the boundary point is calculated by interpolation. en, the deformation displacement of the boundary point is obtained. e calculation steps for the IB-LBM can be summarized as follows (see Figure 4).

IB-LBM Parallel Model on Heterogeneous Platforms
For the development of parallel algorithm of IB-LBM, it is important to make the most of computing resources on CPU-GPU heterogeneous platforms. e traditional parallel algorithm on CPU-GPU heterogeneous platforms is implemented by dividing the algorithm into GPU part, which is responsible for computing the LBM part, and CPU part, which consists of executing the IBM part. However, the amount of calculation between IBM and LBM is inconsistent, which may cause the GPU or CPU to wait for the other.
On the other hand, after the evolution of each time step, some GPU and CPU data need to be transferred, which leads to extra time consumption. is paper attempts to provide a new approach to make full use of the computing resources of both GPU and CPU. e new method improves the traditional algorithms. A GPU contains multiple multiprocessors, and each multiprocessor contains multiple processors, called cores. erefore, in the single-instruction multiple-data mode, the GPU can process a large amount of data at the same time.
e solution of IB-LBM to fluid-structure interaction problems is relatively local. is feature has a high agreement with the GPU architecture and can achieve good performance results. To simulate IB-LBM on the GPU, the flow field needs to be divided into grid and blocks.
In the structure of the GPU, threads are grouped by blocks, and all blocks form a grid, which can make a threedimensional shape. In the case of three-dimensional, the thread in a block can be determined by the thread coordinates: threadIdx.x, threadIdx.y, and threadIdx.z. Similarly, the position of the block can be determined by the coordinates of the block: blockIdx.x, blockIdx.y, and blockIdx.z. is method is also applicable to two-dimensional and one-dimensional situations.
As shown in Figure 5, the flow field is mapped to a twodimensional grid, which contains n × m blocks; a block contains i × j threads. Each thread calculates a lattice point in the flow field. e index of the array can be obtained by where blockDim.x and blockDim.y are the dimensions of the block and the number of threads in the X and Y directions, respectively. threadIdx.x , threadIdx.y, blockIdx.x, and blockIdx.y are the coordinates of the thread and block, respectively. To divide the entire flow field into multiple twodimensional subregions of a specified size, that is blocks, the Calculate the restore force using equation (7) Interpolate the force from the Lagrange points to the Euler points by equation (8) Discretize the force into the LBGK model on the basis of equation (10) Compute the equilibrium distribution function using equation (4) Collision and stream with equation (1) Compute the macroscopic variables by means of equation (11) and equation (12)  total number of threads must be consistent with the total number of lattice points in the flow field, which can avoid unnecessary resource consumption and extra calculations.
Although the efficiency of IB-LBM simulation on the GPU has been greatly improved, it still caused a waste of computing resources. After the kernel function is launched, the calculations on the GPU are performed on the GPU. At the same time, the control of the program returns to the CPU, and the CPU is in a waiting state. is paper proposes an algorithm on CPU-GPU heterogeneous platform to avoid this situation. We divide the flow field into two parts: GPU computing domain and CPU computing domain. When control returns to the host, the CPU immediately performs IB-LBM simulation on the CPU computing domain. e specific execution steps are shown in Figure 6. e launch time of the kernel function is very short and can be ignored. GPU and CPU can be seen as evolving at the same time on the CPU and GPU computing domain, respectively. Multiple kernel functions can be launched on the CPU. After the kernel functions are launched, they are queued up for execution on the GPU, and control is returned to the CPU. e CPU launches some kernel function of a time step and then performs a time step simulation on the CPU, so that, after the calculation on the CPU and the GPU is completed, a time step simulation of the entire flow field is completed. In order to ensure that the calculation has been completed before data transmission, GPU synchronization needs to be set. It is necessary to wait for the GPU-side calculation to end when the synchronization position is reached. e data exchange is completed on the CPU, so the CPU simulation does not need to set synchronization, and the CPU can achieve automatic synchronization. After synchronization is achieved, only a small amount of data needs to be exchanged, the density, velocity, and distribution function at the boundary of the calculation domain.
It can be seen from Figure 6 that the GPU and the CPU need to exchange data to ensure that the flow field can evolve correctly after the evolution of the flow field is performed once. While the data is being exchanged, both the GPU and the CPU need to wait. At the same time, frequent data exchange adds extra time consumption. In order to optimize these problems, this paper introduces the concept of a buffer at the junction of two computational domains (see Figure 7).
In Figure 7, the blue area is the buffer with the same number of lattice at the junction of the GPU and CPU computing domains. In this way, there is no need to exchange data after each evolution. After the introduction of the buffer, the corresponding number of evolutions can be executed according to the size of the buffer, and then data exchange can be performed. e red area in Figure 7 is the area that needs to be copied. e algorithm overwrites the value of CPU buffer-zone with the value of GPU copy-zone and then overwrites the value of GPU buffer-zone with the value of CPU copy-zone. In this way, the evolution of the CPU and GPU computing domains has been correctly designed. By introducing a buffer, the number of data exchanges and CPU-to-GPU communications is greatly reduced, and additional time consumption is reduced. e parallel algorithm on CPU-GPU heterogeneous platforms can be illustrated as Algorithm 1.
e IB-LBM simulation contains four different calculations: IBM part, the entire flow field, boundary processing, and buffer assignment. According to these four kinds of calculations, different grid divisions Grid_1, Grid_2, Grid_3, Grid_4 and different block divisions Block_1, Block_2, Block_3, and Block_4 are used in Algorithm 1. Proper division can make better use of the performance of the GPU. e GPU can start multiple kernel functions simultaneously and then execute them sequentially. e start-up time of the kernel function is extremely short and can be ignored. erefore, Algorithm 1 approximates that the GPU and CPU start performing calculations at the same time.
e key problem of the algorithm proposed in this paper is how to divide the GPU computing domain and the CPU computing domain. Optimal performance can be achieved with appropriate mesh sizes in the two computational domains. e IB-LBM simulation with the same mesh amount is performed on the GPU and the CPU before the calculation Mathematical Problems in Engineering domain is divided. According to the proportion of GPU and CPU time consumption, the flow field is divided into equal proportions. rough the above design, a new algorithm on CPU-GPU heterogeneous platform is generated to simulate IB-LBM (see Figure 8). e GPU and CPU simulate at the same time and exchange data through the buffer. e CUDA programming model is used on the GPU side and OpenMP is used on the CPU side to perform parallel processing on IB-LBM.

Performance Analysis
is section tests the performance of IB-LBM on a CPU-GPU heterogeneous platform by conducting experiments flow around a cylinder. e main characteristics of the experimental platform are shown in Table 1.  Figure 6: IB-LBM calculation process and data exchange process on heterogeneous platforms. 6 Mathematical Problems in Engineering is paper used the conventional MFLUPS metric (millions of fluid lattice updates per second) to assess the performance. e following formula shows how the peak number of potential MFLUPS is calculated for a specific simulation [25]: where s is the total number of evolutions, N fl is the number of grids in the flow field, and T s is the execution time. e rules for dividing the flow field into GPU computing domain and CPU computing domain can be expressed by where N cpu and N cpu are the number of GPU computing domains and CPU computing domains, MFLUPS cpu and MFLUPS cpu are the MFLUPS value calculated by the multicore CPU and a GPU, and E is the number of lattice errors; the value range is − 10 4 ≤ E ≤ 10 4 . e choice of buffer size is a key issue, and the appropriate buffer size can greatly improve the calculation efficiency. After a large number of experiments and analyzing the experimental results, the choice of buffer size can be based on the following expression: where N buffer is the number of buffer grids, N cpu is the number of CPU computing domains, and the value of E is the same as the value in equation (14).
Heterogeneous: // e data in the array is assigned to the buffer in the flow field. buffer_to_field<<<Grid_4, Block_4>>>() // e GPU performs IB-LBM simulation.  e size of the buffer has a relatively small impact on the performance of the computation simulation on the GPU side and has a relatively large impact on the performance of the computation simulation on the CPU side. erefore, the size of the buffer can be calculated based on the size of the CPU computing domain. Choose buffers of different sizes for multiple experiments, and compare the optimal results with the CPU computing domain. e above experiments were conducted under different flow field scales. Finally, the conclusion of equation (15) is reached. It can be seen from the expression that the size of the buffer is related to the grid amount of the CPU computing domain. e amount of grid in the CPU computing domain is related to platform equipment and problem scale. erefore, the size of the buffer depends on platform features and problem scale. e single-precision and double-precision performance tests of the algorithm are shown in Tables 2 and 3.
As can be seen from Table 2 and Table 3, using a separate GPU to simulate IB-LBM can also achieve good performance results. e parallel performance of the CPU is far from the effect of the GPU; it is still a computing resource that can be utilized. In some cases, the hardware implements double precision, then single precision is emulated by extending it there, and the conversion will cost time.
is results in double-precision performance being faster than single-precision performance. Using a single GPU for computing has greatly improved computing performance. However, due to the computing characteristics of the CUDA model, the CPU only executes control of the kernel functions, so the computing resources of the CPU are not fully utilized. In the proposed algorithm, CPU controls the start of the kernel function and performs multi-threaded calculations without causing a waste of resources. Besides that, compared with the traditional heterogeneous method, the communication between the CPU and the GPU is reduced. It can be concluded from Tables 2 and 3 that the performance of the new algorithm has been improved. However, due to the small size of the flow field, the performance improvement is not obvious.
It can be seen from the experimental result that the performance with double precision is much lower than that with single precision on GPU. is can be caused by two reasons: On the one hand, the computing power of GPU single-precision floating-point numbers is higher than that of double-precision floating-point numbers. On the other hand, double-precision floating-point numbers increase memory access. Figure 9 shows the performance image of the enlarged flow field scale in the single precision case. e performance improvement can be more clearly observed.
It can be seen from Figure 9 that, after the scale of the flow field reaches 6 million, the MFLPUPS of the CPU-GPU heterogeneous platform with buffer is similar to the sum of GPU and CPU multithreading. When the number of grids is 1 million, the simulation effect of a single GPU is the best. is is due to the relatively small number of grids in the CPU computing domain divided by a heterogeneous CPU-GPU platform. e time spent on thread development and destruction has a greater impact on the total CPU simulation time. When performing parallel simulation on CPU-GPU heterogeneous platforms, not only will communication affect performance, but synchronization will also affect performance. e introduction of the buffer reduces the number of communications and synchronization at the same time. It can also be seen from Figure 9 that the effect of using the buffer is better than not using the buffer.
To study the actual FSI, the scale of the problem is often relatively large. At the same time, many evolutionary calculations are often required. Parallel algorithms can effectively reduce computation time. Most clusters now include GPU and CPU devices. e algorithm proposed in this paper can make full use of the resources of the cluster, and it can reduce the simulation time in the research of RBC aggregation and deformation in ultrasonic field [6], viscous flow in large distensible blood vessels [26], suspending viscosity effect on RBC dynamics and blood flow behaviors in microvessels [27], and other studies.

Application
RBCs are an important component in blood [28]. Recently, the IB-LBM has been used for simulating the deformation and motion of elastic bodies immersed in fluid flow including RBCs [29]. Esmaily et al. [30] used IB-LBM to investigate motion and deformation of both healthy and sick RBCs in a microchannel with stenosis. Tan et al. [31] presented a numerical study on nanoparticle (NP) transport and dispersion in RBCs suspensions by an IB-LBM fluid solver.
In this paper, the RBCs are suspended in blood plasma which has a density ρ � 1 and Reynolds number Re < 1. In the boundary processing, the nonequilibrium extrapolation is used. e cross section profile of a RBC in x-y plane [32] is given by the following relation: with c 0 � 0.207, c 1 � 2.002, and c 2 � 1.122. e two-dimensional RBC shape is shown in Figure 10. e red points represent the Lagrangian points on the RBC, and the black points are Euler points. e force at the Lagrangian points on RBC consists of two parts defined as where s is the arc length, F s is the stretching/compression force, and F b is the bending force. F s and F b are obtained by the following relation [33]: Here, N is the total number of the Lagrangian points on the RBC, k � 1, 2, . . . , N, (F s ) k and (F b ) k are elastic Lagrangian forces associated with the node k, the values of F s and F b are 1.0 × 10 − 10 and 1.0 × 10 − 12 , respectively, X(s, t) is the Lagrangian points, and δ j,k is the Kronecker delta function.
Blood flow in the microvessels is better approximated by a Poiseuille flow than a shear flow, which makes the study of RBCs in Poiseuille flows more physiologically realistic [34]. erefore, this paper first applies the algorithm to study the dynamical RBC behavior in Poiseuille flows; the simulation domain is 30 × 100 grid points (see Figure 11).   RBCs are perpendicular to the direction of fluid flow in Figure 11; the geometric shape gradually changes from a double concave to a parachute shape. e curvature after RBC deformation is closely related to the deformability of the boundary [35]. RBC can recover its initial shape associated with the minimal elastic energy when the flow stops [36][37][38].
Many diseases narrow the blood vessels. e degree of stenosis is high, which will affect the circulatory function of the human body. e study of RBCs passing through the stenotic area of the blood vessel is more popular. is paper applies the algorithm to this situation. Figure 12 shows geometry of the microchannel.
In the geometry of the microchannel, X and Y are number of Eulerian points in x and y direction, the values are 30 and 150, D is the diameter of vessel, and h is the diameter of constriction. e ratio of D and h is D � 3 h. Figure 13 shows the motion and deformation of an RBC through a microchannel with stenosis.
It can be seen from Figure 13 that the velocity of the fluid increases in the stenosis area, resulting in more deformation of the RBCs [39]. After passing through this area, the RBCs return to their initial shape due to the decrease in the velocity of the fluid. is is because RBC is very flexible and yet resilient enough to recover the biconcave shape whenever the cell is quiescent [40].

Conclusions
In this paper, a new parallel algorithm on CPU-GPU heterogeneous platforms is proposed to implement a coupled lattice Boltzmann and immersed boundary method. Good performance results are obtained by investigating heterogeneous platforms. e main contribution of this paper is to make full use of the CPU and GPU computing resources. CPU not only controls the launch of the kernel function but also performs calculations. A buffer is set in the flow field to reduce the amount of data exchanged between the CPU and the GPU. e CUDA programming model and OpenMP are applied to the GPU and CPU, respectively, for numerical simulation.
is method can approximate the simultaneous simulation of different regions of the flow field by the GPU and CPU. e final experiments have obtained relatively good performance results. e CPU-GPU heterogeneous platform MFLUPS value is approximately the sum of GPU and CPU multi-threads. In addition, the proposed algorithm is used to simulate the movement and deformation of RBCs in Poiseuille flow and through a microchannel with stenosis. e simulation results are in good agreement with other literatures. Our future work is to implement the method on large-scale GPU clusters and large-scale CPU clusters to simulate three-dimensional FSI problems.

Data Availability
All data, models, and codes generated or used during the study are included within the article.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Figure 13: An RBC through a microchannel with stenosis at different times.