^{1}

^{2, 3}

^{3}

^{1}

^{1}

^{1}

^{2}

^{3}

Key aspects of cardiac electrophysiology, such as slow conduction, conduction block, and saltatory effects have been the research topic of many studies since they are strongly related to cardiac arrhythmia, reentry, fibrillation, or defibrillation. However, to reproduce these phenomena the numerical models need to use subcellular discretization for the solution of the PDEs and nonuniform, heterogeneous tissue electric conductivity. Due to the high computational costs of simulations that reproduce the fine microstructure of cardiac tissue, previous studies have considered tissue experiments of small or moderate sizes and used simple cardiac cell models. In this paper, we develop a cardiac electrophysiology model that captures the microstructure of cardiac tissue by using a very fine spatial discretization (8

Heart diseases are responsible for one third of all deaths worldwide [

Mathematical models for cell electrophysiology are a key component of cardiac modeling. They serve both as standalone research tools, to investigate the behavior of single cardiac myocytes, and as an essential component of tissue and organ simulation based on the so-called bidomain or monodomain models [

On the tissue level, the bidomain model [

However, the demand of sub-cellular discretization for the solution of the PDEs and nonuniform, heterogeneous tissue electric conductivity have prevented the study of the aforementioned phenomena on large-scale tissue simulations. In addition, due to the high computational costs associated with the simulations of these microscopic models of cardiac tissue, previous works have adopted simple myocyte models, instead of modern MC-based models [

In this work, we present a solution for this problem based on multi-GPU platforms (clusters equipped with graphics processing units) that allows fast simulations of microscopic tissue models combined with modern and complex myocyte models. The solution is based on merging two different high-performance techniques. We have previously investigated for cardiac modeling: cluster computing based on message passing communications (MPI) [

We developed a two-dimensional model that is based on the previous work of Spach and collaborators [

Basic unit of cardiac myocyte distribution based on a total of 32 cells. Cells are displayed in different alternating colors along the

This basic unit was created in such a way that it allows the generation of larger tissue preparations via the connections of multiple instances of it. Figure

Six basic units being combined to form a larger tissue.

Figure

In this work, there are only 5 possible types of connections between neighboring volumes that are membrane, which indicated no-flux between neighboring volumes; cytoplasm, which indicates that the neighboring volumes are within the same cell; three possible types of gap junctions, plicate, interplicate, and combined plicate. For the simulations presented in this work, the gap junction distribution of the basic template unit was manually chosen to reproduce the distribution presented before in [

Action potentials propagate through the cardiac tissue because the intracellular space of cardiac cells is electrically coupled by gap junctions. In this work, we do not consider the effects of the extracellular matrix. Therefore, the phenomenon can be described mathematically by a reaction-diffusion type partial differential equation (PDE) called monodomain model, given by

In this work, the modern and complex Bondarenko et al. model [

The finite volume method (FVM) is a mathematical method used to obtain a discrete version of partial differential equations. This method is suitable for numerical simulations of various types of conservation laws (elliptical, parabolic, or hyperbolic) [

This section presents a brief description of the FVM application to the time and spatial discretization of the heterogeneous monodomain equations. Detailed information about the FVM applied to the solution of monodomain can be found in [

The reaction and diffusion parts of the monodomain equations were split by employing the Godunov operator splitting [

Since the spatial discretization of our model,

For the discretization of the nonlinear system of ODEs, we note that its stiffness demands very small time steps. For simple models based on Hodgkin-Huxley formulation, this problem is normally overcome by using the Rush-Larsen (RL) method [

However, as already indicated above, we use different time steps for the solution of the two different uncoupled problems, the PDE and the ODEs. Since we use an unconditionally stable method for the PDE, the time step

The diffusion term in (^{2}) expresses the density of intracellular current flow and
^{3}) is a volumetric current and corresponds to the left-hand side of (

For the space discretization, we will consider a two-dimensional uniform mesh, consisting of regular quadrilaterals (called “volumes”). Located in the center of each volume is a node. The quantity of interest

After defining the mesh geometry and dividing the domain in control volumes, the specific equations of the FVM can be presented. Equation (

Finally, assuming that

For this particular two-dimensional problem, consisting of a uniform grid of quadrilaterals with side

For the case in which we have defined a conductance value at face

Using centered finite difference, we have for (

Equations for

Rearranging and substituting the discretizations of (

Large scale simulations, such as those resulting from fine spatial discretization of a tissue, are computationally expensive. For example, when an 8

To deal with this high computational cost, two distinct tools for parallel computing were used together: MPI and GPGPU.

The

To solve the non-linear systems of ODEs, the explicit Euler method was used. This is an embarrassingly parallel problem. No dependency exists between the solutions of the different systems of ODEs of each finite volume

In our

However, we have accelerated the solution of the systems of ODEs by using multiple GPUs. This is a different strategy from those we have used before when the full Bidomain equations (elliptic PDE, parabolic PDE, and systems of ODEs) were completely implemented in a single GPU [

The motivation for choosing a different strategy is based on several reasons. As presented in [

Our

In order to run an application, the programmer must create a parallel function called kernel. A kernel is a special C function callable from the CPU but executed on the GPU simultaneously by many threads. Each thread is run by a GPU stream processor. They are grouped into blocks of threads or just blocks. The blocks can be one-, two-, or three-dimensional. A set of blocks of threads form a grid, that can be one- or two-dimensional. When the CPU calls the kernel, it must specify how many blocks and threads will be created at the GPU to execute the kernel. The syntax that specifies the number of threads that will be created to execute a kernel is formally known as the execution configuration and is flexible to support CUDA’s hierarchy of threads, blocks of threads, and grids of blocks. Since all threads in a grid execute the same code, a unique set of identification numbers is used to distinguish threads and to define the appropriate portion of the data they must process. These threads are organized into a two-level hierarchy composed by blocks and grids and two unique coordinates, called

Some additional steps must be followed to use the GPU: (a) the device must be initialized; (b) memory must be allocated in the GPU and data transferred to it; (c) the kernel is then called. After the kernel have finished its execution, results are transferred back to the CPU.

Two kernels have been developed to solve each of the systems of ODEs related to BDK model. The first kernel is responsible for setting the initial conditions of the systems of ODEs, whereas the second one integrates the systems of ODEs at each time step.

Both kernel implementations were optimized in many different ways. The state variables of

Pure domain decomposition was used for parallelism. The tissue domain was linearly decomposed on

Linear parallel decomposition of tissue. Example for the case of two nodes (each with 8 cores and 2 GPU devices, that is, a total of 16 CPU cores and 4 GPU devices). Each CPU core processes one task. Four tasks are assigned to each GPU device. For instance, at node 0, GPU 0 processes tasks

For the solution of the ODEs, both sequential and parallel (CUDA) codes used single precision. For the solution of the PDE we have used double precision. For the case of monodomain simulations, we have shown in [

The simulations were performed using the microscopic model with spatial discretization of 8 ^{−1} and 1.0 ^{2}, respectively. The time step used to solve the linear system associated with (

Three different tissue setups were used to test our model and parallel implementations: a cardiac tissue of 0.5 cm × 0.5 cm size that was stimulated in the center and was executed for 10 ms, a cardiac tissue of 1.0 cm × 1.0 cm size that was stimulated in the center and was executed for 10 ms, and a cardiac tissue of 1.0 cm × 1.0 cm that was stimulated using the S1-S2 protocol to generate a spiral wave, a form of self-sustained reentrant activity strongly associated with cardiac arrhythmia.

Our experiments were performed on a cluster of 8 SMP computers. Each computer contains two Intel E5620 Xeon quad-core processors and 12 GB of RAM. All nodes run Linux version 2.6.18 – 194.17.4.el5. The codes were compiled with gcc 4.1.2 and CUDA 3.2. Each node contains two Tesla C1060. The Tesla C1060 card has 240 CUDA cores and 4 GB of global memory.

All tests were performed three times. The average execution time, in seconds, is then used to calculate the speedup, defined as the sequential execution time divided by the parallel execution time.

Figure

Action potential propagation (transmembrane potential) after a central stimulus on a tissue of size 1 cm × 1 cm for different time instants. (a)

(a) Transmembrane potential at

Table

Average execution time and speedup of parallel implementations for a tissue of 0.5 cm × 0.5 cm. The execution times are presented in seconds.

Parallel implementation | Cores | Total Time | ODE time | PDE time | Speedup |

| |||||

Cluster | 1 | 137,964 | 132,590 | 5,264 | — |

Cluster | 8 | 18,492 | 17,210 | 1,262 | 7.5 |

Cluster | 16 | 9,922 | 9,316 | 595 | 13.9 |

Cluster | 32 | 4,198 | 3,884 | 311 | 32.9 |

Cluster | 64 | 2,283 | 2,087 | 191 | 60.4 |

| |||||

Multi-GPU | 64 + 16 GPUs | 401.84 | 209.4 | 187 | 343 |

Table

Average execution time and speedup of parallel implementations for a tissue of 1.0 cm

Parallel implementation | Cores | Total Time | ODE time | PDE time | Speedup |

| |||||

Cluster | 1 | 546,507 | 523,331 | 23,177 | — |

Cluster | 64 | 8,934 | 8,313 | 607 | 61.2 |

| |||||

Multi-GPU | 64 + 16 GPUs | 1,302 | 682 | 611 | 420 |

As a result, using this very fast parallel implementation, we were able to simulate the formation of spiral waves, a form of self-sustained reentrant activity strongly associated with cardiac arrhythmia, see Figure

Spiral wave formation after an S1-S2 stimulus protocol at different time instants: (a)

Our results show that our

Nevertheless, we believe our parallel implementation can be further improved. For instance, in the current implementation, the CPU cores are idle while waiting for the results of the nonlinear ODEs that are being computed by the GPU devices. For future work, we intend to evaluate different load balancing techniques to better distribute the parallel tasks between GPU devices and CPU cores and make a more efficient use of all the computational resources. Another possible improvement is related to the multilevel parallelism introduced for the solution of the bidomain equations [

Recent studies that focus on the discrete or discontinuous nature of AP propagation have avoided the computational challenges that arise from microscopic models via the development and use of discrete models, where each cardiac myocyte is represented by a single point connected with the neighboring myocytes by different conductivities [

In this paper, we developed a cardiac electrophysiology model that captures the microstructure of cardiac tissue by using a very fine spatial discretization and uses a very modern and complex cell model based on Markov chains for the characterization of ion channel’s structure and dynamics. To cope with the computational challenges, the model was parallelized using a hybrid approach: cluster computing and GPGPUs. Different

B. G. de Barros and R. S. Oliveira contributed equivalently in this paper. “Therefore they can be both considered as first authors appearing in alphabetical order”.

The authors would like to thank CAPES, CNPq, FAPEMIG, FINEP, UFMG, UFSJ, and UFJF for supporting this work.