^{1}

^{1}

^{1}

^{1}

^{1}

^{1}

To reduce the cost of designing new specialized FPGA boards as direct-summation MOND (Modified Newtonian Dynamics) simulator, we propose a new heterogeneous architecture with existing FPGA boards, which is called RP-ring (reconfigurable processor ring). This design can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resources. In order to avoid overall performance loss caused by the slowest board, we build a mathematical model to decompose workload among FPGAs. The dividing of workload is based on the logic resource, memory access bandwidth, and communication bandwidth of each FPGA chip. Our accelerator can achieve two orders of magnitude speedup compared with CPU implementation.

N-body simulations have been widely used in scientific and engineering applications. Problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics require efficient N-body simulation methods [

Computational solutions for N-body simulation can be categorized as CPU, GPU, ASIC, and FPGA according to the computing unit. Furthermore, these technologies vary in their cost, programming abstraction level, and power consumption [

Figure

Basic idea of GRAPE and its top-level structure.

Figure

GRAPE computing cluster.

We want to work out an accurate numerical computation method based on MOND theory. MOND (Modified Newtonian Dynamics) theory is an alternative for the popular Dark Matter (DM) theory, which successfully explains the distribution of force in an astronomical object from observed distributions of baryonic matters [

MOND’s numerical algorithm is different from traditional N-body simulation’s method, so the GRAPE is not suitable for our mission. MOND theory is based on potential calculation and can be described as follows: given the density distribution of baryonic matters

Finally,

Different from the traditionally direct-summation N-body algorithm, MOND requires a more time-consuming potential calculation, whose computation complexity is

With limited project budget, we choose to use the FPGA-based direct-summation algorithm. Instead of designing new specialized boards, we reuse existing ones in order to reduce the overhead. The scale of MOND simulation is limited by the available computational resources. A single FPGA chip does not provide enough logical resource, so the multi-FPGA solution seems to be the only choice. Major contributions of our work are as follows:

The remainder of this paper is organized as follows: Section

MOND numerical simulation is a variant of N-body simulation; the calculation can be described in the following five steps [

With the known baryonic matter distribution

Calculate the phantom dark matter distribution

Solve the Poisson equation (

Calculate the acceleration and velocity with the final potential by

Calculate the location of each particle in the next time step.

Steps

Therefore, in the following article we use FPGA to construct potential calculation pipeline and propose the RP-ring solution to build a larger multi-FPGA system. It should be pointed out that this work is not limited to MOND theory numerical simulation. It can be extended conveniently to other direct-summation N-body simulations.

As (

Figure

The architecture of RP-ring.

The input of potential pipeline.

As shown in Figure

Obtain the results from the previous FPGA board and put data into the Input-FIFO.

There is DMA on-board memory to get local particle information and put it into the DMA-FIFO.

The pipelines gain data from Input-FIFO and DMA-FIFO, calculate the potential, and then write the result into Output-FIFO

Read data from Output-FIFO, and send it to the next FPGA board through output connection.

Figure

The data flow of RP-ring.

The RP-ring solution, we propose in this paper, can avoid the problem we mentioned in Section

In RP-ring, when the whole working set flows through the ring network once, the calculation of interaction is completed. The amount of data that needs to be transported between FPGA boards is reduced, and the demand for communication bandwidth is also reduced.

The ring network topology is simpler than the tree network in GRAPE cluster. There is no need for additional network board.

The interconnection protocol is quite simple. It requires little overhead to implement protocols no matter the software or hardware. Thus, we can save more resources to construct potential calculation pipeline.

Potential pipeline is designed based on Poisson equation (

Figure

Potential pipeline’s detail.

From the above, the ring network may result in the slowest board dragging the overall performance down, so it is important to balance the time-consumption

The purpose of this mathematical model is, given multiple FPGA boards with known parameters, how to decompose the workload among them and choose their parameters of potential calculation pipeline, so that the whole system’s maximum throughput can be obtained.

Assume that the scale of simulation is

In order to maximize the whole system’s throughput, we just need to allocate the workload among the FPGA boards in a proportional way according to their processing capacity, so the problem is converted to how to choose

Therefore, in this model,

Furthermore, there are three constraints in this model:

FPGA logic resource constraint,

memory access bandwidth constraint,

communication bandwidth constraint.

FPGA logic resource constraint: in each FPGA, the logic resource consumption of FIFO, DMA, memory controller, input/output interconnection, and potential pipeline is smaller than the maximum amount of resources that FPGA can provide. Suppose that

Apparently,

Memory access bandwidth constraint: the input data of potential pipeline come from previous board’s result and local memory’s particle information. We fixed the data from the previous board and traverse the local particle information, so half of the potential pipelines’ input bandwidth is borne by the memory access bandwidth. That is to say,

In (

Communication bandwidth constraint: as described in memory access bandwidth constraint, the

In conclusion, based on the target function and the three constraints, when the parameter of the boards and the needed functional relation are given, solving the optimization problem can guide us on how to decompose the work load among the boards and how to choose their parameters of potential calculation pipeline.

In this section, we will demonstrate our implementation under RP-ring solution and its performance parameters in our MOND theory numerical simulation project.

Table

based on the boards’ feature, select software or hardware to implement the interconnection protocol,

according to the boards’ resource, choose different interconnection media.

The parameters of the boards.

Board | Main chip | Hard CPU | Logic cells | Flip flop | Block RAM | DSP slice | Memory access bandwidth (theoretical) | Connection |
---|---|---|---|---|---|---|---|---|

Zedboard | XC7Z020 | Cortex A9 Dual Core | 85,000 | 106,400 | 560 Kb | 220 | 8.5 GB/s | Ethernet |

KC705 | XC7K325T | No | 326,080 | 407,600 | 4000 Kb | 840 | 12.8 GB/s | Ethernet and SMA |

XUPV5 | XC5VLX110T | No | 46,080 | 28,800 | 1728 Kb | 48 | 3.2 GB/s | Ethernet and SMA |

Gemini-1 | XC6VLX365T ×2 | No | 364,032 each | 455,040 each | 14,976 Kb each | 576 each | 12.8 GB/s each | SMA |

Jetson-TK1 (host computer) | Tegra K1 | Cortex A15 Quard Core | — | — | — | — | — | Ethernet |

Thus, this solution has good flexibility and scalability and is compatible with heterogeneous multi-FPGA.

In Table

Top-level view of Gemini-1.

Figure

The experiment’s connection.

Boards.

The provisions of the RP-ring’s particle information format are as shown in Figure

The data structure of particle information.

For the FPGA with integrated CPU, like Zynq-7000, the interconnection protocol can be implemented with software as Figure

Protocol’s software implementation.

For the FPGA without integrated CPU, like XC7K325T, the interconnection protocol can be implemented with hardware as Figure

Protocol’s hardware implementation.

Table

Resource consumption.

Gemini-1 | KC705 | |||||||
---|---|---|---|---|---|---|---|---|

LUT | REG | BRAM | DSP | LUT | REG | BRAM | DSP | |

FIFO | 265 | 129 | 4 | 0 | 265 | 112 | 4 | 0 |

DMA | 2753 | 4078 | 8 | 0 | 1322 | 1680 | 8 | 0 |

IN/OUT | 450 | 530 | 0 | 0 | 1479 | 2082 | 0 | 0 |

MEM | 5889 | 5882 | 0 | 0 | 14016 | 9019 | 2 | 0 |

PIPELINE | 209493 | 301565 | 5 | 480 | 185509 | 273149 | 5 | 736 |

| | | | | | | | |

| ||||||||

Zedboard | XUPV5 | |||||||

LUT | REG | BRAM | DSP | LUT | REG | BRAM | DSP | |

| ||||||||

FIFO | 265 | 112 | 4 | 0 | 47 | 57 | 4 | 0 |

DMA | 0 | 0 | 0 | 0 | 777 | 516 | 0 | 0 |

IN/OUT | 0 | 0 | 0 | 0 | 2226 | 2307 | 0 | 0 |

MEM | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

PIPELINE | 45014 | 80403 | 2 | 171 | 10552 | 15945 | 0 | 64 |

| | | | | | | | |

In order to measure the communication bandwidth consumption between boards, we add counters to record the data traffic on the interconnection. Figure

Counters’ location and interconnection’s notation.

Table

Communication bandwidth consumption.

Interconnection | Data traffic (MB) | Time (ms) | Bandwidth measured (MB/s) | Bandwidth theory (MB/s) |
---|---|---|---|---|

Ethernet 0 | 15.733 | 1297.2 | 12.128 | 1000 |

Ethernet 1 | 15.641 | 1297.2 | 12.057 | 1000 |

Ethernet 2 | 15.847 | 1297.2 | 12.216 | 1000 |

SMA 0 | 2.911 | 1297.2 | 2.244 | 3125 |

SMA 1 | 2.926 | 1297.2 | 2.256 | 3125 |

PCB trace | 2.903 | 1297.2 | 2.238 | 6400 |

Table

The number of potential pipelines.

Board | Pipeline | Frequency | GFlops | MPair/s |
---|---|---|---|---|

Zedboard | 8 | 200 MHz | 32.9 | 866.6 |

KC705 | 32 | 344.9 MHz | 227.2 | 5977.8 |

XUPV5 | 2 | 224.3 MHz | 9.234 | 242.9 |

Gemini-1 | 32 × 2 | 266.1 MHz | | |

Total (theory) | 106 | — | 620.1 | 16311.3 |

Total (experiment) | 106 | — | 503.5 | 13244.7 |

We choose CPU and GPU solutions as the control groups of our work. Fabian’s RAMSES code is a widely used method for MOND simulation [

Software performance comparison.

Table

Some FPGA solutions are also listed in the Table

Hardware performance comparison.

Implement | Main chip | GFlops |
---|---|---|

GRAPE-4 cluster [ | ASIC | 1080 |

GRAPE-6 cluster [ | ASIC | 1349 |

GRAPE-8 board [ | ASIC | 960 |

Lienhart et al.’s work [ | FPGA | 3.9 |

Spurzem et al.’s work [ | FPGA | 4.3 |

Hamada et al.’s Bioler-3 [ | FPGA | 324.2 |

GPU cluster [ | GPU | 781 |

Sozzo et al.’s work [ | FPGA | 46.55 |

Our Work | FPGA | 503.5 |

In this paper, we proposed an extensible solution: RP-ring, which is used for heterogeneous multi-FPGA-based direct-summation N-body simulation, and a model to decompose workload among each FPGA. RP-ring tries to use existing FPGA boards rather than designing new specialized boards to reduce cost. The solution can be expanded conveniently with any heterogeneous FPGA boards and the communication bandwidth requirement is quite low, so that the communication protocol could be designed to be simple and consume few resource. The model considers the constraint of FPGA’s logic resource, memory access bandwidth, and communication bandwidth to divide workload reasonably and optimize the whole system’s performance. We also build a heterogeneous multi-FPGA system based on RP-ring and use it for MOND theory’s numerical simulation. The experimental result shows that the low cost multi-FPGA system is 193 times faster than high-end CPU implementation and achieves similar performance to high performance GPU.

An earlier version of this work was presented as a poster at 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

The authors declare that they have no conflicts of interest.

This research was sponsored by Huawei Innovation Research Program (YB2015090102); support from Huawei Technologies Co., Ltd., is gratefully acknowledged.

^{2}) gravitational N-body simulation