RP-Ring : A Heterogeneous Multi-FPGA Accelerator

To reduce the cost of designing new specialized FPGA boards as direct-summation MOND (Modified Newtonian Dynamics) simulator, we propose a new heterogeneous architecture with existing FPGA boards, which is called RP-ring (reconfigurable processor ring). This design can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resources. In order to avoid overall performance loss caused by the slowest board, we build a mathematical model to decompose workload among FPGAs.The dividing of workload is based on the logic resource, memory access bandwidth, and communication bandwidth of each FPGA chip. Our accelerator can achieve two orders of magnitude speedup compared with CPU implementation.


Introduction
N-body simulations have been widely used in scientific and engineering applications.Problems in astrophysics, semiconductor device simulation, molecular dynamics, plasma physics, and fluid mechanics require efficient N-body simulation methods [1].The problem can be described as follows.The topic gives the initial positions and velocities of N particles, demanding updating their positions and velocities every  time steps.Nonetheless, the size of the N-body simulation () is always limited by the available computational resources, and the increasing need for larger system simulations requires more efficient computational methods.So many researchers have been interested in faster algorithms for large-scale particle simulation and invented some efficient algorithms, such as Barnes and Hut algorithms that reduce the computation complexity to ( log ) and FMM algorithms which have a computation complexity of () [2].However, these algorithms use some approximation and are complex to parallelize.Direct-summation N-body algorithm computes the interaction between particles in an accurate way and is quite convenient for parallelization.What is more, direct-summation is a fundamental building-block for other algorithms [3], so a lot of high performance directsummation computational platforms emerge these years.In the modified Newton dynamics simulation project of Yunnan Observatories, Chinese Academy of Sciences, we want to work out such a platform that can make use of all the resources that we have in lab and to meet the demand for power and performance at the same time.
1.1.Background.Computational solutions for N-body simulation can be categorized as CPU, GPU, ASIC, and FPGA according to the computing unit.Furthermore, these technologies vary in their cost, programming abstraction level, and power consumption [4].There has been one thorough study on x86-based or power-PC-based tuning [5] and several papers on GPU cluster implementation [2,6].The GRAPE ("GRAvity piPE") project built ASIC-based high performance computing solutions for gravitational force calculations where the calculation of particle interactions was calculated by an ASIC chip in the form of a fully pipelined hardwired processor dedicated to gravitational force calculation [7,8].Hamada et al. used the Bioler-3 system to implement an FPGA-based gravitational force computing accelerator [4].These studies have shown that CPUs' performance is limited and ASIC offers no advantages; GPUs are competitive in performance and performance per cost; the performance per Watt figure favoured FPGA [4].

Related Work and Challenge.
Figure 1 shows a basic structure of a hardware accelerated solution for N-body simulations.It consists of a host computer and an acceleration coprocessor for potential calculation.The host computer performs all other calculations, for example, position and velocity upgrade.More specifically, the coprocessor consists of a number of pipelines and the particle information stored in particle memory.The control/interface unit receives instruction from host computer and controls the pipeline to acquire particle information from particle memory to calculate potential [7,9].Figure 2 shows how to build a computing cluster with GRAPE.To build a cluster with 16 GRAPE boards, we need one host-interface board (HIB), five network boards (NBs), and 16 processor boards (PBs).Each NB has one uplink and four downlinks.Thus, the 16 PBs are connected to the host computer through two-level tree network of NBs.NB and HIB handle the communication between PB and the host computer [9].With the increase in the amount of PBs, the demand for NBs increases rapidly.It means that the interconnection overhead of building a large computing cluster is unacceptable and the interconnection problem becomes a challenge of building large computing cluster.

Motivation.
We want to work out an accurate numerical computation method based on MOND theory.MOND (Modified Newtonian Dynamics) theory is an alternative for the popular Dark Matter (DM) theory, which successfully explains the distribution of force in an astronomical object from observed distributions of baryonic matters [10].
MOND's numerical algorithm is different from traditional N-body simulation's method, so the GRAPE is not suitable for our mission.MOND theory is based on potential calculation and can be described as follows: given the density distribution of baryonic matters   (), try to figure out the final potential Φ().The final potential is influenced by the distribution of two kinds of matter:   () and  ph (), where   () is the density distribution of baryonic matters including stars and gasses and  ph () is the density distribution of phantom dark matter, which is the theoretical hypothesis of MOND [11].The final gravity potential Φ() is given by the classical Poisson equation: ph () is given by the   ()'s potential () as in the following equation: Finally,   ()'s potential () is given by the classical Poisson equation: Different from the traditionally direct-summation Nbody algorithm, MOND requires a more time-consuming potential calculation, whose computation complexity is ( 2 ).
With limited project budget, we choose to use the FPGAbased direct-summation algorithm.Instead of designing new specialized boards, we reuse existing ones in order to reduce the overhead.The scale of MOND simulation is limited by the available computational resources.A single FPGA chip does not provide enough logical resource, so the multi-FPGA solution seems to be the only choice.Major contributions of our work are as follows: (1) In order to accelerate direct-summation N-body simulation we propose an extensible heterogeneous multi-FPGA solution called RP-ring (reconfigurable processor ring) to utilize existing multiple different FPGA boards.The experiment shows that our implementation achieved two orders of magnitude speedup compared with high-end CPU implementation.
(2) In order to prevent the slowest board from dragging the overall performance down, we propose a model about how to decompose workload among FPGAs and optimize logic resource allocation.To improve the whole system's performance, this model shall divide workload based on the logic resource, memory bandwidth, and communication bandwidth of each FPGA board and allocate logic resource among potential calculation pipeline, DMA/FIFO, and other modules.
The remainder of this paper is organized as follows: Section 2 presents background information on the algorithm of MOND theory numerical simulation.The following section will present the architecture of RP-ring.Then we will present the model, which guides the decomposition of workload and logic resource's allocation.After that, we will present our hardware implementation result on heterogeneous multi-FPGA and compare it with other implementation on various GPU boards and CPUs as well as ASIC implementation.These results will then be discussed before conclusions are drawn.

Direct-Summation Algorithm
MOND numerical simulation is a variant of N-body simulation; the calculation can be described in the following five steps [11]: (1) With the known baryonic matter distribution   (), calculate gravity potential () according to (3).
(4) Calculate the acceleration and velocity with the final potential by Φ() finite difference.
(5) Calculate the location of each particle in the next time step.
Steps (2), (4), and (5) have a computation complexity of (), so using CPU to do the tasks serially will not influence the performance.Steps (1) and (3) have a computation complexity of ( 2 ), so we focus on accelerating them.In direct-summation algorithm, Steps (1) and (3), which calculate gravity potential, we can use the solution of the Poisson equation: Therefore, in the following article we use FPGA to construct potential calculation pipeline and propose the RPring solution to build a larger multi-FPGA system.It should be pointed out that this work is not limited to MOND theory numerical simulation.It can be extended conveniently to other direct-summation N-body simulations.

Architecture
As (4) shows, the accumulation of the potential acting on each particle by all other particles in the system is mutually independent.We can calculate several particle-pairs' potential simultaneously, so the existing ASIC implementation represented by GRAPE puts several potential calculating pipeline in the chip to find the best degree of parallelism.In Section 1, we have analyzed the existing work and find out their bottleneck.Now we come to the RP-ring.

RP-Ring Solution.
Figure 3 illustrates a ring network consisting of  FPGA boards and a host computer.Each FPGA board has on-board memory and is connected with previous/next FPGA board with cable.There are  pipelines, DMA, memory controller, and other modules in an FPGA chip.All these modules are controlled by protocol controller.The potential pipelines have two input ports, one connected to Input-FIFO and the other connected to DMA-FIFO.In order to reuse the local particle information, we fix the data from the Input-FIFO, which is received from previous board, and traverse the local particle information as Figure 4 shows.In the following paragraphs, we will explain RP-ring's control flow and data flow.

Control Flow.
As shown in Figure 3, each FPGA board's control flow has the following features: Compute interaction FPGA board 0 FPGA board 1

Compute interaction
On-board memory: On  (1) Obtain the results from the previous FPGA board and put data into the Input-FIFO.
(2) There is DMA on-board memory to get local particle information and put it into the DMA-FIFO.
(3) The pipelines gain data from Input-FIFO and DMA-FIFO, calculate the potential, and then write the result into Output-FIFO (4) Read data from Output-FIFO, and send it to the next FPGA board through output connection.
The information of the th particle consists of its location x = (  ,   ,   ), mass   , and potential   which is initialized to zero.When the th particle's information reaches the zero board, the FPGA board calculates the interaction between the particle and the local particles work set 0 and stores the intermediate results  0 + 0 into the potential field.When the th particle's information reaches the first board, the FPGA board calculates the interaction between the particle and the local particles work set 1 and stores the intermediate results  1 +  0 + 0 into the potential field.Therefore, when the th particle's information flows through the whole ring and returns to the host computer, the potential field reaches the final result ∑ −1    .Then the host computer can continue to follow up operation.
The RP-ring solution, we propose in this paper, can avoid the problem we mentioned in Section 1.It implements an extensible heterogeneous multi-FPGA solution.Different from the GRAPE's tree network, RP-ring utilizes ring topology network.Each FPGA board's on-board memory stores a portion of particle information.During the process, each board receives particle information from the previous board, then calculates the interaction with local particle information, and finally sends the result to the next board in the ring network.For each particle's information, when it flows through the whole boards in the ring, the calculation of interaction with all other particles is finished.The advantages of this solution are as follows: (1) In RP-ring, when the whole working set flows through the ring network once, the calculation of interaction is completed.The amount of data that needs to be transported between FPGA boards is reduced, and the demand for communication bandwidth is also reduced.
(2) The ring network topology is simpler than the tree network in GRAPE cluster.There is no need for additional network board.(3) The interconnection protocol is quite simple.It requires little overhead to implement protocols no matter the software or hardware.Thus, we can save more resources to construct potential calculation pipeline.

Potential Pipeline. Potential pipeline is designed based on
Poisson equation (4).To make each stage of the pipeline have similar latency, ( 4) is rewritten as International Journal of Reconfigurable Computing Figure 6 shows the design of potential pipeline.According to the complexity of different operation, the addition, substraction, and multiplication units are set to the same latency cycle as the division and square root units.In our design, , , ,  are presented in IEEE 754 floating-point format.

System Optimization.
From the above, the ring network may result in the slowest board dragging the overall performance down, so it is important to balance the timeconsumption (  ) of particle's information flowing through each board.Processing capacity of each board varies, but by decomposing the workload based on their own capacity we can adjust   to improve the whole system's performance efficiently.(See suitable work set  in Figure 5.) The following section will discuss this problem in detail with a mathematical model.

Model
The purpose of this mathematical model is, given multiple FPGA boards with known parameters, how to decompose the workload among them and choose their parameters of potential calculation pipeline, so that the whole system's maximum throughput can be obtained.

Symbol Conventions.
Assume that the scale of simulation is , and we have  FPGA boards.
is the workload assigned to the th FPGA board.It can be seen from the RP-ring's architecture that each board's processing capacity depends on the number of potential calculation pipelines and their operating frequency.Suppose that the th FPGA board contains   pipelines and their operating frequency is   ; then the performance of the board   has In order to maximize the whole system's throughput, we just need to allocate the workload among the FPGA boards in a proportional way according to their processing capacity, so the problem is converted to how to choose   and   , s.t. = ∑    maximum.Additionally,   is a function of   ; that is, Therefore, in this model,   is the only free variable.Finally, the problem is rewritten as 4.2.Constraint.Furthermore, there are three constraints in this model: (1) FPGA logic resource constraint, (2) memory access bandwidth constraint, (3) communication bandwidth constraint.
FPGA logic resource constraint: in each FPGA, the logic resource consumption of FIFO, DMA, memory controller, input/output interconnection, and potential pipeline is smaller than the maximum amount of resources that FPGA can provide.Suppose that LUT  , FF  , BRAM  , and DSP  are the amounts of LUT, Flip-Flop, BRAM, and DSP Slice that the th FPGA provides; we use vector R i to present them, R i = (LUT  , FF  , BRAM  , DSP  ), R i FIFO is the resource consumption of the th board's FIFO, and so on.Then we have Gemini-1 board

and R i
IOCtrl depend on the pipeline needed data bandwidth and can be seen as a function of   .That is to say, Memory access bandwidth constraint: the input data of potential pipeline come from previous board's result and local memory's particle information.We fixed the data from the previous board and traverse the local particle information, so half of the potential pipelines' input bandwidth is borne by the memory access bandwidth.That is to say, In (12), BW  Mem max is the maximum memory access bandwidth of the th board, and BW  pipeline is the input data bandwidth summation of all the potential pipelines in the th board.Apparently, BW  pipeline is proportional to the number of potential pipelines.That is to say, BW  pipeline ∝   , so (12) can be written as where  is a constant.Communication bandwidth constraint: as described in memory access bandwidth constraint, the th board's communication bandwidth between previous board and next board should be greater than one over   of the half of BW  pipeline .That is to say, In conclusion, based on the target function and the three constraints, when the parameter of the boards and the needed functional relation are given, solving the optimization problem can guide us on how to decompose the work load among the boards and how to choose their parameters of potential calculation pipeline.

Implementation
In this section, we will demonstrate our implementation under RP-ring solution and its performance parameters in our MOND theory numerical simulation project.
Table 1 shows our existing boards and their parameters.The experiment demonstrates that under RP-ring solution, we can, (1) based on the boards' feature, select software or hardware to implement the interconnection protocol, (2) according to the boards' resource, choose different interconnection media.
Thus, this solution has good flexibility and scalability and is compatible with heterogeneous multi-FPGA.
In Table 1, Jetson-TK1 is a NVIDIA Application Processor board, which is used as host computer.Zedboard, KC705, and XUPV5 are Xilinx Evaluation Kits for Zynq-7000, Kintex-7, and Virtex-5.Gemini-1 is our design FPGA board for prototyping.Figure 7 shows the top-level structure of Gemini-1.Gemini-1 has two pieces of XC6VLX365T Virtex-6 FPGA; they are connected through PCB trace.Each Virtex-6 FPGA chip has SMA connector to transport data.8 shows connection between the boards and connection between chips.In the ring network, Tegra K1 is used as host computer, and XC7Z020, XC7K325T, XC5VLX110T, and XC5VLX365T are connected through ethernet, SMA cable, or PCB trace.Figure 9 is the picture of real product.

Data Structure.
The provisions of the RP-ring's particle information format are as shown in Figure 10.The location, mass, and potential field store the particle's three-dimensional coordinates, mass, and the provisional result of potential, as well as the field of Tag record FPGA boards that the particle information has passed.Once the information passes an FPGA board, the corresponding position in Tag filed is set.When all of the bits are set, the potential field contains the final result.

Software Implementation.
For the FPGA with integrated CPU, like Zynq-7000, the interconnection protocol can be implemented with software as Figure 11 shows.Figure 11 shows its system architecture: multiple pipelines are instantiated in FPGA; two input ports of each pipeline are connected to Input-FIFO and DMA-FIFO; the output port of each pipeline is connected to Output-FIFO.The CPU creates three processes: Input Process, Computing Process, and Output Process.
Input Process.Control Gigabyte Ethernet receives particle information from previous board and stores it to Input-FIFO.Moreover, Input Process handles the situation of retransmission, Input-FIFO's fullness, and so on.
Computing Process.Control DMA traverses local particle information from DDR3.Control pipelines compute the potential based on the particle information in Input-FIFO and local information in DMA-FIFO and then write the result to Output-FIFO.
Output Process.Control Gigabyte Ethernet Sends the result in Output-FIFO to the next board.Moreover, Output Process handles the situation of retransmission, next board's Input-FIFO's fullness, and so on.

Hardware Implementation.
For the FPGA without integrated CPU, like XC7K325T, the interconnection protocol can be implemented with hardware as Figure 12 shows.Because the interconnection media can be ethernet cable, SMA cable, or PCB trace, the input/output connection can be ethernet controller, SMA controller, or SelectIO controller.
In Figure 12, a protocol FSM replaces the role of CPU.It controls the input connection, receives the data from previous board, stores the message in the Input-FIFO, and controls the DMA traverse local data, and pipelines finish the calculation, control the output connection, and send the result in Output-FIFO to the next board.

Communication Bandwidth Consumption.
In order to measure the communication bandwidth consumption between boards, we add counters to record the data traffic on the interconnection.Figure 13 shows the counters in the system and the interconnection's notation.These counters will work after each packet passes through the path.At the    end of the whole experiment, we can read out the value of each counter and divide it by the time of calculation.By this way, we can confirm that our data has no loss in the communication and give out the specific throughput bandwidth of each FPGA boards.Table 3 shows the communication bandwidth consumption result during calculating 131,072 particles' potential, the time of which is 1297.113ms.Because of the solution's ring topologies, the consumption of communication bandwidth is very low. 4 shows the number of potential pipelines in each FPGA board, operation frequency, and their performance.The conversion between GFlops and MPair/s is based on Atsushi Kawai's work [16].Based on the above results, we calculate the whole system's theoretical parameters.Finally, we list the system's experimental result.

6.3.1.
Comparison with Software.We choose CPU and GPU solutions as the control groups of our work.Fabian's RAMSES code is a widely used method for MOND simulation [11].Therefore, we choose their method as the reference software implementation and use Intel Xeon E5-2660 to run different scale test.Based on RAMSES's QUMOND method and Nitin's direct N-body kernels to simulate the same work set on NVIDIA's Tesla K80 GPU, Figure 14 shows the time taken to calculate potential in different platform.The benefit from the structure of FPGA is that RP-ring is faster than other solutions at the beginning.In the right-end, the NVIDIA results appear to be better than the RP-ring, because as the number of particles increases, our platform is approaching its theoretical maximum performance (503.5 of 620.1 GFlops), and Tesla K80 has not reached its limits.As for the CPU solution, it uses much more time compared to RP-ring and GPU, limited by its poor capability of parallel computing.When simulating a system with 131072 particles, our work is 193 times faster than Xeon E5-2660 CPU and can achieve similar performance to Tesla K80.  4 shows a number of hardware implementation details of N-body simulation.Junichiro's GRAPE-4 cluster uses 1692 pipeline and achieves 1.08 TFlops [7].And he builds a compute cluster with GRAPE-6 chips whose peak performance is 1.349 TFlops [9].Furthermore, his GRAPE-8 achieves 960 GFlops with just one board [8].Reference [14] showed a solution with GPU whose performance is 781 GFlops.Some FPGA solutions are also listed in the

Conclusion and Discussion
In this paper, we proposed an extensible solution: RP-ring, which is used for heterogeneous multi-FPGA-based directsummation N-body simulation, and a model to decompose workload among each FPGA.RP-ring tries to use existing FPGA boards rather than designing new specialized boards to reduce cost.The solution can be expanded conveniently with any heterogeneous FPGA boards and the communication bandwidth requirement is quite low, so that the communication protocol could be designed to be simple and consume few resource.The model considers the constraint of FPGA's logic resource, memory access bandwidth, and communication bandwidth to divide workload reasonably and optimize the whole system's performance.We also build a heterogeneous multi-FPGA system based on RP-ring and it for MOND theory's numerical simulation.The experimental result shows that the low cost multi-FPGA system is 193 times faster than high-end CPU implementation and achieves similar performance to high performance GPU.
-；ＭＭ = m i Particle i information (ii) -；ＭＭ = m i Particle i information

Figure 5 :
Figure 5: The data flow of RP-ring.

Table 4 :
The number of potential pipelines.