^{1, 2}

^{1}

^{3}

^{4}

^{4}

^{1}

^{2}

^{3}

^{4}

Recent advances in Field-Programmable Gate Array (FPGA) technology make reconfigurable computing using FPGAs an attractive platform for accelerating scientific applications. We develop a deeply pipelined and parallel architecture for Quantum Monte Carlo simulations using FPGAs. Quantum Monte Carlo simulations enable us to obtain the structural and energetic properties of atomic clusters. We experiment with different pipeline structures for each component of the design and develop a deeply pipelined architecture that provides the best performance in terms of achievable clock rate, while at the same time has a modest use of the FPGA resources. We discuss the details of the pipelined and generic architecture that is used to obtain the potential energy and wave function of a cluster of atoms.

Reconfigurable Computing (RC) using Field-Programmable Gate Arrays (FPGAs) is emerging as a computing paradigm to accelerate the computationally intensive functions of an application [

Our design goals are performance, numerical accuracy, and flexibility. To quantify performance, we measure the speed up achieved by our hardware implementation over the optimized software application.

The paper is organized as follows. In Section

Quantum Monte Carlo (QMC) methods are widely used in physics and physical chemistry to obtain the ground-state properties of atomic or molecular clusters [

We use a flavor of QMC called the Variational Monte Carlo (VMC) algorithm [

Figure

Data movement in the QMC algorithm.

Table

Total execution time versus compute time for 1 iteration.

Components | Time (in s) or % |
---|---|

Potential energy calculation | 0.7615 |

Wave function calculation | 2.739 |

Compute time for 1 iteration | 3.500 |

Total time for 1 iteration | 3.715 |

% of total time spent on compute | 94% |

Figure

Top-level Block Diagram.

In addition to the above components, lookup tables using on-chip Block RAMs (BRAMs) are used to store the coordinate positions of atoms and the interpolation coefficients for the function evaluations. A state machine controller is used to generate the addresses to the BRAMs that store the coordinate positions and transfer the (

The 18-kbit embedded BRAMs on the Xilinx FPGA are used as lookup tables to store the (

BlockRAM Usage by the Lookup Tables.

16 KB (8) | 18 KB (10) | 18 KB (10) | |

16 KB (8) | 18 KB (10) | 18 KB (10) | |

16 KB (8) | 18 KB (10) | 18 KB (10) | |

Total | 48 KB (24) | 54 KB (30) | 54 KB (30) |

Memory Layout.

From our initial experiments varying the order of interpolation, we infer that quadratic interpolation with coefficients (

For region I coefficients, we require 6 KB of memory, and for region II coefficients, we require 48 KB of memory for a total of 54 KB. For potential energy and wave function calculations, the coefficient memories consume a total of 60 BRAMs on the FPGA. The amount of memory needed to store the interpolation coefficients is given in Table

We use dual-port BRAMs, such that Port

State machine generates addresses to read

The ground-state properties such as potential energy and wave function are functions of the coordinate distance,

Squared distance calculation,

We use the Xilinx IP cores from the CoreGen library [

Data Widths and latencies of

Signal/Core | ||
---|---|---|

32-bit (signed 12.20) | — | |

Subtractor | 32-bit i/p, 33-bit o/p | 1 |

Multiplier | 33-bit i/p, 66-bit o/p | 7 |

Adder1 | 65-bit i/p, 66-bit o/p | 1 |

Adder2 | 66-bit i/p, 67-bit o/p | 1 |

53-bit (unsigned 27.26) | — |

The schemes to look up the interpolation coefficients are different for the two regions in each function, as the functions exhibit different numerical behavior in each region. The squared distance from the ^{2}

Region I Bin Calculation.

We employ a logarithmic binning scheme in region II. We divide the region II into 21 regimes. Each regime is divided into 64 bins for a total of 1344 coefficients. For regions I and II, we have a total of [256 + 21

Region II Bin Calculation (First stage to obtain the regime).

Region II Bin Calculation (Second stage to obtain the bin address).

Figure

Data widths and latencies of

Signal/Core | ||

(clock cycles) | ||

Interpolation coefficients ( | — | |

Multiplier1 | 52-bit i/p, 104 bit o/p | 7 |

Adder1 | 52-bit i/p, 52-bit o/p | 1 |

Multiplier2 | 52-bit i/p, 104-bit o/p | 7 |

Adder2 | 52-bit i/p, 52-bit o/p | 1 |

Pairwise Potential Energy/Wave Function: signed 0.52 | — |

Potential energy accumulation (

Figure

Wave Function accumulation (

We perform successive multiplications of region I potential values and region I and II wave function values. The distances we sample are such that most values of potential energy and wave function are close to one. Hence, repeated multiplication of a number of these values (especially for large

There are also overflow issues associated with the accumulator while accumulating region II potential values. To overcome this problem, we accumulate the region II potentials in a large register that can hold

We target the FPGA implementation to the Cray XD1 high performance reconfigurable computing system [

The design modules are developed using VHDL and synthesized using Xilinx XST tools. The implementation process, consisting of translating, mapping, and placing and routing of the design is done using the Xilinx ISE tools. The user design is clocked at 100 MHz. Table

Resource usage on the Virtex-4 FPGA.

Resource type | Virtex-4 LX160 FPGA |
---|---|

SLICES | 50% |

BLOCK RAMs | 51% |

DSP48s | 52% |

Plot of relative error versus number of atoms.

Speed up versus number of atoms.

We present a pipelined reconfigurable architecture to obtain the potential energy and wave function of clusters of atoms. We outline some of the goals that are critical for our design. From the design choices available for the building blocks of our architecture, we carefully evaluate the choices to obtain optimal performance, numerical accuracy, and resource usage consumption. Our design choices including the use of a pipelined architecture, fixed-point representation, and a quadratic interpolation scheme to evaluate the function enable us to achieve a significant performance compared to the software version running on the processor. Our hardware design strategy for the Quantum Monte Carlo simulation offers a speed up of 40

This work was supported by the National Science Foundation Grant NSF CHE-0625598, and the authors gratefully acknowledge prior support for related work from the University of Tennessee Science Alliance. The authors also thank Mr. Dave Strenski, Cray Inc., for providing them access to the Cray XD1 platform that was used to target our implementation.