Exploring Many-Core Design Templates for FPGAs and ASICs

,


Introduction
Direct hardware implementations, using platforms such as FPGAs and ASICs, possess a huge potential for exploiting application-specific parallelism and performing efficient computation.As a result, the overall performance of custom hardware-based implementations is often higher than that of software-based ones [1,2].To attain bare metal performance, however, programmers must employ hardware design principles such as clock management, state machines, pipelining, and device specific memory management-all concepts well outside the expertise of application-oriented software developers.
These observations raise a natural question: does there exist a more productive abstraction for high-performance hardware design?Based on modern programming disciplines, one viable approach would (1) allow programmers to express parallelism through some API defined in a high-level programming language, (2) support coarse-grain multithreading and fine-grain threading while permitting bit-level resource control, and (3) reduce the effort required to repurpose the implemented hardware platform for different algorithms or different applications.This paper proposes an abstraction that constrains the design to a microarchitectural template, accompanied by an API, that meets these programmer requirements.
Intuitively, constraining the design to a template would likely result in performance degradation compared to fullycustomized solutions.Consider the high-level chart plotting designer effort versus performance, shown in Figure 1.We argue that the shaded region in the figure is attainable by template-based designs and warrants a systematic exploration.To that end, this work attempts to quantify the performance/area tradeoff, with respect to designer effort, across template-based, hand-optimized, and programmable approaches on both FPGA and ASIC platforms.From our analysis, we show how a disciplined approach with architectural constraints, without resorting to manual hardware design, GPP GPU FPGA (HDL) ASIC MARC Ease-of-design Performance Low Low High High Figure 1: Landscape of modern computing platforms.Ease of application design and implementation versus performance (GPP stands for general purpose processor).may reduce design time and effort while maintaining acceptable performance.
In this paper; we study microarchitectural templates in the context of a compute-intensive data-parallel Bayesian inference application.Our thesis, therefore, is that we can efficiently map our application to hardware while being constrained to a many-core template and parallel programming API.We call this project MARC, for Many-core Approach to Reconfigurable Computing, although template-based architectures can be applied outside the many-core paradigm.
We think of a template as an architectural model with a set of parameters to be chosen based on characteristics of the target application.Our understanding of which aspects of the architecture to parameterize continues to evolve as we investigate different application mappings.However, obvious parameters in the many-core context are the number of processing cores, core arithmetic-width, core pipeline depth, richness and topology of an interconnection network, and customization of cores-from addition of specialized instructions to fixed function datapaths as well as details of the cache and local store hierarchies.In this study we explore a part of this space and compare the performance/area between MARC and hand-optimized designs in the context of a baseline GPGPU implementation.
The rest of the paper is organized as follows: Section 2 introduces the Bayesian network inference application, the case study examined in this paper.Section 3 describes the execution model used by OpenCL, the high-level language used to describe our MARC and GPGPU implementations.The application mapping for the GPGPU platform is detailed in Section 4. We discuss the hand-optimized and MARC implementations in Section 5. Section 6 covers hardware mappings for both the hand-optimized and MARC designs as well as a comparison between FPGA and ASIC technology.Finally, in Section 7, we compare the performance/area between the MARC hand-optimized and GPGPU implementations.
1.1.Related Work.Numerous projects and products have offered ways to ease FPGA programming by giving developers a familiar C-style language in place of HDLs [3].Early research efforts including [4][5][6] formed the basis for recent commercial offerings: Catapult C from Mentor Graphics, ImpulseC, Synfora Pico from Synopsys, and AutoESL from Xilinx, among others.Each of these solutions requires developers to understand hardware-specific concepts and to program using a model that differs greatly from standard C-in a sense, using an HDL with a C syntax.Unlike these approaches, the goal of MARC is to expose software programming models applicable to design of efficient hardware.
There has been a long history of mapping conventional CPU architectures to FPGAs.Traditionally, soft processors have been used either as a controller for a dedicated computing engine, or as an emulation or prototyping plat-form for design verification.These efforts have primarily employed a single or small number of processor cores.A few FPGA systems with a large number of cores have been implemented, such as the RAMP project [7].However, the primary design and use of these machines have been as emulation platforms for custom silicon designs.
There have been many efforts both commercially and academically on customization of parameterized processor cores on an application-specific basis.The most widely used is Xtensa from Tensilica, where custom instructions are added to a conventional processor architecture.We take processor customization a step further by allowing the instruction processors to be replaced with an applicationspecific datapath that can be generated automatically via our C-to-gates tool for added efficiency and performance.We leverage standard techniques from C-to-gates compilation to accomplish the generation of these custom datapaths.
More recently, there have been several other efforts in integrating the C-to-gates flow with parallel programming models, [8,9].These projects share with MARC the goal of exploiting progress in parallel programming languages and automatically mapping to hardware.

Application: Bayesian Network Inference
This work's target application is a system that learns Bayesian network structure from observation data.Bayesian networks (BNs) and graphical models have numerous applications in bioinformatics, finance, signal processing, and computer vision.Recently they have been applied to problems in systems biology and personalized medicine, providing tools for processing everincreasing amounts of data provided by highthroughput biological experiments.BNs' probabilistic nature allows them to model uncertainty in real life systems as well as the noise that is inherent in many sources of data.Unlike Markov Random Fields and undirected graphs, BNs can easily learn sparse and causal structures that are interpretable by scientists [10][11][12].
We chose to compare MARC in the context of Bayesian inference for two primary reasons.First, Bayesian inference is a computationally intensive application believed to be particularly well suited for FPGA acceleration as illustrated by [13].Second, our group, in collaboration with Stanford University, has expended significant effort over the previous two years developing several generations of a hand-optimized FPGA implementation tailored for Bayesian inference [13,14].Therefore, we have not only a concrete reference design but also well-corroborated performance results for fair comparisons with hand-optimized FPGA implementations.

Statistics Perspective.
BNs are statistical models that capture conditional independence between variables via the local Markov property: that a node is conditionally independent of its nondescendants, given its parents.Bayesian inference is the process by which a BN's graph structure is learned from the quantitative observation, or evidence, that the BN seeks to model.Once a BN's structure (a set of nodes {V 1 , . . ., V N }) is determined, and the conditional dependence of each node V i on its parent set Π i is tabulated, the joint distribution over the nodes can be expressed as ( Despite significant recent progress in algorithm development, computational inference of a BN's graph structure from evidence is still NP-hard [15] and remains infeasible except for cases with only a small number of variables [16]. The algorithm surveyed in this paper is the union of two BN inference kernels, the order and graph samplers (jointly called the "order-graph sampler").An order is given by a topological sort of a graph's nodes, where each node is placed after its parents.Each order can include or be compatible with many different graphs.The order sampler takes a BN's observation data and produces a set of "high-scoring" BN orders (orders that best explain the evidence).The graph sampler takes this set of high-scoring orders and produces a single highest-scoring graph for each order.The observation data is generated in a preprocessing steps and consists of N (for each node) sets of P local-score/parent-set pairs (which we will refer to as "data" for short).A local score describes the likelihood that a given parent set is a node's true parent set, given an order.A postprocessing step is performed after the order-graph sampler to normalize scores and otherwise clean up the result before it is presented to the user.
In this work, we only study the order and graph sampler steps for two reasons.First, the order-graph sampler is responsible for most of the algorithm's computational complexity.Second, the pre-and postprocessing phases are currently performed on a GPP (general purpose processor) platform regardless of the platform chosen to implement the order-graph sampler.
Following [14,16,17], the order-graph sampler uses Markov chain Monte Carlo (MCMC) sampling to perform an iterative random walk in the space of BN orders.First, the algorithm picks a random initial order.The application then iterates as follows (1) the current order is modified by swapping two nodes to form a "proposed order," which is (2) scored, and (3) either accepted or rejected according to the Metropolis-Hastings rule.The scoring process, itself, (1) breaks the proposed order into N disjoint "local orders," and (2) iterates over each node's parent sets, accumulating each local score whose parent set is compatible with the node's local order.For a network of N nodes, the proposed order's score can be efficiently calculated by [18] Score where D is the set of raw observations that are used by the preprocessor to generate local scores, and O p is the proposed order.This iterative operation continues until the score has converged.
To decrease time to convergence, C orders (together called a chain) can be dispatched over a temperature ladder and exchanged per iteration in a technique known as parallel tempering.Additionally, R independent chains (called multiple restarts) can be dispatched to increase confidence that the optimum score is a global optimum., which equates to 36457 and 66712 (N = 32 and N = 37, resp.) for the design points that we study (see Section 7).

Compute
We classify the reformulated loop nest as compute intensive for two reasons.First, a relatively small amount of input (a local order) is needed for the score() function to compute per-node results over the R * C orders.Second, D[n] (shown in Algorithm 1) depends on N and not R * C. Since N ≈ 37 and R * C = 512 (i.e., R * C N ), in practice there is a large amount of compute time between when n in D[n] changes.

The OpenCL Execution Model
To enable highly productive hardware design, we employ a high-level language and execution model well suited for the paradigm of applications we are interested in studying: dataparallel, compute-bound algorithms.Due to its popularity, flexibility, and good match with our goals, we employ OpenCL (Open Computing Language) as the programming model used to describe applications.
OpenCL [19] is a programming and execution model for heterogeneous systems containing GPPs, GPGPUs, FPGAs [20], and other accelerators designed to explicitly capture data and task parallelism in an application.The OpenCL model distinguishes control thread(s) (to be executed on a GPP host) from kernel threads (data parallel loops to be executed on a GPGPU, or similar, device).The user specifies how the kernels map to an n-dimensional dataset, given a set of arguments (such as constants or pointers to device or host memory).The runtime then distributes the resulting workload across available compute resources on the device.Communication between control and kernel threads is provided by shared memory and OpenCL system calls such as barriers and bulk memory copy operations.
A key property of OpenCL is its memory model.Each kernel has access to three disjoint memory regions: private, local, and global.Global memory is shared by all kernel threads, local memory is shared by threads belonging to the same group, while private memory is owned by one kernel thread.This alleviates the need for a compiler to perform end for Parallel tempering: ) then

30: exchange(O[r][c], O[r][c + 1]) end if end for end for end for
Algorithm 1: The reformulated order-graph sampler loop nest.{S o , S g } and G are the current {order, graph} scores and graph associated with an order.initialize() generates a random order, swap() exchanges nodes in an order, and save() saves a result for the postprocessing step.
costly memory access analysis to recognize dependencies before the application can be parallelized.Instead, the user specifies how the application is partitioned into data-parallel kernels.With underlying SIMD principles, OpenCL is well suited for data-parallel problems and maps well to the parallel thread dispatch architecture found in GPGPUs.

Baseline GPGPU Implementation
To implement the order-graph sampler on the GPGPU, the application is first divided to different parts according to their characteristics.The scoring portion of the algorithm, which exhibits abundant data parallelism, is partitioned into a kernel and executed on the GPGPU, while the less parallelizable score accumulation is executed on a GPP.This ensures that the kernel executed on the GPGPU is maximally parallel and exhibits no interthread communication-an approach we experimentally determined to be optimal.Under this scheme, the latency of the control thread and score accumulation phases of the application, running on a GPP, are dominated by the latency of the scoring function running on the GPGPU.Moreover, the score() kernel (detailed in the following section) has a relatively low bandwidth requirement, allowing us to offload accumulation to the GPP, lowering total latency.The GPP-GPGPU implementation is algorithmically identical to the hardware implementations, aside from minor differences in the precision of the log1p(exp(d)) operation, and yields identical results up to the random sequence used for Monte Carlo integration.4.1.Optimization of Scoring Kernel.We followed four main strategies in optimizing the scoring unit kernel: (1) minimizing data transfer overhead between the control thread and the scoring function, (2) aligning data in device memory, (3) Order sampler: allocating kernel threads to compute units on the GPGPU, and (4) minimizing latency of a single kernel thread.
First, we minimize data transfers between the GPP and GPGPU by only communicating changing portions of the data set throughout the computation.At application startup, we statically allocate memory for all arrays used on the GPGPU, statically set these arrays' pointers as kernel arguments, and copy all parent sets and local scores into off-chip GPGPU memory to avoid copying static data each iteration.Each iteration, the GPP copies R * C proposed orders to the GPGPU and collects R * C * N proposed order/graph scores, as well as R * C graphs from the GPGPU.Each order and graph is an N × N matrix, represented as N 64 bit integers, while partial order and graph scores are each 32 bit integers (additional range is introduced when the partial scores are accumulated).The resulting bandwidth requirement per iteration is 8 * R * C * N bytes from the GPP to the GPGPU and 16 * R * C * N bytes from the GPGPU back to the GPP.In the BNs surveyed in this paper, this bandwidth requirement ranges from 128 to 256 KB (GPP to GPGPU) and from 256 to 512 KB (GPGPU to GPP).Given these relatively small quantities and the GPGPU platform's relatively high transfer bandwidth over PCIe, the transfer latency approaches a minimal value.We use this to our advantage and offload score accumulation to the GPP, trading significant accumulation latency for a small increase in GPP-GPGPU transfer latency.This modification gives us an added advantage via avoiding intra-kernel communication altogether (which is costly on the GPGPU because it does not offer hardware support for producer-consumer parallelism).
Second, we align and organize data in memory to maxi mize access locality for each kernel thread.GPGPU memories are seldom cached, while DRAM accesses are several words wide-comparable to GPP cache lines.We therefore coalesce memory accesses to reduce the memory access range of a single kernel and of multiple kernels executing on a given compute unit.No thread accesses (local scores and parent sets) are shared across multiple nodes, so we organize local scores and parent sets by [N ][P ].When organizing data related to the R * C orders (the proposed orders, graph/order scores, and graphs), we choose to maximally compact data for restarts, then chains, and finally nodes ( This order is based on the observation that a typical application instance will work with a large number of restarts relative to chains.When possible, we align data in memoryrounding both R, C and P to next powers of two to avoid false sharing in wide word memory operations and to improve alignment of data in memory.
Third, allocating kernel threads to device memory is straightforward given the way we organize data in device memory; we allocate multiple threads with localized memory access patterns.Given our memory layout, we first try dispatching multiple restarts onto the same compute unit.If more threads are needed than restarts available, we dispatch multiple chains as well.We continue increasing the number of threads per compute unit in this way until we reach an optimum-the point where overhead due to multithreading overtakes the benefit of additional threads.Many of the strategies guiding our optimization effort are outlined in [21].
Finally, we minimize the scoring operation latency over a single kernel instance.We allow the compiler to predicate conditional to avoid thread divergence.Outside the inner loop, we explicitly precompute offsets to access the [N ][P ] and [N ][C][R] arrays to avoid redundant computation.We experimentally determined that loop unrolling the score() loop has minimal impact on kernel performance, so we allow the compiler to unroll freely.We also evaluated a direct implementation of the log1p(exp(d)) operation versus the use of a lookup table in shared memory (which mirrors the hand-optimized design's approach).Due to the low utilization of the floating point units by this algorithm, the direct implementation tends to perform better than a lookup table given the precision required by the algorithm.

4.2.
Benchmarking the GPGPU Implementation.To obtain GPGPU measurements, We mapped the data parallel component to the GPGPU via OpenCL, and optimized the resulting kernel as detailed in Section 4.1.We measured the relative latency of each phase of the algorithm by introducing a number of GPP and GPGPU timers throughout the iteration loop.We then computed the latency of each phase of computation (scoring, accumulation, MCMC, etc.) and normalized to the measured latency of a single iteration with no profiling syscalls.To measure the iteration time, we ran the application for 1000 iterations with no profiling code in the loop and then measured the total time elapsed using the system clock.We then computed the aggregate iteration latency.

Architecture on Hardware Platforms
As with the GPGPU implementation, when the Bayesian inference algorithm is mapped to hardware platforms (FPGA/ ASIC), it is partitioned into two communicating entities: a data-parallel scoring unit (a collection of Algorithmic-cores or A-cores) and a control unit (the Control-core, or C-core).The A-cores are responsible for all iterations of the score() function from Algorithm 1 while the C-core implements the serial control logic around the score() calls.This scheme is applied to both the hand-optimized design and the automatically generated MARC design, though each of them has different interconnect networks, memory subsystems, and methodologies for creating the cores.

Hand-Optimized Design.
The hand-optimized design mapping integrates the jobs of the C-core and A-cores on the same die and uses a front-end GPP for system initialization and result collection.At the start of a run, network data and a set of R * C initial orders are copied to a DRAM accessible by the A-cores, and the C-core is given a "Start" signal.At the start of each iteration, the C-core forms R * C proposed orders, partitions each by node, and dispatches the resulting N * R * C local orders as threads to the A-cores.As each iteration completes, the C-core streams results back to the front-end GPP while proceeding with the next MCMC iteration.
The hand-optimized design is partitioned into four clock domains.First, we clock A-cores at the highest frequency possible (between 250 and 300 MHz) as these have a direct impact on system performance.Second, we clock the logic and interconnect around each A-core at a relatively low frequency (25-50 MHz) as the application is compute bound in the cores.Third, we set the memory subsystem to the frequency specified by the memory (∼200 MHz, using a DRAM DIMM in our case).Finally, the C-core logic is clocked at 100 MHz, which we found to be ideal for timing closure and tool run time given a performance requirement (the latency of the C-core is negligible compared to the A-cores).

Scoring Unit.
The scoring unit (shown in Figure 2) consists of a collection of clustered A-cores, a point-to-point interface with the control unit, and an interface to off-chip memory.
A scoring unit cluster caches some or all of a node's data, taking a stream of local orders as input and outputs order/ graph scores as well as partial graphs.A cluster is made up of {A-cores, RAM} pairs, where each RAM streams data to its A-core.When a cluster receives a local order, it (a) pages in data from DRAM as needed by the local orders, strip-mines that data evenly across the RAMs and (b) dispatches the local order to each core.Following Algorithm 1, R * C local orders can be assigned to a cluster per DRAM request.Once data is paged in, each A-core runs P f /U c iterations of the score() inner loop (from Algorithm 2), where P f is the subset of P that was paged into the cluster, and U c is the number of Acores in the cluster.
A-core clusters are designed to maximize local order throughput.A-cores are replicated to the highest extent possible to maximize read bandwidth achievable by the RAMs.Each A-core is fine-grained multithreaded across multiple iterations of the score() function and uses predicated execution to avoid thread divergence in case of non-compatible (!compatible()) parent sets.To avoid structural hazards in the scoring pipeline, all scoring arithmetic is built directly into the hardware.
Mapping a single node's scoring operation onto multiple A-cores requires increased complexity in accumulating partial node scores at the end of the score() loop.To maximally hide this delay, we first interleave cross-thread accumulation with the next local order's main scoring operation (shown in Figure 3).Next, we chain A-cores together using a dedicated interconnect, allowing cross-core partial results to be interleaved into the next local order in the same way as threads.Per core, this accumulation scheme adds T cycles of accumulation overhead to the scoring process, for a Tthread datapath and a single additional cycle for cross-core accumulation.To simplify the accumulation logic, we linearly reduce all threads across an A-core and then accumulate linearly across A-cores.The tradeoff here is that the last local order's accumulation is not hidden by another local order being scored and takes T 2 + T * U c cycles to finish, where U c is the number of A-cores in the cluster.
Given sufficient hardware resources, more advanced Acore clusters can be built to further increase system throughput.First, the number of A-cores per RAM can be increased to the number of read ports each RAM has.Second, since a given node's data (D[n]) does not change over the R * C local orders, A-core chains can be replicated entirely.In this case, we say that the cluster has been split into two or more lanes, where A-cores from each lane are responsible for a different local order.In this setup, the cluster's control stripmines local orders across lanes to initiate scoring.While scoring, corresponding A-cores (the first A-core in each of several lanes, e.g.) across the lanes (called tiles) read and process the same data from the same RAM data stream.An example of an advanced A-core cluster is shown in Figure 4.
The following analytic model can be used to estimate the parallel completion time to score O l local orders over the P f subset of the data (for a single cluster): where Cycles DRAM is the number of cycles (normalized to the core clock) required to initialize the cluster from DRAM, U c is the number of A-cores per lane (doubles when two SRAM ports are used, etc.), U l is the number of lanes per cluster, and T is the number of hardware threads per A-core.

Memory Subsystem.
The scoring unit controls DRAM requests when an A-core cluster requires a different subset of the data.Regardless of problem parameters, data is always laid out contiguously in memory.As DRAM data is streamed to a finite number of RAMs, there must be enough RAM write bandwidth to consume the DRAM stream.In cases where the RAM write capability does not align to the DRAM read capacity, dedicated alignment circuitry built into the scoring unit dynamically realigns the data stream.

Control Unit.
We implemented the MCMC control unit directly in hardware, according to Figure 5.The MCMC state machine, node swapping logic, parallel tempering logic, and Metropolis-Hasting logic is mapped as hardware state machines.Furthermore, a DSP block is used for multiplicative factors, while log(rand(0, 1)) is implemented as a table lookup.The random generators for row/column swaps, as well as Metropolis-Hastings and parallel tempering, are built using free-running LFSRs.
At the start of each iteration, the control unit performs node swaps for each of the R * C orders and schedules the proposed orders onto available compute units.To minimize control unit time when R * C is small, orders are stored in row order in RAM, making the swap operation a single cycle row swap, followed by an N cycle column swap.Although the control unit theoretically has cycle accurate visibility of the entire system and can therefore derive optimal schedules, we found that using a trivial greedy scheduling policy (first come first serve) negligibly degrades performance with the benefit of significantly reducing hardware complexity.To minimize A-core cluster memory requirements, all R * C local orders are scheduled to compute units in bulk over a single node.
When each iteration is underway, partial scores received from the scoring unit are accumulated as soon as they are received, using a dedicated A-core attached to a buffer that stores partial results.In practice, each A-core cluster can only store data for a part of a given node at a time.This means that the A-core, processing partial results, must perform both the slower score() operation and the simpler cross-node "+" accumulations.We determined that a single core dedicated to this purpose can rate match the results coming back from the compute-bound compute units.
At the end of each iteration, Metropolis-Hastings checks proceed in [R][C] order.This allows the parallel tempering exchange operation for restart r to be interleaved with the Metropolis-Hastings check for restart r + 1.

Many-Core Template.
The overall architecture of a MARC system, as illustrated in Figure 6, resembles a scalable, many-core-style processor architecture, comprising one  Figure 4: The hand-optimized A-core cluster.This example contains four tiles and three lanes and uses two RAM read ports per tile."GS" stands for graph sampler.
Control Processor (C-core) and multiple Algorithmic Processing Cores (A-cores).Both the C-cores and the A-core can be implemented as conventional pipelined RISC processors.However, unlike embedded processors commonly found in modern SOCs, the processing cores in MARC are completely parameterized with variable bit width, reconfigurable multithreading, and even aggregate/fused instructions.Furthermore, A-cores can alternatively be synthesized as fully customized datapaths.For example, in order to hide global memory access latency, improve processing node utilization, and increase the overall system throughput, a MARC system can perform fine-grained multithreading through shift register insertion and automatic retiming.Finally, while each processing core possesses a dedicated local memory accessible only to itself, a MARC system has a global memory space implemented as distributed memories accessible by all processing cores through the interconnect network.Communication between a MARC system and its host can be realized by reading and writing global memory.

Execution
Model and Software Infrastructure.Our MARC system builds upon both LLVM, a production-grade open-source compiler infrastructure [22] and OpenCL.Figure 7 presents a high-level schematic of a typical MARC machine.A user application runs on a host according to the models native to the host platform-a high-performance PC in our study.Execution of a MARC program occurs in two parts: kernels that run on one or more A-cores of the MARC devices and a control program that runs on the C-core.The control program defines the context for the kernels and manages their execution.During the execution, the MARC application spawns kernel threads to run on the A-cores, each of which runs a single stream of instructions as SPMD units (each processing core maintains its own program counter).

Application-Specific Processing
Core.One strength of MARC is its capability to integrate fully customized application-specific processing cores/datapaths so that the kernels in an application can be more efficiently executed.To this end, a high-level synthesis flow depicted by Figure 8 was developed to generate customized datapaths for a target application.
The original kernel source code in C/C++ is first compiled through llvm-gcc to generate the intermediate representation (IR) in the form of a single static assignment graph (SSA), which forms a control flow graph where instructions are grouped into basic blocks.Within each basic block, the instruction parallelism can be extracted easily as all false dependencies have been removed in the SSA representation.Between basic blocks, the control dependencies can then be transformed to data dependencies through branch predication.In our implementation, only memory operations are predicated since they are the only instructions that can generate stalls in the pipeline.By converting the control dependencies to data dependencies, the boundaries between basic blocks can be eliminated.This results in a single data flow graph with each node corresponding to a single instruction   in the IR.Creating hardware from this graph involves a one-to-one mapping between each instruction and various predetermined hardware primitives.To utilize loop level parallelism, our high-level synthesis tool also computes the minimal interval at which a new iteration of the loop can be initiated and subsequently generates a controller to pipeline loop iterations.Finally, the customized cores have the original function arguments converted into inputs.In addition, a simple set of control signals is created to initialize a C-core and to signal the completion of the execution.For memory accesses within the original code, each nonaliasing memory pointer used by the C function is mapped to a memory interface capable of accommodating variable memory access latency.The integration of the customized cores into a MARC machine involves mapping the input of the cores to memory addresses accessible by the control core, as well as the addition of a memory handshake mechanism allowing cores to access global and local memories.For the results reported in this paper, the multithreaded customized cores are created by manually inserting shift registers into the single-threaded, automatically generated core.

Host-MARC Interface.
Gigabit Ethernet is used to implement the communication link between the host and the MARC device.We leveraged the GateLib [23] project from Berkeley to implement the host interface, allowing the physical transport to be easily replaced by a faster medium in the future.

Memory Organization.
Following OpenCL, A-core threads have access to three distinct memory regions: private, local, and global.Global memory permits read and write access to all threads within any executing kernels on any processing core (ideally, reads and writes to global memory may be cached depending on the capabilities of the device, however in our current MARC machine implementation, caching is not supported).Local memory is a section of the address space shared by the threads within a computing core.This memory region can be used to allocate variables that are shared by all threads spawned from the same computing kernel.Finally, private memory is a memory region that is dedicated to a single thread.Variables defined in one thread's private memory are not visible to another thread, even when they belong to the same executing kernel.
Physically, the private and local memory regions in a MARC system are implemented using on-chip memories.Part of the global memory region also resides on-chip, but we allow external memory (i.e., through the DRAM controller) to extend the global memory region, resulting in a larger memory space.

Kernel Scheduler.
To achieve high throughput, kernels must be scheduled to avoid memory access conflicts.The MARC system allows for a globally aware kernel scheduler, which can orchestrate the execution of kernels and control access to shared resources.The global scheduler is controlled via a set of memory-mapped registers, which are implementation specific.This approach allows for a range of schedulers, from simple round-robin or priority schedules to complex problem-specific scheduling algorithms.

International Journal of Reconfigurable Computing
The MARC machine optimized for Bayesian inference uses the global scheduler to dispatch threads at a coarse grain (ganging up thread starts).The use of the global scheduler is therefore rather limited as the problem does not greatly benefit from a globally aware approach to scheduling.

System Interconnect.
One of the key advantages of reconfigurable computing is the ability to exploit applicationspecific communication patterns in the hardware system.MARC allows the network to be selected from a library of various topologies, such as mesh, H-tree, crossbar, or torus.Application-specific communication patterns can thus be exploited by providing low-latency links along common routes.
The MARC machine explores two topologies: a pipelined crossbar and a ring, as shown in Figure 9.The pipelined crossbar contains no assumptions about the communication pattern of the target application-it is a nonblocking network that provides uniform latency to all locations in the global memory address space.Due to the large number of endpoints on the network, the crossbar is limited to 120 MHz with 8 cycles of latency.
The ring interconnect only implements nearestneighbor links, thereby providing very low-latency access to some locations in global memory, while requiring multiple hops for other accesses.Nearest neighbor communication is important in the Bayesian inference accumulation phase and helps reduce overall latency.Moreover, this network topology is significantly more compact and can be clocked at a much higher frequency-approaching 300 MHz in our implementations.The various versions of our MARC machine, therefore, made use of the ring network because of the advantages it has shown for this application.

Mapping Bayesian Inference onto the MARC Machine.
The order-graph sampler comprises a C-core for the serial control logic and A-cores to implement the score() calls.Per iteration, the C-core performs the node swap operation, broadcasts the proposed order, and applies the Metropolis-Hastings check.These operations consume a negligible amount of time relative to the scoring process.
Scoring is composed of (1) the parent set compatibility check and (2) an accumulation across all compatible parent sets.Step 1 must be made over every parent set; its performance is limited by how many parent sets can be simultaneously accessed.We store parent sets in on-chip RAMs that serve as A-core private memory and are therefore limited by the number of A-cores and attainable A-core throughput.
Step 2 must be first carried out independently by each Acore thread, then across A-core threads, and finally across the A-cores themselves.We serialize cross-thread and crosscore accumulations.Each accumulation is implemented with a global memory access.
The larger order-graph sampler benchmark we chose (see Section 7) consists of up to 37 nodes, where each of the nodes has 66712 parent sets.We divide these 66712 elements into 36 chunks and dedicate 36 A-cores to work on this data set.After completion of the data processing for one node, data from the next node is paged in, and we restart the A-cores.

Hardware Prototyping
For this research, both the design and MARC machines are implemented targeting a Virtex-5 (XCV5LX155T-2) of a BEEcube BEE3 module for FPGA prototyping.We also evaluate how each design performs when mapped through a standard ASIC design flow targeting a TSMC 65 ns CMOS process.A design point summary, that we will develop over the rest of the paper, is given in Table 1.
The local memory or "RAMs", used in each design point, were implemented using block RAMs (BRAMs) on an FPGA and generated as SRAM (using foundry-specific memory generators) on an ASIC.All of our design points benefit from as much local memory read bandwidth as possible.Weincreased read bandwidth on the FPGA implementation by using both ports of each BRAM block and exposing each BRAM as two smaller single-port memories.For the ASIC platform, the foundry-specific IP generator gives us the capability to create small single-ported memories suitable for our use.
In addition to simple memories, our designs used FIFOs, arbiters, and similar hardware structures to manage flow control and control state.On an FPGA, most of these blocks were available on the Virtex-5 through Xilinx Coregen while the rest were taken from the GateLib library.On an ASIC, allof these blocks were synthesized from GateLib Verilog or generated using foundry tools.
To obtain all FPGA measurements, we designed in Verilog RTL and mapped the resulting system using Synplify Pro (Synopsys) and the Xilinx ISE flow for placement and routing (PAR).To obtain ASIC measurements, we used a standard cell-bawed Synopsis CAD flow including Design Compiler and IC Compiler.
No manual placement or hierarchical design was used for our studies.We verified the resulting system post-PAR by verifying (a) timing closure, and (b) functionality of the flattened netlist.The tools were configured to automatically retime the circuit to assist timing closure, at the expense of hard-ware resources.It is worth noting that the automatic retiming did not work as well with the MARC multithreaded cores because of a feedback path in the core datapath.Therefore, manual retiming was required for performance improvement with the MARC multithreaded design points.

Hand-Optimized Configurations.
On the FPGA platform, the best performing configurations were attained when using 48 cores per FPGA running at a 250 MHz core clock and 36 cores at 300 MHz (the former outperforming the latter by a factor of 1.1 on average).Both of these points were used between 65% and 75% of available device LUTs and exactly 95% of device BRAMs.We found that implementing 48 cores at 300 MHz was not practical due to routing limitations and use of the 48 core version at 250 MHz for the rest of the paper.
For the ASIC implementation, because performance is a strong function of the core clock's frequency, we optimize the core clock as much as possible.By supplying the Verilog RTL tp the Synopsys Design Compiler with no prior optimization, the cores can be clocked at 500 MHz.Optimizing the core clock requires shortening the critical path, which is in the datapath.By increasing the number of threads from 4 to 8 and performing manual retiming for the multithreaded datapath, the core clock achieves 1 GHz.

MARC Configurations.
The MARC implementation comprises one C-core and 36 A-cores.While the C-core in all MARC machines is a fully bypassed 4-stage RISC processor, MARC implementations differ in their implementation of the A-cores.For example, fine-grained multithreaded RISC cores, automatically generated application-specific datapaths, and multithreaded versions of the generated cores are all employed to explore different tradeoffs in design effort and performance.To maintain high throughput, the better performing A-cores normally execute multiple concurrent threads to saturate the long cycles in the application dataflow graph.

Memory System.
As in other computing platforms, memory accesses significantly impact the overall performance of a MARC system.In the current MARC implemen-Table 1: A-core counts, for all design points, and a naming convention for all MARC configurations used in the study.If only one core count is listed, it is the same for both 32 and 37 node (32n and 37n) problems (see Section 7).All A-core counts are given for area normalized designs, as discussed in Section 7.With respect to area, the overhead of multithreading is more pronounced on an FPGA relative to an ASIC.For the 37 node benchmark, the MARC machines with single, twoway, and four-way multithreaded customized A-cores utilize 47%, 65%, and 80% of the flip-flops on Virtex-5.Since they operate on the same amount of data, 85% of BRAMs are used for each of the three design points.Meanwhile, on an ASIC we only observe an area increase from 6.2 mm 2 in the singlethreaded case to 6.4 mm 2 for the four-way multithreaded design.This is because the ASIC implementation exposes the actual chip area, where the increase in number of registers is dwarfed by the large SRAM area.

Performance and Area Comparisons
We compare the performance of the hand-optimized design and the MARC machines on FPGA as well as ASIC platforms.For both the hand-optimized and the MARC implementations on an ASIC, we normalize our area to the FPGA's die area.FPGA die area was obtained by X-ray imaging the packaged dies and estimating the die area from the resulting photographs.For the remainder of the paper, all devices whose die areas and process nodes are relevant are given in Table 2.
For the FPGA designs, we packed the device to its limits without performance degradation.Effectively, the designs are consuming the entire area of the FPGA.We then performed a similar evaluation for the ASIC platform by attempting to occupy the same area as an FPGA.This is achieved by running the design for a small number of cores and then scaling up.This technique is valid as the core clock is not distributed across the network, and the network clock can be slow (50-100 MHz) without adversely affecting performance.
The specific Bayesian network instances we chose consist of 32 and 37 nodes, with dataset of 36457 and 66712 elements, respectively.The run times on each hardware platform are shown in Tables 4 and 5, for the 32 and 37 nodeproblem, respectively.The execution time for each platform is also normalized to the fastest implementation-handoptimized design on ASIC-to show the relative performance of every design point.

Benchmark Comparison.
The large gap between the amount of data involved in the two problems gives each distinct characteristics, especially when mapped to an ASIC platform.Because data for the 32 node problem can fit on an ASIC for both MARC and the hand-optimized design, the problem is purely compute bound.The hand-optimized solution benefits from the custom pipelined accumulation and smaller and faster cores, resulting in its 2.5x performance advantage over the best MARC implementation.The 37 node problem, on the other hand, could not afford to have the entire dataset in the on-chip SRAMs.The required paging of data into the on-chip RAMs becomes the performance bottleneck.Having exactly the same DRAM controller as the MARC machines, the hand-optimized design only shows a small performance advantage over MARC, which can be attributed to its clever paging scheme.For the FPGA platform, both the 32 and 37 node problems involve paging of data, but as the run time is much longer, data transfer only accounts for a very small fraction of the execution time (i.e., both problems are compute bound).

MARC versus Hand-Optimized Design.
For computebound problems, it is clear that MARC using RISC instruction processors to implement A-cores achieves less than 2% of the performance exhibited by the hand-optimized implementation, even with optimized interconnect topology (a ring network versus a pipelined crossbar).Customizing the A-cores, however, yields a significant gain in performance, moving MARC to within a factor of 4 of the performance of the hand-optimized implementation.Further optimizing the A-cores through multithreading pushes the performance even higher.The best performing MARC implementation is within a factor of 2.5 of the hand-optimized design and corresponds to two-way multithreaded A-cores.Like the FPGA platform, further increase to four threads offers diminishing returns and is outweighed by the increase in area, and therefore the area-normalized performance actually decreases.

Cross-Analysis against GPGPU.
We also benchmark the various hardware implementations of the order-graph sampler against the GPGPU reference solution, running on Nvdia's GeForce GTX 580.
As the GTX 580 chip has a much larger area than Virtex-5 FPGA and is also on 40 nm process rather than 65 nm, we scaled its execution time according to the following equations, following Table 2: To make sure the comparison is fair, the technology scaling [24] takes into account the absolute area difference between the GPU and FPGA, as well as the area and delay scaling (i.e., S, the technology scaling factor) due to different processes.Our first assumption is that the performance scales linearly with area, which is a good approximation due to our Bayesian network problem and device sizes.Second, we assume zero wire slack across both process generations for all designs.The original and scaled execution times are displayed in Table 3.
It can be seen from Tables 4 and 5 that MARC on FPGA can achieve the same performance as the GPGPU when application-specific A-cores are used.With multithreading, the best MARC implementation on FPGA can achieve more than a 40% performance advantage over the GPGPU.Handoptimized designs, with more customization at the core and network level, push this advantage even further to 3.3x.The reason for this speedup is that each iteration of the inner loop of the score() function takes 1 cycle for A-cores on MARC and the hand-optimized design, but 15 to 20 cycles on GPGPU cores.It is apparent that the benefit from exploiting loop level parallelism at the A-cores outweighs the clock frequency advantage that the GPGPU has over the FPGA.
When an ASIC is used as the implementation platform, the speedup is affected by the paging of data as explained in Section 7.1.For the 32 node problem where paging of data is not required, the best MARC implementation and the hand-optimized design achieve 156x and 412x performance improvement over the GPGPU, respectively.For the 37 node problem, which requires paging, we observe a 69x and 84x performance advantage from the best MARC variant and hand-optimized implementation, respectively.Using only a single dual channel DRAM controller, we have about 51.2 Gb/sec of memory bandwidth for paging.However the GPGPU's memory bandwidth is 1538.2Gb/sec-30x that of our ASIC implementations.As a result, the GPGPU solution remains compute bound while our ASIC implementations are getting constrained by the memory bandwidth.Thus, the performance gap between the 32 and 37 node problems is because of the memory-bound nature of our ASIC implementations.
It is also interesting that the MARC implementation with RISC A-cores on ASIC is about 6 times faster for the 32 node problem and 8 times faster for 37 node problem, compared to the GPGPU.With both MARC RISC A-cores and GPGPU cores, the kernel is executed as sequence of instructions rather than by a custom datapath.In addition, the clock frequency gap between MARC on ASIC and the GPGPU is small.We claim that the performance gap is due to the application-specific nature of the MARC design-MARC is able to place more cores per unit area (see Table 1) while still satisfying the requirements of local caching.In addition, the network structure in MARC machines is also optimized to the Bayesian inference accumulation step.The combined effect results in a significantly better use of chip area for this application.

Conclusion
MARC offers a methodology to design FPGA and ASICbased high-performance reconfigurable computing systems.It does this by combining a many-core architectural template, high-level imperative programming model [19], and modern compiler technology [22] to efficiently target both ASICs and FPGAs for general-purpose, computationally intensive dataparallel applications.The primary objective of this paper is to understand whether a many-core architecture is a suitable abstraction layer (or execution model) for designing ASIC and FPGAbased computing machines from an OpenCL specification.We are motivated by recently reemerging interest and efforts in parallel programming for newly engineered and upcoming many-core platforms, and feel that if we can successfully build an efficient many-core abstraction for ASICs and FPGAs, we can apply the advances in parallel programming to high-level automatic synthesis of computing systems.Ofcourse, constraining an execution template reduces degrees of freedom for customizing an implementation using application-specific detail.However, we work under the hypothesis that much of the potential loss in efficiency can be recovered through customization of a microarchitectural template designed for a class of applications using application-specific information.The study in this paper represents our initial effort to quantify the loss in efficiency incurred for a significant gain in design productivity for one particular application.
We have demonstrated via the use of a many-core microarchitectural template for OpenCL that it is at least sometimes possible to achieve competitive performance relative to a highly optimized solution and to do so with considerable reduction in development effort (days versus months).This approach also achieves significant performance advantage over a GPGPU approach-a natural platform for mapping this class of applications.In this study, the most significant performance benefit came from customization of the processor cores to better fit the application kernel-an operation within reach of modern high-level synthesis flows.
Despite these results, the effectiveness of MARC in the general case remains to be investigated.We are currently limited by our ability to generate many high-quality handoptimized custom solutions in a variety of domains to validate and benchmark template-based implementations.Nonetheless, we plan to continue this study, exploring more application domains, extending the many-core template tailored for OpenCL and exploring template microarchitectures for other paradigms.We are optimistic that a MARC-like approach will open new frontiers for rapid prototyping of highperformance computing systems.

Algorithm 2 :
The score(D, O l ) function takes the data D (made of parent set (ps) and local score (ls) pairs) and a local order (O l ) as input.The scoring function produces an order score (s o ), graph score (s g ), and graph fragment (g).

Figure 3 :
Figure 3: Thread accumulation over a 4-thread/stage core for two adjacent local orders.

Figure 6 :
Figure 6: Diagram of key components in a MARC machine.

Table 2 :
Device die areas and process nodes.