Designing hardware cores for FPGAs can quickly become a complicated task, difficult even for experienced engineers. With the addition of more sophisticated development tools and maturing high-level language-to-gates techniques, designs can be rapidly assembled; however, when the design is evaluated on the FPGA, the performance may not be what was expected. Therefore, an engineer may need to augment the design to include performance monitors to better understand the bottlenecks in the system or to aid in the debugging of the design. Unfortunately, identifying what to monitor and adding the infrastructure to retrieve the monitored data can be a challenging and time-consuming task. Our work alleviates this effort. We present the Hardware Performance Monitoring Infrastructure (HwPMI), which includes a collection of software tools and hardware cores that can be used to profile the current design, recommend and insert performance monitors directly into the HDL or netlist, and retrieve the monitored data with minimal invasiveness to the design. Three applications are used to demonstrate and evaluate HwPMI’s capabilities. The results are highly encouraging as the infrastructure adds numerous capabilities while requiring minimal effort by the designer and low resource overhead to the existing design.
As hardware designers develop custom cores and assemble Systems-on-Chip (SoCs) targeting FPGAs, the challenge of the design meeting timing, fitting within the resource constraints, and balancing bandwidth and latency can lead to significant increases in development time. When a design does not meet a specific performance requirement, the designer typically must go back and manually add more custom logic to monitor the behavior of several components in the design. While this performance information can be used to better understand the inner workings of the system, as well as the interfaces between the subcomponents of the system, identifying and inserting infrastructure can quickly become a daunting task. Furthermore, the addition of the monitors may change the original behavior of the system, potentially obfuscating the identified performance bottleneck or design bug.
In this work, we focus on an extensible set of tools and hardware cores to enable a hardware designer to insert a minimally invasive performance monitoring infrastructure into an existing design, with little effort. The monitors are used in an introspective capacity, providing feedback about the design’s performance under real workloads, while running on real devices. This paper presents our Hardware Performance Monitoring Infrastructure (HwPMI), which is designed to ease the identification, insertion, and retrieval of performance monitors and their associated data in developed systems.
The motivation for the creation and evaluation of this infrastructure stems from the inherent need to insert monitors into existing designs and to retrieve the data with minimal invasiveness to the system. Over the last several years we have been assembling a repository of performance monitors as new designs are built and tested. To increase a designer’s productivity, we have put together a suite of software tools aimed at profiling existing designs and recommending and/or inserting performance monitors and the necessary hardware and software infrastructure. This work also leverages existing open-source work by including Torc (Tools for Open Reconfigurable Computing) to provide an efficient backend for reading, writing, and manipulating designs in EDIF format [
Included in HwPMI are the specific monitors, such as timers, state-machine trackers, component utilization, and so forth, along with a sophisticated monitoring network to retrieve each monitor’s data all while requiring little user intervention. To evaluate HwPMI, three use cases show how existing designs can utilize HwPMI to quickly integrate and retrieve monitoring data on running systems. Moreover, HwPMI is flexible enough to support both high performance reconfigurable computing (HPRC), running on the Spirit cluster as part of the Reconfigurable Computing Cluster (RCC) [
The remainder of this paper is organized into the following sections: in Section
As FPGA resources increase in number and diversity with each device generation, researchers are exploring architectures to outperform previous implementations and to investigate new designs that were previously limited by the technology. Unfortunately, a designer trying to exploit the inherent parallelism of FPGAs is often faced with the nontrivial task of identifying system bottlenecks, performance drains, and design bugs in the system. The result is often the inclusion of custom hardware cores tasked with monitoring key locations in the system, such as interfaces to the network, memory, finite-state machines, and other resources such as FIFOs or custom pipelines.
This is especially true in the rapidly growing field of High Performance Reconfigurable Computing (HPRC). There are several research projects underway that investigate the use of multiple FPGAs in a high-performance computing context: RAMP, Maxwell, Axel, and Novo-G [
Tools of some form are needed to help the designer manage the complexities associated with hardware design, such as timing requirements, resource limitations, routing, and so forth. FPGA vendors provide many tools beyond synthesis and implementation that reduce development time, including component generators [
There are also several projects investigating ways to monitor FPGA systems. The Owl system monitoring framework presented in [
Torc (Tools for Open Reconfigurable Computing) is a C++ open-source framework designed to simplify custom tool development and enable new research [
Torc consists of four core Application Programming Interfaces (APIs) and a collection of CAD tools built upon them. The APIs are the Generic Netlist API, the Physical Netlist API, the Device Architecture API, and the Bitstream Frames API. The main CAD tools are the router and placer. For HwPMI, the appeal of Torc’s generic netlist is that it allows us to insert performance monitors without having to modify the design’s HDL source. The Generic Netlist API supports netlists that are not mapped to physical hardware and provides full EDIF 2.0.0 reading, writing, and editing capability. The internal object model is flexible enough to support other netlist formats, if parsers and exporters are provided. Because the Generic Netlist API supports generic EDIF, it is usable for Xilinx FPGAs, non-Xilinx FPGAs, ASICs, or even circuit boards.
Early versions of HwPMI interacted with VHDL source to identify and insert performance monitors and the necessary infrastructure into existing designs. Torc is being added to expand beyond VHDL and to ensure that the original source remains unmodified after it has been profiled and evaluated—only the resulting synthesized netlists are modified. Another reason for migrating to Torc is its efficiency and scalability: a plot of EDIF read and write performance is provided in Figure
Torc’s Generic Netlist API I/O Performance. This log-log plot shows reasonably linear I/O performance for EDIF file sizes ranging from 1 KB to 175 MB. On a 2.8 GHz quad-core Xeon, the API reads 45-thousand lines per second, and 2.3 Megabytes per second on average. The largest of these files contains over 150,000 instances and over 200,000 nets. Shape differences between the two curves are most likely due to different name lengths in the EDIF files.
Another well-established CAD tool for reconfigurable computing is VPR [
The performance monitoring infrastructure assembled as part of this work builds upon our previous research in the area of resilient high-performance reconfigurable computing. In [
Unlike conventional approaches where a designer must manually create and insert monitors into their design, including the mechanisms to extract the monitored results, this work analyzes existing designs and generates the necessary infrastructure automatically. The result is a large repository of predesigned monitors, interfaces, and tools to aid the designer in rapidly integrating the monitoring infrastructure into existing designs. This work also provides the designer with the necessary software infrastructure to retrieve the performance data at user defined intervals during the execution of the system.
The HwPMI tool flow, which will be discussed within this section, consists of several stages, as depicted in Figure
Hardware Performance Monitoring Infrastructure's Tool Flow.
Block diagram of Spirit cluster’s HwPMI.
Each worker node is running an application-specific SoC which includes the HwPMI hardware cores. These cores can be seen in Figure
Block diagram of FPGA node’s HwPMI.
Initial HwPMI development was targeted to support high-performance reconfigurable computing systems, such as Spirit. However, the tools and techniques are also easily adapted to support more traditional embedded system development with FPGAs. In fact, the sideband network can be replaced with a bus interface to give an embedded system access to its own performance monitoring data. While this introspective monitoring does add to the runtime overhead of the system, designers can now specify the interface mechanism to HwPMI. PowerPC 405, PowerPC 440, and MicroBlaze systems are supported through interfaces with the Processor Local Bus. An embedded system example is shown in Figure
Block diagram of HwPMI System-on-Chip running on embedded device under test.
The process of identifying performance monitors to be inserted into an existing design begins with Static HDL Profiling, shown in Figure
The Static HDL Profiler is comprised of three software tools written in Python, to more autonomously profile the original design. Presently, HwPMI supports the Xilinx Synthesis Tool (XST) and designs written in VHDL; however, work is underway to extend beyond XST and VHDL through the use of Torc. The first tool, HwPMI Core Parser, provides parsing capabilities for VHDL files. The designer invokes the parser on the specific design through the command line and can specify a specific VHDL source to evaluate. The parser identifies the entity’s structure in terms of ports, signals, finite-state machines, and instantiated components. The parser works by analyzing the VHDL source file and uses pattern matching to decompose the component into its basic blocks. The parser is only responsible for the identification of the VHDL component’s structure. The results are then passed into a Python Pickle for rapid integration with the remaining tools.
Next, the HwPMI System Analyzer tool iteratively parses the design to identify the different interfaces, such as bus slaves, bus masters, direct memory access, and Xilinx LocalLink. This is done at a higher level than the HwPMI Core Parser which more specifically analyzes individual IP Cores. Figure
Sample output of HwPMI System Analyzer tool on the Collatz Design, identifying the components, registers, statements, and interfaces to the core.
To support Xilinx Platform Studio (XPS) IP core development, the HwPMI Parse PCORE tool is used to parse Xilinx PCORE directory files: The Microprocessor Description (MPD), Peripheral Analysis Order (PAO), and Black Box Description (BBD) files, along with any Xilinx CoreGen (XCO) project files. This enables a designer to migrate profiled cores to other XPS systems with minimal effort. Furthermore, monitors can be written based on the Xilinx CoreGen project files to provide monitoring of components such as generated FIFOs, memory controllers, or floating point units, if so desired.
The next stage is Component Synthesis where the original hardware design is synthesized prior to any insertion of performance monitors. The purpose of synthesizing the design at this point is to retrieve additional design details from the synthesis reports including subcomponents, resource utilization, timing requirements, and behavior. This leverages the synthesis tool output to supplement the Static HDL Profiling stage by more readily identifying finite state machines and flip-flops in the design. All of the configuration information and synthesis results are available for performance monitor recommendation/insertion.
Three tools have been developed to specifically support the designer in the Component Synthesis stage. These tools automatically synthesize, parse, and aggregate the individual component utilization, resource utilization, and timing information data for the designer. The first tool is the Iterative Component Synthesis tool which runs the synthesis scripts for each of the components in the design. The second tool is the Parse Component Synthesis Reports tool: it runs after all of the system components have been synthesized, and collects a wealth of information about each component from the associated synthesis report file (SRP). This information includes the registers, FIFOs, Block RAMs, and finite-state machines (FSM), in addition to all subcomponents. The third tool is the Aggregate System Synthesis Data tool which is used to aggregate all of the data collected as part of the Parse Component Synthesis Reports tool. These tools collectively identify the system’s interconnects, processors, memory controllers, and network interfaces, in addition to the designer’s custom compute cores.
At this point the design has been analyzed in order to recommend specific performance monitors for insertion. The designer can choose to accept any number of these recommendations, from a report like that shown in Figure
Sample output of performance monitor recommendation tool.
Block diagram of performance monitor interface connecting to a simple configurable timer monitor.
It is important to emphasize that the HwPMI flow does not intelligently insert a subset of monitors when the available resources are depleted. Future work is looking into ways to weight monitors such that HwPMI can insert more important monitors; however, presently HwPMI recommends the available monitors that can be inserted into the design and it is up to the designer to choose the subset that will provide the best feedback versus resource availability trade-off.
When a performance monitor is created there is a set of criteria that must also be included to allow the recommendation to take place. For example, there is a PLB Slave Interface performance monitor which specifically monitors reads and writes to the hardware core’s slave registers. All signals are identified during profiling, but until these signals are matched against a list of predetermined signals, there is no specific way to identify when those signals are being written to. Another example considers finite-state machines: once an FSM has been identified by the system, it is trivial for the respective performance monitor to be recommended for insertion. The actual insertion of the performance monitors is done at the HDL level. Each performance monitor’s entity description and instance are automatically generated and inserted in the HDL.
The initial development of HwPMI focused on parsing designs written in VHDL and inserting monitors directly into the VHDL source. The advantage of this approach is the portability of the monitored design. However, by leveraging tools such as Torc, the insertion can be performed at the netlist level. The HwPMI tool flow has now been augmented to support inserting the monitoring infrastructure into either the VHDL source or into synthesized netlists, based on a user parameter. Continued work is underway to perform the netlist profiling with Torc instead of relying on the HDL parsing tools. Specifically, Torc inserts the monitors into the EDIF design. Figure
Insertion of Torc into Hardware Performance Monitoring Infrastructure’s Tool Flow.
Once the design is running, it is necessary to retrieve the performance monitoring data with minimal invasion to the system. This is accomplished through the use of the sideband network in the HPRC system or the HwPMI SoC in an embedded system. The head node issues requests to retrieve data from a node, core, or a specific hardware monitor anytime the application is running. To aid in the retrieval, the Performance Monitor Collection tool assembles the entire system’s performance monitoring data structure for the head node to use for runtime data collection. This data is stored in a
Three applications are used to demonstrate our HwPMI tool flow: single precision matrix-matrix multiplication, a hardware implementation of the Smith/Waterman
Matrix-Matrix Multiplication (MMM) is a basic algebraic operation where two matrices,
Three sets of performance monitors were selected and inserted into the MMM core: Firstly, for the PLB slave IPIF, to monitor the efficiency of the transfers from the processor. Secondly, for the eighteen FIFOs to monitor capacity and determine if more buffer resources should be allocated in the future to improve performance. And thirdly, for the utilization of the core itself, to monitor performance and determine how much time the core spends on actual computation versus I/O. The results indicate the largest bottleneck in the current design is the PLB slave IPIF. The processor spends over 98% of the total execution time transferring the matrix data and results into and out of the MMM hardware core. Furthermore, the results for the FIFOs showed very low overall utilization, which indicates that the FIFO depth can be reduced.
The second design evaluated with HwPMI is a hardware-accelerated implementation of the Smith/Waterman algorithm, commonly used in protein and nucleotide sequence alignments [
From the Static HDL Profiling and Component Synthesis stages, six performance monitors were identified for inclusion into the Smith/Waterman hardware core. Figure
Smith/Waterman core’s performance monitors.
The first performance monitor identifies the number of writes to the software-addressable registers in the Smith/Waterman hardware core via the PLB Slave interface (PLB SLV IPIF), the results of which are listed in Table
Performance monitor results for PLB SLV IPIF.
Register | Original | Modified | ||
---|---|---|---|---|
name | # Reads | # Writes | # Reads | # Writes |
Control_reg | 0 | 186 | 0 | 186 |
Core_status | 29540 | 0 | 29540 | 0 |
aa1 | 0 | 2095745 | 0 | 2095745 |
n1 | 0 | 2095838 | 0 | 93 |
n0 | 0 | 2095838 | 0 | 93 |
GG | 0 | 2095838 | 0 | 93 |
HH | 0 | 2095838 | 0 | 93 |
f_str_waa_s | 0 | 2095838 | 0 | 93 |
score | 186 | 0 | 186 | 0 |
ssj | 0 | 2095838 | 0 | 93 |
Counter_idle | 186 | 0 | 186 | 0 |
Counter_work | 186 | 0 | 186 | 0 |
Modification made to original dropgsw2.c.
Also identified by HwPMI were additional performance monitors to evaluate the PLB Master interface (PLB MST IPIF), which identified that only off-chip memory transactions were performed by the core. Moreoever, the off-chip memory transactions were 118,144 read-only requests, which is a significant number of transfers and warrants the evaluation of a DMA interface. The designer could also adapt the core to leverage an alternative interface, such as the Native Port Interface (NPI), to reduce memory access latency, increase bandwidth, and reduce the PLB contention.
An FSM profiler performance monitor was added that provides feedback in the form of a histogram, to identify the percentage of time each FSM state is active. Figure
Smith/Waterman core’s FSM profiler monitor results.
Another useful feature of HwPMI is its ability to evaluate designs with different resource utilizations. Designers often find themselves adding buffers or FIFOs into designs without knowing a priori how large they should be. In these cases, HwPMI can collect run-time data as a designer modifies the FIFO depth. Sometimes these modifications can reveal interesting design tradeoffs, such as those shown in Figure
Smith/Waterman’s performance with varying FIFO depths.
The third application used in our HwPMI evaluation is Collatz. The Collatz Conjecture states that given a natural number, it is possible reduce that number to one, by either dividing by two when even, or multiplying by three and adding one when odd [
The performance monitor data is collected with the assistance of HwPMI. In addition to the utilization and interface performance monitors, an additional monitor was added that yielded a highly beneficial result. This is an example of supplemental data collection. When a designer needs to collect additional information, HwPMI offers the ability to add custom monitors without the need to augment how the system will retrieve the data. This can be especially useful for quick one-off data that might only be useful for a short period of time. Rather than manually adding the infrastructure to the original core, only to remove it later—or retain it and waste resources—HwPMI can collect the data quickly and efficiently. Figure
Collatz core’s histogram of steps.
Another interesting monitor is the processor interrupt monitor. For latency sensitive applications with processor-to-hardware core communication, an interrupt is often used. However, configuring the interrupt or optimizing the interrupt service routines is critical. In the case of Collatz the time for the processor to respond to a single interrupt was measured as
Finally, we present the resource utilization of HwPMI. Our goal is to be minimally invasive both in terms of processing overhead and resource utilization overhead. Listed in Table
Example of HwMPI's resource utilization on V4FX60.
Component | Configuration | FFs (%) | LUTs (%) |
---|---|---|---|
Performance monitor hub | 1 port | 14 (0.03%) | 70 (0.14%) |
Performance monitor hub | 2 ports | 17 (0.03%) | 78 (0.15%) |
Performance monitor hub | 4 ports | 21 (0.04%) | 153 (0.30%) |
Performance monitor hub | 8 ports | 21 (0.04%) | 250 (0.49%) |
Performance monitor hub | 16 ports | 23 (0.05%) | 419 (0.83%) |
Timer monitor | 1 32-bit timer | 37 (0.07%) | 96 (0.19%) |
Match counter monitor | 1 64-bit counter | 67 (0.13%) | 109 (0.22%) |
Match counter monitor | 2 64-bit counters | 132 (0.26%) | 207 (0.41%) |
Match counter monitor | 16 64-bit counters | 1034 (2.05%) | 1593 (3.15%) |
FIFO monitor | 1 32-bit FIFO | 402 (0.80%) | 594 (1.17%) |
Histogram monitor | 512 Bins | 20 (0.04%) | 3207 (6.34%) |
Finite state machine monitor | 12 states | 775 (1.53%) | 1266 (2.50%) |
Finite state machine monitor | 64 states | 4116 (8.14%) | 6332 (12.52%) |
System monitor hub | 1 port (1 Hw Core) | 212 (0.42%) | 513 (1.01%) |
System monitor hub | 2 ports (2 Hw Cores) | 213 (0.42%) | 565 (1.12%) |
System monitor hub | 4 ports (4 Hw Cores) | 216 (0.43%) | 691 (1.37%) |
System monitor hub | 8 ports (8 Hw Cores) | 224 (0.44%) | 911 (1.80%) |
System monitor hub | 16 port (16 Hw Cores) | 230 (0.45%) | 1369 (2.71%) |
The Hardware Performance Monitoring Infrastructure (HwPMI) presented in this work expedites the insertion of a minimally invasive performance monitoring networks into existing hardware designs. The goal is to increase designer productivity by analyzing the existing design and automatically inserting monitors with the necessary infrastructure to retrieve the monitored data from the system. As a result of HwPMI the designer can focus on the development of the hardware core rather than trying to include front-end application support to monitor performance. Toward this goal, a collection of hardware cores have been assembled, and a series of software tools have been written to parse the existing design and recommend and/or insert hardware monitors directly into the source HDL.
HwPMI also integrates with an existing sideband network to retrieve the performance monitor results in High Performance Reconfigurable Computing without requiring modifications to the original application. Embedded systems can leverage HwPMI through a dedicated System-on-Chip controller which reduces run-time overhead on existing processors in the system. This work demonstrated HwPMI’s capabilities across three applications, highlighting several unique features of the infrastructure.
This work also leverages Torc to provide netlist manipulations quickly and efficiently, in place of the original HDL modifications [
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001-11-C-0041. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). DARPA Distribution Statement A. Approved for Public Release, Distribution Unlimited.