^{1}

^{2}

^{1}

^{2}

Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.

Real-world communication workloads exhibit structure in the form of locality, sparsity, fanout distribution, and other properties. If this structure can be exposed to automation tools, we can reshape and optimize the workload to improve performance, lower area, and reduce energy. In this paper, we develop a traffic compiler that exploits structural properties of Bulk-Synchronous Parallel communication workloads. This compiler provides insight into performance tuning of communication-intensive parallel applications. The performance and energy improvements made possible by the compiler allow us to build the NoC from simple hardware elements that consume less area and eliminate the need for using complex, area-hungry, adaptive hardware. We now introduce key structural properties exploited by our traffic compiler.

When the natural communicating components of the traffic do not match the granularity of the NoC architecture, applications may end up being poorly load balanced. We discuss

Most applications exhibit sparsity and locality; an object often interacts regularly with only a few other objects in its neighborhood. We exploit these properties by

Data updates from an object should often be seen by multiple neighbors, meaning the network must route the same message to multiple destinations. We consider

Applications that use barrier synchronization can minimize node idle time induced by global synchronization between the parallel regions of the program by using

We show the compilation flow for the NoC in Figure

development of a traffic compiler for fine-grained applications described using the BSP, Token Dataflow, and Static SIMD compute models,

use of communication workloads extracted from ConceptNet (BSP), Sparse Matrix-Vector Multiply (BSP), Bellman-Ford (BSP), and Sparse Direct Matrix Solve (Token Dataflow) running on range of real-world circuits and graphs,

quantification of cumulative benefits of each stage of the compilation flow (performance, area, energy) compared to the unoptimized case.

NoC traffic compilation flow (annotated with

We consider two compute models and associated applications: (1) Graphstep [

Parallel graph algorithms are well suited for concurrent processing on FPGAs. We describe graph algorithms in a Bulk-Synchronous Parallel (BSP) compute model [

Lightweight processing of sparse dataflow graphs can be efficiently accelerated using FPGAs. In [

Applications in the compute models we consider generate traffic with a variety of communication characteristics (e.g., locality, sparsity, multicast) which also occur in other applications and compute models as well. Our traffic compiler exploits the

In [

We organize our FPGA NoC as a bidirectional 2D mesh [

NoC architecture and organization.

Architecture of the NoC

Graphstep PE

Dataflow PE

Mesh Switch (DOR, Implicit PE connections)

For the BSP PE shown in Figure

For the Token Dataflow PE shown in Figure

Each switch in the bidirectional 2D mesh supports fully-pipelined operation using composable

We measure network performance as the number of clock cycles (_{unoptimized}/_{optimized}.

In this section, we describe a set of optimizations performed by our traffic compiler. The compiler accepts the graph structure from the application and maps it to the NoC architecture. It suitably modifies the graph structure (replacing nodes and edges) and generates an assignment of graph nodes to the PEs of the NoC. The traffic compiler also selects the type of synchronization implemented in the PEs. It is a fully automated flow that sequences the different graph optimizations to generate an optimized mapping.

Ideally for a given application, as the PE count increases, each PE holds smaller and smaller portions of the workload. For graph-oriented workloads, unusually large nodes with a large number of edges (i.e., nodes that send and receive many messages) can prevent the smooth distribution of the workload across the PEs. As a result, performance is limited by the time spent sending and receiving messages at the largest node (streamlined message processing in the PEs implies work

Application graphs.

Graph | Nodes | Edges | Max | |

Fanin | Fanout | |||

BSP Compute Model [ | ||||

ConceptNet | ||||

| 14556 | 27275 | 226 | 2538 |

| 224876 | 553837 | 16176 | 36562 |

Matrix-Multiply | ||||

| 2395 | 17319 | 124 | 124 |

| 1473 | 17857 | 27 | 30 |

| 19716 | 218308 | 18 | 18 |

| 9152 | 765944 | 255 | 255 |

| 4929 | 33185 | 27 | 28 |

| 17758 | 126150 | 574 | 574 |

| 3200 | 18880 | 6 | 6 |

| 5940 | 83842 | 30 | 20 |

Bellman-Ford | ||||

| 12752 | 36455 | 33 | 93 |

| 29347 | 97862 | 9 | 109 |

| 69429 | 222371 | 137 | 170 |

| 161570 | 529215 | 267 | 196 |

| 183484 | 588775 | 163 | 257 |

| 210613 | 617777 | 85 | 209 |

Token Dataflow Compute Model [ | ||||

753 | 985 | 3 | 6 | |

1037 | 1395 | 3 | 8 | |

2883 | 3866 | 3 | 4 | |

9814 | 13356 | 3 | 10 | |

43000 | 67265 | 3 | 10 | |

22259 | 30108 | 3 | 11 | |

40400 | 55765 | 3 | 8 | |

40400 | 55765 | 3 | 8 | |

43055 | 62067 | 3 | 11 | |

221807 | 391654 | 3 | 53 | |

70928 | 106247 | 3 | 13 | |

70666 | 103314 | 3 | 12 | |

73914 | 108888 | 3 | 14 | |

81060 | 119475 | 3 | 16 | |

90288 | 133901 | 3 | 16 | |

100637 | 151868 | 3 | 20 | |

220092 | 380930 | 3 | 54 | |

146442 | 228017 | 3 | 26 | |

212474 | 348453 | 3 | 39 | |

124720 | 178396 | 3 | 8 | |

416454 | 747587 | 3 | 172 |

In general, when the output from the graph node is a result which must be multicast to multiple outputs, we can easily build an output fanout tree to decompose output routing. However, input edges to a graph node can only be decomposed when the operation combining inputs is associative. ConceptNet and Bellman-Ford (discussed later in Section

We implement the decomposition phase by constructing an n-ary tree that replaces the high-fanin or high-fanout node. As as example, consider node 6 in Figure

Fanin

While

Clustering cost function.

Object communication typically exhibits locality. A random placement ignores this locality resulting in more traffic on the network. Consequently, random placement imposes a greater traffic requirement which can lead to poor performance, higher energy consumption, and inefficient use of network resources. We can

Some applications may require multicast messages (i.e., single source, multiple destinations). Our application graphs contain nodes that send the exact same message to their destinations. Routing redundant messages is a waste of network resources. We can use the network more efficiently with

In parallel programs with multiple threads, synchronization between the threads is sometimes implemented with a global barrier for simplicity. However, the global barrier may artificially serialize computation. Alternately, the global barrier can be replaced with local synchronization conditions that avoid unnecessary sequentialization. Techniques for eliminating such barriers have been previously studied [

We generate workloads from a range of applications mapped to the BSP compute model and the Token Dataflow model. We choose applications that cover different domains including AI, Scientific Computing, and CAD optimization that exhibit important structural properties.

ConceptNet [

Iterative Sparse Matrix-Vector Multiply (SMVM) is the dominant computational kernel in several numerical routines (e.g., Conjugate Gradient, GMRES). In each iteration, a set of dot products between the vector and matrix rows is performed to calculate new values for the vector to be used in the next iteration. We can represent this computation as a graph where nodes represent matrix rows and edges represent the communication of the new vector values. The graph captures the sparse communication structure inherent in the dot-product expression. In each iteration, messages must be sent along all edges; these edges are multicast as each vector entry must be sent to each row graph node with a nonzero coefficient associated with the vector position. We use sample matrices from the Matrix Market benchmark [

The Bellman-Ford algorithm solves the single-source shortest-path problem, identifying any negative edge weight cycles, if they exist. It finds application in CAD optimizations like Retiming, Static Timing Analysis, and FPGA Routing where the graph structure is a representation of the physical circuit. Nodes represent gates in the circuit while edges represent wires between the gates. The algorithm simply relaxes all edges in each step until quiescence. A relaxation consists of computing the minimum at each node over all weighted incoming message values. Each node then communicates the result of the minimum to all its neighbors to prepare for the next relaxation. Again, we capture this computation in the BSP compute model and implement it on a Graphstep PE (Figure

Matrix Solve computation on sparse matrices is a key repetitive component of many applications like the SPICE circuit simulator. For SPICE, we prefer to use sparse direct solver techniques than SMVM-based (see Section

All our experiments use a single-lane, bidirectional-mesh topology that implements a Dimension-Ordered Routing function. The network for Matrix-Vector Multiply and Sparse Direct Matrix Solve experiments is 84-bit wide (64-bit double-precision data, 20-bit header/address) while the network for ConceptNet and Bellman-Ford experiments is 52-bit wide (32-bit integer data, 20-bit header/address). The switch is internally pipelined to accept a new packet on each cycle (see Figure

NoC timing model.

Mesh Switch | Latency |
---|---|

2 | |

4 | |

6 | |

2 | |

5 | |

Processing Element | Latency |

1 | |

1 | |

9 | |

8 | |

10 | |

57 |

NoC dynamic power model.

Datawidth (Application) | Block | dynamic power at diff. activity (mW) | ||||

0% | 25% | 50% | 75% | 100% | ||

52 (ConceptNet, Bellman-Ford) | Split | 0.26 | 1.07 | 1.45 | 1.65 | 1.84 |

Merge | 0.72 | 1.58 | 2.1 | 2.49 | 2.82 | |

84 (Matrix-Vector Multiply, Sparse Marix Solve) | Split | 0.32 | 1.35 | 1.78 | 2.02 | 2.26 |

Merge | 0.9 | 1.87 | 2.45 | 2.88 | 3.25 |

We use a Java-based cycle-accurate simulator that implements the timing model described in Section

We now examine the impact of the different optimizations on various workloads to quantify the cumulative benefit of our traffic compiler. Our performance baseline is an unoptimized, unprocessed, barrier-synchronized graph workload which is randomly distributed across the NoC PEs. We order the optimization appropriately to analyze their additive impacts. We first show relative scaling trends for the total routing time for the

Ideally, as PE counts increase, the application performance scales accordingly, (

Explaining performance scaling of total cycles for

This metric measures the number of cycles spent injecting or receiving messages at the NoC-PE interface. We measure this as follows:

In Equation (

We expect that, for ideal scaling, the number of serialization cycles decrease with increasing PE counts. We distribute both the computation and the communication over more PEs. However, communication from very large graph nodes (i.e., large number of edges) will cause a serial bottleneck at the PE-NoC interface. In Section

This metric measures the number of cycles required for messages to cross the chip bisection. If the volume of NoC traffic crossing the chip bisection is larger than the number of physical wires (NoC channels in the bisection × Channel Width), then the bisection must be reused

The top-level bisection may not be the largest bottleneck in the network. Hence, we consider several hierarchical cuts (horizontal and vertical cuts for a mesh topology) and identify the most limiting of cuts (

For applications with high locality, the amount of traffic crossing the bisection is low (when placed properly) and bandwidth does not become a bottleneck. Conversely, for application with low locality, a larger number of messages need to cross the bisection and bisection bandwidth can become a bottleneck. In the Section

This metric measures the sum of switch latencies and wire latencies along the worst-case message path in the NoC assuming no congestion

For barrier-synchronized workloads, all data is routed from sources to sinks in an epoch. At small PE counts, the number of cycles required to cross the network will be small compared to serialization or bisection cycles. However, as the PE counts increase, the latency in the network will also increase and eventually dominate both serialization and bisection. In the high latency regime, latency hiding techniques like

In Figure

In Figure

In Figure

In Figure

As we can see in Figure

Impact of

We show performance scaling with increasing PEs for the Bellman-Ford

In Figure

We look at cumulative speedup contributions and relative scaling trends of all optimizations for all workloads at 25 PEs, 256 PEs, and 2025 PEs. The relative impact and importance of these optimizations shift as a function of system size. In some cases, a particular optimization is irrelevant at a particular PE count point in the NoC design space; for example, fanout routing is most useful at small system sizes and placement is important at larger system sizes.

At 25 PEs, Figure

Performance ratio at 25 PEs.

At 256 PEs, Figure

Performance ratio at 256 PEs.

At 2025 PEs, Figure

Performance ratio at 2025 PEs.

Overall, we find ConceptNet workloads show impressive speedups up to 22×. These workloads have decomposable nodes that allow better load balancing and have high locality. They are also the only workloads which have the most need for

For some low-cost applications (e.g., embedded), it is important to minimize NoC implementation area and energy. The optimizations we discuss are equally relevant when cost is the dominant design criteria.

To compute the area savings, we first pick the smallest PE count (PE_{unopt} in Figure _{opt} in Figure

How we compute area savings.

Area ratio to baseline.

To compute energy savings, we use the switching activity factor and network cycles to derive dynamic energy reduction in the network. Switching activity factor is extracted from the number of packets traversing the

Dynamic energy savings at 25 PEs.

Large, on-chip networks that support highly-parallel, fine-grained applications will be required to handle heavy message traffic. Load balancing, communication bandwidth, IO serialization, and synchronization costs will play a key role in determining the performance and scalability of such systems. We develop a traffic compiler for sparse graph-oriented workloads to automatically optimize network traffic and minimize these costs. We demonstrate the effectiveness of our traffic compiler over a range of real-world workloads with performance improvements between 1.2× and 22× (3.5× mean), PE count reductions between 3× and 15× (9× mean), and dynamic energy savings between 2× and 3.5× (2.7× mean) for the BSP workloads. We also show speedups of 0.5–13× (geomean 3.6×) for Sparse Matrix Solve (Token Dataflow) workloads when performing a high-quality placement of the dataflow graphs. For large workloads like