The Cost of Adaptivity and Virtual Lanes in a Wormhole Router

We examine the cost in router complexity of adaptivity and virtual lanes in wormhole routers, using f-flat adaptive routers (based on a generalization of planar-adaptive routing) which include routers with a range of routing freedom. Our studies show that adaptivity is expensive because it requires additional virtual channels and much larger crossbar switches for both adaptivity and deadlock prevention. Increases of 50 to 100% in channel utilization are required to justify additional degrees of routing freedom. Three internal router architectures for virtual lanes are examined and the fully expanded crossbar is found to be most effective because it gives simplest control and minimal internal blocking. Examining router designs with from 1-16 virtual lanes indicates that 30% improvements in channel utilization are required to justify each additional virtual lane. These studies combined with published simulation results indicate that only modest numbers of virtual lanes are likely to be cost effective.


INTRODUCTION
concurrent interconnection networks computers, are used by the processing nodes to exchange data and synchronize with each other. Network performance is often critical, as the performance of largescale parallel machines is sensitive to network latency and throughput. While multicomputers have been touted as scalable parallel architectures, their scalability is limited by the performance of their interconnection networks.
An interconnection network is defined by its topology, routing, and flow control. The topology is the pattern of network node interconnection via physical communication channels. The routing algorithm specifies how packets choose paths through the network. Flow control deals with the allocation of channel and buffer resources to packets as they proceed through the network. This paper focuses evaluating the cost of a variety of routing features involving routing and flow control.
Deterministic, dimension-order routers are used in a variety of multicomputers. Such routers use only one path through the network, deterministically routing each packet from source to destination. Deterministic routers are attractive because they are exceedingly simple and provide low latency and high bandwidth. However, deterministic routers have a number of significant disadvantages: poor performance under non-uniform traffic loads and poor fault tolerance [16,32,25]. Adaptive routing is a promising approach to alleviate these problems, but adaptive routers can be more complex and this complexity leads to legitimate concerns about their speed and tangible benefits. Adaptive routers allow many paths between source and destination to be used, generally choosing based on network load conditions. Recently, a number of dramatically simpler adaptive routing algorithms have been proposed [32,9,5]. This breakthrough makes adaptive routing feasible, but not without cost. Deciding whether or not to incorporate adaptive routing is still a complex costperformance tradeoff with the cost side of the equation still largely undefined. Virtual lanes provide multiple lanes for messages along any particular path in a routing algorithm 14,8]. As with adaptive rout-ing, virtual lanes can improve router performance by increasing channel utilization. However, they also increase router complexity, slowing implementations.
In this paper, we examine the cost of adaptivity and virtual lanes in one family of deterministic and adaptive routers based on a series of gate-level router designs. While study of a wider range of adaptive routing algorithms is of interest, such is beyond the scope of the paper. This paper makes two significant contributions. First, it gives a detailed description of an adaptive wormhole router, characterizing the functionality and speed of each module and providing a basis for estimating router speeds. Second, it examines the speed of a baseline deterministic router and a family of enhanced routers with increased routing freedom and numbers of virtual lanes. This not only allows the speed of adaptive routers to be compared to existing deterministic router designs, it also provides a basis for assessing the cost of adaptivity and virtual lanes, admitting a cost performance tradeoff.
To assess the cost of adaptivity, we examine a series of router designs with a range of adaptivity. In this context, we define adaptivity as the maximum number of routing choices at an intermediate routing node. We examine routers with from one to eight degrees of routing freedom. Our router designs show that the cost of a few degrees of routing freedom can be modest. However, higher degrees of adaptivity incur much greater costs; 50% or greater increases in channel utilization are required to justify each additional degree of routing freedom beyond two.
To assess the cost of virtual lanes, we examine router designs with from one to sixteen virtual lanes. Several router architectures have been proposed for virtual lanes, so we first examine each of these and then select the most attractive, a fully expanded crossbar. Our design studies of this architecture show that while virtual lanes are expensive, they are less expensive than increased adaptivity. Each additional virtual lane requires an increase in channel utilization of 30% to be cost effective. The majority of the increased cost is in larger crossbars and much larger virtual channel controllers. Given published studies of the benefits of virtual lanes, a few virtual lanes may give enough of a throughput increase to justify this cost, but large numbers of virtual lanes are unacceptably expensive.
Overview The remainder of the paper is organized as follows. Section 2 describes the context for this work, defining wormhole routing, planar-adaptive routing, and describing previous router implementation studies. Section 3 describes our base router de-sign, a planar-adaptive router. Section 4 presents cost-performance metrics for router designs and applies them to the base router. With a baseline established, Sections 5 and 6 consider the cost of adaptive routing and virtual lanes. The overall performance results are summarized in Section 7. Finally, Section 8 summarize the paper and discusses several possible directions for future research.

BACKGROUND
Communication performance depends critically on a network's topology, flow control, and routing algorithm. We focus on k-ary n-cubes, direct networks with radix k and dimension n [13]. By varying choice of k and n, this family of networks represent a wide range of choices in density of interconnection. We also focus only on routers that use wormhole routing, a low cost approach to flow control that allows small simple routers. Wormhole routers for kary n-cubes have been used in a variety of commercial and research machines [22,31,15,2,1,27,3]. In recent years, an increasing number of interconnection networks have made use of wormhole routing, a fine-grained flow control technique which requires only small amounts of hardware [18]. Consequently, wormhole routers are small, cheap, and fast. The basic idea behind wormhole routing is to begin forming the path from source to destination, sending the data right behind the message header. If the message is blocked, all of the data flits (flow control units) are stopped in place in the network. Because the flits are stopped in place, wormhole routers require only a modest amount of storage and can support messages of arbitrary size. However, stopping the flits in place requires holding the channels along the path, giving rise to a plethora of possible deadlocks and conflicts. Addressing these issues effectively has been the subject of much research [12,29,19,9,5,32].
Routing approaches can be divided into two categories: deterministic and adaptive routing. In deterministic routers, each message is routed along a fixed path, determined by the source and destination of the message. Deterministic routing's main advantage, hardware simplicity, is directly tied to its primary disadvantage, a lack of routing flexibility which limits network performance and fault tolerance. Any particular fixed choice of routes will produce poor performance for some communication patterns.
Adaptive routing can alleviate such problems by mapping communications to paths flexibly, based on network loading. The flexibility in routing improves performance on non-uniform workloads [25] and can provide a measure of fault tolerance [9,23]. The major disadvantage of adaptive routing is the greater complexity required to support the additional routing flexibility while assuring deadlock-freedom. This increase in hardware complexity can significantly reduce router speed, decreasing total network performance. To reduce the cost of adaptive routing, many approaches based on limited adaptivity have been proposed.
Virtual lanes can also be added to deterministic or adaptive routers. The idea behind virtual lanes is to increase the utilization of physical channels by decoupling two types of network resources" buffers and physical channels 14]. If a message holding a buffer is blocked, the message releases the physical channel so that other messages can use it. Adaptive routing and virtual lanes complement each other: adaptive routing attempts to distribute traffic uniformly over the physical channels and virtual lanes attempt to maximize the utilization of each physical channel.
In this paper, we focus on one family of adaptive routing algorithms which encompass a range of adaptivity and hardware complexity. This family allows routing freedom to be traded off against router speed while assuring deadlock-freedom. The simplest adaptive router in this family, a planar-adaptive router, is described below. In this context, we evaluate the cost of adaptivity and virtual lanes.

Planar-adaptive Routing
Planar-Adaptive Routing (PAR) is a limited adaptivity routing algorithm. PAR has many implementation advantages, most notably hardware simplicity. PAR uses only three virtual channels for deadlock prevention and small crossbar switches regardless of the number of dimensions [9,24,4].
The idea in planar-adaptive routing is to provide limited adaptivity by routing adaptively in a series of two-dimensional planes. As the packet progresses towards its destination, it passes through a series of adaptive planes and eventually, the packet completes routing in all dimensions and is delivered to the destination. By limiting adaptivity to two dimensions and structuring the passage from one adaptive plane to the next, network cost can be reduced while maintaining deadlock-freedom.  Within each network, traffic is routed adaptively towards its destination in any of the productive channels. When the di address is correct, routing is completed in plane A i, so proceed to the next high-level step.
IThe order of dimensions is arbitrary.
In the high-level routing, the basic idea is to route successively in the adaptive planes. Routing in adaptive plane Ai reduces the distance in di to zero. After routing in all of the adaptive planes, the packet has reached its destination. For d_ , there cannot be any adaptivity left for a minimal router, so the packet is routed directly to its destination. In the low-level routing, the scheme is adaptive, as multiple paths can be chosen within each adaptive plane.

F-flat Adaptive Routing
The planar-adaptive routing algorithm can be generalized to support higher degrees of adaptivity. The basic idea is to increase the degree of routing freedom at each low-level routing step (adaptive planes to adaptive cubes, etc.), producing the class of f-flat routers. F-flat adaptive routers are deadlock-free and allow a range of routing freedom choices. An f-flat adaptive router allows routing in the f-fiat subspace of the n-dimensional space, giving f degrees of adaptivity. Thus, a planar-adaptive router is a 2-flat adaptive router. The composition of adaptive spaces is handled in an analogous fashion to planar-adaptive routing. Any deadlock-free adaptive routing algorithm can be used with the f-fiat. Increasing the routing freedom by increasing f can improve channel utilization, but each such increase requires additional hardware, incurring increases in router latency and router clock periods. The dimension-order router is a 1-fiat adaptive router.

Related Work
In this section, we survey the related router design studies. All of the work surveyed here involves deterministic routers. While differences in router functionality and implementation technology make direct comparisons difficult, the comparison shows that our router design is competitive. Recently, there others have studied the complexity of adaptive routers [6], but these routers use virtual-cut through and misrouting. While their results are consonant with ours, the different context makes the results somewhat incomparable.
Caltech Routing Chips The Torus routing chip (TRC) is a dimension-order router for k-ary n-cube networks 18] which implemented wormhole routing and used virtual channels to prevent deadlock. Channel throughput was 8 MB/s with byte wide selftimed communication channels (8 Mhz). This was about an order of magnitude better than contempo-rary communication networks used by the Caltech Cosmic Cube or Intel iPSC. The router setup latency was 150ns per routing step. The mesh routing chip (MRC) is the second generation Caltech routing chip [21 ]. The MRC supports mesh networks with dimension-order-routing. The MRC is a self-timed circuit which based on 3 3 crossbar switches (one port for positive direction, the negative direction and the processor element) in each dimension. Router latency is approximately 50ns and channel data rates are as high as 90MB/s, using byte-wide links. Derivatives of MRCs are used in several research machines [1,28,34].
The J-Machine Router The J-Machine is a finegrained concurrent computer developed at MIT 17,33]. The J-machine network is a three-dimensional mesh, with bidirectional 9-bit channels, and dimension-order, wormhole routing. The J-Machine network uses two virtual channels to support two logically independent message priorities and a globally synchronous clock. The data throughput is 36 MB/ s (32 Mhz). The latency of the routing is 62.5 ns per hop.
Recent Router Designs The latest Caltech EMRC routing chips are also dimension-order, wormhole routers [35]. These chips are self-timed and use byte wide channels to achieve 66 MB/s. The typical path formation latency for the head of a packet is approximately 30ns. The Intel Paragon router is descended from the original Caltech MRCs [22,20].
The Paragon router is a deterministic router and comparable to our designs, as it is implemented in a similar technoolgy (0.8 micron CMOS gate array) and gives performance comparable to our designs. Published figures for its delay and channel bandwidth are 40 ns and 200 MB/second respectively.

BASE ROUTER DESIGN
In this section, we describe the basic router design which is used as a.baseline for our cost studies. We describe the architecture of the router, describing the functionality of each module in detail. The goal is to provide insight as to how and why router changes affect performance. In the following sections, the cost of each additional router feature is calculated by comparing the performance of the enhanced router to the baseline design. Our baseline router is a planar-adaptive router (PAR) as described in Section 2.1.
A planar-adaptive router consists of a series of composable modules, one for each adaptive plane.
The external interface of one such plane consists of four bidirectional links and two ports which are typically connected to the local processor (see Figure  2). These two ports can also be used to compose routers for a higher dimensional network. The bidirectional connections to neighbors, denoted L1, L2, L3, and L4 are each implemented with dual unidirectional channels, and a similar interface is used for the processor ports. PAR requires two virtual channels in the y-dimension to support deadlock-free adaptive routing (labeled iyp and iyn for increasing x and dyp and dyn for decreasing x, in the positive and negative y directions respectively). Design and Technology Assumptions Because pin-limitations are a concern affecting router throughput, our designs use data channels with 16 bits of data and 7 additional control signals. This produces a router with well below 250 data pins,well within the range feasible for a pin grid array or more advanced packaging .technology. Our designs are based on a 0.8 micron CMOS gate array library from Mitsubishi Electric Corporation. All timing estimates are based on conservative routing estimates, nominal processing, and nominal operating temperature.

Overall Design
A complete block diagram showing router internals can be found in Figure 3. Packets flow from left to right. Incoming links are attached on the left and outgoing links on the right. For each link there are data lines and control lines in the forward direction (left to right, shown with thick lines) and flow con-trol signals in the reverse direction (right to left, shown with thin lines). The upper crossbar and routing decision logic support the increasing x subnetwork and the lower crossbar and routing decision logic the decreasing x subnetwork (see Figure 1). The basic function of each block is described below.
A complete description of the router can be found in Aoyama's thesis [4].
External flow controller (XFC) supports asynchronous internode communication by synchronizing inputs to the local node clock. Address Decoder (AD) decodes the packet header, generating requests for permissible outputs.
Internal flow controller (IFC) controls data flow across the switch and updates packet headers (relative addressing) based on the routing de-.cision signals from the RD. Header update is overlapped with address decoding in the AD to decrease router latency. Routing Decision block (RD) receives request signals from all of the AD's and arbitrates amongst these signals, generating routing decisions which control the IFC's and the crossbar switch.
Crossbar Switch (CB) connects input channels to output channels. Virtual Channel controller (VC) multiplexes virtual channels onto the physical channels. Because the router interfaces are asynchronous, the VC also synchronizes the flow control signal. The VC and the IFC manage intranode data flow cooperatively to minimize internal delays.
2A few more control signals are needed for each additional virtual channel.
The router is internally synchronous, but externally asynchronous. Internal   nal coordination, particularly fair arbitration and selection for virtual channels, inexpensive. An asynchronous interface between routers is a consequence of the difficulty of distributing a high speed, low skew clock to a large system? We assume that all routers operate with a clock of identical frequency but with differing and even slightly variable skew. Differences in clock phase between nodes are handled by synchronizers. The critical internal router operations which affect router setup latency and achievable clock rate in the network router are path setup and data through, respectively. Path setup is the delay incurred by each packet as it is forming a path to the destination. Data through is the delay incurred by each flit as it moves through the router. A path setup operation involves 3An alternative would be to use phase-locked loops, which does not change the synchronization cost. It simply fixes the synchronization penalty to an integral number of clock periods between routers. the following steps" the AD generates requests based on the header, the RD assigns a path and sets up the switch, and data flows through the switch to the output (VC if appropriate). The data through operation determines the achievable channel bandwidth because it defines the rate at which flits can move through the router. A data through operation consists of the following steps: the IFC sends data forward, the data moves through the CB, and the data is accepted for transmission at the VC. To ensure correct operation our planar-adaptive, wormhole router must manage the following tasks: flow control, routing, and virtual channel multiplexing. In the following sections, we discuss how each of the tasks are accomplished.

Flow Control
The major benefit of wormhole routing is that routers can have extremely low buffer requirements. How-ever, one consequence of this is that flow control performed on small units of data, and must be performed extremely rapidly to prevent buffer overflow.
A competing goal is to maximize the channel bandwidth usable by a single packet. To achieve both of these goals, our design fully pipelines flow control operations, allowing a single packet to use the full bandwidth of a physical channel.
Between routers, both the data and flow control signals are asynchronous, so the synchronization time increases the effective delay in both directions. The XFC synchronizes the incoming data, and the VC synchronizes backward flow control signals using a synchronizer based on a Muller C element [30] to sample the input signal.
Pipelined flow control, synchronization penalties, channel delay, and clock skew all increase the buffer requirements of wormhole routing. If the channel delay is one clock cycle, synchronous pipelined flow control with unit delay channels requires two flit buffers, and the synchronization delay increases the buffer requirement to four flits. Thus, the minimum configuration for unit delay channels is four flit buffers per channel. Adding one more flit buffer dramatically simplifies buffer control, so our designs all include five flit buffers.

Routing Decision
In an adaptive router, routing decisions are based on the packet destination address and current router state. If several messages arrive simultaneously, all of those packets will have their paths set up in a single cycle. For each channel, the AD decodes the message header and generates requests for the permissible paths; this is trivial for a planar-adaptive router. The RD arbitrates between simultaneous requests and enforces resource constraints (no more than one packet connected to each output). Our RD design uses the straight-first selection policy [24] when there is no contention, giving the packets going straight priority over those turning. Fair arbitration is used to prevent starvation when a packet has already been forced to wait.

Multiplexing Physical Channels
Virtual channels are implemented by multiplexing the physical channels. The VC, virtual channel controller, manages this multiplexing, preventing starvation for any virtual channel and attempting to utilize the physical channel efficiently. However, to achieve these goals, a VC must coordinate the movement of data from the router inputs, through the crossbar, as well as the scheduling of virtual channels onto the physical channel. Figure 4 shows three virtual channels sharing a single physical channel.
At the left side there are three IFC's (internal Flow Controllers) and at the right side there is a single VC. Between them is the crossbar for this subnetwork.
The VC and IFCs cooperate to move data through the crossbar, attempting to keep the physical channel busy. while fairly allocating resources to all virtual channels. When the channel is about to become idle, the VC requests data from all virtual channels which have empty downstream buffers (based on the El-3 signals). Each such virtual channel sends data across the crossbar which is buffered, then sequenced over the physical channel. To achieve this collection and sequencing without losing cycles, the VC needs IFC buffer status and flow control information, ready signals from IFCs indicate the upstream buffer states, and the empty signals which indicate the downstream buffer states, allowing the VC to schedule only those virtual channels which can use the physical channel.
Once the data flits have reached the VC, they are sequenced across the physical channel. In Figure 5, the VC inputs come from the left, and the physical channel is on the right. Our VC uses two levels of arbitration; one in a collection phase and one in the delivery phase. First, during the collection phase (when data is accepted from all of the virtual channels into the staging buffers) arbitration decides which virtual channel sends first. Second, during the delivery phase, arbitration sequences the data from the staging buffers (losers in the first round arbitration) over the channel. We use fixed priority arbitration in both cases simplifies the VC. Starvation is prevented by assuring that all collected flits are transmitted before the next collection phase.

Switching
Router must switch packets, conveying data from appropriate inputs to outputs. This is done by the two crossbars (CB) which not only form the forward paths for data, but also the reverse paths for flow control signals.

PERFORMANCE OF THE BASE ROUTER
To evaluate the performance impact of adaptive routing on multiprocessor routing networks, we first de- fine two router performance metrics. These metrics are affected by topology, routing algorithm, and implementation technology. We first focus on the internal router issues where speed is determined largely by routing algorithm and technology, producing a characterization of the internal router delay for a variety of router designs. These estimates quantify the performance impact of including advanced router features such as adaptivity and virtual lanes. Subsequently, we consider the system level issues clocking scheme, clock synchronization, and channel delay--, and how to relate them to the internal delay measures for our base router design. Router setup latency includes both internode and intranode delay; each of which can be broken down into component delays. In this section, we estimate both contributions to router setup latency. The achievable clock rate depends on the clocking scheme and the data through delay which characterizes the basic rate at which flits can move internally.

Cost and Performance Metrics
Router performance can be characterized by several metrics" channel utilization, router setup latency and achievable clock rate. These metrics are defined below.

Performance Analysis
Internode Delay Internode delay contributes to router setup latency. Internode delay includes the time to get off chip, across the wires, and onto the destination chip (buffer, propagation, input latch, synchronizer and synchronization delays). For standard output buffers and input latches, the nominal performance of gate array library gives the delays shown in Figure 6. The output buffer delay includes line charging time, characterized by the loading. Our analysis makes no attempt to account for long channel delays. The synchronizer delay is due to gate delay in the synchronizer. This is in addition to the synchronization delay depends on the clock skew.
Based on these numbers, we can estimate the best case and worst case skew. Based on the fixed components of internode delay, if skew is less than T 4.9us along the forward path (where T is one half a clock period and is greater than 4.9ns), this is the best case, and the channel crossing takes only one cycle. If the skew is greater than T 4.9ns but less than 2T 4.9ns, the crossing will take two cycles. If T is less than 4.9us, the channel crossing may take two or more cycles. Even for clock rates of several hundred megahertz, 2T 4.9ns is an .achievable skew for a large scale system. Intranode Delay There are two important types of intranode delay: path setup and data through delay which contribute to router setup latency and achievable clock rate respectively. While the internode delay depends primarily on topology and packaging, the intranode delay depends strongly on router features. The data through delay determines the flow control rate, thereby affecting the maximum achievable clock rate. In this section, we characterize the intranode delay for the base router, using these delays as a point of reference for the remainder of the paper. Figure 7 (a) and (b) show the critical path and timing of a planar-adaptive router at path setup. The critical path starts at the entry to the AD, passes through the RD, through the IFC, CB, and finally the VC. Figure 7 (b) breaks the overall delay down into constituent module delays. The majority of the path setup delay is in the AD, which latches the header from the XFC then generates route requests. Most of delay in the AD is due to the data latch, L. The RD arbitrates the request signals, generates crossbar control signals, and tells the IFC which path was chosen. With knowledge of which path will be taken, the IFC selects the appropriate new header (all possible updated headers are waiting). Simultaneously, the CB is setup, and a data ready signal from the IFC passes through the CB, arriving at the VC. The crossbar setup and data ready signal operations are not on the critical path; their delay is masked by the larger header selection time. The updated header flit passes through the CB, arriving at the VC where it is latched at phl. Figure 8 shows the critical path and components   8 The critical path and delay at data through in the base router. As before, L denotes latches and CL1-3 denote combinational logic blocks. of delay of a planar-adaptive router at data through. 4 The data from the XFC is latched inside the IFC (in L), and then sent through the CB to the VC. In the VC, the data must wait for the arbitration amongst the virtual channels, even if no others are trying to send at this time. The arbiter's output controls selector S. After passing through the selector, the data is latched in L in the VC by phi. It can now be transmitted to the next node.

Discussion
Determining the router setup latency is fairly straightforward, but determining achievable clock rate based on data through delay involves consideration of intemode and intranode delays as well as clock skew margins. In our design, a flit crosses the channel and the router in a single clock period (channel and synchronization delay is a half period from phl to ph2, and router delay is the other half period from ph2 to phi). If we assume a two-phase clock with equal length phases, then whichever delay is larger, the intemode delay or intranode delay, determines the network clock rate. A more thorough description of the assumptions for determining achievable clock rate are given in Section 7.
For our base line router, the data-through delay of 5.7 ns for our baseline router dominates likely internode delays and skew margins, and thus a clock period of 2.5. 7 11.4 ns is achievable. Such a clock period would allow a generous clock skew margin of 0.8 ns. Channel delays of two cycles would allow and easily achievable skew margin of 6.5ns. While our adaptive router can .sustain high speed and is low latency, it is slower than a comparable deterministic router. Based on the same assumptions, a dimension-order router design has delays of 5.68ns and 3.0ns at path setup and data through, respectively. The major reasons the deterministic router is faster is the lack of serialization between routing decision and header selection (routing choices and header updates are fixed) and virtual channel controllers. In terms of intranode speed, the DOR is nearly twice fast as our baseline adaptive router. However, since intranode delay at data through in DOR (3.0ns) is less than internode delay (4.9ns), the router clock rate can be dominated by internode delay, limiting the achievable clock period for the DOR 4In our simplified reporting of performance figures, the VC appears to have a different delay in the two situations: path setup and data through. This is because the overlap of operations is slightly different in each case. tO 9.8ns, only 16% faster than the planar-adaptive router.

THE COST OF ADAPTIVITY
In this section, we characterize the cost of adaptivity by examining router designs with a range of adaptivity. These routers are all taken from the class of f-flat adaptive routers [10]. The f-fiat adaptive routing framework can be used with any deadlock-free adaptive routing algorithm within each f-fiat; we assume a Linder-Harden router [29] with fully adaptive minimal routing.
Increasing routing freedom not only increases the complexity of individual router modules, many more modules are needed. The hardware module requirements are generalized to f-fiat routers below: To give the reader an idea of how the resource requirements (crossbars and virtual channels) increase, consider a 3-flat adaptive router. Each 3-flat (or cube) corresponds to a plane in planar-adaptive routing and is divided into four virtual subnetworks (x+, y+, z), (x+, y-, z), (x-, y+, z) and (x-, y-, z) for 3-flat deadlock-free routing. This requires two virtual channels in the first two dimensions and four virtual channels in the third dimension. If a series of 3-ttats are composed for a higher-dimensional network, eight virtual channels per physical channel are needed.
As routing freedom is increased, not only does the number of router modules required increase, some of the router modules become more complex, thereby becoming slower. We consider each router module in turn. IFC and AD delays do not change because their designs require only modest changes for higher degrees of routing freedom.
The delay in RD with f-flat adaptivity may be estimated a follows: If+2 ] * TRD 4-1 , Tgat 4 4 Tgat denotes the basic gate delay. The basic structure of the RD consists of f + 2 connection controllers, whose inputs feed into f + 2-input priority encoders. Each controller controls ith priority signal, aod the outputs of the priority controller are used to determine the cnct signals: The term on the right comes from the lowest priority connection controller, and is proportional to f because the controllers are daisy-chained together. The term on the right arises from the combining logic which grows in proportion to the number of cnct signals.
The CB delay increases because even with partitioned crossbars, f + 2 ports are required per CB. The CB delay is described below: TcB:= TcB + If+21 4 1 * Tgat The VC delay for f-flat routers increases because the number of virtual channels to be multiplexed on each physical channel increases, requiring larger (deeper) arbitration circuits and selectors Tvc(3 denotes the basic delay of a virtual channel controller for three virtual channels.

2
-+ (f 1),2 -: In the VC, there are two arbiters. The delay of each arbiter increases in discrete jumps with the number of virtual lanes in a VC. This increase is 5The cnct signals are used grant output port request from the ADs.
represented by the last term in the equation. For path setup, the overlap of the IFC delay causes the last term to be irrelevant when f is smaller than four.
Combining these terms gives overall formulas for router delay with f-fiat adaptivity: At path setup: Tf_fla TAD + TRD Based on our cost model we can estimate the speed of routers with a range of adaptivity (see Figure 9). From these estimates, it is clear that router delay increases significantly with adaptivity, but not very rapidly. This is because each f-flat can be partitioned, requiring much smaller crossbars. For example, the CB in the 3-flat router is only 5 5. Instead, the number of virtual channels required for deadlock prevention the primary source of increased delay. The 3-flat adaptive router requires eight virtual channels just to prevent deadlock. The increased crossbar sizes and large numbers of virtual channels make routers with higher adaptivity much slower.
The 3-flat router is 50% and the 4-flat is 190% slower than the planar-adaptive router at data through with much of the additional delay coming  The delay at data through Router delays with to 8 degrees of routing freedom. Delay ratios are with respect to the base router (two-flat adaptive from exponential increases in the numbers of virtual channels for deadlock prevention. These increases are much larger than the throughput benefits claimed by most adaptive routers, reducing or eliminating the overall performance benefits of high degrees of adaptivity.

THE COST OF VIRTUAL LANES
In this section, we characterize the cost of virtual lanes by examining router designs with from one to sixteen virtual lanes. Virtual lanes can increase channel utilization in a network by multiplexing the physical channels, allowing packets to pass one another [14,26]. Though both virtual channels and virtual lanes use additional hardware buffers, virtual lanes require greaer connectivity in the router as each virtual lane is interchangeable within its virtual channel class. Essentially, this means that the crossbars cannot be partitioned. In this section, we first consider the pros and cons of several proposed architectures for virtual lanes, then estimate the speed and cost of the most attractive architecture.

Architectural Alternatives for Virtual Lanes
In [14], Dally proposes three alternatives for implementing virtual lanes, which differ primarily in the size of the crossbar switch and how it is multiplexed ( Figure 10) Figure 11 for an example). Blocking on option B arises from the interaction of flow control and crossbar .multiplexing, and can cause performance losses. Because of the importance of minimizing internal blocking, we rule out option B. Switching Speed Switch multiplexing affects the critical path length for data through and thus the achievable router speed. Option B has been eliminated on the basis of blocking, so we consider options A and C. As shown in Figure 12, option A requires only one pass through the CB for each data transmission while option C requires several passes. In option A, the crossbar configuration is fixed, and the fixed connections operate identically to the base router. In option C, because the outputs are shared, the switch settings for each cycle are determined by which virtual channels will use the physical channels this cycle. Thus, the VC and IFC's must collaborate to control the switch based on data status information from the IFC's as well as the empty signals from downstream nodes. This approach requires three passes through the switch, the first to get the data status information to the VC's, the second to setup the switch and send the enable signals to the chosen IFC's and finally for the data to pass through the crossbar.
In addition, option C requires extra arbiters to manage the switch multiplexing; one for each switch output, managing the virtual lanes in a single virtual channel class (see Figure 13). These additional ar-bitration steps are more expensive for larger numbers of virtual lanes and increase not only the path setup time, but also the data through delay (switching speed), directly reducing network throughput.
Because latency and bandwidth are first priorities, option A is the most attractive. Gate count is not a major constraint for most router designs, and. for modest dimension networks and virtual lanes, the required crossbar switches are feasible. For example, going from one virtual lane (base router) to two virtual lanes produces a router design with 8 x 8 crossbar switches and 6 input virtual channel controllers (see Figure 14). Alternatives which multiplex the crossbar switch may be more attractive for routers with large numbers of virtual lanes.

Performance of Routers with Virtual Lanes
In this section, we characterize the speed of routers supporting virtual lanes with architectural alternative option A. Adding virtual lanes requires minor modifications to the RD, CB and VC. First, the RD must connect messages to virtual lanes, not just physical channels. Second, adding one virtual lane requires VC's that can support six virtual channels (the former three virtual channels multiplied by two virtual lanes). Finally, the crossbar size is increased. A modified base router with two virtual lanes is shown in Figure 14. Comparing to Figure 3 shows the significant increase in complexity. Based our designs, the speed of a router with m virtual lanes can be estimated as follows: and VC delay increase slowly based on the number of virtual lanes due to increasing depth of the switching and arbitration circuits. The second term in VC delay for path setup is a factor of two smaller than that for data through because of the different overlap at path setup (see section 5). These delays are summarized for a planar-adaptive router with a range of virtual lanes from one to sixteen in Figure 15. Clearly, adding virtual lanes significantly increases router setup and data through latency. For example, if data through latency determines throughput, going from one virtual lane to two virtual lanes requires more than a 30% throughput improvement to be worthwhile. In conjunction with measured perform- We convert the router delays to achievable network clock rates based on four assumptions: two phase clocking, equal length phases (approximately 50% duty cycle clocks), path setup in one and a half clock cycles, and the achievable clock rate determined by the delay at data through. The first two conditions match the intranode delay (at data through) and internode delay. The third condition assumes we can achieve a router setup latency of two clock periods. The final condition assumes a simple router architecture with identical flit and phit (physical transfer unit) size. Figures 16 and 17 summarize the intranode delay, achievable clock rate, delay per hop and required channel utilization improvement rate of routers with f-flat adaptivity and m virtual lanes, respectively. The required channel utilization figures show the performance increase required to justify addition of the feature. The numbers in Figure 17 are based on adding virtual lanes to a planar-adaptive (2-flat) router.
Using the collected information, one can compare the cost of adaptivity and virtual lanes. Previous simulation studies show that network channel utilization benefits from a mix of the two features [25]. Our results show that adaptivity based on the Linder-Harden algorithm is more expensive than virtual lanes, and higher degrees of adaptivity based on this approach are probably not feasible. A 3-flat adaptive router incurs an increase in delay as large as a router with three virtual lanes. A 4-flat router's delay is neatly as large as a router with seven virtual lanes. In summary, routers with modest adaptivity and larger numbers of virtual lanes are most attractive. Further, higher degrees of adaptivity or large numbers of virtual lanes are probably not viable, as their cost-effectiveness depends on four-fold increases in channel utilization.

SUMMARY AND FUTURE WORK
In this paper, we have described the design of a planar-adaptive router, and used that design to analyze the cost of a basic adaptive router. Our router is internally synchronous and extemally asynchronous. Based on a 0.8 micron gate array technology, we characterized the speed of our design. The intranode router delay is 10.3ns and 5.7ns for path setup and data through respectively, supporting a maximum signalling rate of 87 Mhz or 174 MB/s per physical channel (sixteen-bit channels). Our design could be improved by using a single-phase clock and edgetriggered latches, potentially raising performance to 348 MB/sec.
Using the planar-adaptive router design as a baseline, we explored the cost of adaptivity and virtual lanes. Our studies show that higher degrees of adaptivity can be extremely expensive. Justifying the increased cycle time due to adaptivity requires that it deliver huge increases in channel utilization. For example, to justify an increase in adaptivity from a two-flat to a 3-flat router requires a 50% improvement in channel utilization. To justify an increase from a two-flat to four-flat router the improvement must be extremely large, 190%. Simulation studies show that improvements in channel utilization due to adaptive routing are likely to be more modest. Consequently, only low degrees of adaptivity are attractive.
Our studies show that virtual lanes are less expensive than adaptivity, but still quite expensive. To justify the increased cycle time, the first additional virtual lane must provide at least a 30% increase in The delay at path setup 16.9ns Delay Ratio path setup 1.0 The delay at data through 4 20 channel utilization. Justifying second and third virtual lanes require 30% further increases in channel utilization for each. Published simulations studies show that such increases are possible for a modest number of virtual lanes, but performance increases sufficient to justify larger numbers of virtual lanes appear unlikely [14,25]. Consequently, small numbers of virtual lanes should produce the best performance. In summary, routers with modest adaptivity and larger numbers of virtual lanes are most attractive.
By examining the implementation complexity of adaptive routing and virtual lanes, we seek to balance their cost and benefit. While much research has been published on the advantages of these features, we hope to provoke debate on their cost and real benefits. For example, some proponents of adaptive routing have claimed lower latency at low loads as a performance advantage. Our design studies show that increases in router complexity and intrarouter latency are likely to overwhelm such benefits. Both adaptive routing and virtual lanes can increase network throughput, but our design studies show that their complexity can produce compensating reductions in network bandwidth. Measuring the cost of network features allows us to weigh their benefit against their cost and make informed tradeoffs. There are still many avenues open for future work in this area. Though our design presents a basic evaluation of the cost of adaptivity and virtual lanes, it is based on a single technology point and a particular class of router architectures. Other technology points and router architectures should be examined to see if they give qualitatively different results. This study also examined basically one approach to routing, others studies of this type 11 will certainly explore the cost of alternative approaches to adaptive routing. The ultimate goal is to integrate the optimization of concerns to include routing algorithm, network topology, routing freedom, and virtual lanes, allowing the major choices which affect network cost and performance to be related in a global perspective on network design.  17 The delay, achievable clock rate, delay per hop and required channel utilization improvement for planar adaptive (2-flat) routers for a range of virtual lanes.