Contango: Integrated optimization of SoC clock networks

On-chip clock networks are remarkable in their impact on the performance and power of synchronous circuits, in their susceptibility to adverse effects of semiconductor technology scaling, as well as in their strong potential for improvement through better CAD algorithms and tools. Our work offers new algorithms and a methodology for SPICE-accurate optimization of clock networks, coordinated to satisfy slew constraints and achieve best trade-offs between skew, insertion delay, power, as well as tolerance to variations. Our implementation, called Contango, is evaluated on 45nm benchmarks from IBM Research and Texas Instruments with up to 50K sinks.


I. INTRODUCTION
Accurate distribution of clock signals is a major limiting factor for high-performance integrated circuits when unintended clock skew narrows down the useful portion of the clock cycle. Semiconductor scaling in the 1990s made clock optimization more challenging. While transistors continued scaling, interconnect lagged in performance [6]. This phenomenon boosted demands for repeaters in clock networks, raised their power profile, and complicated their synthesis.
Clock networks were among the first circuits to suffer the impact of process, voltage and temperature variations. Systematic variations can affect paths to different sinks in different ways, making effective skew higher than nominal skew. Intra-die variations may be stronger on some paths than on others, which would further increase effective skew. These challenges have motivated research at the device, circuit and algorithm levels [9]. In general, smaller sink latencies and shorter tree paths decrease exposure to variations.
Our work focuses on clock-network synthesis for ASICs and SoCs, where clock frequencies are not as aggressive as in high-performance CPUs, but power is limited, especially for portable applications. In this context, tree topologies remain the most popular choice, potentially with further tuning and enhancements. The SoC context introduces another twist -layout obstacles. A typical SoC includes numerous predesigned blocks (CPUs, RAMs, DSPs, etc) and datapaths. While it may be possible to route wires over such obstacles, buffer insertion is typically not allowed. One can fathom the difficulty of such optimization through comparison to signalnet routing, where obstacle-avoiding Steiner trees currently remain an active area of research [10].
We make the following contributions.
• Notions of slow-down & speed-up slack for clock trees • Tree optimizations driven by accurate delay models • A simple and robust technique for obstacle avoidance in clock trees subject to slew constraints • A provably-good sink-polarity correction algorithm • A methodology for clock-tree optimizations that outperforms the best results at the ISPD'09 contest on every benchmark by 2.15 − 3.99 times, while reducing skew to 2.2 − 4.6ps. On newer Texas Instruments benchmarks with up to 50K sinks, skew remains < 11ps. Further optimization is possible by selecting best parameters for each benchmark, at the cost of increased runtime. But skew< 20ps is considered negligible in industrial practice.
II. BACKGROUND AND PRIOR WORK DME algorithms. Traditionally, clock trees have been constructed with respect to simple delay models -geometric pathlength or Elmore delay. In this context, the results in [1], [2], [4], [14] show how to build zero-skew trees (ZSTs) with minimal wirelength, improving upon H-trees and fishbones.
The Deferred Merge Embedding (DME) algorithm, using the concept of merging segment [1], [4] for constructing zeroskew tree, was extended to the bounded-skew tree (BST) problem. BST/DME algorithms [3], [7] generalize merging segments to merging regions. When BST/DME algorithms were introduced in the early 1990s, many chip designs included one large central buffer to drive clock signals through the entire chip. However, SPICE simulations indicate that traditional clock trees wouldn't satisfy slew constraints in modern designs because the maximal length of unbuffered interconnect decreased significantly due to technology scaling [6].
Obstacle-avoiding clock trees. The concept of merging regions in BST/DME was extended to obstacle-avoiding trees in [8], where (i) obstacles were assumed rectangular, (ii) no routing over obstacles was allowed, and (iii) buffering was not considered. The authors observed that obstacle processing slowed down their BST/DME algorithm and hinted at more sophisticated geometric data structures. In contrast to [8], the ISPD'09 contest allowed routing but not buffering over obstacles, with modern SoCs in mind.
Fast buffer insertion. L. van Ginneken introduced an algorithm for buffering RC-trees [5], which minimizes Elmore delay and runs in O(n 2 ) time, given n possible buffer locations. While not intended for clock trees, it minimizes worst delay rather than skew. The O(n log n)-time variant of van Ginneken's algorithm proposed in [12] is more approriate for large trees. Futhermore, it spares buffers on fast paths and results in low skew if the initial tree was balanced.
The ISPD'09 clock-network synthesis contest was organized by IBM Austin Research Laboratory and based on a 45nm technology. Sink latencies and clock skew were evaluated by SPICE. The main objective was the difference between the least sink latency @1.2V (supply) and the greatest sink latency @1V (supply). This Clock Latency Range (CLR) metric was intended to capture the impact of variations, but nominal skew was also recorded. The 10%-90% slew rate and total power were strictly limited.

III. PROBLEM ANALYSIS
The design of a clock network offers a large amount of freedom in topology selection, spacing and sizing of inverters, as well as the sizing of individual wires. Traditionally, network topology is decided first. Trees offer unparalleled flexibility in optimization because latency from the root to each sink can be tuned individually, while large groups of sinks can be tuned by altering nodes and edges high up in the tree.

A. Optimization objectives & delay modeling
Accurate clock network design is complicated by the fact that the optimization objectives are not available in closed form and take significant CPU resources to evaluate. Skew optimization requires much higher accuracy than popular Elmore-like delay models. For example, a 5ps error represents only 1% of 500ps sink latency, but 50% of 10ps skew. Closed-form models do not capture resistive shielding in long wires, do not propagate slew with sufficient accuracy, and do not account for slew's impact on delay well. Newer, more sophisticated models are laborious to implement and only available in modern commercial tools. Our strategy is to use simple analytical models at the first steps of the proposed flow -(1) to construct zero-skew clock trees and (2) to perform initial fast buffer insertion, -but drive further optimizations by SPICE runs, Arnoldi approximation, or any other available timing analysis tool/model.

B. Nominal skew optimization
An initial buffered clock tree is constructed early in the design flow. Assuming no slew violations, the latency of each sink is known from SPICE simulations, at which point minimal and maximal latencies (T max and T min ) can be found. 1 Since absolute sink latencies are not as important as skew (T max − T min ), skew can be improved by either decreasing T max (speeding up the slowest sinks) or increasing T min (slowing down the fastest sinks).  = 0. It is important to note that the validity of slacks-related calculations does not depend on the use of specific delay models or SPICE simulations. When visualizing clock trees, we color their edges with a red-green gradient, indicating low slack with red and high slack with green, as shown in Figure 3.
Lemma 2 suggests that, instead of changing the delay of an edge, one can change the delay of its downstream edges by an equal amount, as long as only one delay change is applied on each root-to-sink path. When choosing between tree edges on the same path, we prefer (at early stages of optimization) to tune edges as high in the tree as possible, so as to minimize (i) the amount of change, (ii) the risk of introducing slew violations and (iii) power overhead. However, in a highly optimized tree, we tune bottom-level edges where we can better predict the impact on skew. The preference for high-level tree edges can be formalized as follows. To avoid such conflicts, one can perform rounds of speed-up and rounds of slow-down, separated by SPICE-based analysis and slack update. In practice, it is usually much easier to slow down an edge (e.g., by wire snaking) than to speed it up. If any speed-up is possible, e.g., by using stronger buffers, it is performed first. Rounds of speed-up and slow-down are more conveniently performed top-down, so that when an edge cannot be tuned by the desired amount, the remainder is passed to its downstream edges.
We found that after nominal skew is sufficiently optimized, both rising and falling transitions can individually limit speedup and slow-down slacks. We handle the two transitions separately and define edge slacks as the smaller of rise-slack and fall-slack. Furthermore, speed-up and slow-down slacks can be computed for each process corner given (two in the ISPD'09 contest). In order to improve the multicorner CLR objective, a tree edge can be sped up conservatively by the minimum of its speed-up slacks, and can be slowed down by the minimum of its slow-down slacks.

C. Coordinating multiple optimizations
We found that different clock-tree optimizations exhibit different strength/range and different accuracy (see Table III). Our strategy in coordinating clock-tree optimizations is to start with optimizations that offer the greatest range, and then transition to optimizations with greater accuracy. IV. PROPOSED SOC CLOCK-SYNTHESIS METHODOLOGY Our proposed clock-network synthesis methodology and its major algorithmic steps are shown in Figure 1. Contango first builds an initial tree using a ZST/DME algorithm [3] and alters it to avoid obstacles. It then uses an O(n log n)-time variant of van Ginneken's buffer insertion algorithm [12] to ensure small insertion delay and satisfy slew constraints. A series of novel clock-tree optimizations are applied next.

A. Obstacle-avoiding clock trees
As we pointed out in Section II, obstacle-avoiding clock trees can be built by repairing obstacle violations in ZSTs. This approach is attractive when large obstacles abut the chip's periphery because ZSTs naturally avoid areas without clock sinks. This approach is also attractive when obstacles are small or thin enough that a buffer inserted immediately before the obstacle can drive the wire over the obstacle, so that no rerouting is necessary. A third convenient case occurs when a wire can be rerouted around the obstacle without an increase in length. Most obstacles are rectangular in shape, but such rectangles may abut, creating rectilinearshaped obstacles. When two obstacles abut, we cannot place a buffer between them, and therefore handle them as one compound obstacle. Contango detours wires using the following algorithm, illustrated in Figure 2 for a composite obstacles.
Step 1. Identify all wires that intersect obstacles. For each point-to-point connection, perform shortest-path maze routing around the obstacles. For subtrees that cross an obstacle, find L-shaped segments that link points inside and outside the obstacle. For each L-shape, choose one of the two possible configurations that minimizes overlap with the obstacle.
Step 2. When a wire crosses an obstacle, Contango captures an entire subtree enclosed by the obstacle (see Figure 2). The total capacitance of the subtree is then measured and compared to the capacitance that can be driven by a single buffer without risking slew violations (slew-free capacitance). Sub-trees that can be driven by one buffer do not require detours.
Step 3. For obstacles crossed by a subtree that cannot be safely driven by a single buffer, Contango establishes a detour along the contour of the obstacle. This is accomplished by first considering the entire contour as a detour, and then removing one segment between tree sinks adjacent along the contour, so as to ensure that the clock network remains a tree. If we were to minimize total capacitance, we would remove the longest segment of the contour between two adjacent tree sinks. However, we minimize the longest detoured source-tosink path, and therefore remove the segment furthest from the tree source (counting distances along the contour). In other words, we first find the sink most distant from the source along the contour, and include in the detour the entire shortest path to the source. The other segment incident to the sink is removed, but the shortest path from its other end to the source is included (see Figure 2).
Detours may significantly increase skew, but electrical correction can compensate for that.

B. Composite inverter/buffer analysis
Most technology libraries support dedicated clock buffers or inverters that are larger and more reliable than those for signal nets. Parallel composition of buffers increases driver strength, helping with slew constraints and improving robustness to variations. Yet, buffer sizes must be moderated to satisfy total power limits. For a given buffer library, we consider many possible composite buffers. Using dynamic programming, we select several non-dominated configurations that can be further evaluated during buffer insertion. Algorithmic details are omited here because the ISPD'09 contest used only two inverter typeslarge and small. Table I shows that eight parallel small inverters exhibit smaller output resistance than one large inverter, and smaller input/output capacitance. Hence Contango used 8× small inverters instead of large inverters, in batches of 16×, 24×, etc. This benchmarkindependent optimization, along with buffer sizing, plays an important role in our methodology.

C. Initial inverter insertion with sizing
Given a clock tree with buffers, it is easy to increase the latency of a given sink, but it is difficult to speed up a sink. Therefore, our strategy is to first make sinks as fast as possible, and then reduce skew with wiresnaking and wiresizing. When buffers are inserted into an Elmore-balanced tree, source-tosink paths contain practically the same numbers of buffers.
We adapted the O(n log n)-time variant of van Ginneken's algorithm from [12]. Due to its speed, it can be launched with different inverter configurations, effectively performing simultaneous optimization across multiple parameters. Our experiments indicate that driver strength is a major factor in moderating the impact of supply-voltage variations. Therefore Contango performs fast buffer insertion with different composite buffers until it finds the best-performing solution with strongest composite buffers within the 90% of the power limit. We reserve γ = 10% of power budget to facilitate more accurate optimizations.

D. Sink-polarity correction
The O(n log n) variant of van Ginneken's algorithm [12] used in our work assumes that all available clock buffers preserve polarity. However, when polarity-changing inverters are used, as in the ISPD'09 contest, it will typically produce trees with incorrect sink polarity (inverted sinks).
While the algorithm can be extended to account for sink polarity, we found this unnecessary. Even a simple patch -placing additional inverters at each of n × inverted sinks -works reasonably well, because the skew introduced by new inverters can be fixed by downstream optimizations. This technique inserts inverters at half the sinks (n/2) on average. To reduce the added capacitance in cases when n × > n/2, Contango inserts one inverter at the top of the tree, leaving only n = (n − n × ) < n/2 sinks with wrong polarity. The average number of inserted inverters would now be (n + 2)/4. Instead, Contango traverses the tree bottom-up and marks each node (i) whose all sinks have equal polarity, but (ii) whose parent does not satisfy (i). An inverter is inserted at each marked node with downstream sinks of incorrect polarity. As a result, the number of added inverters is significantly reduced, as shown in Table II.

E. Iterative top-down wiresizing
After the initial SPICE run, Contango computes slow-down slacks at every edge as described in Section III, and the Δ slow e parameters. This suggests the amount by which a given tree edge can be slowed down before skew would be negatively affected. Since fast sinks often cluster together, skew can be lowered by slowing down either many bottom-level wires or few wires higher in the tree. Our top-down algorithm pursues the latter, seeking to minimize tree modifications.
We build an ad hoc linear model based on the impact of downsizing a unit-length wire segment. Contango chooses several independent wire segments in the middle of the tree and downsizes them to observe the impact on latencies of downstream sinks. This requires a single SPICE run and produces a single parameter T ws -maximum latency increase (over all sinks). When downsizing a wire, we multiply T ws by its length to estimate the impact on downstream sink latencies. To understand why this linear model works well in practice, assume that delay is modeled by a sum of RC terms. When a short wire segment is sized, the affected Rs and Cs do not appear in the same term, thus, the impact on delay is linear.

F. Iterative top-down wiresnaking
Wiresizing can reduce large skew by applying small changes, which is appropriate after the initial tree construction. An experienced clock-network designer suggested to us that a small amount of wire-snaking is often used to improve clock skew, as long as added capacitance does not significantly affect power. Therefore, we developed an accurate top-down wiresnaking process, to be invoked after top-down wiresizing. This step uses the same slow-down slack computation we described earlier. A SPICE simulation is performed (other accurate delay model can be used) to measure T wn , the worstcase delay of wiresnaking with unit length l wn . l wn affects the accuracy of the wiresnaking algorithm; smaller l wn offers greater accuracy but typically leads to more SPICE runs since skew reduction in each round of top-down wiresnaking is smaller. l wn was set based on empirical data.

G. Bottom-level fine-tuning
After two top-down skew reduction phases, skew becomes small enough to perform bottom level optimizations. Bottom-level wiresizing and wiresnaking optimize the wires directly connected to sinks. Contango performs SPICE-driven bottom-level wiresizing and wiresnaking until the results stop improving. Typically the gain of bottom-level tuning is under 2ps, but can be a significant fraction of remaining skew.
We found that when skew is under 5ps, the corner sinks of rising transition and falling transition are often different. This rise-fall divergence makes further improvements to the clocktree very difficult. Indeed, reducing rising skew by slowing down a fast sink for rising transition may increase falling skew due to excessive slowdown of a slow sink for falling transition. The average skew after bottom-level tuning is 3.21ps on ISPD'09 CNS contest benchmarks.

H. Buffer sliding and interleaving
Optimization techniques covered so far focus on skew, possibly in combination with the CLR objective. We now discuss targeted improvement of robustness to variations in device performance. Extensive experiments suggest that the impact of variations on skew is best reduced by (i) decreasing sink latency (insertion delay), and (ii) using the strongest possible buffers. Each measure must be applied to achieve balance over all source-to-sink paths, or else skew will increase. Recall from Section IV-C that Contango minimizes total wirelength and initially minimizes insertion delay by using strongest possible buffers, subject to power limit and a 10% reserve for downstream optimizations.
Sizing up a single inverter increases its input pin capacitance and can lead to slew violations. To prevent such violations, it is often possible to slide the inverter up the tree to reduce upstream wire capacitance and interleave an inverter when two inverters move too far apart after sliding. The increase in downstream wire capacitance is balanced with the increase in the inverter's driving strength. Sizing a single inverter may increase the skew and require further correction. Therefore, we focused on the top-most levels of the tree, whose impact on skew is relatively small. Given a clock source at the chip boundary, DME algorithms generate a long wire leading to the center of the chip, and the tree branches out from the center. This long wire -the tree trunk -is later populated with a chain of inverters, which can be upsized without significant impact on skew because this equally affects all sinks. However, since roughly 1/3 to 1/2 of sink latency is due to the tree trunk, it accounts for a large fraction of variational impact on latency.

I. Iterative buffer sizing
After sliding and interleaving top-level buffers, we invoke iterative buffer sizing. First, this algorithm sizes up buffers in the tree trunk. At the i-th iteration of buffer sizing, Contango sizes up the composite inverters by at most p i = 100/(i + 3)%. The iterations continue until results improve without slew violation. Buffer sizing in tree branches incurs a greater capacitance penalty. To compensate, Contango borrows capacitance by downsizing bottom-level buffers.
However, sizing up buffers after the trunk often makes the tree unbalanced in terms of skew and results in more load for the skew optimization algorithms. For better performance of skew optimizations, typically 4 or 5 levels after the first branch are sized up by capacitance borrowing buffer sizing algorithm. Table III shows the improvement of CLR by each optimization algorithms. Buffer sizing increases skew, but subsequent skew optimizations bring it back down.

V. EMPIRICAL VALIDATION
ISPD'09 benchmarks include seven 45nm chips up to 17mm × 17mm in size, with up to 330 selected clock sinks [13]. Contango runs faster than NTU and NCTU on most benchmarks (we measured runtimes on a 2.4GHz Intel QuadCore CPU, similar to CPUs used at the contest). A detailed breakdown of Contango optimizations is given in Table III. A clock tree produced by Contango is illustrated in Figure 3. Scalability studies. The ISPD'09 contest was limited to unrealistically small numbers of sinks due to limitations of the open-source ngSPICE software it relied upon. To evaluate the scalability of our optimizations, we replaced ngSPICE with HSPICE. Working with a recent Texas Instruments chip sized 4.2mm × 3.0mm, we identified locations of 135K sinks and randomly sampled them to create a family of benchmarks. For this experiment, our algorithm used groups of large inverters instead of groups of 8 parallel small inverters, improving runtime eightfold at the cost of increasing CLR and skew by 1-2ps and increasing capacitance by 15%. It produced highlyoptimized clock trees with up to 50K sinks. VI. CONCLUSIONS Existing literature on clock networks offers several highly successful algorithms, but does not detail end-to-end solutions to clock-network synthesis that can handle modern interconnect. Our work makes several contributions to this end. First, we develop specialized optimization algorithms necessary to bridge the gaps between well-known point-optimizations. Our emphasis is on robust techniques, that do not require tuning and are amenable to embedding into design flows. Second, we develop an EDA methodology for integrating clock-network optimization steps. T hird, we describe a robust software implementation, called Contango, that outperforms best results from the ISPD'09 contest [13] by a factor of two. 2 Fourth, we scale our implementation to handle large industrial clock networks.
Our work relies in many ways on tree topologies and, by achieving strong empirical results, can make it difficult to justify the insertion of cross-links, advocated in previous literature. On the other hand, trees synthesized by our techniques can be integrated with meshes, as is common in modern CPU design [11]. In CPUs, better trees allow using smaller meshes, thereby reducing power of high-performance CPUs, increasing performance of embedded CPUs, and improving battery life of portable applications. Fig. 3. The clock tree produced by Contango on ispd09 f nb1. Sinks are indicated by crosses, buffers are indicated by blue rectangles. L-shapes are drawn as "diagonal wires" to reduce clutter. Wires are colored by a red-green gradient to reflect slow-down slacks, as described in Section III-B.