Tuning Strategies for Global Interconnects in High-Performance Deep-Submicron ICs

Interconnect tuning is an increasingly critical degree of freedom in the physical design of high-performance VLSI systems. By interconnect tuning, we refer to the selection of line thicknesses, widths and spacings in multi-layer interconnect to simultaneously optimize signal distribution, signal performance, signal integrity, and interconnect manufacturability and reliability. This is a key activity in most leading-edge design projects, but has received little attention in the literature. Our work provides the first technology-specific studies of interconnect tuning in the literature. We center on global wiring layers and interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding/spacing rules for signal integrity and performance. We address four basic questions. (1) How should width and spacing be allocated to maximize performance for a given line pitch? (2) For a given line pitch, what criteria affect the optimal interval at which repeaters Should be inserted into global interconnects? (3) Under what circumstances are shield wires the optimum technique for improving interconnect performance? (4) In global interconnect with repeaters, what other interconnect tuning is possible? Our study of question (4) demonstrates a new approach of offsetting repeater placements that can reduce worst-case cross-chip delays by over 30% in current technologies.


Introduction
With technology scaling, on-chip interconnect becomes an increasingly critical determinant of performance, manufacturability and reliability in high-end VLSI designs.Current and future designs are generally interconnect-limited, and the available routing resource must becarefully balanced among signal distribution, power ground distribution, and clock distribution.Local interconnect layers e.g., M1-M3 should generally remain at near-minimum dimensions and pitch t o achieve acceptable routing density an example analysis of interconnect density i n 0.25m processes is given in 6 .For short lines e.g., several hundred microns or less, thinner metal o ers less lateral coupling capacitance and driver loading, and thus locally improves circuit performance.At the same time, maximum wire width is limited by the aspect ratio upper bound.
The resulting thin and narrow wires are highly resistive and also subject to reliability concerns; they are hence unsuitable for global interconnects, power distribution, etc.
Layers M2-M3 and maybe M4 will support a mix of local and semi-global" wiring, e.g., long wires within a single block.In general, shorter wires are better routed on thinner metal.Thus, the distribution of lengths and performance goals for signals in a given design, as well as designspeci c objectives circuit robustness, guardbanding against manufacturing variation, etc. will a ect the interconnect tuning.
Power distribution layers e.g., M6-M7, maybe M5, which typically also support the top-level clock distribution mesh or balanced -tree, should be as thick as possible for reliability.IR drop and clock skew as well as robustness under process variations also suggest the use of thick wire on these layers.Thick wire additionally conserves area, but can su er from increased lateral capacitive coupling.
Global interconnect layers e.g., M4-M6 support inter-block signal runs with length on the order of 3000m -1 5 0 0 0 m.To satisfy delay and signal integrity constraints, at least three degrees of freedom are available: line width and spacing, repeater insertion, and shield wiring.Repeater insertion shields downstream capacitance and is the canonical means of converting quadratic" RC delay into near-linear" delay; this technique also improves edge rates and hence noise immunity.When lateral coupling capacitances are large, worst-case Miller coupling" begins to dominate noise and delay calculations; this is alleviated by increasing the line spacing and or adding shield wiring i.e., wires connected to ground, with future techniques possibly including dedicated ground and power planes interleaved with signal layers 9 . 1 Another technique to reduce the lateral coupling capacitance is to interleave signal lines which do not switch at the same signal transistion period.The bus-dominated nature of global interconnects in buildingblock and high-performance designs only worsens the e ects of coupling, since it results in longer parallel runs.
All layers are subject to mutual pitch-matching, via sizing, etc. considerations.Hence, widths and spacings on one layer cannot be chosen independently of the widths and spacings on a second layer.
The above are only a few of the applicable design considerations; the net e ect is that balancing interconnect resources is now extremely di cult as designs move i n to and beyond the quarter-micron regime.

Interconnect Strategies
Interconnect tuning is the selection by a design team of line thicknesses, widths and spacings in multilayer interconnect to simultaneously achieve: i distribution available wiring density for local signals, global signals, clock, power and ground; ii performance signal propagation delay, particularly on global interconnects; iii noise immunity signal integrity, again particularly on global interconnects; and iv manufacturability and reliability e.g., required margins for AC self-heat or DC electromigra-tion on interconnects, short-circuit power in attached devices, etc.. Today, interconnect tuning is a key activity in most leading-edge microprocessor projects.It is clearly an option whenever the design and fabrication are owned by a single entity in which case there is overlap with interconnect process optimization"; however, for high-volume projects even fabless design houses exercise increasing in uence on vendors' processes 6 .Nevertheless, this topic has received little attention in the literature, with only a few high-level treatments available.For example, 11 describes a characterization and analysis methodology and the need to break ideal scaling in deep submicron interconnect.14 is another work that centers on analysis of a given multi-layer interconnect process, as opposed to the underlying interconnect tuning.5 and 10 are examples of system-level treatments based on Rent's rule for interconnect length distribution.To our knowledge, the most notable work is the seminal paper of Rahmat et al 12 , which plots the constraints imposed by material, circuit performance and reliability requirements, e.g., crosstalk noise, electromigration, and signal propagation delay.The paper studies such questions as: i maximum interconnect length that can be switched in a clock period;ii delay and noise envelopes for given values of horizontal and vertical pitch; iii coupling capacitance as a function of feature size; and iv maximum length of local interconnect as limited by crosstalk noise.
We believe that our work is the rst in the literature to attempt a wide-ranging study of interconnect tuning with respect to degrees of freedom repeater insertion, choice of pitch, etc. that are most applicable in the high-end design context.We c e n ter on global wiring layers e.g., M4 and M5 in a 6LM process, and interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding spacing rules for signal integrity and performance.Even though the results presented in this paper are for aluminum interconnects with SiO2 dielectric, similar techniques can beapplied for copper interconnects and low-K dielectrics.Several other parameters, notably wire tapering and choice of wire thickness, are not applicable in our design methodology and thus are not part of the present study.
We address four basic questions.
1. How should width and spacing be allocated to maximize performance for a given line pitch?2. For a given line pitch, what criteria a ect the optimal interval at which repeaters should be inserted into global interconnects?We answer these questions using technology parameters from a representative 0.25um CMOS process; this matches the process technology context for many current-and next-generation microprocessors.Coupling capacitance studies are performed with the commercial QuickCap 3-D eld solver, and interconnect delay and noise coupling studies are performed with the commercial HSPICE simulator.Of particular interest is our study of question 4: we demonstrate that a new methodology for o setting repeater placements can reduce worst-case cross-chip delays by over 30 in current technologies, versus traditional repeater insertion methodology.All parameters used in this paper are obtained using drawn dimensions of the transistors.Actual transistor widths and interconnect length width spacing values correspond to a 64 shrink of drawn dimensions of course, the 0:25 m process itself refers to actual dimension.

Allocation of Width and Spacing for Given Pitch
Our rst study examines how t o c hoose a set of pitches for wires used in routing.To c hoose best pitches for a given layer, we plot the decrease in pure interconnect delay against the increase in pitch, with respect to some default or minimum pitch.Ideally, if the decrease in delay matches the increase in pitch, it is bene cial to go for higher pitches.However, if the curve starts to atten i.e., for every given percentage increase in pitch a lesser percentage decrease in delay results this indicates diminishing returns.Using such delay pitch plots we have chosen three optimal ptiches for routing: i default, ii fast pitch, and iii super fast pitch.Figure 1  Figure 1: Decrease in pure interconnect delay i.e., without any load at the end of the line as pitch for M3 wire is increased.We see that the curve starts to atten, i.e., decrease in delay saturates when pitch increase goes beyond 80 of nominal.
Our next study seeks to determine how width and spacing should be optimally allocated for a given line pitch.In practice, the actual line width used is considerably greater than the minimum line width achievable in lithography.Thus, there is freedom to tune the width and spacing once assumptions are in place for line thickness and target line length.We note that because very long inter-block lines will have repeaters inserted regularly see Section 3 below, the maximum line length of interest is equal to the optimum interval between repeaters; this length ranges between 2500 m and 5000 m for global interconnect layers in leading-edge technologies.
We have performed detailed studies of fast" M3 interconnect with 3.2m pitch, assuming that M2 crossunders are dense i.e., can beapproximated as a ground plane 2 and explicitly modeling M4 crossovers.Dielectric modeling is based on actual layer data for a representative 0.25m CMOS process.QuickCap was used to extract coupling and area capacitances, summarized in Table 1.As is typical in such analyses, we a s s u m e w orst-case coupling, i.e., a total coupling factor of 4.0 worst-case coupling factor of 2.0 to each of the left and right n e i g h bors of the victim line under analysis.Table 3: Delay estimates for various M3 line con gurations.Driver and receiver bu er sizes: wp=100m,wn=50m.Delay is computed from input of driver to input of receiver.
Table 3 shows HSPICE-computed line delays for M3 line lengths ranging from 4000m to 6000m.Again, dense M2 is assumed to bea ground plane, and M4 crossovers are modeled explicitly.The Table shows that width,spacing = 1:2; 2:0m gives the best performance for the given line pitch.

Bounding the Interval Between Repeaters
A very basic study in some sense a pre-requisite to all other interconnect tuning asks how often repeaters should be inserted into global interconnects.This is of course a chicken-egg problem, in that the optimum repeater interval depends on the interconnect tuning, and the interconnect tuning depends on the maximum run ever made without an intervening repeater.However, the following can be noted.
A bodyofstudy shows that repeaters should be inserted at uniform intervals.In other words, there should be a constant interconnect length or interconnect delay between each pair of adjacent repeaters; the rst and last segments of the path are exceptions because in practice the driver and receiver sizes may not bethe same as the repeater size.Actually, such theoretical results deviate from real-life practice.On any source-destination path the repeater sizes need not be the same.It may also be better to add repeaters in parallel in order to drive larger wire lengths.This is not just for performance: repeaters locally a ect device area and routing constraints.However, our studies have not yet addressed such layout issues.Using the same principle and with certain types of methodology and chip planning constraints, it can be better to increase the size of the drivers inside the block a s m uch as possible, which w ould increase the rst segment length.Table 4: Summary of M3 interconnect slew times.M4 is top layer; M1 is bottom layer.Two combinations of width spacing are shown, along with three di erent coupling factor assumptions.The input slew time is 400 ps and the output slew times are computed as 10-90 for rise time and 90-10 for fall time.
Assuming that the driver size and the receiver size are the same as the size of the repeaters inserted along the path, we calculate the total delay, optimal numberofrepeaters and optimal distance between the repeaters.
The total delay for a path with K repeaters is The delay of the rst stage is the total delay from the output of driver to the input of the rst repeater, i.e., T f i r s t stage = T gd + T int , where gate load delay is T gd = R rep C eff int + C rep , interconnect delay is T int = R int C int =2 + C rep , and R rep , C rep are repeater output resistance and input gate capacitance.The e ective capacitance at the gate output can be approximated as C eff int = C int where is a constant between 1=6 a n d 1 8 .Let L p betheinterconnect path length between driver and receiver.Then for optimal placement of repeaters the interconnect length between repeaters is Lp K+1 .Therefore, the total delay for the path is where r, c are resistance and capacitance per unit length of the interconnect line.We compute the optimal number of repeaters that minimizes total delay b y setting @ T tot @ K = 0, and obtain To minimize total delay, gate load delay and interconnect delay should be equal.If e ective capacitance is not considered in the gate load delay computation, and with current technology trends, gate load delay will always be greater than interconnect delay.Under these conditions, to minimize total delay one can increase the time of ight or wire length between repeaters until slew time constraints become tight.In the current range of 0.35m and 0.25m process generations, global interconnects have repeaters inserted with periods ranging from 2500 m to 10000 m.
Repeater insertion is also driven by pure interconnect delay, since larger time of ight implies larger slew time on the transition seen at the receiver.Edges with large slew times cause much larger gate delays, are more susceptible to noise, are more susceptible to process-distribution in uenced delay variations, and also increase the short-circuit power dissipation.Even in today's designs, slew times above 600-700 ps cannot betolerated.Thus, even without the delay minimization objective, edge rate control will force insertion of repeaters.In fact, some of the functionality of post-layout optimization" tools for gate sizing and repeater insertion is driven by e d g e r a t e c hecks as opposed to signal delay reduction.
In practice, repeaters will be implemented using inverters whenever possible, due to performance and area e ciency.and Rule3 and the Double-V S S rule Rule1 width spacing, but every other line grounded both allow three signal lines per 13.2m.Table 4 summarizes M3 interconnect slew times for line width 1.0m and line spacing 1.2m corresponding to a dense" M3 routing pitch, and input slew time of 400 ps.All capacitance extractions were performed with QuickCap, and correspond to M4 and M1 as the top and bottom ground planes, respectively.Switching factors range from 4 both neighbors switching in the opposite direction from the victim to 2 both neighbors quiet, or one neighbor switching in the opposite direction and one neighborswitching in the same direction with respect to the victim.We see that the M3 distance between repeaters has an upper bound of 5000m due to edge rate considerations alone.Separate studies show that this upper bound on distance between repeaters is essentially una ected by c hanges to the driver receiver sizing or the input slew time.

Bene ts of Shield Wiring
Our third study addresses the question of whether shield wiring is an e ective means of improving delay and signal integrity performance of long global interconnects.We consider various width-spacing rules for M3 interconnect, in order to evaluate the utility of spacing vs. shielding techniques.Our Again, QuickCap was used to extract capacitive couplings of a given victim line to its neighbor lines and the neighboring top bottom layers; these results are shown in Table 5.Notice that the Rule1, Rule2 and Rule3 rules have w orst-case coupling factors = 4. On the other hand, the Single-V S S rule has worst-case coupling factor = 3, and the Double-V S S rule has worst-case coupling factor = 2. Table 6 shows the delay performance for a 4000m M3 line, under various bottom ground and top plane con gurations.We observe: The Rule3 rule provides 37 decrease in total delay, but since C eff was not used in the gate load delay computation, actual delay reductions could be even greater.
The Single-V S S rule is less e ective than the Rule2 rule; note that the two rules are equivalent in terms of e ective routing density.Our studies have not yet addressed the routing interactions that can potentially a ect this analysis.In particular, shield lines may be added to bring power and ground connections to repeater blocks.
The Double-V S S rule gives improved total delays compared with the Rule3 rule, with the rules being equivalent in terms of e ective routing density.However, the Rule3 rule yields smaller interconnect delays, so that driver size reductions have greater potential for delay improvement.Thus, the Rule3 rule seems preferable.When two buses have activity patterns such that each is quiet when the other is active, then their lines can beinterleaved such that they e ectively follow the Double-V S S rule.In such a case, interleaving is clearly superior to the Rule3 rule, since the e ective routing density is doubled.
Gate load delays are larger than interconnect delays, suggesting that it is preferable to decrease line widths and increase line spacings.We also note that a dense M4 top layer decreases total delay, and a dense M2 bottom ground plane layer decreases total delay for smaller line widths only.

New Repeater O set Methodology for Global Buses
Finally, w e study another form of tuning that is possible for global interconnects.Our motivations are three-fold: i global interconnect is increasingly dominated by wide buses; ii present methodology designs global interconnects for worst-case Miller coupling; and iii present methodology routes long global buses using repeater blocks, i.e., blocks of co-located inverters spaced every, s a y, 4 0 0 0 m.
We have proposed a simple method to improve global interconnect performance.The idea is to reduce the worst-case Miller coupling by o setting the inverters on adjacent lines see Figure 3.In the previous methodology Figure 3a, the worst-case switching of a neighbor line i.e., simultaneously and in the opposite direction to the switching of the victim line persists through the entire chain of inverters.However, with o set inverter locations Figure 3b, any w orst-case simultaneous switching on a neighbor line persists only for half of each period between consecutive i n verters, and furthermore becomes best-case simultaneous switching for the other half of the period!.Table 7: HSPICE delays ns for three lines of length 10000 m, using Technology I, for all combinations of rising R and falling F initial transition on the input waveform.We show delays for inverter phases 0,0 and 0.5,0.5 on the left and right neighbors of the middle line phase 0.
To con rm the advantages of this method, the following experimental methodology was used.
We study systems of three parallel interconnect lines, with lengths either 10000m or 14000m.These lines are stimulated by a waveform with risetime = falltime = 200ps.The middle line is considered the victim" for analysis purposes.We model two technologies" representative of M3 and M4 in an 0.25m CMOS process.In each technology, line resistance is 50 per1000 m.In Technology I, capacitive couplings to left neighbor, ground and right n e i g h bor per 1000 m are respectively 60fF, 80fF and 60fF.In Technology II, capacitive couplings to left neighbor, ground and right neighbor per 1000 m are respectively 80fF, 160fF and 80fF.
We assume a period between inverters repeaters of 4000m.So that HSPICE cannot introduce any error in its RC analysis, we m a n ually distributed the line and coupling parasitics into 40m segments, i.e., repeaters occurred every 100 segments, and line lengths were 250 or 350 segments.Each segment is modeled as a double-pi model.This segmenting is chosen such that any ner-grain representation does not change the HSPICE-computed delays.
We a l w ays place the inverters on the middle line with phase = 0", i.e., at positions 4000, 8000, ... microns along the line.Inverters on the left and right neighbors are placed according to all combinations of phase = 0, 0.1, 0.2, ..., 0.9 again with respect to the period of 4000m.There are 100 di erent phase combinations.Figure 3 shows the three-line con gurations with left right neighbor phase combinations of 0,0 and 0.5,0.5.
We stimulate the three lines with the periodicwaveform, with the rst transition either rising R or falling F. There are eight combinations of directions for the rst transisions, i.e., RRR, RRF, ..., FFF.
Finally, we may o set the input waveforms of the left and right neighbors by -100ps, 0ps or +100ps with respect to the input waveform of the middle line.There are nine combinations of these input o sets.
Table 7 shows HSPICE delays for systems of three lines of length 10000 m, using Technology I, for all combinations of rising R and falling F initial transition on the input waveform.The Table shows delays for inverter phases 0,0 and 0.5,0.5 on the left and right n e i g h borsofthe middle line phase 0. The e ect of Miller coupling is clearly shown.
Table 8 shows the worst-case delays with respect to all eight possible combinations of rising and falling inputs for the middle line, for each c o m bination of phases for the inverter locations on the left and right neighbor lines.Input o sets are all 0, i.e., the waveforms start at the same time.All four combinations of Technology and line length are shown.In every case, the optimum phase combination is 0.5,0.5, while the traditional phase combination of 0.0,0.0 is actually the worst possible.The worst-case delay is reduced by a n ywhere from 25 to 30 when the repeaters are placed with optimum phase.Finally, T able 9 shows the same worst-case delays for the middle line, this time taken over all eight rise fall combinations and all nine combinations of input waveform o sets.Again, even when the inputs do not switch perfectly simultaneously, t h e best phase combination is 0.5,0.5 and the worst phase combination is the traditional 0.0,0.0methodology.

Conclusions
To our knowledge, this work has provided the rst technology-speci c studies of interconnect tuning in the literature.We have described experimental approaches to interconnect tuning issues related to bus routing, repeater insertion, and choice of shielding spacing rules for signal integrity and performance.In particular, four questions have been addressed: allocation of width and spacing to maximize performance for a given pitch, nding the optimal interval for repeater insertion, assessing the potential bene ts of shield wiring, and optimizing the insertion of repeaters in global buses.Our answers to these questions are at times surprising: in answering 3, we demonstrate that current shielding methodologies may besuboptimal when compared with alternate width spacing rules, and in answering 4, we propose a new repeater o set technique that can reduce worst-case cross-chip delays by over 30 in current technologies.Ongoing e orts extend our interconnect tuning research to encompass layer thicknesses, more detailed analyses of noise coupling and tuning to meet noise margins, and the delay noise behavior in emerging technology regimes Cu interconnect and low-K dielectrics or air-gaps.Finally, we seek to develop more complete full-chip interconnect tuning approaches based on analyses of the interconnect structure, speed target, and power dissipation target f o r a g i v en design.
A. Line length 10000 m, Technology I plots the decrease in delay versus the increase in pitch for M3 wire in a representative 0.25m CMOS process.

Figure 2 :
Figure 2: Pitch-matched width-spacing rules.Rule1 allows six lines per 13.2m; Rule2 and the Single-V S S rule Rule1 width spacing, but every third line grounded both allow four signal lines per 13.2m;

Figure 3 :
Figure 3: Reduction of worst-case Miller coupling by o setting inverters.In a, inverters on the left and right neighborlines are at phase = 0 with respect to the inverters on the middle line.In b, inverters on the left and right neighbors are at phase = 0.5.

Table 1 :
Table 1 reproduces several tech-nology projections from the 1997 SIA National Technology Roadmap for Semiconductors 15 .The implications of technology scaling particularly for interconnects are very complicated.Example considerations for a 7-layer metal 7LM process might include cf.16 : Selected technology projections from the 1997 SIA NTRS.

Table 5 :
M3 coupling capacitances extracted using QuickCap for various interconnect tuning rules and combinations of bottom and top planes.

Table 6 :
Delay estimates for a 4000m M3 line, under various interconnect tuning con gurations.Driver and receiver bu er sizes: wp=100m,wn=50m.Delay is computed from input of driver to input of receiver.

Table 9 :
Worst-case delays with all combinations of input o sets.