Area and Power Modeling for Networks-on-Chip with Layout Awareness

Networks-on-Chip (NoCs) are emerging as scalable interconnection architectures, designed to support the increasing amount of cores that are integrated onto a silicon die. Compared to traditional interconnects, however, NoCs still lack well established CAD deployment tools to tackle the large amount of available degrees of freedom, starting from the choice of a network topology. “Silicon-aware” optimization tools are now emerging in literature; they select an NoC topology taking into account the tradeoff between performance and hardware cost, that is, area and power consumption. A key requirement for the effectiveness of these tools, however, is the availability of accurate analytical models for power and area. Such models are unfortunately not as available and well understood as those for traditional communication fabrics. Further, simplistic models may turn out to be totally inaccurate when applied to wire dominated architectures; this observation demands at least for a model validation step against placed and routed devices. In this work, given an NoC reference architecture, we present a flow to devise analytical models of area occupation and power consumption of NoC switches, and propose strategies for coefficient characterization which have different tradeoffs in terms of accuracy and of modeling activity effort. The models are parameterized on several architectural, synthesis-related, and traffic variables, resulting in maximum flexibility. We finally assess the accuracy of the models, checking whether they can also be applied to placed and routed NoC blocks.


INTRODUCTION
Current and future Systems-on-Chip (SoCs) achieve increasingly better functionality and performance by integrating larger amounts of processing elements. This growth in computing resources must be matched by a corresponding evolution of the interconnection infrastructure. Traditional communication fabrics exhibit scalability issues in terms of performance and physical circuit design. The Network-on-Chip (NoC) paradigm, which brings networking concepts to the on-chip domain, answers such concerns with a scalable design, at both the architectural and physical levels.
NoC design is a discipline with a large amount of degrees of freedom. While this is an advantage, in that the interconnect can be optimally tailored to match the application at hand, it creates a critical issue when exploring the design space from the hardware overhead point of view. Given the flexibility of NoCs, an exhaustive exploration would require impractical amounts of synthesis runs, and a thorough characterization of switching activity in every candidate topology to properly assess power consumption.
An answer to the NoC customization complexity lays in the deployment of robust CAD tools, introducing the ability for automated design space exploration according to some fast optimization algorithm. In turn, however, such an approach requires the availability of accurate analytical models of NoC area and power consumptions to drive the optimization algorithms. Models allow the designer to quickly preestimate the area requirements and power consumption overhead introduced by the candidate interconnects. However, it must be noted that, as with any hardware component, the hardware cost of an NoC switch depends on several kinds of parameters, including (i) architectural (e.g., amount of buffering), (ii) synthesis tool-related (e.g., target operating frequency), (iii) operating (e.g., traffic flows).
It is also sometimes easy to underestimate the complexity of synthesis flows, which involve multiple tools, increasingly complex libraries, a large amount of heuristics and 2 VLSI Design several approximations or models of the behavior of physical on-chip devices. Experience shows that one wrong assumption may severely impact the properties of a whole design. For example, in recent years, the increasing importance of wiring resources has sometimes been impacting the final performance of circuits in unexpected ways, for example, by introducing higher parasitic capacitances and therefore power consumption, or by forcing redesign iterations due to higher transition latencies. This scenario demands for careful assessment of the accuracy of any predictive model at the lowest available level of abstraction; if possible, designers should try to validate their assumptions on placed and routed netlists. This obviously requires a large effort and may be impractical.
As a major contribution of this work, we propose an NoC modeling methodology which takes advantage of the designer's knowledge of the target architecture and synthesis library. It is of course impossible to devise an accurate yet fully generic model for the hardware cost, in power and area, of any given NoC. Our focus is instead on how such a model can be built for a specific NoC instance; we will illustrate the challenges and opportunities involved in this flow, in terms of accuracy and characterization time.
For our analysis, we choose the ×pipes NoC switch as a case study due to its parameterizability (Section 3). Key properties of our approach include accuracy and explicit modeling on several parameters of the design, like switch cardinality, flit width, buffering, traffic, and synthesis parameters. These properties make the approach suitable for fast exploration of large parts of the fabric design space, flexible and applicable in real life, for example, by accounting for the behavior of the synthesis tools when the target operating frequency approaches the limits of the design. The characterization is dependent on the target technology library, but can be easily scripted and automated. A major novel feature of our study, improving on our previous work [1], is that we explore the accuracy of our modeling style against placed and routed test instances; we feel that this is an essential step given the uncertainties intrinsic in today's technology processes. We also show that model coefficients can be made even more accurate by using a placed and routed training set for characterization, albeit at a modeling effort cost, and that the remaining inaccuracies can mostly be attributed to the intrinsic variations induced by synthesis tools.
Our approach starts from an existing RTL description of the NoC components, which are then synthesized and characterized under multiple architectural configurations and traffic conditions. A mathematical formulation of the area and power models is derived from empirical evidence and from the designer's knowledge of the NoC. Eventually, the coefficients of the model are fitted to the experimental results, guaranteeing accurate results for the given architecture. We present two different ways of characterizing the coefficients, with varying accuracy/effort tradeoffs, and two models to account for the dependency of synthesis results on the target synthesis frequency. The synthesis process can optionally include the placement and routing step for maximum thoroughness of the assessment.

RELATED WORK
NoCs have recently been proposed as a way to overcome the scalability issues of traditional interconnects [2][3][4]. Research has focused on multiple design levels. From the architectural point of view, a complete scheme is presented for example in [5], while specific topics are tackled in several works, such as quality of service (QoS) provisions [6] and asynchronous implementations [7]. The complexity in tackling the configurability of NoCs has been made clear by [8], where synthesis results for switches show widely different hardware requirements. Hence, the need for the development of algorithms and CAD tool-chains for NoC instantiation and optimization, as found for example in [9,10]. A requirement of such works is the availability of reliable power and area models.
Power models and simulators for processors and memories have been proposed in an extremely large body of research [11,12]. Interconnects have also become the focus of research [13], due to their increasing role in the hardware budget of recent and future systems; for example, the on-chip network of the MIT RAW chip multiprocessor is taking 36% of the chip power budget on average [14].
Some models of NoC hardware cost have already been proposed in previous literature. Results in [15] are derived from a mix of results on template circuits and from technology trends, and are specifically aimed at wide applicability. Therefore, even though they have been used for design space exploration [16] and in association with high-level traffic injection models [17], they do not guarantee maximum accuracy within an architecture-specific CAD flow. The main advantage of these techniques is flexibility and fast deployment. We see them as complementary to our approach, especially for initial exploration when the NoC component library is not available yet.
The approach in [18], on the other hand, attempts to build a cycle-accurate power model of a target router instance. However, several major points differentiate our approach. First, we build a model which is parametric not only on traffic-related events, but also on the architectural knobs of the design. Second, we include an area model in the exploration. Third, our model can be more readily adopted within a CAD mapping flow; this is both because we express the model as a function of architectural parameters, and because we provide a high-level dependence on traffic variables, instead of a cycle-by-cycle one. Fourth, we strive to make our approach as applicable as possible in real-world conditions, including the hard-to-model peculiarities of the behavior of synthesis tools when aiming for maximum frequency operation, and placement and routing issues. Fifth, we propose a fast characterization mechanism, by means of which model coefficients can be quickly derived with a minimal amount of synthesis runs.
In [19], a framework for NoC exploration is presented; the framework includes a power modeling flow. The power model features very limited dependence on architectural parameters and does not seem to account for the configuration knobs of synthesis tools. No area model is provided.   In [20], a bit energy modeling flow is proposed to compare different switch fabrics in IP network routers. The approach is focused on the cost for transmitting bits from input to output ports, and while bit pattern-accurate, it is only focused on comparing router topologies against each other. The authors of [21] propose a model based on transistor count, while in [22], which is focused on FPGAs, switch cardinality is the main parameter. None of these models is meant for simultaneously accurate, parametric, and fast representation of power consumption, that is, suitable for design space exploration within a CAD environment.

THE ×PIPES SWITCH ARCHITECTURE
We choose the switch architecture defined in the ×pipes NoC component library [8,23] as a case study, due to its customizability. The ×pipes switch ( Figure 1) is output buffered; FI-FOs of configurable depth are instantiated at each output, while inputs feature a single register. The flit width can be arbitrarily set. The number of input and output ports is also a parameter; full connectivity is provided in the central crossbar. An arbiter is attached to each output port to handle contention issues. We test the switch with its default ACK/NACK flow control mechanism, which leverages the output buffer resources. Since ×pipes performs source routing, the switch does not include routing LUTs.

PROPOSED MODELING METHODOLOGY
Our modeling activity is composed of five main phases. First, we devise a set of parameters that are relevant to the accuracy of any model which aims at practical applicability. Second, we define a general model formula for area and for power, relying on the knowledge of the target architecture as explained in Section 3. Third, we synthesize several configurations (training set) of the target switch architecture in a 0.13 µm technology library with Design Compiler [24], and measure the corresponding area and power consumption. The configurations are chosen so as to uniformly but sparsely cover the design space of interest, therefore allowing for an accurate yet quick construction of the model. Fourth, we use experimental results to numerically quantify the coefficients  of the model. As outlined later, we propose two different ways of performing this step. Fifth, we assess the quality of our models against configurations (test set) outside of the training set. The first four steps will be covered in Sections 4.1, 4.2, 4.2.3, 4.3, while the fifth will be discussed in Section 5. An outline of how we handle steps 3-5 is provided in Figure 2.

Parameters of interest
A key phase of the approach is devising a model that matches the architecture under consideration and its properties. However, considering the architecture alone does not guarantee that the model will be applicable and accurate enough in practice. For example, synthesis tools play a primary role in defining the area and power efficiency of a component. Therefore, we first summarize the parameters of interest when assembling our model.

Architectural Parameters
The main parameters are (i) switch cardinality (number of ports); to account for rectangular switches, we separately consider the amount of input ports (np i ) and output ports (np o ); (ii) amount of buffering devoted to flow control handling and performance optimization, also called buffer depth (bd) (expressed in terms of single-flit buffering elements); (iii) number of bits of the incoming and outgoing elementary data blocks, also called flit width ( f w).

VLSI Design
Area Frequency f n f max

Implementation flow parameters
It is possible to tune synthesis tools, among other things, for (i) target frequency of operation; (ii) target area; (iii) target power consumption.
Tuning these parameters differently in the synthesis tools yields, as expected, a widely different quality of results. For example, when performance demands are extreme, synthesis tools are forced to create netlists containing large amounts of buffers and fast gates, which are not area-and powerefficient. To mimic a typical industrial flow, where an application performance constraint must be satisfied, we impose as the primary objective a certain target operating frequency (which is a parameter of our model), while area and power minimization are given to the tool as secondary optimization objectives. As a result, area and power requirements, expressed as a function of the target operating frequency, exhibit a characteristic flat behavior followed by a steeply rising trend after an inflection point. This trend is well known, and can be explained by the fact that, above some target operating frequency which can be achieved with minimal circuitry, synthesis backends are forced to insert extra gates to comply with increasing performance demands. Figure 3 shows a linearized and a parabolic approximation of this trend, and at the same time summarizes the ways we modeled this effect. For each device configuration (e.g., 4 × 4 32-bit switches with 6-deep FIFO buffers), a "native" frequency f n can be identified. This frequency is that achieved by the synthesizer with relaxed timing constraints. Under this condition, the tool is free to fully pursue its secondary objectives, hence creating minimum area (A( f n )) and power (P( f n )) netlists. Configuring the tools for target frequencies lower than f n does not result in further decreases of area or power dissipation. For each switch instance, it is also possible to find a frequency f max , that corresponds to the fastest achievable synthesis result. Under this timing constraint, the module has A( f max ) area and P( f max ) power consumption. We approximate the dependency of area and power overheads as linear or parabolic in the range ( f n ; f max ). This assumption allows us to characterize devices only twice, at f n and f max (under various combinations of the other architectural parameters), while being able to estimate results over the whole range of frequencies achievable by the module. Since this analysis is not correlated to other model parameters, in the following, for simplicity of notation, we will not explicitly mention the dependency of coefficients on the synthesis target frequency; the characterization of this parameter will be implicitly assumed.
The linearized or parabolic approximation is a way of abstracting away from low-level details of the logic synthesis process, which are impossible to capture in a highlevel model. The experimental results that will be shown in Section 5 will be based on a test set which is also spread in terms of target operating frequency, therefore providing a metric of the accuracy of such a model. Section 5.5 will compare the accuracy of the linear versus the parabolic models.
Please note that developing area and power models which are a function of the target frequency of operation up to f max also implies making available a model of the timing properties of the switches.

Traffic condition parameters
These parameters are only relevant to power models, since area models are clearly static. They include downstream congestion and internal congestion (i.e., arbitration conflicts). They will be explained in more detail in Section 4.2.2.

Area model
In general, the area equation must be of the form: We identify as suitable the area model: The rationale of this formula is that the area of the target switch can be rendered as the sum of four contributions (Section 3): (i) output buffers, (ii) input buffers, (iii) arbitration and flow control logic, (iv) crossbar. Each contribution strongly depends on a known combination of architectural parameters as follows. (iii) Since a distributed arbitration technique is used in the target switch, one arbiter is instantiated at each output port. Each arbiter has a complexity proportional to the number of candidate input ports np i , therefore the overall contribution is the product of the input and output cardinalities. The arbiter logic is clearly independent of datapath parameters such as flit width and buffer depth. (iv) The area overhead due to the crossbar must have a linear dependency on flit width, must be independent of the buffering resources, and must have a linear dependency on the product of input and output cardinalities.

Power model
The power consumption of a module depends on the switching activity of the cells, so, to express the power consumption of an NoC switch, a term that accounts for traffic conditions must be present. The most general way to model the power consumption thus becomes with T being a generic variable that summarizes the traffic conditions. Since sequential components exhibit a power consumption even if they are not performing computation, due to the clock switching, a static (traffic-independent) term must appear. After analyzing the possible traffic flows in the ×pipes router, we propose (4) as a general power model: where the dots express dependencies on bd, f w, np o , np i which will be analyzed in more depth in the following. The first term models the power dissipated by inactive, but still clocked, registers. The remaining terms depend on traffic conditions. An accurate representation of the traffic conditions requires a separate analysis of the state of each input and output port. Therefore, we define np o traffic variables T O j and T OC j , to model the lack or presence of external congestion, and np i traffic variables T IC j , to model internal contention for resources. More specifically, we define the following.
(i) T O j : percentage of time during which the output port j is successfully transmitting flits. This coefficient models traffic in absence of congestion. (ii) T OC j : percentage of time during which the output port j is trying to transmit, but flits are rejected. This coefficient models external congestion due to traffic spikes. (iii) T IC j : percentage of time during which the input port j of the switch is trying to transmit flits through one of the output ports, but arbitration is denied by the switch logic. This coefficient models the contention for the same output port inside of the switch.
This set of traffic percentages is linearly independent, since the complex arbitration and flow control patterns within an NoC switch make it very easy for some of these time windows to overlap. Please consider the following. Example 1. A 4 × 2 switch (see Figure 5) may feature one established input-to-output connection where traffic is freely flowing (which is expressed by the condition T O1 ), another established input-to-output connection which is stuck due to congestion in the downstream switch (modeled within T OC2 ), while the third input port is unsuccessfully trying to transmit to one of the two output ports, which in this example are already busy (T IC3 ), and the fourth is simply idle (this contribution is therefore included in the coefficient P A ).
The coefficients P A , P B , P C , P D depend on architectural parameters, as for the area model. They account for the power consumption in the traffic states, described above, as follows.
(i) P A accounts for the static power dissipated by the switch and it is due to the noncombinational logic in 6 VLSI Design whose dependencies on architectural parameters are summarized in Table 1. (ii) P B accounts for the dynamic power dissipated by flowing packet streams, due to the enabled registers and to the switching activity of combinational logic. We identify four contributions to the power dissipation: (a) output buffers: in these buffers, during every cycle, one of the flit registers ( f w bits wide) samples a new piece of data; a bd×1 multiplexer then brings a flit to the output port therefore, this contribution is itself the sum of two terms; (b) input buffers; (c) control logic; (d) selected crossbar branch.
The dependencies of these contributions on the architectural parameters are summarized in Table 2.
(iii) P C accounts for the dynamic power dissipated by the switch under a scenario where downstream congestion is preventing a free flow of packets. Although numerically different, the P C coefficient is similar to P B , in that it still involves an established input-to-output channel, and therefore its dependency on architectural parameters is the same (see Table 2). (iv) P D accounts for the power dissipated by the switch when an incoming stream requires the access to an output port, but the arbitration is denied. The contributions to this portion of the power consumption are related to the following logic blocks: (a) input buffers; (b) control logic.
We can summarize the dependencies on this contributions as shown in Table 3.
The dependencies of the power coefficients are thus summarized in Table 4.
We would like to stress that some coefficients, which could be intuitively expected to quadratically depend on parameters, are instead linearly dependent, because they characterize a single input or output port. The quadratic behavior is indirectly restored by the summation symbols in (4).

Choice of a relevant training set
To characterize the coefficients of our area and power models, we define a training set, composed of switch configurations chosen in such a way as to uniformly cover the relevant design space for the particular NoC under study. In the case of the ×pipes NoC, which is focused on the highest customizability of topologies, we choose to study a design space spanning over a large variety of cardinalities (np i and np o of 4, 10, 16, and 20). Since ×pipes is also focused on the best performance/overhead tradeoff point [23], and therefore on low hardware cost, we consider moderate buffer depths bd of 5 and 7 FIFO locations and flit widths f w of 21, 28, and 38 bits.
In the modeling approach called full factorial design, all the possible permutations of the values of the independent design parameters should be studied to create the training set. This is often impractical due to the quick rise in the number of instances as soon as new design knobs are added, leading to approaches to select only a subspace of the characterization set (fractional factorial design). In our case, based on the knowledge of the target architecture, we choose a very simple way of pruning the training set. The rationale is based on the observation that rectangular switches add a smaller amount of information to the training set; for example, when studying the power consumption, a rectangular switch is by design unable to simultaneously feature traffic flows on all of its input and output ports (see Figure 5), and is therefore behaving similarly to a square switch of smaller cardinality. Our preliminary internal testing confirms this property, at least for the ×pipes NoC. Therefore, we simply choose to coalesce the np i and np o axes for the generation of the training set, and only include 4 × 4, 10 × 10, 16 × 16, and 20 × 20 instances.
We finally permutate all the possible parameter values, resulting in 24 (4 cardinalities times 2 buffer depths times 3 flit widths) configurations being synthesized.

Fitting area model coefficients
To estimate A 1 , A 2 , A 3 , A 4 , we propose two different methods.
(i) Methodology 1: coefficients can be derived directly from synthesis reports, which hierarchically list every switch subblock. For example, once the area cost of an output buffer which is bd 0 flits deep and f w 0 bits wide is gathered from one report, it can be called A obuf | bd0, f w0 . Since A 1 is expected to increase linearly with both bd and f w, it can be approximately derived as: Other coefficients can be similarly computed.   Advantages: with this methodology, each contribution in the formula keeps a strict physical meaning. Only one synthesis run is needed to extrapolate coefficients for any switch instance; we arbitrarily choose a 10 × 10, 28-bit switch as a reference. This instance is close to the center of the design space of interest (see the previous subsection); its choice will be further discussed in Section 5. Disadvantages: this simplified approximation discards any constant offset that may be present in the coefficients. Further, the nature of synthesis tools introduces unpredictable fluctuations in the netlist area and power trends under different architectural configurations. This noise does not have any easily characterizable property. Thus, the model incurs a nonnegligible error when compared against actual switch instances. Moreover, the choice of the specific switch instance for characterization might skew the computed coefficient values. (ii) Methodology 2: coefficients can be derived by leveraging the multivariate nonlinear regression algorithms natively provided by several mathematical and statistical packages. In this case, the input is a set of characterization syntheses (the training set described in the previous subsection). The target polynomial for the regression is chosen based on insight of the dependency of area on the architectural design parameters (see (2)). Advantages: the model fits actual synthesis results better. Disadvantages: longer characterization time; with a thorough characterization set like that chosen in Section 4.2.3, experiments must be performed in 24 device instances, against just one. The actual improvement in accuracy depends on the smoothness of the native behavior of the synthesis tools. Some coefficients may lose their physical meaning (e.g., they may become negative).
Both methodologies can be readily adapted to any parameterizable NoC architecture.

Fitting power model coefficients
To characterize the P A , P B , P C , P D coefficients, we first inject traffic into the switch netlists under test, one at a time. This is achieved by ModelSim [25] simulation of the Verilog netlists (please refer to Figure 2), to which traffic generators are attached. The traffic generators are configured to inject into the switch one of the four patterns described above (idle, free flow, downstream congestion, internal contention) at a time. The switching activity is logged and fed as an input to Synopsys PrimePower [26], which provides a hierarchical report of the power consumption of the switch subblocks. For each netlist of the training set, four hierarchical reports are therefore generated.
At this point, the power model coefficients are determined by using either of the techniques just outlined for the area models. The P A , P B , P C , P D scenarios are separately accounted for; the fitting polynomials are directly derived from Table 4. For each of them, either by direct derivation or by nonlinear regression, we extract the coefficients modeling the dependency on architectural parameters.

EXPERIMENTAL RESULTS
To evaluate the accuracy of the proposed techniques, we first randomly choose a test set of 70 switch configurations spread across the design space of interest (both in terms of architectural parameters and target synthesis frequencies), and not 8 VLSI Design overlapping with the training set previously used for characterization. Each switch is synthesized with Design Compiler to extract its area requirements, then stimulated with traffic streams within ModelSim and studied in PrimePower to evaluate its power consumption (Figure 2). A reference set of experimental results is therefore collected. The area and power consumption of the same set of switches is then estimated according to our methodology, and the statistical distribution of the resulting error is plotted to study the behavior of both coefficient fitting strategies.
The implementation flow of any circuit design is composed of several steps, among which the two major ones are logic synthesis (i.e., mapping functionality onto elementary cells from a technology library), which results in a netlist, and placement and routing (i.e., placing and interconnecting the netlist within a target floorplan), generating a layout as the outcome. Netlists can be generated in a relatively short time, but they do not include any information about the placement of the cells, and thus do not give any information about the length of the wires needed for the interconnections. This is a key missing piece of information, especially as designs become wire-dominated. Therefore, logic synthesis tools try to model the effect of wires by means of predictive models provided by the vendor of the technology library. These models are necessarily simplistic, and therefore may impact the accuracy of any area and power evaluation at the netlist level. On the other hand, creating the layout of a complex circuit provides more accurate estimations of its area and power cost, but this extra step is at least as time-consuming as the initial logic synthesis. Therefore, designers would clearly like to avoid performing this extra phase repeatedly during a modeling activity, if at all possible.
To assess the usefulness of our models, we investigate their inference and their application to both netlists and layouts. This can be seen in Figure 2, where the placement and routing step is optional.

Experiments with netlist-based models and a netlist-level test set
In this subsection, we generate our models starting from synthesized (but not placed and routed) switch instances, and check their accuracy against a test set which is also at the netlist level. The results are depicted in Figure 6, where the vertical axis reports the number of occurrences of inaccuracies comprised in the ranges listed on the horizontal axis. As can be seen, in around 80% of the cases, our models result in an error margin less then 10% of the actual value. Sporadically, relatively higherror rates of up to 20% are detected; however, as can be seen for example in Figure 7, the distribution of the errors is quite randomly spread over the design space, and comprises both under-and overestimations. The figure reports modeling inaccuracy for a subspace having as axes the flit width and the switch cardinality; these numbers are thus only a subset of the whole test set. Similar plots can be derived for varying buffer depths and target synthesis frequencies, and we omit them due to space constraints. Therefore, we can attribute inaccuracies to the unpredictability which is intrinsic in the behavior of synthesis tools, and not to a problem of our modeling approach. Comparing the results of the two techniques for coefficient fitting presented in Section 4.3, we see that the tails of the inaccuracy distributions drop more sharply for Methodology 2, indicating a lower chance of large modeling errors. However, Methodology 1 exhibits just marginally worse average inaccuracy rates: 6.26% against 5.30% for power models and 5.97% against 5.45% for area models. In terms of characterization effort, in our experience, we can roughly assume that one hour may be needed in average for the analysis of an instance of the training set; therefore, Methodology 1 requires one hour of runtime, while Methodology 2 needs 24 hours to provide numerical values of coefficients (the actual time depends on how thoroughly the design space Paolo Meloni et al. is covered). Due to the drastically lower effort, Methodology 1 becomes a natural candidate for fast yet accurate modeling. However, this approach leverages upon a single switch instance to characterize all the coefficients. The choice of the reference switch configuration is therefore key, and may impact the robustness of the flow. Internal testing, that we omit due to space constraints, shows that coefficients are quite accurately rendered under a wide range of possible choices of the reference switch. However, when manually picking an "outlier" instance as the reference, errors over the whole design space turn out to be large. As a possible workaround, Methodology 1 could be applied to multiple switch instances to minimize the chance of choosing bad references; outliers could be effectively discarded. This hybrid approach provides better reliability, but requires a modeling effort which is progressively closer to that of Methodology 2 as its robustness is increased. Methodology 2 remains the most accurate and reliable, and its characterization time can still be assumed to be fully acceptable for both academic and industrial environments.

Test case: a complete NoC topology
To further validate the most complex part of our methodology, that is, the power modeling, we study a whole NoC topology, such as a 5 × 3 mesh. The mesh includes switches with three different cardinalities of 4×4, 5 × 5, and 6 × 6. We then inject functional traffic, namely, that required to drive a multimedia application from the MPARM [27] suite, in the topology, and compare the resulting power consumption against that predicted by our model (characterized with Methodology 2). Traffic patterns in the mesh are irregular, due to application needs, causing the switches to spend variable amounts of time in each possible state. The results are  plotted in Figure 8. The average inaccuracy is 5%, with only two switches out of fifteen (about 13%) exhibiting inaccuracies greater than 10%. Since the power consumption of some switches is overestimated while that of others is underestimated, the margin of error on the consumption of the whole mesh is as low as 1.3%. This result confirms the usefulness of our modeling strategy for integration within a CAD mapping and design space exploration flow.

Experiments with netlist-based models and layout-level test set
We try to apply the previously mentioned models, which are based on netlist-level analyses, to a layout-level test set, by placing and routing the test set described above. This activity generates a very realistic test set, and is a demanding metric for the accuracy of the models, since extra unpredictable noise is added. The results we get are presented in Figure 9, which should be compared to Figure 6(b). The two plots exhibit a comparable trend and errors of roughly the same magnitude, even though the average modeling error for the layout-level test set is about 3% higher. This means that models developed by only taking netlists into account still show good accuracy even with respect to layout-level power evaluation. The added noise also blurs the accuracy difference between Methodology 1 and Methodology 2, both in maximum error (26% versus 23%) and average error (8.8% versus 8.35%). While Methodology 2 remains marginally more accurate, these results seem to suggest that the unpredictability introduced by the logic synthesis process is somewhat unrelated to that introduced by the placement and routing phase. In other words, even though Methodology 2, thanks to its interpolation of results, can compensate for some of the nonidealities of the logic synthesis process better than Methodology 1, this compensation is less effective when trying to predict the power consumption after placement and routing.

Experiments with layout-based models and a layout-level test set
In an attempt to check whether more accurate models can be built, we recompute the numerical coefficients starting from a layout-level version of the training set and applying Methodology 2. This model is very close to an ideal reference point, since it is derived from a regression on experimental results which already encompass most of the unpredictable elements of the synthesis flow. However, the time required to build the model coefficients is noticeably longer. In our experience, both logic synthesis and placement steps require a computation time which is not easy to predict, as it largely depends on many factors, such as the switch cardinality and the target operating frequency. However, as a rule of thumb, the two steps are about equally time consuming; therefore, the modeling time is approximately doubled. The error distribution resulting from the usage of the layout-level test set when validating the model coefficients achieved from a layout-level training set is shown in Figure 10.
As can be noticed, and as expected, the average error and the maximum error values both noticeably decrease when compared to Figure 9. However, the decrease is not huge. We attribute the remaining inaccuracies in Figure 10 to the intrinsic unpredictability of the synthesis tools. Even after taking into account all the systematic behaviors in the synthesis flow, the trend is the result of residual instance-to-instance variations due to heuristics in the CAD tools and to degrees of freedom which can only vary in a discrete fashion.
The accuracy improvement guaranteed by a layout-level characterization is associated to a doubling of the runtime overhead, and still does not completely eliminate the presence of some "outlier" instances. The designer may certainly choose to adopt our methodology to characterize devices at the layout level for maximum accuracy. However, we feel that a result that can be derived from our experiments is that, at least at the 0.13 µm technology node, it is still feasible to use accurate netlist-based models in order to save characterization time.

Experiments with a parabolic model for the dependency on the target synthesis frequency
We conclude our experiments by checking whether a linear model is accurate enough to characterize the dependency of synthesis results on the target synthesis frequency (see Figure 3). We leverage a parabolic model as a potentially more accurate approximation of the actual dependency of model coefficients on the target frequency, then recheck the  Table 5.
These results do not seem to indicate a strong bias towards any of the alternatives. The linear approximation seems to cope much better with a netlist-level or layout-level test set when the model is derived from experiments on a training set at the same level, but the parabolic model is quite a bit better at predicting layout-level results starting from netlist-level models. We attribute this behavior to the impact of noise. In other words, although synthesis results do clearly change depending on the target frequency, the choice of a linear or parabolic model to describe this trend does not matter much, since the nonidealities introduced by the synthesis flow induce enough noise to blur the distinction. Overall, the usage of the linear model, which is simpler, seems to be justified.

CONCLUSIONS AND FUTURE WORK
We presented a methodology for characterization of NoC switch area and power requirements. The approach we propose is based on thorough parameterization on several architectural, deployment, and runtime variables. This guarantees excellent applicability within an NoC CAD flow for topology mapping and/or design space exploration. The area and power models for the ×pipes case study turn out to be very accurate within the limits allowed by the nonidealities of synthesis tools, even when applied to a whole NoC topology with irregular traffic flows. Our experiments show that, at least at the 0.13 µm node, applying our methodology to netlistlevel devices yields an acceptable approximation of the actual behavior even after placement and routing, but that even greater precision can be achieved, if desired, by applying the same technique at the layout level. We also show that another tradeoff among accuracy and modeling effort exists, namely, coefficients can be extracted based on a single (or on just a few) device instances, by normalization of the synthesis report, or on several of them, by an interpolation process.
Future work includes minimizing the characterization effort, testing how well our technique scales to finer process technologies, and creating similar models for other NoC components, such as network interfaces (NIs). At the 0.13 µm node, we observe that the power consumption of NoCs is still largely dominated by dynamic switching activity [23], therefore we do not consider wire loads in our modeling approach for switches. Since this is expected to change in future technologies, techniques to establish models for NoC links may also become important.
Besides the absolute accuracy of the models, we also plan on quantifying the accuracy of our model when used from within a CAD flow to establish relative cost assessments of alternative NoC topologies; early results in this activity are encouraging, with good agreement between CAD expectations and actual measurements.