Shrinking silicon technologies, increasing logic densities and clock frequencies, lead to a rapid elevation in power density. Increased power density results in higher onchip temperature, which creates numerous problems tightly firmed to reliability degradation. Since typical low-power design has been proved inefficient to tackle the temperature increment by itself, device architects are facing the challenge of developing new methodologies to guarantee timing, power, and thermal integrity of the chip. In this paper, we propose a thermal-aware exploration framework targeting temperature hotspots elimination through the efficient exploration of multiple microarchitecture selections over the temperature-area trade-off curve. By carefully planning at design time the resources of the initial microarchitecture that should be replicated, the proposed methodology optimizes the system’s thermal profile and attens on-chip temperature under various design constraints. The introduced framework does not impose any architectural or compiler modification, whereas it is orthogonal to any other thermal-aware methodology. For evaluation purposes, we employ the software-defined radio executed onto a thermal-aware instance of LEON3 processor. Based on experimental results, we found that our methodology leads to an architecture that exhibits temperature reduction of 17 Kelvin degrees, which leads to improvement against aging phenomena about 14%, with a controllable overhead in silicon area about 15%, compared to the initial LEON3 instance.
Communication has become one of the central uses of computing technology over the years. Architectures that facilitate communication, such as mobile phones and wireless networks, have been primary factors in driving the evolution of microprocessors and computer systems. With the evolution of wireless mobile communications, the problem emphasis has shifted to networking protocols and signal processing that are required to sustain the necessary bandwidth of these applications. In recent years, we have seen the emergence of an increasing number of wireless protocols (e.g., 2G, 3G, GPRS, WiFi, etc.) that are applicable to different types of networks.
Software-defined radio (SDR) technology was created to improve interoperability between different wireless networks, field radios, and devices [
Since the majority of these systems exhibit high-throughput, low-power requirements, and short time-to-market, previous studies proposed the usage of System-on-Chip (SoC) architectures to support the efficient implementation of SDR [
For this purpose, thermal management has recently received a lot of attention by design architects. The goal of thermal management is to meet maximum operating temperature constraints, while tracking timing specifications. Moreover, thermal management can also achieve further temperature reduction in order to improve the reliability degradation of SoCs.
Previous studies have shown that thermal stress is tightly firmed to reliability issues [
Typical instantiation of this solution is the usage of dynamic voltage and frequency scaling (DVFS) [
Another technique for providing temperature hotspot elimination, especially at multicore architectures, is based on load balancing [
A similar approach is discussed in [
A common drawback among techniques discussed up to now is that they do not incorporate any mechanism for handling thermal history of the cores. This feature provides useful guidelines about the future behavior of the system and can be exploited to improve the results of the migration. In addition to that, existing approaches mainly provide thermal aware application mapping onto SoC devices based on exploration provided through simulation results. These approaches assume that target platform is fixed ignoring about potential improvements achieved through architecture-level optimizations [
The proposed approach aims at temperature optimization, while it can be considered as a proactive strategy that alleviates thermal stress at run-time. The introduced framework does not impose any architectural or compiler modification, whereas it is orthogonal to any other thermal-aware methodology discussed above, since it is based on new architectural schemes to eliminate the consequences posed by temperature hotspots. Thus, existing work on thermal aware application mapping and dynamic thermal management can be used in a modular manner to extend the proposed methodology.
Specifically, we target at the development of an automated design space exploration framework that extracts and evaluates a large number of architectural solutions. Every solution exploits selective block replication. Based on the software-supported automatic exploration, we are able to compute higher thermal quality Pareto curves, in contrast to many similar existing optimization approaches that retrieve only a single architecture [
Previous works introduced the usage of parallelism in order to achieve power savings, which in turn lead to temperature reduction [
The main differentiation of the proposed work, as compared to this approach, is that our solution does not assume that replica blocks of the same type are working in parallel. More specifically, in our methodology, only one of the available replica blocks is active at any time. The selection of this active block is based on its thermal condition, as it is described in upcoming sections. The contributions of this paper can be summarized as follows. We show the optimization potential regarding thermal aware exploration by exploiting selective replication of specific architectural blocks. We introduce of a novel methodology targeting to provide: (i) elimination of thermal hotspots at SoCs targeting SDR architectures and (ii) alleviation to the temperature gradients. Rather than providing only one architectural solution, our methodology retrieves a number of Pareto architectural solutions, each of which trades-off different design constraints/criteria. We propose a novel design methodology that is orthogonal to the existing approaches found in relevant literature [ We provide CAD support through developing a software supported thermal aware exploration framework, which is public available for additional extensions through [ We apply the proposed methodology to a real case SoC design consisting of a synthesized LEON3 processor [
Experimental results prove the efficiency of the proposed methodology, showing that the selected architecture leads to temperature reduction about 8% (from 380 Kelvin to 363 Kelvin), with a controllable silicon area increase of 15%. As we show latter, such a temperature reduction apart from reduction in cooling cost also achieves mentionable improvement to the consequences posed by aging phenomena about 14%.
The rest of the paper is organized as follows: Section
During the last years, a number of different SDR-based architectures have been developed, whereas a typical instantiation is depicted in Figure
The block diagram of employed SoC-based SDR.
LEON3 processor consists of the integer unit, the cache subsystem, the memory management system, and the AMBA interface. The instruction unit is fully compatible with the SPARC V8 instruction set, whereas the pipeline consists of 7 stages. The integer unit has configurable separate instruction and data cache (Harvard architecture), whereas the size for each of them is equals to 1 Kbyte. Furthermore, the integer unit includes a configurable register file with register window equals to 8. Regarding the L1 caches, they are managed by a cache controller which is interfaced to the system's AMBA AHB bus. The communication to LEON3 peripherals is performed with two bus controllers, referred as AHB (advanced high-performance bus) and APB (Advanced Peripheral Bus) controller, respectively. The first of these controllers (AHB controller) is used for the connection of high speed components (i.e., integer unit, memory controller, etc.), whereas the second one (APB controller) provides control to the low-speed peripherals (i.e., UARTs, I/Os, etc.). Finally, LEON3 processor contains a configurable separate local data (2 KByte) and instruction memory (2 KByte).
In this section, we discuss the importance of different hardware blocks to be considered as critical for thermal stress. This problem becomes even more important regarding either high-end processor architectures, that is, superscalar organizations, or multicore SoC designs, where multiple hardware components, each of which with different area and power values, are combined into a single device. Hence, one of the challenges that architects are facing today is to identify the hardware components that higher affect thermal stress.
In order to show how different hardware block thermal profiles affect the thermal stress of the entire IC, Figure
(a) Power consumption and (b) power density pies for LEON3 architecture.
We select such an embedded processor because it is widely used in numerous commercial and/or research products. However, apart from the selected target platform, the methodology we follow in this paper is also applicable to any other digital architecture.
Since embedded cores usually are designed with low power criterion, many researchers up to now pay effort to reduce maximal temperature values by identifying blocks that dissipate increased power budges. Regarding the LEON3 processor, the local data/instruction memories, the L1 data/instruction caches, as well as the register file are found to be the most power hungry blocks. More specifically, the average power consumption at these blocks, as compared to the total power dissipation, is 57%, 31%, and 8%, respectively.
Even though Figure
A candidate metric for this scope is power density, which denotes the ratio of power consumption for each hardware block per the area occupied by this block. Figure
Next, we depict that the criterion of power density is much more important than the corresponding one about power consumption. For this purpose, Figures
Thermal profile for LEON3: (a) without considering replica blocks, (b) with replica blocks (2 × local data/instruction memories, 2 × L1 data/instruction caches, 2 × register file), and (c) with replica blocks (2 × instruction unit, 2 × cache controller, 2 × AHB controller).
As a reference point for this study, we use the thermal map for a LEON3-based SoC SDR architecture when no replica blocks are assumed. This map, shown in Figure
Similar to previous conclusion, the floor-plan that leads to architecture instantiation where components with increased power consumption are replicated (Figure
Note that for the sake of completeness, the temperature scaling is constant among for all the thermal maps depicted in Figures
For supporting selective block replication, the processor microarchitecture has to be properly enhanced. We mention that in this paper, we focus mainly on the exploration methodology developed for the automatic evaluation of opportunities delivered through selective replication. Thus, in this section, we briefly introduce some micro-architectural considerations that enable the design of processor architectures with replicated components.
In the general case, the data flow and the control flow of the original processor architecture have to be modified towards two directions: (i) enabling mutual exclusiveness between the replicated units, and (ii) permitting run-time management of the replicated resources according to the run-time thermal state of the processor. We focus our analysis on the RISC microarchitecture of LEON3 [
We target on lightweight enhancements in original LEON3's datapath to avoid extensive area, organization, and control overheads in respect to the original datapath. For this purpose, we apply selective replication in a coarse grained manner, that is, replicating at the level of instruction unit, rather than at the ALU unit or the instruction fetch level. Furthermore, we avoid replication of the actual memory components (i.e., register file, data cache, etc.), since their replication will require proper control mechanisms to establish data coherency among the various replicas.
Although, selective replication in finer granularity than the proposed is a valid design option, we show that coarse-grained component replication can achieve significant temperature reduction and hotspot elimination, which in turn results among others in device improvement against aging phenomena. The proposed approach is not a restrictive one. As shown in Figure
According to previous analysis, we propose the adoption of a microarchitectural extension similar to the one depicted in Figure
Proposed microarchitectural enhancement.
Furthermore, for each replicated component proper pairs of multiplexing and demultiplexing logic are added to the original datapath, regarding the lightweight control and data flow extension. Specifically, the inputs of each replicated component are driven by the demultiplexer that properly guides the input data to the active module. Accordingly, the output signals from each replicated component are multiplexed in order to propagate to the next level. We recognized two data-flow paths inside the processor datapath, namely, the memory-to-instruction unit and the instruction unit-to-memory data-flow paths. Figure
The original LEON3 architecture is also enhanced with a thermal aware runtime controller module for distributing the workload to the available units during runtime. The thermal aware workload distribution is performed by properly issuing the selection signal to the added multiplexing and operand isolation logic. Actually, the same signal configures both the aforementioned components. Since only the selection signals to the extra logic are issued, the thermal aware controller works transparently from the control logic of the rest of the LEON3 architecture. The controller makes the decision which replicated unit to be turned on/off according to the thermal state of the processor. It is assumed that runtime thermal data are available, that is, through thermal sensors.
We consider a reactive scheme of the thermal aware controller. Thus, the controller alters its state whenever an upper temperature threshold,
The thermal controller reacts to the temperature readings provided by the onchip thermal sensing infrastructure.
The state transitions of the thermal aware runtime controller for a single type of unit, that is, the instruction unit, are depicted in Figure
Employed thermal-aware runtime controller per replicated block.
This section describes in detail the proposed methodology for reducing temperature hotspots through selective replication for some hardware modules of the target architecture. Note that throughout this methodology we do not aim at redesigning the whole microarchitecture, but we focus only to critical components. The goals of this methodology are: (i) to provide a proactive thermal-aware approach targeting at micro-architecture designs and (ii) to support the rapid exploration/evaluation of different architectural selections in term of thermal stress. Note that the architectural modifications applied with our methodology are transparent to the compilation flow (they do not affect existing tools), while they speed up the development of new products, since end-users (e.g., programmers) do not have to consider thermal issues).
This methodology is shown graphically in Figure
The proposed methodology for replication-aware thermal management.
The inputs to this replication-aware thermal management methodology are the architecture description (in VHDL/Verilog), as well as the selected CMOS technology. Initially, design is synthesized with synopsys design compiler [
The output of synthesis task is appropriately encoded into an XML format in order to be manipulated by the introduced tools of our framework. The granularity of system's description in this XML format is tunable, since higher detail means a more accurate thermal analysis, but it imposes the maximum computational effort. On the other hand, a more coarse grain approach leads to lower computational effort but it also imposes a penalty in term of hotspot elimination. Even though our methodology is applicable at design time, and hence there is almost no performance degradation due to additional computational complexity, however, the increased number of functionalities inside an SoC usually makes the selection of a fine grain description of target system a non desirable approach.
The derived system's description is profiled in order to compute the power density of the functionalities described in XML file. For this purpose, input regarding power and area info, as it was already derived from post synthesis simulation, are employed.
Based on this analysis, it is possible to make a decision regarding which of the architecture's hardware blocks have to be replicated. For this step, the power density for hardware blocks has to be measured. Note that the total number of replication blocks is limited by area constraints posed by designer.
Since we try to alleviate thermal stress mainly at hardware blocks with increased power densities, these are the blocks that should be replicated (as we have shown in Section
At this procedure, an extra architectural parameter needs to be defined. More specifically, apart from the blocks that need to be replicated, we also need to clarify the maximum number for each of these blocks that can be replicated. Since more replicas means better thermal management, in the expense of imposing overheads in area and delay, careful study should be applied. In this study, we evaluate solutions that correspond to maximum number of replica blocks up to five. This selection was based on our conclusion that architectures consisted of more replicas do not lead to additional temperature reduction (due to saturation effect). However, constraints posed by architecture specifications might reduce this number.
After defining the type of replica blocks, as well as how many times they should be replicated, the next tool in our framework performs automatically this task by annotating appropriately the design's description. Apart from the insertion of new (replica) blocks to the design, during this task we have to pay effort to provide the appropriate connectivity through routing infrastructure, as well as to insert the thermal-aware runtime controllers in XML format. Moreover, during this annotation we keep the same connectivity among hardware blocks, while we have also to preserve that all the connections to (and from) replica blocks should be also replicated. This is an important differentiation of proposed solution, as compared to similar approaches found in relevant references [
The outcome from this step is all the candidate architecture instantiations that meet area constraints. Such a criterion can eliminate from design space solutions that lead to unacceptable overheads in device area due to excessive number of replica blocks. Also, by allowing a designer-defined overhead in this metric, it is possible to explore and evaluate different architectural solutions. Regarding our exploration, we set this area overhead to 35%, since otherwise our methodology leads to excessive area penalties.
The employed criterion allows blocks with increased power densities to be replicated more times as compared to blocks with smaller values of power density. This occurs because the area occupied from these blocks is usually smaller, and hence more of them are fit into the given (affordable) percentage of area overhead. Since only one of the replica blocks is active at any time (based on approach discussed in Section
Next, we proceed to the second criterion for evaluating the efficiency of derived architecture instantiations that affects the timing constraint. For this purpose, the solutions derived from area filtering are floor-planed with the usage of a thermal-aware floor planner [
The optimization goal during this procedure is to reduce the thermal stress for each architecture in respect to the timing constraint. The alleviation of thermal stress is performed by spreading as much as possible the hardware blocks that contribute more to higher values of onchip temperature (e.g., modules with increased power densities). Similarly, by minimizing the perimeter of bounding box that surrounds all the modules that are connected with a single bus, it is possible to improve the delay of this bus. Hence, blocks that are connected through bus(ses) have to be floor-planed in spatially close locations.
Regarding our methodology we allow all the blocks of the chip to be “soft” blocks, that is, their aspect ratio can change (in a controlled manner) in each annealing movement of
The derived solutions are then evaluated in term of delay degradation, as compared to delay estimation retrieved from postsynthesis simulation. For this scope, we use the Elmore delay model [
The output of this step is all those thermal-aware floor-planned solutions which their timing degradation meets the design specification (as it was derived after postsynthesis simulation without considering yet any replica block).
Finally, in the last task in the proposed methodology, the different architectural solutions are evaluated against thermal constraint. This task is automated with a new tool, named
Specifically, the statistical characterization is performed in a window-based manner (denotes the temporal granularity for performing thermal analysis) by computing the statistical mean values of the primitive operations executed by the processor, that is, number and type of ALU instructions, memory accesses, cache hits/misses, communication load, and so forth. For this purpose, power traces for all the replica units are generated. These traces describe the power activity for all the architecture's units (replicated or not) within a certain amount of time. Using these statistics, the utilization ration per component is extracted in each examined window. The per component utilization ratios are correlated with accurate postsynthesis power measurements to generate the power traces. For the examined power traces, the thermal filtering tool extracts the Pareto frontiers under various design objectives (i.e., power density of the chip, delay, area, max temperature, max thermal gradient, etc.).
In this paper, we study a uniform workload incorporating the key elements of base band processing domain, where these optimizations against to thermal stress can be applied. More specifically, the employed applications, obtained from [ Adaptive differential pulse-code modulation (ADPCM) is a variant of differential pulse-code modulation (DPCM) that varies the size of the quantization step, to allow further reduction of the required bandwidth for a given signal-to-noise ratio. Cyclic redundancy check (CRC) is an error-detecting code designed to detect accidental changes to raw computer data, and is commonly used in digital networks. Fast Fourier transform (FFT) is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. GSM 06.10: GSM 06.10 is a digital speech coding standard used in the GSM digital mobile phone system.
These telecommunication applications exhibit increased demand for bandwidth requirements. Regarding our architecture, the components with higher power density values when such kind of applications is executed are the instruction unit, the cache and memory controller, the DSU (debug support unit), and the AMBA AHB/APB bus controllers.
The output of
These time periods are retrieved by incorporating info from application's simulation. The employed utilization ratios are averaged over hardware components of LEON3 system in order to determine the active/idle time slots accurately. The output from utilization is fed as input to
Furthermore, the area derived in this stage may be different from the one computed during area filtering, due to additional free space inserted to design after floor plan (which is not occupied by any hardware block) in order to model the white space between hardware components.
The workload distribution, in conjunction to the packaging constraints, is fed to the
Thermal characteristics of the employed processors.
Parameters | Model value |
---|---|
Sampling interval | 20 ms |
Die Thickness | 0.15 mm |
Core Area (no replication) | 0.426213 mm2 |
Cache area (L1 + Local I + D) | 0.370561 mm2 |
Convection Resistance | 0.1 K/W |
Convection Capacitance | 140.4 J/K |
Based on the derived thermal profile, it is possible to evaluate the architecture instantiation in terms of different criteria tightly firmed to onchip temperature. For the scope of this paper, all the solutions that do not meet the selected thermal constraints (maximum temperature and the temperature gradient) are eliminated from exploration space, whereas typical packaging for embedded processors is assumed [
This section provides a number of experimental results derived from the proposed exploration methodology that prove the efficiency of our approach in term of reducing the consequences posed by thermal stress. For this purpose, a LEON3-based design is employed [
We have to mention that functionality of the underlined LEON3 processor is not affected by the additional (replica) blocks.
The majority of aging phenomena are tightly firmed to onchip temperature values. Hence, higher maximum temperatures lead among others to devices having increased failure rates. For this reason, the first criterion employed in our methodology for selecting the architecture of target platform involves to study how temperature values are spatially distributed over the target device.
Figure
Temperature variation for different instantiations of target architectural.
The following conclusions can be derived from this figure. More specifically, as we increase the area of target architecture, the temperature values are reduced (almost monotonically). However, this temperature reduction in not constant since just replication of blocks does not guarantee alleviation of thermal stress (as we have already depicted in Figure
Apart from area, the maximum operation frequency also affects the onchip temperature values. Based on Figure
Another interesting conclusion might be derived from Figure
Based on this figure it is possible to select an architecture that better trades-off design constrains. A balanced design solution under the aforementioned criteria (area, maximum temperature, and delay) is the one that replicates four AHB controllers, three integer units, and two cache controllers. This architectural instantiation, mentioned as “selected architecture” in upcoming figures, belongs to solutions marked as valid during the area, timing, and thermal filtering. The selection of this architecture for further evaluation is performed since it belongs to the Pareto front for reliability improvement, as it is discussed in more details in Section
More specifically, the area and delay overheads for our selected replication-aware LEON3 design are 15% and 7%, respectively, as compared to initial implementation (without considering any replica blocks). Even though these penalties are not negligible for ASIC designs, we have to mention that they comes with considerable gains in term of maximum temperature value (about 17 Kelvin or 8%), which in turn leads to higher reliability improvements. Furthermore, the proposed methodology for selectively replication of blocks with increased power densities can also be applied to -core architectures, where the performance degradation is more affordable.
Apart from the selected architecture, any other architecture instantiation can be chosen without affecting the efficiency of our proposed methodology, if different constraints are applied. Note that our framework mainly indents to enable the thermal improvement of architectures through inserting replica blocks.
In order to show the importance of proper identification for hardware blocks that have to be replicated, Figure
Results about power density versus maximum temperature.
In order to plot this graph, architectures are grouped into three categories based on their power densities, as follows. those with area smaller than the 33% of maximum area among all the solutions; those with area ranging between 33% and 66% of the maximum area among all the solutions; those with area higher than 66% of the maximum area among all the solutions.
Note that this classification with respect to area occupied by different architectures is also applied in upcoming figures (Figures
Results about area versus maximum temperature.
Evaluation in term of
Since different architectures consist of different replicated blocks, their power densities also vary. As we can conclude from Figure
The last conclusion is very interesting since it shows that even architectures with increased power densities can achieve considerable onchip temperature reduction. This point verifies the argument discussed in Section
In Figure
In order to study the correlation between areas occupied by target architectures and the maximum temperature values, Figure
Based on Figure
Furthermore, the temperature values for architectures with few replicas (less than 33% area overhead) are about 3x higher, as compared to the remaining solutions. Since our methodology tries to alleviate the hotspots, such high-temperature variations usually result in increased cost for packaging and cooling, and hence they are not desirable. These solutions can also be eliminated from exploration space.
This conclusion is very important in order to find the amount of blocks that have to be replicated. In other words, based on Figure
Reliability is defined as the probability that a device will perform its required function under stated conditions for a specific period of time. Predicting with some degree of confidence, strongly depends on defining a number of parameters.
Accelerated life testing employs a variety of high-stress test methods that shorten the life of a product, or quicken the degradation of the products performance. The goal of such testing is to efficiently obtain performance data that, when properly analyzed, provide reasonable estimates of the products life or performance under normal conditions. This induces early failures that would sometimes manifest themselves in the early years of a products life, and also allows issues related to design tolerances to be discovered before volume manufacturing. Both the type of stressor and the time under test are used to determine the normal lifetime. Regarding SoC designs, usually the majority of these aging degradation mechanisms are tightly firmed to onchip temperature values.
The effect of these stressors can be mathematically determined. Next we model aging acceleration due to thermal stress with the usage of Arrhenius equation (
Based on the values depicted in this figure, we can conclude that the selected architectural instantiation achieves almost the minimum value for
Apart from Arrhenius equation, we also evaluate the different architectures derived during our exploration, under the time-depended dielectric breakdown (TDDB) [
Figure
Evaluation of different architectures under TDDB.
Based on Figure
This subsection describes the results retrieved of applying the proposed methodology for designing chip multiprocessor architectures. For demonstration purposes, the target multiprocessor consists of four instances of LEON3 (this number is parametric to our framework and can be appropriately tuned based on designer's requirements), while the replica modules among LEON3 processors for a given instantiation of multiprocessor, are the same. As a reference to this study we employ a multiprocessor architecture consisted of the LEON3 which was marked as “selected” in the previous figures. For the following figures, this solution is denoted as “reference solution.”
Figure
Thermal profile for 2 × 2 CMP LEON3-based architecture.
Next, we will quantify the efficiency of the above solution when this is used in multiple instantiations of a multiprocessor architecture.
Figure
Normalized power density versus area overhead for multiprocessor LEON3.
In contrast to our conclusion about power density, area has a great impact on the maximum onchip temperature values. This is also depicted at Figure
Normalized maximum temperature versus area overhead for multiprocessor LEON3.
More specifically, based on this figure we can conclude that a controllable area overhead (e.g., 20% increase as compared to the multiprocessor solution composed of LEON3 components selected previously) leads to the reduction of the maximum temperature by almost 0.85x of the previous corresponding value.
Notice that architecture instantiations of the same area exhibit temperature variations due to the different components that are replicated.
In this paper, we propose the adoption of selective replication techniques in order to optimize the thermal behavior of the synthesized microprocessor systems targeting at an SDR system. We developed an automated exploration methodology that permits the thermal aware evaluation of various micro-architectural instantiations.
We show that by using selective replication, we can deliver optimized architectural solutions with minimum thermal stress, affordable delay, and user-constrained area overheads. Experimental results have shown a significant reduction of the maximum operating frequency, by 8%, which in turn leads to improvement at maximum on-chip temperature values. Moreover, they have shown that our approach improves by 14% the aging phenomena. These two results show that our approach compares favorably to the conventional design techniques for SoC-based SDR architectures.