^{1}

^{1}

^{1}

^{2}

^{1}

^{2}

Shortening the marketing cycle of the product and accelerating its development efficiency have become a vital concern in the field of embedded system design. Therefore, hardware/software partitioning has become one of the mainstream technologies of embedded system development since it affects the overall system performance. Given today’s largest requirement for great efficiency necessarily accompanied by high speed, our new algorithm presents the best version that can meet such unpreceded levels. In fact, we describe in this paper an algorithm that is based on HW/SW partitioning which aims to find the best tradeoff between power and latency of a system taking into consideration the dark silicon problem. Moreover, it has been tested and has shown its efficiency compared to other existing heuristic well-known algorithms which are Simulated Annealing, Tabu search, and Genetic algorithms.

The exponential rise of embedded systems, all along with the persistent quest for higher levels of performance have resulted in the necessity of creating efficient types of embedded circuits. In fact, the embedded systems have become the worldwide leader technologies since they have penetrated into the human life to a very large extent. Besides, they play a vital role in industries as well as military applications which requires the necessity of having faster and better performing systems. Unfortunately, most of current technologies have only managed to further increase the system’s capacity in order to have a faster treatment at the cost of a considerable simultaneous augmentation in their power. However, excessive power consumption may damage the integrated circuits through overheating, limiting the degree of transistors integration on a chip, bringing problem signal integrity, shortening battery durability for portable devices, and requiring expensive cooling and packaging systems. Moreover, the huge dependence of wastage power consumption on threshold voltage has limited further threshold and provide voltage scaling. Thus, the power consumption is rising with technology scaling, such that it can no longer be cooled down profitably considering the physical limitations forced by cooling technologies and packaging. This gives rise to the dark silicon problem [

Traditionally, partitioning was carried out manually which requires a detailed knowledge of circuit operations from designers. Such manual approaches were limited only to small designs with small number of constituent blocks [

The rest of the paper is organized as follows: Section

Recently, a new alternative technology that combines logic elements and memory along with an intellectual property processor core has emerged to remedy the excessive need for better performance systems. This technology called System on Programmable Chip SoPC allows and facilitates the SW/HW partitioning.

As generally reckoned, embedded systems consist of programmable software part (SW) and application specific hardware part (HW). Software part is much easier to develop and modify, and it consumes less power compared to the hardware part but it requires extra time to give final response. In fact, compared to the software which is less expensive in terms of cost and power consumption, the hardware provides better performance because it offers a faster treatment. For that reason, the purpose of HW/SW partitioning is to design a balanced system that accomplishes all system constraints [

We consider the applications that can be modeled using data flow graph (DFG). A data flow graph that is used to create a preliminary overview of the system denoted as

A start or an end node

A node that includes simple code

A node that contains the beginning of a control-construct

A node that contains the end of a control-construct

The graph partitioning is to cut the graph into possible partitions

There exist two kinds of partitions:

A control-construct partition that includes a whole construct such as

A mix partition that could contain either two or more control-constructs or one or more control-construct combined with a simple node (that contains simple construct such as addition operation).

To facilitate the search of control-construct partitions, we have used the parameter of link. If the node is a beginning of a control-construct or an end of a control construct, the link value equals 1. For the rest of node’s types, the link value equals 0. The link definition can be defined as follows:

When a task is realized by hardware or software, its execution time and power consumption show diverse values. We define the following functions

Given a path

The power consumption of the system with respect to a given partitioning can be calculated as the summation of all the task power consumption of each node realized by software or hardware. In fact, it can be written as follows:

Our algorithm is meant to achieve graph partitioning in order to find the best compromise between power and execution time. As generally reckoned, the software consumes less power than the hardware but it requires more time to give response while the hardware which tackles the problem of timing consumes more power. This approach starts with a system totally implemented by software, it will not consume power but it will be too tardy. Whenever a partition of the system migrates to be executed by the hardware, the system will consume more power and become faster. As mentioned previously, our algorithm includes two different kinds of partitions. Its first function is to search for all control-construct partitions (Algorithm

// generate the mix partitions

Compute

In case where all nodes are simple, the partitions will simply take all possible combinations of the nodes. Our algorithm (Algorithm

Find all paths

Calculate

Find Cp

Compute

Generate all partitions

Find

Compute

update

Compute function

// choose the best partitions//

TDP

Compute function

To avoid the dark silicon problem, we have introduced a new constraint called thermal design power TDP. This constraint refers to the maximum amount of power that can be provided to a chip while maintaining the chip temperature under the thermal safe temperature. If the system power under a specific partition exceeds the TDP, then that partition will not be among the suggested ones. Thus, ensure the good performance and the reliability of the system.

To further clarify our algorithm, it has been applied on the graph shown in Figure

Node’s parameters.

| | | | | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

| 3 | 2 | 1 | 2 | 1 | 2 | 3 | 1 | 1 | 3 | 3 | 1 | 5 | 4 | 2 |

| 6 | 4 | 2 | 4 | 2 | 4 | 7 | 2 | 2 | 7 | 6 | 2 | 10 | 9 | 5 |

| 2 | 1 | 2 | 1 | 1 | 2 | 2 | 1 | 1 | 3 | 3 | 1 | 4 | 4 | 5 |

| 0.5 | 0.25 | 0.5 | 0.25 | 0.25 | 0.5 | 0.5 | 0.25 | 0.25 | 1 | 1 | 0.25 | 1 | 1.5 | 2 |

The data flow graph.

The first step consists in finding all possible paths and calculating the software latency of all paths in order to get the critical path. In our case

The links of nodes.

| | | | | | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Link | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |

In our case, there exist eleven partitions

The generated partitions.

Partitions | Nodes | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | |

For each generated partition the latency as well as the consumed power will be calculated using the functions

The calculation results.

| | | | | | | | | | | |
---|---|---|---|---|---|---|---|---|---|---|---|

| 45 | 41 | 45 | 41 | 45 | 39 | 35 | 39 | 39 | 34 | 34 |

| 14.5 | 15.5 | 15.25 | 18.5 | 16.75 | 20 | 23 | 20.75 | 22.25 | 23.75 | 25.25 |

| 0.32 | 0.38 | 0.39 | 0.45 | 0.37 | 0.51 | 0.66 | 0.53 | 0.57 | 0.7 | 0.74 |

| | | | | | | | | | | |

| 0.62 | ||||||||||

Best partition | |

The pace of functions

We have to mention that the used TDP value in the previous example is 26. In case where TDP value equals only 22, then the suggested partitions will only be the set of

To prove the efficiency of our algorithm, we have implemented a comparative study that is meant to illustrate the amount of power consumed by a given application and its execution time by the use of our algorithm in comparison to three existing heuristic algorithms which are the Simulated Annealing (SA), Tabu Search (TS), and Genetic algorithms (GA). To put that into practice we have applied our approach on 8-point Discrete Cosine Transform (DCT) [

The characterization of each application.

Number of nodes | ||
---|---|---|

8-point DCT | 42 | |

16 multiplication operations | 26 addition operations | |

16-point DCT | 224 | |

128 multiplication operations | 96 addition operations | |

H.264 | 28 |

The hardware power values are based on the results provided in [

Table

Design result provided by our algorithm.

8-point DCT | Number of partitions | 458 | ||

Best partition | Hardware | 12 operations of (×) | 19 operations of (+) | |

Software | 4 operations of (×) | 7 operations of (+) | ||

Total power consumed (mW) | 82.35 | |||

The system’s latency (ns) | 453 | |||

| ||||

16-point DCT | Number of partitions | 12512 | ||

Best partition | Hardware | 68 operations of (×) | 91 operations of (+) | |

Software | 60 operations of (×) | 5 operations of (+) | ||

Total power consumed (mW) | 525.876 | |||

The system’s latency (ns) | 2780 | |||

| ||||

H.264 | Number of partitions | 24 | ||

Total power consumed (W) | 6.850 | |||

The system’s latency (ns) | 1530 |

Tables

Design result provided by Simulated Annealing algorithm.

8-point DCT | Best partition | Hardware | 14 operations of (×) | 4 operations of (+) |

Software | 2 operations of (×) | 22 operations of (+) | ||

Total power consumed (mW) | 91.55 | |||

The system’s latency (ns) | 586 | |||

| ||||

16-point DCT | Best partition | Hardware | 60 operations of (×) | 48 operations of (+) |

Software | 68 operations of (×) | 48 operations of (+) | ||

Total power consumed (mW) | 487.78 | |||

The system’s latency (ns) | 3024 | |||

| ||||

H.264 | Total power consumed (W) | 6.250 | ||

The system’s latency (ns) | 1740 |

Design result provided by Tabu Search algorithm.

8-point DCT | Best partition | Hardware | 13 operations of (×) | 0 operations of (+) |

Software | 3 operations of (×) | 26 operations of (+) | ||

Total power consumed (mW) | 86.8 | |||

The system’s latency (ns) | 493 | |||

| ||||

16-point DCT | Best partition | Hardware | 40 operations of (×) | 44 operations of (+) |

Software | 88 operations of (×) | 52 operations of (+) | ||

Total power consumed (mW) | 393.712 | |||

The system’s latency (ns) | 3776 | |||

| ||||

H.264 | Total power consumed (W) | 5.276 | ||

The system’s latency (ns) | 1990 |

Design result provided by Genetic algorithm.

8-point DCT | Best partition | Hardware | 12 operations of (×) | 22 operations of (+) |

Software | 4 operations of (×) | 4 operations of (+) | ||

Total power consumed (mW) | 84.4 | |||

The system’s latency (ns) | 630 | |||

| ||||

16-point DCT | Best partition | Hardware | 60 operations of (×) | 90 operations of (+) |

Software | 68 operations of (×) | 6 operations of (+) | ||

Total power consumed (mW) | 501.652 | |||

The system’s latency (ns) | 2896 | |||

| ||||

H.264 | Total power consumed (W) | 6.050 | ||

The system’s latency (ns) | 1650 |

To compare our algorithm to the previously mentioned algorithms, we have introduced a comparison metric called

According to (

Design results.

| Simulated Annealing | Tabu Search | Genetic algorithm | Our algorithm | | |
---|---|---|---|---|---|---|

8-point DCT | 73.482 | 53.648 | 54.648 | 40.206 | 37.304 | 8.379 |

16-point DCT | 3335 | 1475.04 | 1486.65 | 1558 | 1461.9 | 531.831 |

H.264 | 27462 | 10875 | 10499.24 | 10560 | 10480.5 | 3432.75 |

Based on Table

Given today’s requirements for less consuming systems accompanied by high speed, the necessity of creating more efficient types of embedded systems has been persisted. One of the most elegant solutions that provides system’s optimizations is the HW/SW partitioning. For that, we have developed a new algorithm based on HW/SW partitioning in order to obtain the best tradeoff between power and latency taking into account the dark silicon problem. Our algorithm has been applied and tested to Simulated Annealing, Tabu Search, and Genetic algorithms and as the research has illustrated, we admit that our algorithm is best suited for the urgent achievement of the desired combination of high speed and less power in core based embedded systems.

The authors declare that they have no competing interests.