Power and Execution Time Optimization through Hardware Software Partitioning Algorithm for Core Based Embedded System

Shortening themarketing cycle of the product and accelerating its development efficiency have become a vital concern in the field of embedded system design.Therefore, hardware/software partitioning has become one of the mainstream technologies of embedded systemdevelopment since it affects the overall systemperformance. Given today’s largest requirement for great efficiency necessarily accompanied by high speed, our new algorithm presents the best version that can meet such unpreceded levels. In fact, we describe in this paper an algorithm that is based on HW/SW partitioning which aims to find the best tradeoff between power and latency of a system taking into consideration the dark silicon problem. Moreover, it has been tested and has shown its efficiency compared to other existing heuristic well-known algorithms which are Simulated Annealing, Tabu search, and Genetic algorithms.


Introduction
The exponential rise of embedded systems, all along with the persistent quest for higher levels of performance have resulted in the necessity of creating efficient types of embedded circuits.In fact, the embedded systems have become the worldwide leader technologies since they have penetrated into the human life to a very large extent.Besides, they play a vital role in industries as well as military applications which requires the necessity of having faster and better performing systems.Unfortunately, most of current technologies have only managed to further increase the system's capacity in order to have a faster treatment at the cost of a considerable simultaneous augmentation in their power.However, excessive power consumption may damage the integrated circuits through overheating, limiting the degree of transistors integration on a chip, bringing problem signal integrity, shortening battery durability for portable devices, and requiring expensive cooling and packaging systems.Moreover, the huge dependence of wastage power consumption on threshold voltage has limited further threshold and provide voltage scaling.Thus, the power consumption is rising with technology scaling, such that it can no longer be cooled down profitably considering the physical limitations forced by cooling technologies and packaging.This gives rise to the dark silicon problem [1][2][3].The concept of dark silicon is based on constraint that important fraction of transistors on chip cannot be powered on at a nominal voltage for a specific thermal design power TDP budget and have to be power-gated or simply remain dark.The TDP is the maximum amount of power provided to a chip while maintaining the chip temperature under the thermal safe temperature.In case where the TDP is exceeded, the temperature of the chip will rise beyond the cooling capacity that will throttle the chip.Previous studies [1,2] have predicted that 50% to 80% of the chip area will be dark for GPU and CPU based systems.To overcome such dilemmas, designers' efforts have been increased to produce less consuming systems.In this context, some research groups have focused on the creation of new architectures in terms of the material [4] while other groups have focused on extending batteries life cycle [5].Yet, such solutions require high resources that several research groups do not have.For that, other methods have appeared in order to offer a less power consuming system such as hardware/software partitioning [6,7].
Traditionally, partitioning was carried out manually which requires a detailed knowledge of circuit operations from designers.Such manual approaches were limited only to small designs with small number of constituent blocks [8,9].Since digital systems have become much more sophisticated, automatic HW/SW partitioning has become a necessity.In fact, many research groups have opted for the HW/SW partitioning in order to increase the performance of a system as presented in approaches such as [10,11]; most of these approaches purposes are to meet performance constraints while keeping the system cost (area) as low as possible.Unfortunately, none of them took the power consumption and the execution time into consideration.Hence, we present in this paper an algorithm that finds a possible HW/SW partitioning of a data flow graph that finds out a tradeoff between power and latency taking into account the dark silicon problem.
The rest of the paper is organized as follows: Section 2 reviews the related literature; the proposed partitioning algorithm is addressed in Section 3 followed by an illustrative example; the numerical experimentation and discussion theorem are presented in Section 5 and finally the article ends up with the conclusion that briefs the present findings and future research on this theme.

Related Work
Recently, a new alternative technology that combines logic elements and memory along with an intellectual property processor core has emerged to remedy the excessive need for better performance systems.This technology called System on Programmable Chip SoPC allows and facilitates the SW/HW partitioning.
As generally reckoned, embedded systems consist of programmable software part (SW) and application specific hardware part (HW).Software part is much easier to develop and modify, and it consumes less power compared to the hardware part but it requires extra time to give final response.In fact, compared to the software which is less expensive in terms of cost and power consumption, the hardware provides better performance because it offers a faster treatment.For that reason, the purpose of HW/SW partitioning is to design a balanced system that accomplishes all system constraints [12].Most of formulations of HW/SW partitioning dilemma have proven to be NP-hard [13,14].In fact, many exact algorithms have been proposed such as Branch-and-Bound [15], dynamic programming [16], and integer linear programming [17].However, these exact algorithms tend to be quite slow for bigger inputs.Hence, for bigger partitioning problem heuristic algorithms have been the basis for the majority of researches such as Genetic algorithm (GA) [18], Tabu Search [19,20], Simulated Annealing [21], Particle Swam Optimization [22,23], Ant algorithm [24,25], shuffled frog leaping algorithm [26], and greedy algorithm [27].Other designers have mixed two heuristic algorithms to solve the HW/SW partitioning problems like in [28] where authors have used hybrid algorithm of Genetic algorithm (GA) and Tabu Search one, while others [29] have mixed the Discrete Particle Swarm Optimization (DPSO) and Branch-and-Bound (B&B) algorithms to meet the same aim.Besides, authors in [30] have proposed a new heuristic solution based on HW/SW partitioning that aims to reduce the execution time of the overall circuit.Moreover, authors in [31] have come up with a new IVA-HD which is a programmable, true multistandard, and full HD video coding engine that adopts HW/SW partitioning to achieve the low power and area equipment of the OMAP 4 processor.To attain the same goal of power optimization, [32] has proposed a minimizing approach based on mapping clusters of instructions to a core that yields a high utilization rate of resources and thus minimizes power consumption.Such a method has offered a less consuming system at the cost of an additional hardware overhead.The problems that these previously mentioned works have met are either to optimize one parameter at the cost of another important constraint or to focus on achieving the optimization of only one constraint such as power or execution time.Also, none of them have mentioned the dark silicon problem.In fact, the dark silicon has become a critical issue for designers since it can decrease the reliability in the nanoera [33][34][35] and leads to soft errors, aging and even process variations [36,37].Recent works have explored the dark silicon problem by applying a very low voltage to power on more cores [38] and proposed new accelerators architectures [39,40].Almost, the majority of works have handled the dark silicon problem on low level codesign which necessitates a good knowledge of the target circuit and extra time of marketing cycle of the product.Other designers have proposed new architectures by exploiting architectural heterogeneity [41][42][43].However, such solutions require high resources that several research groups do not have.In the literature, only few works have combined the HLS and dark silicon problem due to its complexity [44,45].It is true that generally the dark silicon problem appears for multiprocessor system-on-chip (MPSoc).But, due to the Soc huge rise the dark silicon problem must be taken into consideration even with one core based embedded system [46].Motivated by the fact and coming across the shortages of other researches, it has been vital to come up with a new idea of developing a new algorithm that aims to create a less consuming system and a faster one without influencing the system reliability.

Problem's Definitions
We consider the applications that can be modeled using data flow graph (DFG).A data flow graph that is used to create a preliminary overview of the system denoted as (, ), where  = {V 1 , V  , . . ., V  }, is the set of vertices or nodes that are interconnected to each other by edges  = {{V  , V  }, {V  , V +1 }, . ..}. Edges of the graph present the dependencies between the components of the system.In general, the node of the graph can represent a basic block [47], a short of instruction [48], a procedure or a function [49], and so on.In this paper, we use four different types of nodes: (i) A start or an end node V startend  ; V startend  ∈  startend and  startend ⊆ .
(ii) A node that includes simple code V   ; V   ∈   and   ⊆ .
(iii) A node that contains the beginning of a controlconstruct V   ; V   ∈   and   ⊆ .(iv) A node that contains the end of a control-construct V   ; V   ∈   and   ⊆ .

Partitions' Types.
The graph partitioning is to cut the graph into possible partitions  all = { 1 ,   , . . .,   } where  all is the set of all possible partitions;   is a possible partition; and  is the number of possible partitions.
There exist two kinds of partitions: (i) A control-construct partition that includes a whole construct such as if to end if, case to end case, and so on.
(ii) A mix partition that could contain either two or more control-constructs or one or more control-construct combined with a simple node (that contains simple construct such as addition operation).

Node's Links.
To facilitate the search of control-construct partitions, we have used the parameter of link.If the node is a beginning of a control-construct or an end of a control construct, the link value equals 1.For the rest of node's types, the link value equals 0. The link definition can be defined as follows: (1)

Related Statements.
When a task is realized by hardware or software, its execution time and power consumption show diverse values.We define the following functions   (  ),   (  ),   (  ) and   (  ) to represent the hardware latency, the software latency, the hardware power, and the software power respectively of a given partition   .Although obtaining the exact values of the execution time and power consumption is a challenging problem, it is beyond the scope of this article.Rather, we focus on algorithmic issues in partitioning.
Given a path   : V  → V  and a hardware/software partitioning   for all the nodes in   , the completion time of   under partitioning   is the summation of all the latencies occurred on   taking into consideration the parallel execution of some tasks.The system completion time is defined to be the completion time of a critical path Cp in DFG.
The hardware latency and the software latency corresponding to a target partition can be written as follows: where For a given (, ), we define a vector   to indicate either the task is realized by hardware or software.For instance, for a node V  ,   () equals 0 if the task is executed by the software and equal to 1 if it is realized by the hardware.
The power consumption of the system with respect to a given partitioning can be calculated as the summation of all the task power consumption of each node realized by software or hardware.In fact, it can be written as follows: So, to recapitalize, we define the hardware/software partitioning problem as follows: given (, ) and thermal design power, find a partitioning that offers the best tradeoff between total power and execution time of the system.

Proposed Algorithm
Our algorithm is meant to achieve graph partitioning in order to find the best compromise between power and execution time.As generally reckoned, the software consumes less power than the hardware but it requires more time to give response while the hardware which tackles the problem of timing consumes more power.This approach starts with a system totally implemented by software, it will not consume power but it will be too tardy.Whenever a partition of the system migrates to be executed by the hardware, the system will consume more power and become faster.As mentioned previously, our algorithm includes two different kinds of partitions.Its first function is to search for all control-construct partitions (Algorithm 2) and then it builds the mix partitions.After that, it makes all possible combinations between the generated partitions (Algorithm 1).
In case where all nodes are simple, the partitions will simply take all possible combinations of the nodes.Our algorithm (Algorithm 3) is based on three functions. 1 provides the total latency of the system for each generated partition while  2 computes the total consumed power of the // generate the mix partitions For all   ∈  all If ∀ V  ∈ Link(pred (V start ) = 0) then // V start is the start node of the control-construct partition

Begin
For all V  ∈  Compute Link(V  ) End for  = 1;  ∈ N // generation of all control construct partitions For all V  ∈   :

End if End for
Algorithm 2: Generation of control-construct partitions.system under a given partition.When the algorithm becomes so close to the best solution, it will offer us an interval that includes some suggestions. 1 and  2 are written as follows: where   (  ) = ∑ V  ∈    (V  ),   (  ()) and   (  ()) are the software and the hardware latency of the critical path of the partition , respectively.
To avoid the dark silicon problem, we have introduced a new constraint called thermal design power TDP.This constraint refers to the maximum amount of power that can be provided to a chip while maintaining the chip temperature under the thermal safe temperature.If the system power under a specific partition exceeds the TDP, then that partition will not be among the suggested ones.Thus, ensure the good performance and the reliability of the system. 3 is introduced to facilitate taking the decision of which the suggested partition is the best one. 3 is introduced as follows: The best solution equals the closest value to where  is the number of suggested solutions.

Illustrative Example
To further clarify our algorithm, it has been applied on the graph shown in Figure 1.The node's constraints  = {V 1 , V  , . . ., V 15 } are presented in Table 1.The first step consists in finding all possible paths and calculating the software latency of all paths in order to get the critical path.In our case   :   (  ) = 45.The second step is to compute the link of each node as Table 2 presents.Then, all possible partitions will be generated.
For each generated partition the latency as well as the consumed power will be calculated using the functions  1 and  2 respectively (Table 4).For instance,  1 does not belong to   so the latency of the system will stay the same "45."However, with partitions whose nodes belong to   the latency of the graph can change such as in the case of  2 and in such circumstances we have to be aware that the critical path changes as well.The final decision is taken using  3 .The pace of function  1 and  2 and the best partition are shown in Figure 2.
End for End.
Table 3: The generated partitions.We have to mention that the used TDP value in the previous example is 26.In case where TDP value equals only 22, then the suggested partitions will only be the set of { 6 ,  8 } and the best partition will be  6 instead of  7 .

Experiment Results
To prove the efficiency of our algorithm, we have implemented a comparative study that is meant to illustrate the amount of power consumed by a given application and its execution time by the use of our algorithm in comparison to three existing heuristic algorithms which are the Simulated Annealing (SA), Tabu Search (TS), and Genetic algorithms (GA).To put that into practice we have applied our approach on 8-point Discrete Cosine Transform (DCT) [50] (Figure 6), 16-point DCT [51] (Figure 7), and H.264 [52].The DCT is the most intensive part of the CLD algorithm.The H.264 is a video coding format that is one of the most used formats for compressing, recording, and distributing of video content.The characterizations of the three previous applications are presented in Table 5.
The hardware power values are based on the results provided in [53] and [54] for the DCT and the H.264, respectively.The software power is almost negligible compared to the hardware power.The latency of hardware (FPGA) equals one-third to one-fifth of the latency of software (processor) [55].The TDP value equals 7W for the rest of the comparison.
Table 6 illustrates the design results provided by our approach for the 8-point DCT, 16-point DCT, and H.264 applications.

Comparison with Other Algorithms and Discussion.
Tables 7, 8 and 9 summarize the design results provided by Simulated Annealing, Tabu Search, and Genetic algorithms, respectively.To compare our algorithm to the previously mentioned algorithms, we have introduced a comparison metric called Ω.It is obvious that the worst case  worst presents when the system consumes the highest value of power (purely hardware) and takes too much time (purely software).We assume that  worst =  max ×  max .The best case  best takes place when the system consumes less power and responds faster:  best =  min ×  min .
where  worst < Ω <  best .According to (9), when Ω is close to  best , the solution is better.Table 10 presents the different values of Ω.
Based on Table 10, we deduce that our algorithm offers less and closest rates to the best case, whereas using Simulated Annealing, Tabu Search, and Genetic algorithms the values of Ω are too high and far from the best state.Thus, we admit that our algorithm offers the best tradeoff between the two critical parameters: power consumption and system execution time.

Conclusion
Given today's requirements for less consuming systems accompanied by high speed, the necessity of creating more efficient types of embedded systems has been persisted.One of the most elegant solutions that provides system's optimizations is the HW/SW partitioning.For that, we have developed a new algorithm based on HW/SW partitioning in order to obtain the best tradeoff between power and latency taking into account the dark silicon problem.Our algorithm has been applied and tested to Simulated Annealing, Tabu Search, and Genetic algorithms and as the research has illustrated, we admit that our algorithm is best suited for the urgent achievement of the desired combination of high speed and less power in core based embedded systems.

Figure 1 :
Figure 1: The data flow graph.

Figure 2 :
Figure 2: The pace of functions  1 and  2 .
is the end node of the control-construct partition   ≤ {; succ(V end )}

Table 2 :
The links of nodes.
new) is the set of the new latency of each path For all P  do Compute   (  ); // the new latency of the path   (  ) =  1 (  ); TDP ≤ the dark power For all P  ∈  all do For all V ℎ ∈   do if (V ℎ ∈ Cp and V ℎ+1 ∉ Cp) then if  2 (  ) < TDP then  suggested ≤

Table 4 :
The calculation results.

Table 5 :
The characterization of each application.

Table 6 :
Design result provided by our algorithm.

Table 7 :
Design result provided by Simulated Annealing algorithm.

Table 8 :
Design result provided by Tabu Search algorithm.

Table 9 :
Design result provided by Genetic algorithm.