An Efficient Technique for Hardware/Software Partitioning Process in Codesign

Codesign methodology deals with the problem of designing complex embedded systems, where automatic hardware/software partitioning is one key issue. The research efforts in this issue are focused on exploring new automatic partitioning methods which consider only binary or extended partitioning problems. The main contribution of this paper is to propose a hybrid FCMPSO partitioning technique, based on Fuzzy C-Means (FCM) and Particle Swarm Optimization (PSO) algorithms suitable for mapping embedded applications for both binary and multicores target architecture. Our FCMPSO optimization technique has been compared using different graphical models with a large number of instances. Performance analysis reveals that FCMPSO outperforms PSO algorithm as well as the Genetic Algorithm (GA), Simulated Annealing (SA), Ant Colony Optimization (ACO), and FCM standard metaheuristic based techniques and also hybrid solutions including PSO then GA, GA then SA, GA then ACO, ACO then SA, FCM then GA, FCM then SA, and finally ACO followed by FCM.


Introduction
The hardware/software partitioning process presents the crucial task of the codesign methodology. It is concerned to decide which functions are to be implemented in hardware components and which ones in software components. This partitioning process aims at finding an optimal trade-off between conflicting requirements to improve the system performance.
Recently, different optimization methods have been undertaken to automate the hardware/software partitioning process.
These optimization methods can be split into exact and heuristic methods. The exact methods, such as Integer Linear Programing (ILP) [1], dynamic programming [2,3], and branch-and-bound [4], work effectively for smaller graph with several tens of nodes. However, the heuristic methods produce near-optimal solutions even for larger inputs. The heuristic methods can, also, be iterative or constructive. The iterative methods such as PSO [5], Genetic Algorithm (GA) [6], Ant Colony Optimization (ACO) [7], Simulated Annealing (SA) [8], Fiduccia-Matteyeses [9], Kernighan/Lin [10], and Tabu Search (TS) [11] attempt to modify a given solution until no improvement can be done. However, the constructive methods, such as greedy and hierarchical clustering [12], generate a small number of solutions starting from an initial partitioning by selecting and adding components to the partial solution until a complete solution is obtained.
Designers of embedded systems focus on achieving more optimal partitioning solutions by emphasizing a combination between existing optimization methods. They propose to combine partitioning algorithms in order to generate optimal solutions of partitioning in a reduced time. In the literature, designers focus on combining the GA and the TS algorithms [13], the PSO and the Branch-and-Bound algorithms [14], the GA and the SA algorithms [15], and the GA and the PSO algorithms [16]. The given results prove that these combinations produce more accurate solutions than the classical algorithms in terms of cost and execution time metrics. In [17], the authors consider the reliability as a factor when solving the partitioning problem, in addition to the cost and time metrics. They propose to combine the recursive and the linear programming algorithms.

Scientific Programming
Constructive algorithms are usually suggested to be integrated with iterative algorithms to increase the quality of the generated solution. For example, the authors in [18] propose an algorithm based on clustering algorithm to make the GA algorithm better in bigger-scale embedded system. The proposed algorithm overcomes the shortcoming that GA algorithm execution time is too long to achieve good results in system partitioning.
In this work, a new hybrid method combining clustering FCM algorithm and the PSO algorithm called "FCMPSO" algorithm is proposed. Experimental results indicate that the FCMPSO algorithm is superior to GA, SA, ACO, FCM, and PSO standard algorithms and PSO-GA, GA-SA, GA-ACO, ACO-SA, FCM-GA, FCM-SA, and ACO-FCM hybrid techniques for both binary and extended partitioning approaches.
This paper is organized as follows. In Section 2, related works of the hardware/software partitioning techniques are introduced. The constructed benchmarking scenario model definition for partitioning problem is described in Section 3. In Sections 4 and 5, the formulation of hardware/software partitioning problems as a binary and then as an extended approach is presented. Experimental results and comparison of the proposed FCMPSO algorithm with standards partitioning techniques and hybrid ones are discussed in Section 6. Finally, the paper concludes in Section 7 by briefing the present work.

The Used Optimization Algorithms to Solve Partitioning Problems
This section provides some detailed notations and definitions of the PSO algorithm, the FCM algorithm, and our proposed FCMPSO algorithm.

PSO Algorithm.
PSO is a stochastic, iterative populationbased evolutionary optimization algorithm. It was developed in 1999 by Shi and Eberhart [19]. It uses the swarm intelligence that is based on social-psychological and biological social principles. By equivalence with the swarm intelligence, each swarm member (particle) takes advantage of private memory and has a degree of randomness in its movement as well as knowledge gained by the whole swarm to discover the best available food source. The problem of a food search can be solved by optimizing a fitness function. The definition of the communication structure (or social network) is obtained by assigning the neighbors for each swarm. All particles, in the search space, have fitness values which are evaluated by the fitness function to be optimized and have velocities which direct their motion in the multidimensional search space. Each particle remembers the information about its best solution and its position in the search space and both are available to its neighbors. In order to update the appropriate changes of its position and velocity, each particle has a memory holding: the particle " best " position which presents the best solution the particle has seen by itself and the global best particle location's " best " that the particle acquires by communicating with a subset of swarms. The th particle velocity ( ) and position ( ) updates are based on the following equations: ( + 1) = ( ) + 1 1 ( best ( , ) − ( )) where is the inertia factor that takes linearly decreasing values downward from 1 to 0 according to a predefined number of iterations, ( ) is the velocity, ( ) is the current solution (or position), 1 and 2 are uniform random numbers in the range between 0 and 1, and 1 and 2 present positive constant parameters called "acceleration coefficient."

FCM Algorithm.
FCM algorithm is a determinist, constructive optimization algorithm. Different studies prove that the FCM outperforms different existing clustering algorithms, that is, the Self-Organization Map (SOM) neural network algorithm [20], -mean algorithm [21], and hierarchical clustering [20]. It is the most popular fuzzy clustering method, which was originally proposed by Dunn [22] and later had been modified by James [23]. FCM algorithm is efficient, straightforward, and easy to implement. It is based on fuzzy behavior and provides a natural technique for producing clustering where membership weights have a natural interpretation but not probabilistic (determinist). The main goal of the FCM is to minimize an objective function, taking into account the similarity of elements and cluster centers.
Suppose FCM algorithm aims at finding a prototype matrix and a membership degree matrix that minimize ( , ): the objective function called "fitness function." The prototypes that minimize the objective function are updated using the following equation: The membership degrees that minimize the objective function are updated according to the following equation: where is the level of cluster fuzziness. In the limit = 1, the membership degree converges to 0 or 1, which implies a good partitioning.
Scientific Programming 3 FCM algorithm is an effective algorithm. It is faster than the PSO algorithm because it requires fewer function evaluations, but it is sensitive to initial values and usually falls into local optima. The weaknesses of these two algorithms motivate the proposal of an alternative approach based on the combination of FCM and PSO algorithms to form a novel FCMPSO algorithm which maintains the merits of both PSO and FCM algorithms.

FCMPSO Algorithm.
In this work, FCMPSO algorithm was proposed to solve both binary and extended hardware/software partitioning problems. This algorithm takes together advantages of both PSO and FCM algorithms: the PSO algorithm has a strong global search capability, while FCM algorithm produces approximate solutions faster and fails in a local optimal solution easily. Hence, we integrated the FCM algorithm with PSO algorithm to provide nearoptimal solution with faster speed.
Firstly, we applied the FCM algorithm to create the uncertain initial partitioning solutions in order to reduce (limit) the research space of the PSO algorithm. Then, we execute the PSO algorithm to have a near-optimal partitioning solution.
The pseudocode of our FCMPSO algorithm is presented in Algorithm 1. Repeat until criteria are met ( ≥ ) (1) Update particle's velocity as (1) (2) Update particle's position as (2) If (1) Update the best know position of particle : (1) Update the swarm's best position: best ( ) = ( ) The complexity analysis of the original PSO and FCM algorithm is as follows: (i) The PSO algorithm is ( + 2 ), where is the swarm population and is the maximum number of iterations.
(ii) The FCM algorithm is ( ), where is the objects number, is the maximum number of iterations, and is the cluster center. In our case, = 1.
where is the dimension of the population issue from the FCM algorithm, is the objects number, and is the maximum number of iterations.
As can be seen, the computational complexity of the FCMPSO algorithm is mainly affected by the number of objects and the population dimension issue from the FCM algorithm. The disadvantages of the PSO algorithm are that it is easy to fall into local optima in high-population dimension and has a low convergence rate in the iterative process. It can also be observed that the computational complexity of the 4 Scientific Programming hybrid FCMPSO is accepted when it is applied to solve the high-dimensional and complex problems.
Our proposed FCMPSO algorithm will be tested for two kinds of partitioning approaches: binary and extended ones. The main difference between these two kinds of architectures appears in the number of the used devices and their types. In the binary partitioning approach, the target architecture includes a single hardware processing unit and a single software processing unit or a reconfigurable architecture. However, in extended partitioning approach, the target architecture includes multiprocessing hardware components and several software processing components.
Before starting the hardware/software partitioning process, it is necessary to transform the initial specification into formal specification. The benchmarking scenario model used to validate our proposed hardware/software partitioning problem is presented in the next section.

Benchmarking Scenario: Task Graphs
Different benchmarks and applications are used to validate hardware/software partitioning approaches. These applications and benchmarks are varying from each other. The embedded application to be partitioned is generally given as a Direct Acyclic Graph (DAG) that represents the sequence of nodes in the embedded system application.
In this work, we use Task Graph for Free (TGFF) tool to generate a set of 20, 50, 100, 200, 500, 1000, and 2000 nodes graphs. Each graph is donated as ( , ), where = { 1 , . . . , } presents the set of tasks and { , | 1 ≤ , ≤ } are the set of edges which present the data dependency between two nodes. The partitioning process aims to find a partition , where = ( H , S ) such that H ∪ S = and H ∩ S = 0. It can generate a deciding partition vector = { 1 , 2 , . . . , }, representing the implementation way of the task nodes.
Such approach is necessary to avoid the system architecture dependency variation in the system by making parameter changes. The variations in the number of the input/output nodes, the node metrics (i.e., execution time, cost, area, and power), and the software/hardware processor number are randomly assigned in the TGFF file input configuration file.
The obtained graphs are used as a system specification to validate the efficiency of our proposed partitioning algorithm to solve both binary and extended partitioning problems.

Hardware/Software Partitioning as a Binary Problem
This section provides the formal description of the binary partitioning problem, especially the used target architecture and the mathematical model of the objective function and the related constraints. communicate with each other through a shared bus for communication between hardware and software components, as shown in Figure 1.

Model Formulation.
In the binary partitioning approach, each node value must take a value of 0 or 1, where value is 1 if the node value is assigned to a hardware component; otherwise, it is 0 indicating that the task is assigned to a software component. Each node in the DAG is associated with cost parameters, that are (i) software costs (software execution time S , memory requirement M), (ii) hardware costs (hardware execution time H , used slices rate constraint S H ), and (iii) communication cost ( SH ). The last cost refers to the delay required to transfer data from the hardware node to software node and vice versa. Given identical parameters and input speeds, TGFF can generate identical task graphs. These random DAG graphs allow representation of applications by using input parameters as follows.
(1) Target Architecture. One software processor and one hardware processor related over communication bus.
(2) Software Constraints. It is presented as "software execution time" constraint ( S ) fixed between 200 and 400 ( s) and "memory" constraint (M) fixed between 0 and 20 MB.
(3) Hardware Cost. It is presented as "hardware execution time" constraint ( H ) fixed between 75 and 225 ( s) and "used slice rate" constraint (S H ) fixed between 50 and 150 (slices).
The parameters generated from the TGFF input files for our three DAGs graphs are presented in Table 1.
In Table 1, AllTimeSw means the time when all nodes are implemented in software, while AllTimeHw means the time when all nodes are implemented in hardware. AllCostSw means the memory requirement when all nodes are implemented in software. AllCostHw means the hardware resource utilization when all nodes are implemented in hardware.
The communication costs required between hardware and software tasks are much less than the task processing time; it can be neglected for simplicity.   AllTimeSw  5761  18526  28476  58162  4230  283800  564796  AllCostSw  190  403  949  1745  142338  8463  16719  AllTimeHw  2821  7151  13842  28626  45920  139221  272813  AllCostHw  1862  4532  9727  18737  69260  92294  183506 (i) Execution Time Constraint. S and H present the execution time of, respectively, software and hardware implemented solution. They can be expressed as follows:

Mathematical Constraints
where is the total number of tasks in the system and S and H are the software and hardware execution time, respectively, of each th task.
presents a binary variable. Its value is 0 if the task is assigned to a software component or 1 if the th task is assigned to a hardware component.
(ii) The Memory Requirement. M presents the memory requirement only for components assigned to software architecture. The total memory is obtained as follows: where Sw is the software cost for a th task.
(iii) The Used Slices Rate. S H presents the number of slices that a partitioning solution will use in a particular hardware component. The total used slices rate is obtained as follows: where Hw is the hardware cost of the th task. Partitioning algorithms try to find a trade-off between these conflicting constraints to improve the system performance. It can be modeled as the minimization of the objective function (F.F.) as follows: The target architecture was generally assumed to consist of only one software and only one hardware unit. In recent years, many researchers are committed to resolve the extended hardware/software partitioning problems: for multiprocessor systems with a high quality solution. In the next section, we will introduce the formal description of the extended hardware/software partitioning problem proposed in this work.

Hardware/Software Partitioning as an Extended Problem
This section provides the formal description of the extended architecture.

Architecture
Representation. The architectural model of heterogeneous multiprocessors considered, in this section, consists of two general purpose processors and two application-specific components, denoted by W = {Sw1, Sw2, Hw1, Hw2}. Each processor has its local memory (LM) used for the intertasks communication on the same processor. Data in a shared memory is acceded by both hardware and software parts, as presented on Figure 2.

Model Formulation.
In the extended partitioning problem, a node must take values between 0 and , where presents the number of processing units. Software nodes are performed by Sw1 and Sw2. The data communication time between tasks is much less than task processing time, and it can be omitted for simplicity. However, hardware nodes are implemented on FPGA or ASIC. Each node is associated with cost parameters that are as follows: (i) Hardware/software execution time ( Hw1 / Hw2 / Sw1 / Sw2 ). (ii) Used slices rate constraint (S H1 /S H2 ).
A 20 nodes' random task graph is generated based on the TGFF using parameters denoted in Table 2.
In this work, the communication time ( SH ) taken between two different components was neglected.

Mathematical Constraints Formulation.
In this work, the fitness function (F.F.) to minimize is defined as follows: AllTimeHw1 +

Hw2
AllTimeHw2 where we define the following.
where is number of processors. presents a binary variable whose value is "00" if the task is assigned to a Sw1 component, "01" if the task is assigned to a Sw2 component, "10" if the task is assigned to a Hw1 component, and "11" if the task is assigned to a Hw2 component. is the execution time for the th task.
(ii) Used Slices Rate. S H presents the number of slices that a partitioning solution will use in a particular hardware component. The total used slices rate is obtained as follows: where is the total number of tasks in the system and Hw is the hardware cost for the th task.
(iii) The Memory Requirement. M S presents the memory requirement only for components assigned to software architecture. The total memory is obtained as follows: where Sw is the software cost for a th task.

Empirical Results and Discussions
In this paper, we will evaluate the performance of the proposed FCMPSO algorithm in both binary and extended architectures. The experiments are performed on an Intel Celeron CPU having 2.16 GHz processor speed and 2 GHz RAM. Hardware/software optimization algorithms were coded in Matlab environment and they are executed in Windows 7 operating system.

Empirical Results and Discussion for Binary Partitioning Approach.
The binary partitioning problem was solved considering different task graphs ranging from 20 to 2000 nodes. The values of software execution time ( S ), hardware execution time ( H ), memory requirement (M), and used slices rate (S H ) were generated randomly as described in Section 4.2. First, we have proceed to simulate standards PSO, ACO, GA, and FCM and SA algorithms were performed. The aim is to determine the best parameters of each algorithm that can give the better solutions and compare the partitioning results. We consider the following heuristics algorithms parameters: (i) For SA algorithm, the parameters used were as follows: initial temperature = 10, final temperature = 0, and ≥ 0.93. (ii) For GA algorithm, the parameters used were as follows: population of 100 individuals, selection rate = 0.5, mutation rate = 0.2, and crossover rate = 0.5.
(iv) For ACO algorithm, the parameters used were as follows: the evaporation parameter is 0.95 and the positive and negative pheromone were 0.06 and 0.3, respectively.
(v) For FCM algorithm, the parameters used were as follows: = [0, 2 ], cluster center = 1, and the scalar termed = 2, where presents the number of nodes in the task graph.
Each algorithm was executed 10 times. In each execution time, the evaluations of the objective function were made 100 times. The best solutions were always taken. Performance analysis for both processing time and quality of the cost solution found by the considered algorithms is given in Figures 3 and 4, respectively.
The processing time comparison of standards partitioning algorithms, presented in Figure 3, proves that FCM algorithm is faster than PSO, SA, GA, and ACO algorithms because it requires fewer function evaluations. Figure 3 demonstrates also that the PSO algorithm takes a considerable time comparing to FCM algorithm to discover and evaluate the input random particles because of the large input swarm (50 particles).
Moreover, Figure 4 reveals that the quality result, which gives a measure of the fitness function cost percentage of solution, is found to be the best for PSO algorithm.
Results demonstrate also that PSO algorithm has a strong global search capability, while the FCM algorithm generates approximate solutions faster. Hence, a combination between  the FCM and PSO algorithms allows the generation of a nearoptimal solution with faster speed. The simulation results of the proposed FCMPSO algorithm as well as the original FCM, PSO, ACO, GA, and ACO algorithms are presented in Figures 5 and 6. Figure 5 reveals that the FCMPSO algorithm enhances the processing time convergence of the PSO algorithm when combining it with FCM clustering algorithm by minimizing its input population number and limiting its "local search" space.
The cost comparison provided in Figure 6 reveals the efficiency of the proposed algorithm to generate the optimal solutions comparing with the original PSO and FCM algorithms. We can also observe that the cost of generated solutions becomes more and more close when the number of nodes decreases within 2000 nodes' graph size; the variation between the FCMPSO cost and the FCM cost presents 11.11% while it is less than 4% with 20 nodes' graph size. This variation is due to the role that FCM algorithm plays in improving the convergence of the PSO by minimizing its "local search input population" to generate an improved optimal solution. However, FCMPSO algorithm obtains significant improvements over FCM and PSO algorithms in terms of processing time and generated cost solution in binary partitioning approach.
To measure the efficiency of combining FCM and PSO algorithms, we proposed to compare it to hybrid combinations including PSO then GA, GA then SA, GA then ACO, ACO then SA, FCM then GA, FCM then SA, and finally ACO followed by FCM. Simulated results are shown in Figures 7  and 8. the algorithm processing time of FCMPSO is 0.033 seconds while that of FCM-SA is 0.24 seconds. This result improves that around 86% of speed reduction is in favor of FCMPSO.
As shown in Figures 9, 10, 11, and 12, the best cost of FCMPSO is 170.19 while it is 170.61 for FCM-GA for 20 nodes' graphs. This result represents around 0.25% improvement in the result quality in favor of FCM-GA.
Recently, many approaches are committed to researching how to improve the performance of hardware/software partitioning in multiprocessor systems. Different solutions have been developed. In the next section, we will prove the efficiency of our proposed algorithm in an extended partitioning approach.   executed on a task graph containing 20 nodes whose parameters are described in Section 5.2. These nodes are extended for partitioning problem on multiprocessors target architecture, composed of two software units and two hardware units. Results are obtained from 10 times of execution. In each execution, the objective functions were evaluated 100 times. At each time, the best solution was used. The simulation of the PSO, GA, ACO, SA, FCM, and our proposed FCMPSO algorithms allows the generation of individual solution, as illustrated in Table 3.
To demonstrate the efficiency of our proposed algorithm, we also propose to compare it to some hybrid combinations including PSO then GA, GA then SA, GA then ACO, ACO then SA, FCM then GA, FCM then SA, and finally ACO followed by FCM. Simulated results are shown in Figures 11 and 12. As expressed in Figures 11 and 12, the best cost was performed by FCMPSO. This result represents around 8% improvement in the result quality in favor of ACO-GA. For the algorithm processing time, the FCMPSO algorithm needs 0.051 sec while ACO-GA algorithm needs 4.57 sec. This result presents around 38% improvement in performance in favor of FCMPSO. Simulation results of both binary and extended partitioning problems demonstrate the efficiency of the FCMPSO algorithms in terms of processing time and generated solution quality.

Conclusion
In this paper, the FCM algorithm was integrated with PSO algorithm to form a clustering PSO called "FCMPSO." This solution maintains the merits of both PSO and FCM algorithms, to solve both binary and extended partitioning problems. FCMPSO algorithm applies FCM to the particles in the swarm to improve the generated solution (improve the quality of input swarm) in a fast processing time. We have demonstrated that FCMPSO algorithm can produce better solution with quick search speed than both standards heuristics algorithms and other hybrid partitioning techniques to solve both binary and extended partitioning problems. Our future works will consider the scheduling problem and the communication between different components.