Performance Estimation Based Multicriteria Partitioning Approach for Dynamic Dataflow Programs

The problem of partitioning a dataflow program onto a target architecture is a difficult challenge for any application design. In general, since the problem is NP-complete, it consists of looking for high quality solutions in terms of maximizing the achievable data throughput. The difficulty is given by the exploration of the design space which results in being extremely large for parallel platforms. The paper describes a heuristic partitioning methodology applicable to dynamic dataflow programs. The methodology is based on two elements: an execution model of the dynamic dataflow program which is used as estimation of the performance for the exploration of the large design space and several partitioning algorithms competing to lead to specific high quality solutions. Experimental results are validated with executions on a virtual platform.


Introduction
An interesting alternative to the classical sequential programming methods for signal processing system implementations is the approach based on dataflow programming.Dataflow programs are characterized by a high analyzability and platform independence and by providing an explicit exposition of the potential parallelism.For these reasons, they can be used for exploring a variety of parallel implementation options and to provide an extensive and systematic implementation analysis [1,2].Hence, they have been investigated in several research works [3][4][5].
Dataflow programs are in general structured as, possibly, hierarchical networks of communicating computational kernels, called actors.Actors are connected by directed, lossless, order preserving point-to-point communication channels, and data exchanges are only permitted by sending data packets (called tokens) over those channels.A general dataflow Model of Computation (MoC) is known in the literature as "Dataflow Process Network (DPN) with firings" [6].A DPN network evolves as a sequence of discrete steps (called firings) corresponding to the executions of actions that may consume and/or produce a finite number of tokens and modify the internal actor state.At each step, according to the current actor internal state, only one action can be fired.The processing part of the actors is thus encapsulated in the atomic executions completely abstracting from time. Figure 1 illustrates the construction of a sample dataflow program and the underlying structure of an actor.
An important property of a dataflow program is the composability of its components (actors).Porting a program onto a target architecture requires determining three settings: the partitioning (the assignment of dataflow actors to the processing units), the scheduling (the execution order inside each unit, where, depending on the internal nature of actors, a static scheduling may or may not exist), and the dimensioning of buffers (a finite size for each communication channel in the network).An essential aspect of the problem is that although the configurations are applied at the level of actors, the program execution (i.e., in terms of data dependencies) is described at the level of action firings.
Depending on the architecture and on the dataflow program itself, the size of the solution space of admissible partitioning and scheduling configurations can be, in general, extremely large.Therefore, an important design challenge is to find a set of configurations that optimizes the desired objective function.This problem has been proven to be NPcomplete even for the case of only two processors [7].Among different objective functions that might be considered, this work focuses on the maximization of the data throughput of a dataflow program.This particular objective function is often an appropriate choice for stream programs, since in many cases it contributes also to an optimization of other criteria, such as, for instance, resources utilization and energy consumption [8].
An efficient exploration of the multidimensional design space has two important applications.First, exploration of feasible regions leads to determining a close-to-optimal set of configurations according to the desired objective function.Second, it enables the identification of unreachable regions of the design space that could become reachable by applying refactorization stages to the considered design.For instance, a different implementation of an algorithm might be required for obtaining higher performances if its current exposed parallelism is lower than the potential parallelism offered by the processing platform.
Among different dimensions of the design space, more attention is usually paid to the partitioning, since its impact on the overall performance is considered to be dominant.Furthermore, the room for improvement left, for instance, to scheduling strongly depends on the quality of the applied partitioning (i.e., low-quality partitioning configurations cannot be improved much by applying efficient scheduling).This property has been confirmed also for other, nondataflow domains [9,10].The purpose of this work is to describe a new partitioning methodology to perform the design space exploration of dataflow program implementations.
The original contribution of this paper includes different stages of the partitioning process: starting from presenting and discussing different aspects of the problem, possible impact of a considered architecture, through the modeling approach, up to the description of several partitioning solution methods with different quality and complexity.An important property of the methodology is the construction of partitioning algorithms based on highly accurate models of the program executions.The execution model is used in order to evaluate each solution and to extract some execution properties which are further considered as optimization criteria.The approach can be implemented with limited memory requirements (i.e., it can be implemented on a standard PC also for complex designs) and provides solutions according to quasi-constant evaluation times for different configurations considered.Unlike most of the existing approaches, the methodology can be used for analyzing static as well as fully dynamic dataflow programs.The paper has the following structure.First, Section 2 presents and formulates the partitioning and scheduling problem explicitly for the case of dynamic dataflow programs.In this context, the related work is discussed in Section 3.Then, Section 4 describes the execution model used to derive the estimations and Section 5 reports in detail the proposed partitioning algorithms built upon this model.The algorithms are analyzed for complexity, quality of solution, and usability, based on the experimental results presented in Section 6.Finally, conclusions and future works are discussed in Section 7.

Dataflow Partitioning and Scheduling
The design problem considered here can be applied to any dataflow MoC.

Problem Presentation.
Following the terminology commonly used in the production field [11], an objective is to find an assignment of  jobs (representing action firings) to a set of  parallel machines (corresponding to processors).The objective function to minimize is the overall makespan (i.e., the completion time of the last performed job among all processors).Assuming that the processors do not allow parallel execution, only one job can be performed at a time on each machine.Therefore, when all jobs are assigned to the machines, it must be decided in which order the jobs are going to be performed on each machine.The literature terminology often calls the first problem as mapping in the spatial domain (binding) and the second as mapping in the temporal domain (scheduling) [12].In this work, partitioning (  ) and scheduling (  ) refer to these problems, respectively, and are illustrated in Figure 2.
Each job  has an associated processing time (weight)   and a group (actor)   .There are  possible groups, and each one can be divided into subgroups where all jobs have the same processing time.This division can be easily identified with actions and their executions (firings).Between some pairs {,   } of incompatible jobs (i.e., with   ̸ =    ) is associated a communication time    .The communication time is subject to a fixed quantity    of information (or the number of tokens) that needs to be transferred.The size of this data is fixed for any subgroup (i.e., an action always produces/consumes the same amount of data).Due to the structure of dataflow programs, the partitioning and scheduling must take into account the following constraints: (i) Group constraint: all jobs belonging to the same group have to be processed on the same machine (i.e., an actor must be entirely assigned to one processing unit Assuming that job  is assigned to the machine , the actual value of communication time    is a product of two elements: the number of tokens    and the variable time    (  ,    ) needed to transfer a single unit of information from   to    .For two given jobs  and   , the largest    can be significantly larger than the smallest    .In theory, every connection (,   ) can have as many different    's as the number of different possible assignments to the machines.But in practice, this number can be usually reduced to few different values of latency, depending on the memory level serving a given communication (see Chapter 4 of [13]).
The case of heterogeneous platforms (typically hardware or software families of processors) may introduce some further constraints to the problem, such as the following: (i) Eligibility constraint: not all jobs can be processed on each machine (i.e., each machine has a collection (  ) of not supported operations).
(ii) Capacity constraint: the memory size is limited for a given family fl of processors.It involves that the memory requirements of all jobs assigned to processors of this family cannot exceed the given size.
Different figures of merit can be also introduced for communication time, especially for communication occurring between jobs partitioned to machines belonging to different families.

Related Work
Due to the NP-completeness of the partitioning problem for realistic instances, it is only possible to develop methods providing close-to-optimal solutions [14].They can be obtained by applying constructive heuristics, where a solution is generated from scratch by sequentially adding components to the current partial solution according to some criteria until the solution is complete [15].Another possibility is metaheuristics, formally defined as an iterative generation process which guides a subordinate heuristic by combining intelligently different concepts for exploring and exploiting the search space [16].Metaheuristics (e.g., simulated annealing, tabu search, variable neighborhood search, and guided local search) can usually lead to solutions of higher quality, but in general they require much longer computing times [17,18].There are several examples of the approaches based on metaheuristics used for partitioning or, more generally, for the design space exploration of dataflow programs.In [19], simulated annealing is employed for estimating the bounds of the partitioning program.Various optimization stages (including the selection of a target architecture, partitioning, scheduling, and designing space exploration) are applied in [20] in order to identify feasible solutions.The optimizations are performed using an evolutionary algorithm.Multiobjective evolutionary algorithms used for performing an automatic design space exploration are also an objective of work discussed in [21].An interesting transition from simple heuristics to advanced metaheuristics (such as genetic algorithms) is also described in [22], where more advanced methods act as a refinement to the less advanced ones.
An important aspect of metaheuristic approaches for design space exploration, which requires explicitly providing an execution model, is performance estimation that can be used as an evaluation of a solution.The level of details considered in the model usually determines the accuracy of estimation as well as the evaluation time.The more detailed is the model, the more accurate estimation can be obtained, but employing a much longer time [23][24][25].There are multiple frameworks which use the concept of performance estimation in order to build the optimization methods on them.
An interesting example, demonstrating some similarities with the methodology described in this paper, is the MPSoCs Application Programming Studio (MAPS) [26].This design space exploration framework performs an estimation of a program execution, based on the analysis of the execution trace.Chapter 6 of [27] introduces a trace replay module employing a discrete-event simulator in order to simulate the scheduling of the segments of the trace.Several optimization heuristics are built on the estimation, with an emphasis on the advantages of light-weight heuristics over evolutionary methods, and the composability analysis for the purpose of executing simultaneously multiple applications on a given platform without interference [4].In terms of modeling differences, the framework addresses only KPN dataflow programs and supports a limited set of dependencies.Another framework, supporting only the KPN programs, is Sesame system-level simulation framework [28].Similar to the work presented in this paper, it introduces a concept of an application model which is independent from the architecture model and the partitioning configuration.The objective of the framework is to find the most suitable and efficient target MPSoC platform for a given program.Hence, the exploration process involves simulating and analyzing the candidate architectures.
Unlike the two approaches mentioned earlier, Metropolis is a framework that abandons the imposition of a specific language or flow model to the design [29].It introduces a design description at different levels of abstraction and provides an infrastructure based on meta-modeling that remains generic enough to support existing MoCs, whereas it can also accommodate new MoCs.The meta-model allows capturing the architecture, the functionality, and the mapping between different abstraction levels.The simulation and formal analysis allow the user to determine how well an implementation satisfies the specified requirements.With its generic features, the framework responds only partially to the demands of DPN, because it does not support a simultaneous analysis at the level of partitioning, possibly dynamic scheduling and buffer dimensioning.
In the mentioned examples, the partitioning methodologies have been designed explicitly for the purpose of dataflow partitioning, which is a specific instance of a graph partitioning problem.The research field of graph partitioning is thoroughly covered by different algorithms proposed in the literature [30] as well as some software packages, such as METIS [31] or SCOTCH [32].Such general purpose partitioning algorithms cannot be, however, easily applied for the case of dataflow programs, since they are not aware of the semantics related to the nodes and edges of a dataflow graph.An attempt of applying the METIS, mentioned earlier, for the purpose of run-time actor mapping of dataflow programs has been made in [33].This approach explores the results of profiling and extracts some optimization criteria (connectivity between actors, critical path).It is, however, difficult to evaluate the obtained solutions in terms of being close-to-optimal, or point to possible optimizations in the design, since no execution model is provided.The considered partitioning graph is the program network itself and not the execution trace which can provide elements and measures of the execution properties of the dataflow program.Furthermore, such combinatorial approach, which might operate quite effectively for small instances of the problem, cannot be successfully applied to the exploration of design problems of larger size.
Summarizing, a comparison with existing approaches puts in evidence that there are some important differences that should be emphasized.First of all, the approach can be applied to all dataflow models of execution including fully dynamic dataflow programs [34].Hence, it includes also other, less expressive dataflow variants such as, for instance, SDF and CSDF [35].Furthermore, due to the demands of the dynamic programs, such as dynamic scheduling influenced by the applied buffer configuration, the methodology can be used directly for optimization of other design configurations, without imposing any particular order of optimizations.Finally, the long-term objective of the methodology is to find a close-to-optimal (close-to-the-potential-parallelism) design configuration on a given platform, in order to assess the maximum performance of a dataflow program, identify its bottlenecks, and hence identify new regions of the design space to be reached after applying some refactorization to the design.

Dataflow Program Execution Modeling
A general dataflow MoC, DPN with firings, is considered.It allows implementation of fully dynamic programs efficiently capturing all classes of signal processing applications (e.g., video and audio codecs, packet switching in communication networks).For this reason, a representation of a dataflow program must capture its entire dynamic behavior.Such a representation can be built by generating a directed, acyclic graph G, called Execution Trace Graph (ETG), where nodes of the graph represent a collection of action executions, called firings.Such representation has been already extensively studied to solve some optimization problems of dynamic dataflow implementations [36].
It is possible to explicitly characterize various intrinsic dependencies between two firings, which are used to describe the program execution.They can be grouped into two types.The first type includes internal dependencies related to FSM, internal variables, and ports ( in / out in Figure 1) and describes the relations between two firings of the same actor.The second type (token dependencies) describes the relations between two firings of different actors that, respectively, produce and consume at least one token.Defining dependencies between executed firings establishes precedence orders: if firing  2 depends on firing  1 , then  1 has to be executed and completed (including the communication, if applicable) before  2 can be started.
An ETG has to consider a sufficiently large and statistically meaningful set of input stimuli in order to cover the whole span of dynamic behavior of the application considered.The size of G varies according to the type of application and to the size of the input stimuli set.In fact, for some applications, it can be large.For instance, a very complex algorithm, such as HEVC video compression operating on an input signal of a few tenths of frames, may result in G containing a few billions of nodes.Still, it can be processed in reasonable time using standard PC platforms, as demonstrated by the examples described in Chapter 9 of [36].
Considering the program execution on a given architecture, the ETG is weighted at the nodes and at the edges, and the weights correspond to the values of   and    , respectively.These values are normally obtained by profiling the program on the target architecture.A weighted ETG allows simulating the program execution for a given partitioning, scheduling, and buffer dimensioning configuration.This task is accomplished by an event-driven performance estimation tool that processes the firings following the constraints and properties, as presented in Section 2.

Partitioning Algorithms
This section presents the algorithmic details of different partitioning solution methods, namely, greedy constructive procedures, the decent local search heuristics, and the tabu search metaheuristic.Each method takes the weighted ETG as input data.

Greedy Constructive Procedures.
In order to construct a solution, the greedy procedures require specifying only the target number of processors.The solution generation succeeds in a negligible time frame, since no performance estimation needs to be performed.

Workload Balance (𝑊𝐵).
The concept of balancing the workload in order to minimize the bottleneck of the program and hence maximize the throughput has been already successfully employed for partitioning purposes of systems of different types [37].Inspired by such approaches, the very first constructive heuristic has been designed.The algorithm starts from calculating the total workload of each group throughout the program execution.It is expressed as the sum of the   's for all jobs (firings) belonging to one group (actor) .The actors are then sorted decreasingly by the sum of weights (workload) wl() = ∑ ∈   .The partitioning decision is based on the sum of workloads of actors partitioned already in one processor : wl() = ∑ ∈ wl().The next actor on the list is always partitioned on the processor with the smallest sum of workloads wl().In this way, a balance of overall workload of each partition should be achieved and the bottleneck (understood as the workload of the most occupied processor) is likely to be minimized.

Balanced Pipeline (𝐵𝑃).
The algorithm starts from giving each actor a dedicated processor.Next, the processors are being iteratively reduced and the members of the least occupied processors are attached to the remaining processors.The optimization criteria of the algorithm involve equalizing the average preceding workload (understood as the maximal sum of weights of each firing of each actor that precedes the given actor in the network in terms of topological order) and maximizing the number of common predecessors (understood as the number of actors appearing on the topological list of predecessors for a given pair of actors) within each partition.The simulation of the execution time might be incorporated to the algorithm in order to estimate the optimal number of processors from the perspective of processors utilization.The Balanced Pipeline algorithm, extended with some additional optimization procedures for communication volume and idle time, has been shown to outperform existing partitioning algorithms for dataflow programs.For a more detailed description of the algorithm, its metrics, and its results, the reader is referred to [38].

Decent Local Search Heuristics (𝐷𝐿𝑆).
As described in [39], a local search starts from an initial solution and then explores the solution space by moving from the current solution to a neighbor solution.A neighbor solution is usually obtained by making a slight modification of the current solution, called a move.The neighborhood () of a solution  is the set of solutions obtained from  by performing each possible move.In a descent local search (DLS), the best solution (according to the considered objective function ) of   ∈ () is generated at each iteration.The main drawback of this method is that it stops in the first local optimum.Two DLS approaches are proposed below: the Idle  and the Communication Frequency .

Idle 𝐷𝐿𝑆 (𝐼𝐷𝐿𝑆).
Representing the program execution with the ETG, and simulating it for a given partitioning, scheduling, and buffer dimensioning configuration using the performance estimation tool, may provide important information related to actor states throughout the execution.The following states may occur for an actor that is currently not processing and has not yet terminated.
(i) Blocked reading considers the situation where an actor has not yet received the required input tokens and therefore cannot be executed.
(ii) Blocked writing takes into account the situation where the buffer an actor is expecting to write to is full, so it has to wait for the available space.
(iii) Idle corresponds to the situation where although an actor has the necessary tokens and required space in the buffers, it cannot be fired because another actor is currently processing in the same processor (as previously mentioned, only one job can be executed on each processor at a time).
Assuming a design space exploration in the dimension of partitioning and scheduling, it is particularly important to minimize the occurrences of the idle state.In order to achieve that, in IDLS, all actors are sorted according to their idle times in decreasing order (idle time list).A newly created solution  is generated by moving a single actor to the most idle partition, where the idleness of a partition is defined as the overall time during the execution when none of its actors could be executed due to being blocked reading/writing or terminating already.In each iteration, the possible moves are prioritized according to the position of the considered actor on the idle time list.A move is evaluated by estimating the makespan of the new solution.For the case of a successful move, the statistics on the idle times of the actors and the corresponding idle time list are being regenerated.Since the moves are prioritized, there is a risk that if there is a move with a high priority that does not improve the solution, it will be unnecessarily repeated in each iteration.To prevent that from happening, a simple release mechanism is implemented: a (once unsuccessful) move, understood as an actor-partition pair, may be repeated only if the content of the target partition has been modified by applying another move.

Communication Frequency 𝐷𝐿𝑆 (𝐶𝐹𝐷𝐿𝑆).
Another information that can be extracted from the ETG is the number of token dependencies between the firings of different actors.Accumulating these numbers for all firings leads to creating an actor-actor communication frequency map.This map is independent from the partitioning configuration, but taking the partitioning into consideration, it can be easily transformed into an actor-partition map.Indeed, this map is taken as an optimization criterion by another local search.For each actor, the algorithm calculates the internal communication frequency (token exchange with actors partitioned to the same processor) and external communication frequency (token exchange with actors partitioned to different processors).As described previously in Section 2.2, the partitioning of actors may strongly influence the values of communication cost and therefore the makespan.If for any actor the external communication frequency with one processor exceeds the internal communication frequency, this actor-partition pair is considered as a move.The moves are prioritized according to the overall communication frequency of the actors and a release mechanism (similarly to IDLS) is implemented.The move can be evaluated in two ways: by estimating the execution time of a new solution or by analyzing if the overall external communication frequency (calculated collectively for all partitions) has decreased.In this work, for the purpose of consistency with other solution methods, only the first method is used.

Tabu Search (𝑇𝑆).
Tabu search, as introduced by Glover [40], is still among the most cited and used local search metaheuristics for combinatorial optimization problems.It avoids the problem of getting stuck in the first local optimum by making use of recent memory with a tabu list.More precisely, it forbids performing the reverse of the moves done during the last  (parameter) iterations, where  is called tabu tenure.At each iteration of TS, the neighbor solution   is obtained from the current solution  by performing on the latter the best nontabu move (ties are broken randomly).The process stops, for instance, when a time limit  (parameter) is reached.In most TS implementations, if the neighborhood size is too big, only a proportion is explored in each iteration.This proportion can be, for instance, a random sample involving % (parameter) of the neighbor solutions.
TS has proven to have a good balance between intensification (i.e., the capability to focus on specific regions of the solution space) and diversification (i.e., the ability to visit diverse regions of the solution space).In addition, it has a good overall behavior according to the following measures [18]: (1) quality of the obtained results (according to a given objective function  that has to be optimized); (2) quickness (time needed to get competitive results); (3) robustness (sensitivity to variations in data characteristics); (4) simplicity (facility of adaptation); and ( 5) flexibility (possibility to integrate properties of the considered problem).To adapt TS to the studied problem, the following elements have to be designed: the representation of any solution , the neighborhood structure (i.e., what is a move), the tabu list structure (i.e., what type of information is forbidden), and a stopping criterion (i.e., what is the most appropriate time limit).

Solution Encoding and Neighborhood Structure.
A solution for partitioning is represented as a map of actors and processors, where the number of processors is fixed.Each actor can be mapped to only one processor at the time, and each processor must be mapped to at least one actor.Hence, leaving empty processors is not allowed.The following basic types of moves are possible: (1) REINSERT: moving an actor to another processor; (2) SWAP: two actors belonging to two different processors.For the purpose of swapping, the term complementary move is introduced.Assume that a move (,   ,    ) consists of relocating an actor  from source partition   to target partition    .A move (  ,    ,   ) is complementary to (,   ,    ) if it involves moving any actor   from source partition    to target partition   .In this work, the neighborhood structures are generated by performing REINSERT and SWAP moves according to four different criteria, presented below: (1)  () (for balancing): (i) REINSERT: choose randomly an actor from the most occupied processor and move it to the least occupied processor.(ii) SWAP: choose randomly two actors in different partitions so that swapping the actors decreases the relative workload imbalance between the two partitions.
(2)  () (for idle): (i) REINSERT: for each actor which has a bigger idle time is than its processing time, find the most idle processor, different from the one currently mapped, where the definition of idle is as described in Section 5.2.1.
(ii) SWAP: generate a set of moves on the REIN-SERT basis, but allow actors to be moved to any partition except for the least idle one (search for complementary pairs of moves).
(3)  (CF) (for communication frequency): (i) REINSERT: check the internal and external communication frequency of each actor and consider the moves, as described in Section 5.2.2.(ii) SWAP: generate a set of moves on the REIN-SERT basis (search for complementary pairs of moves).
(4)  () (for random): (i) REINSERT: choose randomly an actor and move it to a different processor (randomly chosen).(ii) SWAP: generate a set of moves on the REIN-SERT basis (search for complementary pairs of moves).

Parameters.
Any time an actor  is moved from a processor  to another processor, it is forbidden to put  back to  for  iterations, where  is an integer uniformly generated in interval [, ], and the values of parameters  and  were tuned to 5 and 15, based on the preliminary experiments.Smaller values do not allow escaping from local optima, whereas larger values do not allow intensifying the search around promising solutions.There are two other sensitive parameters that have to be tuned for TS, namely,  (the proportion of neighbor solutions explored during each iteration) and  (the time limit).Reaching the time limit  results in immediate termination of the search and returning of the best solution ever found.Usually,  is set so that the improvement potential is poor (i.e., the percentage of improvement is below a threshold during a predefined time interval) if the method is run for larger time limits.Next, the smaller is , the more iterations are performed but the fewer neighbors are investigated in each iteration.A large value of  contributes to the intensification ability of the method (indeed, all the solutions around the current one are explored), whereas a small value plays a diversification role (indeed, no focus is put on the neighborhood of each solution).Finally, a small (resp., large) value of  strengthens the intensification (resp., diversification) ability of the search.

Advanced Variants of Tabu Search.
Since each of the used neighborhood structures relies on different properties, a more advanced version of the TS involves a consolidation of all neighborhood structures.It is applied in two different variants: (i) Joint Tabu Search (JTS): at each iteration, the neighborhood structure includes moves obtained according to all types.Therefore, the used neighborhood structure is  () =  () ∪  () ∪  (CF) ∪  () .
This variant should have more flexibility, because it involves various types of moves.The proportion of the set sizes for different types of moves can be freely tuned.(ii) Probabilistic Tabu Search (PTS): at each iteration, the search assigns a probability to the selection of each neighborhood of the set { () ,  () ,  (CF) ,  () }.This probability is tuned based on the history of the search during the considered run.As a result, the search is guided by the success rate of each type of move (where a success corresponds to an improvement of the current solution).

Experimental Results
6.1.Experimental Setup.The focus of this work is to explore the design space along the partitioning dimension.Hence, the other dimensions need to be fixed.The dynamic scheduling policy used in the experiments is nonpreemptive, which involves that the same actor is continuously scheduled, as long as its firing conditions (i.e., available input tokens and output space) are fulfilled.The nonsatisfaction of any of the conditions results in choosing the next actor (on the roundrobin basis) among all actors with satisfied input/output conditions.Regarding the buffers, an infinite size would be, ideally, considered.Since it is not possible for the practical implementations, the buffer size must be sufficiently large, so that deadlocks are avoided and the overall blocked writing time of the actors, as defined in Section 5.2.1, is kept at a minimum level.In this way the influence of a buffer size which is too small remains negligible for the overall performance.For the dataflow programs considered here it has been sufficient to set all buffers to the size of 512 tokens.

Experimental Designs.
The quality of a partitioning algorithm is, in principle, assessed by the quality of a solution it provides.For each solution it can be evaluated how close it is to the objective function and how it approaches the potential parallelism of a given dataflow program implementation, for instance, how far is a given throughput from the best throughput achievable by the dataflow program under consideration.Hence, it is essential to perform the experiments with an application that in principle can provide a sufficient level of parallelism.If this condition is not satisfied, it is likely that either none or any partitioning algorithm can provide any satisfactory performance.For this reason, finding appropriate dataflow programs for validating partitioning algorithms is not a trivial task.The application used in the experiments is an MPEG-4 SP decoder design that consists of 41 actors in total and provides the upper bound on the potential parallelism around 6.28.This upper bound on the potential parallelism is evaluated as a proportion of the critical executions in the overall execution time, assuming a full parallel execution and the unbounded buffer size configuration (as described in Chapter 5 of [36]).MPEG-4 SP decoder network is an implementation of a full MPEG-4 4 : 2 : 0 Simple Profile decoder standard written in CAL Actor Language [41].The main functional blocks include a parser, a reconstruction block, a 2D inverse discrete cosine transform (IDCT) block, and a motion compensator.These functional units are hierarchical compositions of actors in themselves.The decoding starts from the parser (the most complicated actor in the network consisting of 71 actions) which extracts data from the incoming bitstream, through reconstruction blocks exploiting the correlation of pixels up to the motion compensator performing a selective adding of blocks.In order to give an overview of the complexity of the design, a schematic representation of the network is presented in Figure 3.

Target Platform.
The platform used for the experiments is an array of Transport Triggered Architecture (TTA) processors [42].There are several features of this platform that allow an easy modeling and a precise simulation for the purpose of partitioning and scheduling design space exploration.The most significant one is a cacheless, negligible interprocessor communication, and therefore an independence of the profiling results from the mapping configuration.In fact, as confirmed by our previous experiments on the platform, the results of a single profiling remain valid for multiple partitioning and scheduling configurations [43].Hence, such a platform is a good choice for an initial validation of the effectiveness of the methodology.

Tuning of Parameters.
As described in Section 5.3, apart from the length of the tabu tenure, TS is sensitive to two parameters that need to be properly tuned: the time limit  and the percentage  of explored neighbor solutions.For that purpose, 3 runs on a set of initial partitioning configurations have been performed for each:  () ,  () ,  (CF) , and  () .First, with a fixed value of  = 0.5, each TS variant was performed 3 times on the initial set of partitioning configurations.For each run, TS was stopped any time 5 minutes have elapsed without improving the best encountered solution (during the current run) by at least 1%.Parameter  has been set as the largest encountered stopping time (minus 5 minutes) among all these experiments.
Next, with the selected value of , all TS variants were tested with different values of  ∈ 0.2, 0.4, 0.6, 0.8 in order to deduce the best value for each neighborhood type.The value of  = 0.4 has been chosen as the one providing the best average results among all instances in the test set.It was also observed that if a method is performed several times on the same instance, it gets similar results.This indicates the robustness of the proposed approach.
Proper tuning of parameters is important in order to reliably compare all of the iterative methods.All stages of parameters tuning in the proposed methodology have been performed automatically.The time limit  tuned for TS has been used also as a time limit for the DLS methods.Additionally, since these methods tend to get quickly stuck in local optima, a restarting procedure has been implemented.If DLS finishes before elapsing the given time limit, it is restarted with a new random solution.At the end, the best found solution (among the restarts) is returned.

Methodology of Experiments.
Most of the tools used for the experiments are the components of Turnus codesign framework [44].They include the generation of an ETG for a given statistically meaningful input stimulus, the performance estimation tool exploiting the ETG, and the results of the platform profiling and the generation of partitioning configurations using different algorithms.Complementary units in this workflow are the profiling of the TTA architecture and a TTA cycle-accurate simulator [45] that allows a verification of the estimated results in terms of a real execution time obtained on the platform.The complete workflow is presented in Figure 4.The partitioning configurations were generated using each of the described algorithm for the number of processors between 2 and 8. Considering the choice of application, 8 units should already approach their potential parallelism.For the local search methods that require specifying an initial solution, in each case, two sets were tested: the random one and the one generated by the WB algorithm.Such a choice was made in order to provide the algorithms with possibly good, as well as bad, initial configurations and also observe their sensitivity to the quality of an initial solution.The evaluation of the solutions generated by each algorithm was accomplished by the performance estimation tool that calculated the total execution time expressed in clock-cycles.Based on those values, the speed-up versus the mono-core execution was calculated in each case.The results presented in this section target the calculated values of speed-up in order to relate them easily to the potential parallelism of the application.Finally, the results obtained by estimation were also consistently verified by the platform execution.generated with the WB and BP algorithms along with the values estimated for a random set of configurations.Tables 2 and 3 contain the results obtained for the IDLS and CFDLS heuristics for the two sets of initial partitioning configurations.

Greedy and Decent Local Search Heuristics.
Since the purpose of a greedy constructive method is to build a solution from scratch, an important property is the scalability of the performance.In this case, both algorithms scale; however, the BP achieves a saturation already around 5 processors, unlike the WB that scales further.The maximal speed-up obtained for BP configurations is similar to the random configurations but is achieved on a smaller number of processors (5 versus 8).Applying the IDLS and CFDLS methods in all the cases improve the initial solution, but the improvement is bigger for CFDLS.The quality of the solution provided by the DLS heuristics depends also strongly on the quality of the solution provided as a starting configuration.

Tabu Search.
The first experiment aimed at confirming the most beneficial types of moves.It was performed separately for each type of neighborhood structure.In the first execution, only the REINSERT moves were allowed, whereas in the second one, the SWAP moves were also included.SWAP moves were not considered alone, since they do not lead to any change of the initial size of each partition (resulting in a nonconnected solution space).The results of this comparison for each neighborhood type are presented in Tables 4-11.Along with admitting the SWAP moves, a significant improvement has been brought only to  () .In fact, the performance of  () based on REINSERT only was very poor and a slight improvement has been introduced only for certain initial configurations.It relies on the fact that the possible space of moves is very narrowed in this case (only actors from the most occupied partition are taken into consideration) and the tabu list can be very restrictive.Since it also aims at balancing the workload, for higher number of processors, it is not rare to encounter a solution where the heaviest bottleneck actor is placed alone on the processor.Due to the solution definition described in Section 5.3.1, the algorithm cannot proceed from that point.Relative balancing of the workload between two partitions instead of an overall balancing seems to be a much more effective approach.
For the other neighborhood structures, allowing SWAP moves decreased the quality of the final solution in the vast majority of the cases.It might be due to the fact that SWAP moves unnecessarily increased the set of neighbor solutions and reduce the diversification ability of the method.Comparing the neighborhood structures, there are some conclusions that can be made.First,  () seems to outperform the other variants, including  (CF) .This observation is contrary to what has been previously observed for the DLS heuristics, where the search based on the communication frequency outperformed the idle optimization.It confirms that determining a local optimization criterion is challenging, whereas employing an appropriate exploration strategy (i.e., the TS framework) is the other one.Finally, the results obtained for  () and  (CF) also prove that a guided choice of moves outperforms a random selection.In other words, the complete freedom of choice when choosing a move, as for  () , does not necessarily lead to competitive solutions.The final part of experiments with TS involves a comparison of its two advanced variants.Taking into consideration the previous observations, the analysis involved neighborhood  () based on the SWAP moves and neighborhoods  () ,  (CF) , and  () based on REINSERT moves.Since different types provide different sizes of the neighborhood sets, such sizes have been equalized according to the averaged values.For this reason, another parameter, namely, the admission rate, has been introduced for each neighborhood structure.Admission rate expresses the percentage of moves that is generated at each iteration.For  () and  (CF) , a given percentage of moves is extracted according to the priorities (i.e., most idle or most communicative actors, resp.).For  () and  () , since there are no priorities, the solutions are extracted randomly.The values of admission rate have been tuned as follows: 0.9 for  () , 0.48 for  (CF) , 0.16 for  () , and 0.08 for  () .Tables 12 and 13 contain the results of the analysis of the advanced variants of TS.In almost all cases, PTS performed better than JTS and provided the results that, considering the previously mentioned upper bound on the potential parallelism of an application, can be considered as close-tooptimal.PTS and JTS were also less sensitive to the quality of the initial configuration.In fact, in few cases, a random initial solution leads to better results than a balanced initial configuration.

Consistency Verification.
In order to verify the consistency of the obtained estimated results and their relation to the platform execution, a comparison of execution times was performed on a representative fraction of generated partitioning configurations spanned on each of the considered numbers of processors.Figure 5 presents a chart indicating the consistency of the estimation and platform execution results.Finally, Table 14 summarizes the best solutions obtained with PTS for each number of processors.Apart from the values of execution times expressed in clockcycles and the speed-up, the distance between the execution  time and the length of the critical path expressed in % is also highlighted.This value indicates how far a given solution is from the maximal parallelism of the application.It might be a particularly precious information for the application designer in terms of possible tracking the approach to the potential parallelism of an application.The last column contains the value of estimation discrepancy for this particular solution.
6.6.Discussion.Comparing the results obtained for all implemented algorithms, the first observation is that according to the decreasing quality of the output solutions, the algorithms can be ordered as follows: advanced TS variants, TS, DLS, and the greedy constructive procedures.This ranking is consistent, as a more refined approach outperforms a simpler one.It highlights that the specific ingredients belonging to a more refined method are relevant.Additionally, finding a good partitioning configuration for a small number of processors (i.e., 2 or 3) is relatively easy and the differences between the solutions provided by different algorithms are minor.For instance, for the case of two processors, the difference between the solutions provided by the best and the worst algorithms is less than 6%.With the increasing number of processors, the differences become more significant.For the case of the WB algorithm, the biggest difference of 30% with respect to PTS can be observed at around 5 processors, 13 min  ()  44 min  (CF)  84 min  ()  318 min JTS 308 min PTS 276 min whereas for the BP algorithm, on 8 processors, the difference goes up to 100%.The comparison of different variants of TS leads to a conclusion that the resulting solution might benefit from varying the definition of the neighborhood.In fact, both JTS and PTS outperformed the variants where only one type of neighborhood was taken into consideration.Among the advanced variants, the success of PTS over JTS might rely on two factors: (1) using the history of local search, which allows an adaptation of the search to the properties of the test case and (2) much smaller size of the neighborhood in each iteration that contributes to a diversification of the search.
An important aspect that must be also taken into account for evaluating the algorithms is the time required for their completion.It involves the evaluation time for all considered solutions in all iterations, extraction of the optimization criteria, and computation of new solutions.For DLS and TS, the upper bound on the time is defined by the user.However, for each algorithm, it was observed when the last improving move (before termination at the specified point) was performed.The averaged values among different instances are summarized in Table 15.For the TS, a big difference is visible between  () and the other variants.In fact, in this work, it is the time for  () that enforces the time limit for all other algorithms, but in the case of this particular variant, it does not necessarily correspond to the quality of the final solution.A promising observation can be made for the advanced variants of TS, since PTS not only provides with the best results, but also succeeds in ca.10% shorter time than JTS.In all cases, the most impacting factor is the number of iterations performed, since the performance estimation and, at the same time, extraction of optimization criteria much outstrip the cost of computing the new solution.
Since the described partitioning methodology relies on an estimation of the actual performance, an important question is the estimation quality in terms of precision.This specific issue occurs in various fields [46].For the case of experiments performed in this work, the average estimation discrepancy is between 3.35% (in total) and 10.31% (for the subset of best solutions).These values can be considered acceptable, since they still allow DLS and TS algorithms to iteratively improve an initial solution.In order to explain the reasons of this discrepancy, the first factor that must be taken into account is the general uncertainty of profiling [47].The estimation of the execution time must rely on the limited set of measurements coming from the platform that can be burdened with some errors.On the other hand, there are some properties of the program execution not present in the proposed execution model, because of their hard tractability.A good example is the overhead introduced by a partition scheduler [48] that might depend on multiple factors, such as the number of actors in one partition, the properties of a scheduling policy, the number of conditions to be checked before an actor is executed, or even the order of appearance of actors on the list representing each partition.

Conclusion
This paper presents a partitioning methodology for dynamic dataflow programs that is based on a program execution model and uses multiple solution methods belonging to different classes of optimization algorithms: greedy constructive procedures, decent local search, and tabu search metaheuristic.The algorithms differ in terms of the time needed for generating a solution, the quality of the final solution, and the way they explore different properties of a dataflow application and its execution on the target platform.The best performing algorithms have been verified to approach the full potential parallelism of dataflow programs and hence to be capable of efficiently exploring the design space in all the partitioning dimensions.
The algorithms are based on the performance estimation of a program execution on a target platform.The estimation takes into account the partitioning, scheduling, and buffer size configuration and is capable of evaluating performances in terms of the total execution time.During the evaluation, the execution properties are also tracked and extracted in order to provide optimization criteria to the algorithms.The estimation has been experimentally verified to be highly accurate and consistent with real executions on the considered platform.
A direction for future improvements is to investigate the opportunities of extending the model and the methodology by further properties in order to minimize the estimation discrepancy.The methods might include tracking the variance for different executions of the same action in order to provide the model with individual weights for each job and profiling of scheduling cost that might be used for accurate scheduling and buffer size optimization.Further study is also required in order to determine the level of influence of these configurations on each other in order to effectively explore the design space in multiple dimensions.After validating the methodology on a simple platform with small level of uncertainty, an important objective is to consider more performing platforms which could turn to be more difficult to model.Studying the methodology of profiling communication cost occurring on such platform is currently an ongoing work.

Figure 1 :
Figure 1: Construction of a sample dataflow program.
).A fixed relative order is decided within each group.It can be assumed that this order is established based on the program's input data.
(ii) Precedence constraint: (,   ) means that a job  must be completed before job   is allowed to start.(iii)Setup constraint: it requires that, for each existing connection (,   ) involving jobs from different groups, a setup (or communication) time    occurs.(iv) Communication channel capacity constraint: the size of a communication channel (buffer) through which the information (tokens) is being transmitted is bounded by .That is, the sum of    's assigned to this buffer (  ) cannot exceed .If it occurs, it might affect the executability of  and   and introduce serious delays in the overall makespan.2.2.Architecture Impact.Executing the dataflow program on a given architecture requires introducing the notion of time to its originally time-abstracting definition.It is achieved by assigning a concrete value to the weight   of each job and to the communication time    .The types and ranges of values for   's and    's fully depend on the properties of the targeted architecture.For the case of homogeneous platforms,   is constant no matter how a group (actor) is actually partitioned.It is not the case for heterogeneous platforms, where this value can vary according to the processor family that the processing unit belongs to (i.e., software or hardware).

Table 1 :
Table 1 contains the speed-up values of partitioning configurations Speed-up: greedy constructive procedures.

Table 12 :
PTS and JTS speed-up: balanced start.

Table 13 :
PTS and JTS speed-up: random start.

Table 15 :
Averaged time of the final improvement.