Meta-algorithms for Scheduling a Chain of Coarse-grained Tasks on an Array of Reconfigurable Fpgas

This paper considers the problem of scheduling a chain of n coarse-grained tasks on a linear array of k reconfigurable FPGAs with the objective of primarily minimizing reconfiguration time. A high-level meta-algorithm along with two detailed meta-algorithms (GPRM and SPRM) that support a wide range of problem formulations and cost functions is presented. GPRM, the more general of the two schemes, reduces the problem to computing a shortest path in a DAG; SPRM, the less general scheme, employs dynamic programming. Both meta algorithms are linear in n and compute optimal solutions. GPRM can be exponential in k but is nevertheless practical because k is typically a small constant. The deterministic quality of this meta algorithm and the guarantee of optimal solutions for all of the formulations discussed make this approach a powerful alternative to other metatechniques such as simulated annealing and genetic algorithms.


Introduction
In this paper, we consider the problem of scheduling a chain of coarse-grained tasks on a linear array of reconfigurable FPGAs.This scheduling problem arose in the development of the CHAMPION software environment for mapping image processing applications onto an adaptive computing system (ACS).CHAMPION [1], which was developed at the University of Tennessee, maps a high-level data-flow diagram developed using the Cantata graphical programming software [2,3] onto an ACS.CHAMPION maps the workspace to a netlist and performs data width matching and synchronization, followed by partitioning.Each resulting subnet-list is translated into structural VHDL which is merged with VHDL files specifying the ACS architecture and synthesized using commercial synthesis tools.The partitioning step which is described in detail in [4] takes as input a directed acyclic graph (DAG) and computes a partition  = { 1 ,  2 , . . .,   } such that each   can be implemented on a single target FPGA.The partition algorithm must consider (1) the capacity of the partition (2) the number of I/O pins, and (3) the limit on RAM access modules.If  is greater than the number of FPGAs on board (), multiple configurations of the FPGA (i.e., temporal partitioning) will be necessary.To reduce design complexity, a design decision was made to use a constrained configuration of the boards such that all signals pass in one direction along the array of FPGAs.The signal flow must accordingly be unidirectional from  1 to   on the task chain.The input DAG can be linearized easily using topological sorting.However, the partitioning step must maintain the unidirectional signal flow when it allocates multiple DAG nodes to individual tasks on the task chain.The DARPA image processing application was successfully mapped onto the Wildforce-XL, the SLAAC, and MSP FPGA boards.Examples of commercial boards with multiple state-of-the-art Xilinx and Altera FPGAs can be found at http://www.dinigroup.com/new/products.php.The number of FPGAs on these boards is typically six or less although the DN7020K10 and DNDPBS327 boards have 20 and 27 FPGAs, respectively.
The result of the partitioning algorithm described above (and the input to the formulations described in this paper) is a directed chain of coarse-grained tasks such that exactly one task fits on an FPGA.The objective of this paper is to design scheduling meta-algorithms that optimally map tasks to FPGAs under some measure of cost.Our two algorithmic schemes are general purpose reconfiguration meta-algorithm (GPRM) and the special purpose reconfiguration meta-algorithm (SPRM).Both are meta-algorithms in that they can each address several design scenarios.(The meta-algorithm concept is discussed in more detail in Section 2.) Our focus in this paper is primarily on minimizing reconfiguration time (although it is possible to incorporate execution time into our cost formulations) because they typically dominate execution times by several orders of magnitude.For example, Natarajan et al. [1] reported that the hardware execution time to process one image using an ACS implementation of ATR, an automatic target recognition algorithm, was 33 milliseconds, as compared to nearly seven seconds needed for the entire execution.Gajjala Purna and Bhatia [5] reported that the hardware execution time for a set of four applications was on the order of tens of microseconds, while a single reconfiguration requires 242 milliseconds on the RACE architecture.More recently, Birla and Vikram [6] reported that an integral image computation was executed in 12.36 s with reconfiguration times ranging from 3.309 to 52.944 ms (depending on the configuration clock frequency).A feature extraction and classification computation had an execution time of 8.66 s with reconfiguration times ranging from 3.392 to 54.268 ms.
Although the original motivation for this work was to configure a board with several FPGAs, the underlying abstraction employed by our meta-algorithms can also, in principle, capture partial reconfiguration that allows "specific regions of the FPGA to be reprogrammed with new functionality while applications continue to be run in the remainder of the device." [7].The performance of partial reconfiguration continues to be an active area of research, with recent work being focused on developing cost models that characterize reconfiguration times accurately [8][9][10].Accurate cost models provide reconfiguration time data that is vital to scheduling algorithms such as SPRM and GPRM presented in this paper.
Section 3 formulates the problem and relates it to existing research.In Sections 4 and 5, respectively, we present the SPRM and GPRM meta-algorithms along with several specific instantiations.Section 6 combines the individual formulations into detailed SPRM and GPRM meta-algorithms.Section 7 contains experimental results and Section 8 concludes the paper.

New Algorithmic Techniques Used
A meta-algorithm is a high-level algorithmic strategy that is somewhat independent of the detailed algorithms used for solving precise instantiations of the problem.One may view techniques such as simulated annealing and genetic algorithms as meta-algorithms.These techniques have to be customized to the precise problem being solved.For example, simulated annealing may be customized by instantiating the concept of state, cost, and move for a given problem.Because of their stochastic nature (the use of randomness is an integral component of these techniques), simulated annealing and genetic algorithms have been widely used to solve a very large and diverse set of problems.However, the stochastic nature of these techniques leads to some difficulties in their deployment: (1) experimental results are not guaranteed to be reproducible and (2) the selection of parameters (such as the cooling factor in simulated annealing) has an appreciable impact on the runtime of the algorithm and the quality of solution.Although there are guidelines and best practices on choosing parameters, the practitioner does not know a priori what parameters will work well; considerable effort has to be expended in trial-and-error.It is also possible that different parameters are needed for different inputs.
The idea of a meta-algorithm can also refer to high-level strategies for a specific problem such as hierarchical web caching [11]: this paper experiments with challenging the highlevel implicit strategy used in hierarchical caching that "a hit for a document at an -level cache leads to the caching of the document in all intermediate caches (levels  − 1, . . ., 1) on the path towards the leaf cache that received the initial request." These alternative strategies for being more selective in choosing the caches that store a local copy of the requested document are considered to be meta-algorithms because they operate independently of the actual replacement algorithm running in each individual cache.
Another scenario where meta-algorithms can be useful is when there are several detailed algorithms available to solve different variations of a problem.In this case, the metaalgorithm provides a procedure that navigates the available options and chooses which detailed algorithm to deploy.A simple example of this occurs when a difficult problem might be solved using an exhaustive algorithm for small input sizes (e.g.,  < 20) and a heuristic or an approximation algorithm for large input sizes (e.g.,  ≥ 20).In this case, the (simple) meta-algorithm chooses which detailed algorithm to deploy.If a better heuristic or approximation algorithm is discovered for  ≥ 20, the original one can be replaced without impacting the meta-algorithm.This approach has been used to assist a user to automatically select most suited algorithms during data mining model building process [12].
To the best of our knowledge, the use of meta-algorithms in VLSI design automation (and in configuration minimization problems, in particular) is novel.
There are two respects in which we describe meta-algorithms for this problem.
(1) This paper describes two algorithms (SPRM and GPRM) to minimize reconfiguration time of a sequence of coarse-grained tasks that are to be executed in a linear array of FPGAs.SPRM is based on dynamic programming while GPRM is based on a shortest-paths formulation.SPRM is faster but is concerned with a special case of the problem (where cost is a function of adjacent tasks in the chain), while GPRM is concerned with the general case (where cost is a function of all tasks in adjacent configurations).The SPRM and GPRM methods are themselves metaalgorithms that can be customized to address several (2) then Use optimal LIU-based algorithm [13] (3) else if ( is a Chain or a Loop and  is a single unit/unstructured) (4) then Use Dynamic Programming [14] (5) else if ( is a Chain and  is a Linear Array) (6) then if ( is a function of adjacent tasks) (7) Use SPRM (8) else //  is a function of all tasks in a configuration (9) Use GPRM.end Algorithm 1: Metaheuristic that provides a high-level framework for choosing an appropriate algorithm based on the task, implementation, and cost models being considered.detailed problem formulations and cost specifications.These formulations include reconfiguration time minimization, dynamic task generation, repeated and similar tasks, implementation libraries, and limited cycles, all of which are discussed later.
(2) Along with Ghiasi et al. [13] and Bondalapati and Prasanna [14], SPRM and GPRM represent a novel use of these techniques in the configuration of coarsegrained tasks with the goal of minimizing reconfiguration time on reconfigurable logic.Following the approach of Fan and Lei [12], we present a framework within which these algorithmic techniques can be deployed.We present a metaheuristic (Algorithm 1) that provides a high-level framework for choosing an appropriate algorithm based on the task, implementation, and cost models being considered.

Previous Work and Problem Formulation
3.1.Previous Work.Some of the key differences between the work presented in this paper and early dynamic reconfiguration algorithms [5,[15][16][17][18][19] are that (1) our tasks are more coarse-grained, whereas the previous algorithms operate at the netlist level.(2) We assume that an array of FPGAs is available, whereas the previous work assumes that there is a single dynamically reconfigurable FPGA.(3) Finally, we assume that the application represented by a directed acyclic graph (DAG) of tasks has been converted into a linear chain of tasks in a precomputation step.The mapping of a DAG to a linear chain and the rationale and implications of working with linear chains are discussed in the next subsection.
In addition, there are two relatively recent papers that consider coarse-grained tasks.However, neither of these allows for a linear array of FPGAs.We describe their work in greater detail below.
Bondalapati and Prasanna [14] consider a formulation where a chain of tasks is implemented on a single configurable logic unit.A single task is executed on the configurable logic unit at a time.When its execution is complete, the logic unit is reconfigured to execute the next task in the list and so on.Each task can be implemented using multiple configurations.
The choice of configurations for consecutive tasks affects the reconfiguration time.A dynamic programming algorithm optimizes the total reconfiguration time in ( 2 ) time, where  is the number of tasks and  is the number of configurations for each task.Our formulation fundamentally differs from theirs in that our chain of tasks will be implemented on a linear pipelined array of  configurable logic units.Indeed, their dynamic programming solution may be viewed as a special case of our shortest-paths solution to our GPRM formulation when  = 1.(We note that Bondalapati and Prasanna further consider the case where their chain of tasks is contained in a loop requiring repeated execution and unrolling, which is outside the scope of our work.) Ghiasi et al. [13] consider a formulation where a directed acyclic graph (DAG) of tasks has to be scheduled and executed on partially reconfigurable hardware with capacity ; that is, at most  tasks may be allocated at a time.Different tasks in the DAG may be identical, so savings in reconfiguration time may be obtained by caching a task in the partially reconfigurable hardware (PRH).Their algorithms are based on extending solutions to paging problems.This is possible because, in their formulation, any task can be placed in any of the  locations in the PRH, which gives them much more flexibility.Our tasks must be placed on a linear pipelined array in the order in which they appear in the chain to facilitate signal flow.Further, they do not address some of the formulations addressed in the paper (such as dynamic task generation and implementation libraries).

Discussion of Design Context.
In this section we discuss the rationale and implications for the following design decisions.
(1) Model of computation: the DAG has traditionally been used as a general model of computation in the literature.As described earlier, our approach is to preprocess the DAG, transforming it into a linear chain of tasks and then applying our algorithms to the task chain.In principle, this preprocessing can be achieved by performing a topological sort on the DAG.Edges connecting vertices that are not adjacent in the chain can be accommodated by including simple tasks in the intermediate chain vertices that facilitate the passing-through of the data.The existence of many such edges will add overhead, possibly making this approach impractical.Other considerations during this transformation are discussed in [4].
(2) Implementation model: FPGA boards contain a routing switch which permits a portion of the I/O of any FPGA to be interconnected to a portion of the I/O of any other FPGA on the board.In this work, we are using a constrained configuration of the boards such that signals flow in one direction along the array of FPGAs.
The rationale for these decisions is that many applications (e.g., image processing) can be naturally decomposed into a linear sequence of steps which map directly into a chain.For these applications, the linear flow of data results in communication paths between all of the FPGAs that are identical and predictable.This greatly simplifies system design and makes it more amenable to automation.Some of the implications and consequences of these decisions are listed below.
(1) The traditional advantage of using a DAG computation model over a linear chain is that it permits tasks to be executed in parallel on different processors.In current FPGA systems, the response time is dominated by FPGA reconfiguration delays and not computation time [1,5,6].This, along with the limited number of FPGAs available on a board, minimizes the traditional advantage of working directly with a DAG.Note, however, that our approach does not preclude the use of fine-grained parallelism in the implementation of a task.In other words, the implementation of that task in an FPGA could require several control logic blocks (CLBs) to execute in parallel.In the event that computation time is significant, our approach permits reconfigurating earlier FPGAs in parallel with ongoing computations in later FPGAs [20] and also permits reconfiguring FPGAs in parallel.
(2) Since FPGA technology results in fast execution times for complex algorithms, it is important to ensure that the scheduling algorithm itself does not consume an excessive amount of time.An advantage of linear chains is that scheduling algorithms are significantly faster than those for DAGs.
(3) A linear chain implementation of an algorithm makes it more amenable to pipelining.Thus, several images can be pipelined through an FPGA board configuration and the partial results stored.The board can then be reconfigured and the partial results pipelined through a new board configuration.This amortizes reconfiguration time over several images.

Consolidated Problem Formulation.
As mentioned previously, the meta-algorithms described here capture several scenarios.Each such scenario will require a subtly different  problem formulation.We begin by presenting two abstracted formulations that capture the essence of the problems solved by SPRM and GPRM: we describe (1) input, (2) output, and (3) an abstracted cost function.
Input is a chain of tasks labeled 1 through  as shown in Figure 1 and an integer  denoting the number of FPGAs on a board.
Output is an optimal assignment of each task  to a temporal board configuration and an FPGA denoted by (  ,   ).  is an integer between 1 and  (the total number of board configurations) that denotes the temporal board configuration that Task  is assigned to.  is an integer between 1 and  (the number of FPGAs on the board) that denotes the FPGA that Task  is assigned to.The output may be visualized as the two-dimensional array  shown in Algorithm 1.In this example,  5 = 2 and  5 = 3 because Task 5 is assigned to FPGA 3 in Board Configuration 2. Note that some entries may be left blank (e.g., (1,4) and (2, 1) in Table 1) in the optimal solution found by our algorithms.As we will see later, this could happen as a result of scheduling repeated tasks on the same FPGA in consecutive board configurations in order to minimize reconfiguration time.
Linear Task Order Constraint.Any pair of tasks  and  with  <  must satisfy the following constraints: either   <   (i.e., task  appears in an earlier board configuration than task ) or   =   and   <   (i.e., tasks  and  appear in the same board configuration but  is assigned to an FPGA that appears earlier in the pipeline than the FPGA allocated to ).
We have not yet defined optimality.We consider two cost scenarios that lead to a fundamental distinction between the SPRM and GPRM strategies.𝑅).Note that this scenario subsumes the SPRM scenario.The GPRM scheme is used to compute an optimal solution for this more general scenario.

SPRM
We present three formulations of the SPRM problem.The first is a basic formulation, the second permits dynamic task generation, and the third permits limited use of bidirectional edges and cycles.The three scenarios are quite different, but all of them share the property that the problem is fully specified by specifying costs between adjacent tasks in the chain.

SPRM Formulation 1:
Basic.Given a chain of  tasks and a cost  +1 = (,  + 1) of separating tasks  and  + 1 into different configurations, compute a set of cuts of minimum total cost such that each configuration (represented by the tasks between adjacent cuts) has  or less tasks.
Application.In a chain of tasks, data is propagated along the pipeline from task  to the next task  + 1.When these tasks belong to FPGAs in the same board configuration, the transfer can take place directly.However, when they are not in the same board configuration, the data must be stored in memory by task  and read from memory by task  + 1 after the board is reconfigured so that task  + 1 may proceed with its computation. +1 denotes this cost and could be substantial in case the data consists of images or video.The optimal solution consists of making cuts at edges 2, 6, and 10 with costs 3, 50, and 5, respectively, resulting in a cost of 58.The cuts are shown in Figure 3.
The dynamic programming algorithm is described below.We employ two one-dimensional arrays cost and firstCut of size  (where  is the number of tasks in the chain).The element [] will, upon completion of the algorithm, contain the cost of an optimal set of cuts separating the tasks in the subchain from  to −1.Note that this value will be zero if −  ≤ , since no cuts are required to separate a chain consisting of  or less tasks.The element [] will contain the location of the first cut in the subchain from  to  − 1 in an optimal solution.The algorithm is outlined in Algorithm 2.
The elements of the cost and firstCut arrays are computed in reverse order.Lines 3 and 4 consider the case where the task chain from  to  − 1 contains  or less tasks, whereas lines 5-9 consider the general case.The general case is addressed by considering all the ways in which the first cut in the subchain starting at  can be made.For each possibility, we use the value of cost [++1] that was computed in a previous iteration and  ++1 .
The cost array upon completion of execution of the algorithm on the input of Figure 2 is shown in Table 2 and the firstCut array in Table 3.

Theorem 1. Algorithm COST (Algorithm 2) computes an optimal solution to SPRM Formulation 1.
Proof.Clearly, the algorithm returns the correct value (zero) when  −  ≤ , since all the tasks can be accommodated in one configuration and there is no need to incur reconfiguration costs.If − > , it must be the case that at least one cut is needed.Our dynamic programming solution considers all  possibilities for the leftmost cut and uses previously computed optimal cost values for the subchains.

Theorem 2. Algorithm COST has time complexity 𝑂(𝑛 log 𝑘) and space complexity 𝑂(𝑛).
Proof.There are () elements in the cost array, each of which is computed in () time.This results in a complexity of ().We note that this can be improved to ( log ) as follows: observe that in order to compute any element [], we obtain the minimum from a set of  quantities of the form COST() {Compute minimum cost using dynamic programming} (1) for  =  − 1down to 0 do //  is number of tasks in chain (2) [] =  (6) else // General case (7) for  in [0,  − 1] do // Try  cuts, determine which is best ( 8)   The algorithm to output cuts is given in Algorithm 3.

SPRM Formulation 2:
Dynamic Task Generation.The dynamic task generation problem is a modification of the formulation described above.As before, the input is a linear chain of tasks.However, in addition to the  +1 costs associated with each edge, a boolean parameter is also associated with each node.If this parameter is set to true a cut on either side of this node, this will require the creation of an additional node.This additional node will occupy one of the available FPGAs and therefore directly affects the calculation of the optimum cost and cut-set in the algorithm.
Application.When a task is placed in the first FPGA in a configuration, it will need to read intermediate data from memory.Similarly, when a task is placed in the last FPGA in a configuration, it will need to write intermediate data to memory.This additional read/write functionality must be implemented on the FPGA.However, if the additional logic required does not fit on the FPGA or if the existing functionality on the FPGA already contains memory accesses and more memory accesses are not permitted by the architecture, an additional task must be created and accommodated on another FPGA.We use the same example as before, except that along with the reconfiguration cost there is a boolean flag associated with each vertex (Figure 4).
Our algorithm obtains cuts at 3, 5, 6, and 10 of costs 10, 4, 50, and 5 resulting in an optimal cost of 69.Note that the addition of extra task nodes affects the cut location and therefore the overall cost.The optimum cuts would break the task list as can be seen in Figure 5.
The cost function, which is the main modification to the Straight-Cut algorithm, is displayed in Algorithm 4.
The main difference between the dynamic task and the straight cut versions is that configurations are analyzed to determine how many of their end-nodes are marked (i.e., an additional node is required if the cut is at this location).This quantity (0, 1, or 2) is subtracted from the possible number of FPGAs available for a configuration.

SPRM Formulation 3: Linear Chain with Limited Cycles.
Recall that the partitioning algorithm of [4] partitions a DAG into a chain of  tasks such that the signal flows from left to right.However, this may not be possible or advantageous in some applications and may result in bidirectional edges or cycles as shown in Figure 6.
The interconnect on FPGA boards can be configured to permit bidirectional or cyclic signal flow, so the only limitation is that all tasks involved in this type of signal flow must be placed in the same board configuration.(Otherwise, a task in an earlier board configuration will need data produced by a task in the future.)Although SPRM is designed for a linear chain with a unidirectional signal flow, it can be modified for situations with a limited number of bidirectional edges or cycles.This constraint can be accommodated by assigning a large cost M to each bidirectional edge and to all edges in a cycle as shown in Figure 7.This will discourage our dynamic programming algorithm from cutting these edges forcing the algorithm to place the relevant tasks on the same board, provided this is feasible.

GPRM
Recall that GPRM is designed for situations where the cost of reconfiguring Configuration  into Configuration  + 1 is a function of all tasks assigned to those two configurations whereas SPRM is designed for situations where the reconfiguration cost is a function of the last task assigned to  and the first task assigned to +1, while ignoring all of the other tasks in the two configurations.Thus GPRM subsumes SPRM and can handle all of the cases discussed in the previous section.In this section, we present six formulations of the GPRM problem.The first three consider scenarios where reconfiguration time can be saved by reusing logic through partial reconfiguration.The fourth and fifth consider dynamic task generation and linear chains with limited cycles (which can be handled by SPRM Formulations 2 and 3, resp.), while the sixth describes how cost metrics can be extended to include execution time.

GPRM Formulation 1:
Repeating Tasks.To illustrate the GPRM strategy, we consider an example consisting of a board with  = 5 FPGAs and a task chain with  = 9 tasks, some of which are repeated.(Tasks can repeat if the same image processing transformation or the same memory management step is used several times in the computation.)The task chain consists of the tasks A, B, C, C, A, B, D, E, and C. Repeating tasks are represented by using the same letter of the alphabet.Table 4 shows four possible configuration sequences.Within each of the four sequences, a row represents a configuration and indicates which task is placed in each FPGA in that configuration.
To further simplify the presentation, we assume that the cost of reconfiguring an FPGA is 1 or 0 depending on the tasks that are allocated to the FPGA in consecutive configurations.The cost is 0 in the following two cases.(1) The FPGA remains empty in consecutive configurations.(2) The FPGA contains the same task in consecutive configurations.The cost is 1 in the following three cases.(1) The FPGA is empty in one configuration and occupied by a task in the next configuration.
(2) The FPGA is occupied in one configuration and empty in the next configuration (reconfiguration is needed to prevent unwanted side effects).( 3) The FPGA contains different tasks   in the two consecutive configurations.Note that the reconfiguration cost is a function of ALL tasks in consecutive configurations, which is precisely when the GPRM strategy is applicable.Note also that the SPRM strategy would not apply in this scenario.
Our GPRM strategy consists of modeling the problem as a shortest-path problem in a directed acyclic graph with nonnegative costs assigned to the edges of the dag.Each node in the graph represents a possible configuration, that is, an assignment of tasks to FPGAs.Nodes of the graph are denoted by [,  1  2 ⋅ ⋅ ⋅   ], where  denotes the index of the first task included in the configuration and   , 1 ≤  ≤ , is a bit which is set to 1 if FPGA  is occupied by a task and 0 if it is left unoccupied.Thus, we refer to  1  2 ⋅ ⋅ ⋅   as the occupancy bit string.
The total number of tasks included in a configuration is ∑  =1   and the index  of the last task in the configuration is  + ∑  =1   − 1.Note that empty configurations (∑  =1   = 0) or configurations where  >  need not be considered.(Recall that  is the total number of tasks.) Configuration node [,  1  2 ⋅ ⋅ ⋅   ] has an out-edge to all nodes of the form [+1, * ], where  denotes the index of the last task node in [,  1  2 ⋅ ⋅ ⋅   ] and * denotes any occupancy bit pattern that represents a valid configuration.Edges connect every possible pair of consecutive configurations.We assign a cost to each edge which denotes the cost of reconfiguring the FPGA array with the new set of tasks.Since this quantity is a function of the two configurations joined by the edge, it can be easily computed.We also include a dummy source and a dummy sink node to facilitate the shortest-path computation.The source represents an "empty" configuration and has an out-edge to all nodes of the form [1, * ].The cost of the out-edge is the cost of initially configuring the FPGA array and is trivially computed.All nodes representing a configuration that contains task  have an out-edge to the sink node of cost 0.
The DAG corresponding to our example is partially shown in Figure 8.
The four configuration sequences of Table 4 correspond to four different paths from source to sink in the DAG.Thus, Solution 1 contains the following sequence of intermediate nodes (using our notation for nodes) ([1, 11100], [4,00111], This is the starting location within the task string. The number of entries here represents the number of FPGAs.A "1" in a particular location signifies a filled FPGA.A "0" signifies an empty FPGA.[7, 11100]) while Solution 4 contains the sequence ([1, 11011], [5,11111]).There are two sequences that give a minimum cost and therefore an optimum solution.
The three steps of the shortest-paths technique are summarized in Algorithm 5.

Lemma 3. The total number of configuration nodes for a chain of 𝑛 tasks on an array of 𝑘 FPGAs is bounded by 𝑛2
Proof.The variable  can take on  values and each of the  elements of the occupancy string can take on one of two values.Multiplying these quantities gives the desired result.

Lemma 4. The total number of edges for a chain of 𝑛 tasks on an array of 𝑘 FPGAs is bounded by 𝑂(𝑛4 𝑘 ).
Proof.There are (2  ) nodes, each of which has at most 2  out edges, yielding the desired result.
Theorem 5.The shortest-paths in DAG algorithm (Algorithm5) computes an optimal cost solution to GPRM Formulation 1.
Proof.Note that the only constraint is that the chain of tasks must be executed in order.Any path from source to sink in our dag gives a valid sequence of configurations since nodes are only joined by edges if they contain consecutive task sequences.Similarly, any valid sequence of configurations is represented by a path in our graph since each valid configuration is represented by a node and since edges are constructed for all possible pairs of consecutive configurations.The total cost for a path is the sum of the cost of the edges, which represent individual reconfiguration costs.Consequently, the shortest-path from source to sink gives an optimal-cost sequence of configurations.Theorem 6. GPRM requires (4  ) time and (2  ) space.
Proof.The number of nodes in the dag is (2  ) and the number of edges is (4  ).Edge costs are computed in () time per edge since a pair of tasks needs to be examined for each of  FPGAs on the board.Thus, the total time for graph creation is (4  ).A topological ordering based technique can be used to compute the shortest source-sink path since the graph is a dag in (4  ) time.Thus, the total time complexity is (4  ).The space required by the graph is proportional to the number of edges and is (4  ).Next, we propose a data structure to represent the DAG that reduces the memory requirements of GPRM to (2  ).We use an  × 2  2dimensional array to represent the nodes of the DAG.The element in position (, ) corresponds to the node where the bit string  1  2 ⋅ ⋅ ⋅   is the binary representation of .Note that it is not necessary to explicitly maintain the edges leading out of the node corresponding to element (, ).These edges can be implicitly computed during the topological traversal of the DAG as there will be an outgoing edge to each of the 2  nodes of the form (+∑  =1   , * ).This reduces the storage complexity to (2  ).
An anonymous referee asked us to consider how we might modify GPRM when the FPGA interconnection network permits signals between any pair of FPGAs leading to the observation that the linear task order constraint assumed in Table  need not be satisfied.While this is beyond the scope of the current paper, we briefly mention how this could be achieved.
Our notation of [4,01110] currently means that tasks 4, 5, and 6 must be assigned to FPGAs 2, 3 and 4, respectively.Under the referee's assumption any permutation of the three tasks would also be acceptable, leading to 3! = 6 configurations.The notation could be modified by replacing the "1"s in the bit string with the appropriate permutation information.Thus, [4,01230] would mean that Tasks 4, 5, and 6 are placed in FPGAS 2, 3, and 4, respectively, as before, while [4,02310] would mean that Tasks 4, 5, and 6 are placed in FPGAs 4, 2, and 3, respectively.In the worst case, this will expand the number of nodes in the DAG by a factor of ! and the number of edges by a factor of ! 2 .

GPRM Formulation 2: Similar
Tasks.Formulation 1 assigned a reconfiguration time of 0 if the identical task was allocated to an FPGA in consecutive configurations.In Formulation 2, we consider the scenario where two tasks are not identical, but are similar, because some subtasks are identical (Figure 9).Specifically, if two tasks located in the same FPGA in consecutive configurations have identical subtasks in the same subareas of the FPGA, it may be possible to obtain the second configuration from the first by partially reconfiguring the FPGA.This would consume less time than a full reconfiguration and can be modeled by the DAG construction as described in the previous section by simply modifying edge costs before executing the shortest-path algorithm.Note that actual reconfiguration costs can be modeled fairly accurately using approaches discussed in [8][9][10].

GPRM Formulation 3: Implementation Libraries.
A task can be implemented in several ways in an FPGA depending on how its subtasks are laid out within the FPGA.This gives rise to several implementations of a task that can be stored in a library.An intelligent scheduler would choose task implementations from the library that minimize reconfiguration cost.We illustrate these ideas in Figure 10.Suppose that an FPGA originally contains a task consisting of subtasks {A, B, C, D}.Suppose this FPGA is to be reconfigured with a task consisting of subtasks {C, D, E, F}.The figure shows three implementations of this task.Clearly, the choice of an implementation affects the reconfiguration cost.For example, implementation 1 results in a 0% match (and therefore a complete reconfiguration), while implementation 2 results in a 25% match (and therefore a partial reconfiguration cost of 75%), and implementation 3 results in a partial reconfiguration cost of 50%.Assume that the maximum number of implementations of any task that are available in an implementation library is .Then, we can construct a graph consisting of nodes of the form [,  1  2 ⋅ ⋅ ⋅   ], where  denotes the index of the first task included in the configuration and   , 1 ≤  ≤ , is an integer which is set to 0 if FPGA  is unoccupied or to , 1 ≤  ≤  if FPGA  is occupied by implementation number  of a task.
Configuration node [,  1  2 ⋅ ⋅ ⋅   ] has an out-edge to all nodes of the form [ + 1, * ], where  denotes the index of the last task node in [,  1  2 ⋅ ⋅ ⋅   ] and * denotes any occupancy pattern that represents a valid configuration.As before, dummy source and sink nodes are added to the graph.The source has an out-edge to all nodes of the form [1, * ].All nodes representing a configuration that contains task  have an out-edge to the sink node of cost 0.
As before, the shortest-path from source to sink results in an optimal configuration schedule.Lemma 7. The total number of nodes in the graph is bounded by ( + 1)  .
Proof.The variable  can take on  values and each of the  elements of the occupancy string can take on one of +1 values.Multiplying these quantities gives the desired result.Lemma 8.The total number of edges for a chain of  tasks on an array of  FPGAs is bounded by ( + 1) 2 .5.4.GPRM Formulation 4: Dynamic Task Generation.Recall that additional functionality may be required in the first and last FPGAs to, respectively, read in and write out data in each configuration.This additional functionality may not fit in an FPGA when combined with the task assigned to that FPGA from the task chain.Recall also that this was addressed by the SPRM Formulation 2. Since GPRM is a generalization of SPRM, we expect to be able to capture this formulation within the framework of GPRM.The GPRM algorithm can be modified by simply deleting all nodes (and incident edges) in the graph that contain tasks in the first or last that cannot accommodate the additional functionality.A shortest-path computation on the remainder of the graph deftly solves this formulation.

GPRM Formulation 5: Linear Chain with Limited Cycles.
We again consider the situation described in SPRM Formulation 3 and depicted in Figure 7 (task chain with a limited number of bidirectional edges and small cycles).In SPRM, this was addressed by imposing a large cost on bidirectional edges or between adjacent tasks that were part of a cycle.This was done to prevent the scenario where a board configuration requires the results of a future configuration.The same approach can be used in GPRM: if there are a pair of consecutive configurations (represented by vertices joined by a directed edge in the shortest-path graph) such that the output of a task in the later configuration is the input of a task in an earlier configuration, the cost of the directed edge connecting the two configurations is made arbitrarily large.The shortestpath algorithm used to compute the optimal schedule will find a path that (implicitly) ignores the high-cost edge.

GPRM Formulation 6: Alternative Cost Functions.
Although the preceding discussion has focused on minimizing reconfiguration time, our model can handle alternative cost formulations that (1) include execution time and (2) systems where reconfiguration of the FPGAs may be sequential or parallel.To make this concrete, we consider the two scenarios shown in Figure 11.In both scenarios, each of the four FPGA tasks is decomposed into a reconfiguration component () and an execution component ().An arrow from the reconfiguration component to the execution component of each task is the dependency resulting from the observation that a task can only be executed after it is configured.In (a), the four reconfigurations are assumed to be sequential-this is denoted by the arrows from   to   to   to   .In (b) the four reconfigurations are assumed to be carried out in parallel.Thus, there are no arrows among   ,   ,   , and   .In both cases we assume that there is a sequential dependency between the executions (so there are arrows from   to   to   to   ).
Example.Assume that reconfiguration times   ,   ,   , and   are 100 units each and that execution times   ,   ,   , and   are 5, 7, 3, and 10 units, respectively.Since execution times are smaller than reconfiguration times, the total cost of the formulation in Figure 11(a) is 400 (4 sequential reconfigurations) + 10 (execution time of   ) = 410.The cost of the formulation in Figure 11(b) is 100 (to configure the FPGAs in parallel) + 25 (the sum of the execution times) = 125.More generally, the formal cost function that captures the dependency in Figure 11(b) is max(max(max( 1 + 1 ,  2 )+ 2 ,  3 )+  3 ,  4 )+ 4 .Substituting, we can confirm that this is evaluated to be max(max(max(100+5, 100)+7, 100)+3, 100)+10 = 125.The formal cost function for Figure 11( Both of these are functions of the tasks in the board configuration and fall under the framework of the GPRM model.In the GPRM shortest-path formulation, incoming edges to the vertex corresponding to this board configuration can be assigned costs computed as above.Other complex cost functions involving reconfiguration and execution times can also be similarly captured by our cost model.

Detailed SPRM and GPRM Meta heuristics
We have previously shown a high-level metaheuristic in Algorithm 1 that uses the cost model to determine whether to use SPRM or GPRM.We have also argued that, since the GPRM cost model subsumes the SPRM cost model, any formulation that can be addressed by SPRM can also be solved using GPRM.However SPRM is faster making it the better choice when either approach can be used.
For ease of presentation, we described three SPRM and six GPRM formulations separately in the preceding sections.However, in some cases these formulations can be combined with each other in complex ways.This is illustrated in Algorithms 6 and 7.
Finally, we note that although we have only proved optimality for SPRM Formulation 1 and GPRM Formulation 1, all of the scenarios described in the previous sections including the combinations described in Algorithms 6 and 7 can be solved optimally by using appropriate extensions of SPRM and GPRM.

Experimental Results
All algorithms were implemented in the C programming language on a 500 Mhz Dell OptiPlex GX1 with 256 MB of RAM.The automatic target recognition algorithm (ATR) that was implemented on CHAMPION was the only application that required board reconfigurations.Our algorithms obtained an identical (optimal) solution to that of the manual cut.
In addition to this, we designed some examples to "stresstest" our code.The results are presented below.The results for GPRM Formulation 1 (repeated tasks) with cost functions as defined in Section 5.1 are shown in Table 5. Elapsed time was reported in seconds using the clock( ) function.A naive heuristic that was used for comparison purposes simply made cuts at regular intervals of  tasks along the task chain.Note that in all cases the runtime was not more than a few seconds.
GPRM Meta Algorithm (1) if (Implementation Libraries Used) (2) then DAG Structure = Basic DAG Structure defined in GPRM 3 (3) else DAG Structure = Basic DAG Structure defined in GPRM 1 (4) if (Dynamic Task Generation Needed) (5) Update DAG Structure as described in GPRM 4 (6) Cost Model = SPRM 1 Cost Model// baseline cost model (7) If (Repeated Tasks Present) (8) then Update Cost Model as described in GPRM 1 (9) If (Similar Tasks Present) (10) then Update Cost Model as described in GPRM 2 (11) If (Limited Cycles Present) (12) then Update Cost Model as described in GPRM 5 (13) If (Alternative Cost Functions Needed ) (14) then Modify Cost Model as described in GPRM 6 (15) Run Shortest Paths Algorithm.end Algorithm 7: Detailed GPRM metaheuristic: this elaborates on the GPRM outline provided in Algorithm 5 by first determining the appropriate DAG structure.It then computes the correct cost model, which can be a combination of any of the cost models used in SPRM 1, GPRM 1, GPRM 2, GPRM 5, and GPRM 6.The naive heuristic was modified to add task nodes depending on the values of the Boolean flag associated with each node.(If the flag is true, the task is not permitted to be placed in the first or last FPGA; instead it is placed in the next available FPGA.) (An anonymous reviewer suggested using a greedy heuristic for the SPRM problem formulations that could traverse the task chain, greedily making a cut at the smallest cost arc from among the next  arcs at each stage.We expect that this will perform much better than the naive heuristic.However, we note that this greedy approach cannot guarantee optimal costs and moreover will have essentially the same linear runtime as our dynamic programming algorithm for small constant .)

Discussion and Conclusion
This paper presented a high-level meta-algorithm and detailed meta-algorithms for SPRM and GPRM for scheduling FPGA board configurations.This is a powerful strategy that encapsulates a wide range f problem formulations (including combinations of problem formulations) and algorithms into a single framework.The deterministic quality of this meta-algorithm and the guarantee of optimal solutions for all of the formulations discussed make this a viable alternative to traditional stochastic meta-algorithms such as simulated annealing and genetic algorithms.

Figure 2 :
Figure 2: Initial task list with reconfiguration costs.

Figure 3 :
Figure 3: Task list with optimum cuts.
)  = [] end Algorithm 3: Output-Cuts algorithm.[++1]+++1 .Also, observe that in the computation of [ − 1],  − 1 of these quantities are identical to those used for [].Thus, it is possible to use a min-heap data structure in combination with a queue of pointers to elements in the min-heap to (1) compute the minimum, (2) delete one element, and (3) insert one element per iteration.Each of these can be accomplished in (log ) time, giving the result.

Figure 4 :Figure 5 :
Figure 4: Initial task list with reconfiguration costs and boolean flag.

Figure 6 :
Figure 6: Allocating multiple DAG nodes to a task can lead to a bidirectional edge: (a) original DAG, (b) allocation of DAG nodes to FPGA tasks, (c) bidirectional edge resulting from node merging.

Figure 7 :
Figure 7: (a) task chain with one cycle and one bidirectional edge.(b) Cost M assigned to each bidirectional or cycle edge.

Figure 9 :
Figure 9: Similar Tasks Illustration: subtask D is repeated, giving a 25% similarity or a 0.75 partial reconfiguration cost instead of a full reconfiguration cost of 1.

Figure 11 :
Figure 11: Both scenarios include execution time.(a) assumes sequential reconfiguration and (b) assumes parallel reconfiguration.

Table 1 :
Two-dimensional array  representing the assignment of tasks to FPGAs and board configurations.Each row represents a board configuration and each column represents an FPGA.(Thus, Task 5 is assigned to FPGA 3 in Configuration 2.) Note that tasks are listed in row major order but are not necessarily consecutive.Note also that  the number of temporal board configurations is not known a priori.
(3,4)e cost to reconfigure from board Configuration  to board Configuration +1 is purely a function  of the last task denoted   in Configuration  (highest numbered task appearing in Row  of the output table) and the first task in Configuration  + 1 (lowest numbered task appearing in Row +1 of the output table), which is   + 1.In the example, the cost to reconfigure from Configuration 1 to Configuration 2 would be (3,4)because Task 3 is the last task in Configuration 1 and Task 4 is the first task in Configuration 2. In this restricted model, the other tasks allocated to the configurations such as Task 2 in Configuration 1 and Task 5 in Configuration 2 are assumed to have no bearing on the cost.This scenario arises when reconfiguration time is dominated by the time needed to store intermediate results between two adjacent configurations.The total cost is formally defined as ∑ −1 =1 (  ,   +1).
(2) Cost to reconfigure from Configuration  to Configuration  + 1 is a function  of all of the tasks assigned to Configurations  and  + 1.This captures the scenario when CLB reconfiguration dominates reconfiguration time.In this scenario, repeated tasks placed on the same FPGA in consecutive configurations result in significant savings.The total cost is formally defined as ∑ −1 =1 ([, ⋅], [+1, ⋅]), where [, ⋅] denotes all tasks in Configuration  (i.e., Row  in the output table

Table 2 :
Contents of array cost on completion of algorithm.

Table 3 :
Contents of array first cut on completion of algorithm.

Table 4 :
Four possible solutions to GPRM sample problem: The first row shows the initially empty configurations.The second row shows the FPGAs with their initially loaded tasks (with the cost of loading included in the cost column in Row 2).We describe the example in detail for Solution 1: The cost of Row 2 is 3 because 3 of the 5 empty FPGAs are loaded with tasks while the other 2 remain empty.The cost of reconfiguring Row 2 into Row 3 is 4 (four FPGAs go from loaded to empty or vice versa while FPGA 3 retains the same task C).The cost of reconfiguring Row 3 into Row 4 is 4 (again four FPGAs either go from loaded to empty or vice versa while FPGA 3 continues to retain task C).We ignore the cost of emptying out the FPGAs at the end, giving a total cost of 3 + 4 + 4 = 11.Solutions 3 and 4 are the best, each with a total cost of 6.