An FPGA Task Placement Algorithm Using Reflected Binary Gray Space Filling Curve

With the arrival of partial reconfiguration technology, modern FPGAs support tasks that can be loaded in (removed from) the FPGA individually without interrupting other tasks already running on the same FPGA. Many online task placement algorithms designed for such partially reconfigurable systems have been proposed to provide efficient and fast task placement. A new approach for online placement of modules on reconfigurable devices, by managing the free space using a run-length based representation. This representation allows the algorithm to insert or delete tasks quickly and also to calculate the fragmentation easily. In the proposed FPGAmodel, the CLBs are numbered according to reflected binary gray space filling curve model. The search algorithm will quickly identify a placement for the incoming task based on first fit mode or a fragmentation aware best fit mode. Simulation experiments indicate that the proposed techniques result in a low ratio of task rejection and high FPGA utilization compared to existing techniques.


Introduction
Reconfigurable devices with partial reconfiguration capabilities allow multitasking applications on a single chip.Embedded applications like cryptography, video communication, image processing, and so forth can exploit this capability.Efficient placement and scheduling algorithm can improve FPGA resource utilization and overall execution time of applications.
One of the most interesting problems is to decide where to locate the bitmap of a new task in the FPGA when it must be run.A data structure is required to keep track of the available free area, and the algorithm must find out the best location for the arriving task, trying to use the reconfigurable area as efficiently as possible.In online placement system, due to dynamic addition and deletion of tasks, the empty area of FPGA becomes highly fragmented and FPGA area cannot be utilized efficiently.
In this paper, a new data structure based on onedimensional run-length encoding is developed to manage the empty area.Using this data, structure placement algorithm can locate a suitable location to place the incoming task quickly.A new fragmentation metric gives an indication of continuity of free space.The FPGA surface is modeled by a matrix coded according to reflected binary gray curve.The results show significant improvement over placement using well-known algorithms like bottom left, 2D adjacency based placement, least interference fit technique, and CLook algorithm.
This paper is organized as follows.Section 2 presents an overview of the problem of scheduling and placement in dynamic reconfigurable devices.A brief review of various placement and scheduling techniques are given in Section 3. In Section 4, a new technique called reflected binary gray curve based placement is proposed.Section 5 describes the experimental setup made for performance analysis.Results of average device utilization, task rejection ratio, average task waiting time, and so forth are discussed in Section 6.Finally, conclusions are presented in Section 7.

Problem of Scheduling and Placement in Dynamic Reconfigurable Devices
The proposed online placement system model consists of host CPU and partially reconfigurable FPGA.The reconfigurable resources on FPGA are a set of CLB organized in a twodimensional array.The placement module running on the host CPU consists of a scheduler, a placer, and a loader.The scheduler determines which of the tasks in the module library should be loaded and executed next.The placer will manage free space and find out the optimum placement for the task.
The loader loads the configuration data of tasks in the FPGA.When a task is completed the resources occupied by it will be released.
The system assumes that the tasks arrive online.As long as free area is available in the FPGA, the incoming task will be placed in an unoccupied area on the FPGA.If there is no free space and the task cannot be delayed, then the task is rejected.A good placement algorithm should reduce rejection rate.
The tasks are nonpreemptive.Once a task is loaded into the FPGA, it runs to termination.The tasks should be independent without any precedence constraints.These task parameters are defined as follows: for a task   = (ℎ  ,   ,   ,   ,   ,   ,   ), ℎ  and   represent its height and width, respectively, and are measured in number of cells and   ,   , and   are the task arrival time, execution time, and deadline time.The rectangular area is assigned to the task by its top left corner (  ,   ) where   is the row number and   is the column number.The size, arrival time, execution time, and deadline are uniformly distributed in a predefined region and a priori unknown.

Related Works
Bazargan et al. [1] proposed an algorithm for managing free space by keeping track of nonoverlapping rectangles.The main disadvantage is that the number of empty rectangles produced quickly increases with more task insertions.This can lead to some tasks being rejected even though there is adequate space to accommodate them but this space is divided between two nonoverlapping rectangles.To solve this problem, they presented the idea of allowing overlapping of the empty rectangles, specifically overlapping maximal empty rectangles MERs.For  tasks, we can have () nonoverlapping rectangles and, in the case of MERs, we can have ( 2 ) rectangles.
Walder et al. [2] proposed three partition algorithms based on Bazargan method: enhanced Bazargan, on the fly, and enhanced on the fly.The third is based on a 2D hashing table to find a feasible task placement with a run time complexity of (1), but they did not account for reconfiguration time and also they did not account for the update time needed to update the hashing table.
Ahmadinia et al. [3][4][5][6] proposed horizontal line algorithm in which two horizontal lines are used: one above and another below the placed tasks.They also presented a free space management based on the contour of the union of rectangles algorithm.Handa and Vemuri [7][8][9] proposed staircase algorithm for finding the maximal empty rectangles.The bottleneck is time for constructing staircase and finding MERs.Tabero et al. [10][11][12] used vertex lists to store free space where each vertex is a possible location for an input task.Tomono et al. [13] proposed a method in which module connectivity to the remainder of the system is taken into account.Jin et al. [14] proposed a set of algorithms called scan line algorithm.But finding maximum key elements and the MER is time consuming.Marconi et al. [15,16] proposed an intelligent merging technique to speed up Bazargan algorithm without losing its placement quality.It is a combination of three techniques selected based on the task characteristics.The techniques are as follows: merge only if needed, partial merging, and direct combine.Deng et al. [17] proposed an algorithm which packs tasks densely called 2D and 3D adjacency method.Lee et al. [18,19] proposed a CLook and CSAF method, also multistrategy fit algorithm.Bassiri and Shahhoseini [20] considered reconfiguration time by classifying tasks into significant or nonsignificant.Steiger et al. [21][22][23] proposed stuffing techniques for combined placement and scheduling.Belaid et al. [24] proposed an offline algorithm for placement of tasks.ELfarag et al. [25] and Esmaeildoust et al. [26] proposed various fragmentation aware techniques.Lu et al. [27,28] proposed flow scan algorithm for placement of online tasks.

Proposed Work
The proposed work is based on a novel representation for vacant space inside FPGA.A data structure called run-length matrix has been introduced to describe the FPGA area.Runlength representation consists of a list of tuples.Each tuple (, ) indicates an empty slot where  and  are the starting location and size of empty slot, respectively.In Figure 1, the area inside the dark shaded box indicates the task already placed.In this figure, three tasks  1 ,  2 , and  3 are already placed at locations 12, 54, and 24, respectively, on an FPGA of size 8 × 8.The remaining free area can be described using free space run-length matrix as shown below: FRL = {(0, 12) , (16,8) , (32, 16) , (56, 8)} . ( This representation is possible, because the FPGA cells are labeled using reflected binary gray space filling curve.This space filling curve has excellent spatial locality property.Therefore, when this array is mapped into one-dimensional array the run-length representation will be very compact.Secondly, the size of this depends only on the fragmentation level.The size of run-length list is independent of size of FPGA and the number of tasks running.
The width and height of the incoming task is assumed to be even.The algorithm first scans the run-length and identifies probable candidates for placement.For example, if the incoming task size is 4 × 4, it will first search the runlength matrix list for vacant space of 16 or more cells.The idea is that a 4 × 4 task placed at this location will occupy a single contiguous region.If it is not able to find such a location, then it will try to obtain a location which is a multiple of 8 and selected regions can be represented by two regions of 8 International Journal of Reconfigurable Computing cells which are adjacent in 2D and so on.In order to avoid checking the same place again and again in the same instance the checked locations are stored in a list.For each probable location, the algorithm extracts a region of width and height equivalent to the incoming task (in this example 4×4).The region can be slided in the horizontal and vertical direction to get other possible locations.The extracted regions are analyzed to check whether they are vacant.In the above example, the algorithm finds two positions for placing the incoming task shown as A and B, in Figure 1.For placing at A, we need vacant space (32, 16) and placing at B requires vacant space TRL = {(16, 8), (40, 8)}.Based on resulting fragmentation one among these will be selected for placing the incoming task.If location A is selected, the FRL will be updated to {(0, 12), (16,8), (56, 8)}.
Let  and  indicate the row and column of the candidate location cell .Loop can be U shaped or inverted U shaped.Loop direction can be explored by checking the position of  + 1 and  + 3 using a look up reflected binary gray matrix.Each loop will have an entry which can be vertical or horizontal.This can be found by examining row and column of  − 1 cell.The U shaped loops at locations 12 and 48 have vertical entry of distance 4 and 8 rows, respectively.U shape loops at 24 and 40 have horizontal entry of length 4 column place.Similarly, we have inverted U loop with vertical entry at 16, 44, and so forth and horizontal entry at 32 and 56, respectively.This information will be useful while sliding task.When the task  1 placed at position 12 get expired, here, again we find the blocks to be removed TRL = {(12, 4), (48, 4)}.The run-length matrix will be updated as follows FRL = {(0, 24), (56, 8)}.
In algorithms based on area matrix methods, whenever a new task is added or deleted the cells have to be recalculated.This takes a considerable amount of time.The run-length will be smaller in size (worst case will be one eighth of the number of CLB's) and hence less number of entries only need to be checked.Updating the run-length is also having less complexity.
The quality of placement algorithm can be improved by finding all feasible solutions and then selecting one based on fragmentation.Best fit finds the fragmentation index of all the feasible solutions and place the task in a position that reduces the resulting fragmentation.Due to the runlength representation, we make use of a new method to measure continuity of free space.Compared to other methods proposed in literature, this is faster and gives better results.Fragmentation is calculated using the method given by Gehr and Schneider [29].Consider Here,  is taken as 2. If the entire space is free, then fragmentation will be 0. In the worst case of checkerboard pattern, it will be almost 1.
The first fit method tries to place task in the first available location that can accommodate the incoming task.Best fit tries to fix the task in a place which reduces the overall fragmentation.It does not guarantee optimal results because it is a heuristic and the future inputs are unpredictable.
Mapping a task with odd dimension on to a reflected binary gray space will increase the fragmentation.To reduce complication, we consider the size of the task as the nearest even number.Therefore, the allotted space for the task will be slightly more than actual space required.This leads to internal fragmentation.In this paper, the tasks are assumed to have even tasks.The pseudocode is given below: Input: incoming task   , Free space run-length FRL Set Best frag = 1, found = 0; Select  such that 2  ≤  * ℎ where  and ℎ are width and height of the incoming task.
While  > 0 do Check FRL for a vacant space of size more than 2  Find a feasible location G inside the vacant space.
Select a region sufficient to occupy the incoming task and including G and represent it in run-length form TRL.
Try to insert TRL into FRL.
If any task already existing or the region exceeds FPGA boundary this will fail.
If fail then slide the region and try previous two steps.
If there is no overlap then insertion is success.If First fit then report G as the location for the incoming task and quit.If best fit algorithm update best frag if the new fragmentation is low, set found = 1 and continue.
If success or fail that location will be stored in a list to avoid checking the same location again Decrement .

If found = 1 International Journal of Reconfigurable Computing
Report the location of the incoming task else Return fail end 4.1.Complexity Analysis.Let  be number of empty slots in FRL and let  be the number of blocks to be inserted as in TRL.To find out the number of empty slots examined by the algorithm to place all the tasks, we consider the worst case which occur when the placed task splits the empty slot into two.Suppose all the blocks come inside the last slot.We examine  − 1 slots for fixing the blocks.While examining the th slot, we place the first block creating new slot.We place the second block in this new slot creating another slot and so on.Therefore, by placing  blocks, we generate  new slots.Therefore, the total number of slots becomes  +  of which the last slot created need not be examined to place task because all the  blocks have been placed.Hence, the loop needed to be run only up to a maximum of  +  − 1 iterations.For best case loop needed to be run only  iterations.Complexity of finding fragmentation is ().The clustering property assures that  and  will be small.Selecting regions and representing them into run-length are having complexity (1).Worst case complexity of sliding of the region is ( × ℎ) but  and ℎ are width and height of incoming task and are small compared to size of FPGA.
To show that size of  is small, we calculated the size of TRL for blocks of all possible widths and heights on all possible locations.A histogram in Figure 2 is plotted for a 16 × 16 FPGA based on the size of TRL.From the figure, it is clear that in 90% of cases the size of TRL will be less than 5 and the average value is 3.905.This is true for bigger FPGA also.The maximum TRL size for a 8 × 8, 16 × 16, and 32 × 32 block on a 64 × 64 FPGA are 10, 22, and 46, respectively.

Experimental Setup
Simulation framework has been done using Matlab 7.8 running on 2.2 GHz Intel core i3 processor.The simulation is done using randomly generated data for evaluating the algorithm.This has been done in the past, because it is impossible to generate real data for future technological advancement.In this section, we present two methods: the first one is a fast placement (GFF) and the other is a fragmentation aware placement technique (GBF).These techniques are compared with standard placement techniques like bottom left, 2D adjacency based placement, least interference fit technique, and CLook algorithm.Bottom left (BL) is a classical bin packing algorithm which places the incoming task first empty slot available starting from bottom left corner of the FPGA.2D adjacency based technique (Deng) chooses the location of the incoming tasks to make tasks placed "densely, " in order to have a larger continuous free area remains.The 2D adjacency of a candidate cell is equal to the number of adjoining tasks/boundaries of the incoming task, if the base cell of the incoming task is placed here.The least interference technique (LIF) will select a location which minimizes the number of columns disturbed to minimize the number of running tasks getting halted during reconfiguration.CLook method is explained in Trong [14].
In order to evaluate the effectiveness of algorithm, simulation is performed for an FPGA with 16×16, 32×32, and 64×64 CLBs.The space filling curve requires the FPGA to be square shaped with dimension as a power of two.To demonstrate the impact of rejection rate on various parameters, we have used 16 × 16 FPGA.This model is adopted because the previous studies most relevant to this work used FPGA of similar size for their simulations and the space filling curve works on surface with size power of two.Sixty sets of 500 tasks each are randomly generated for each experimental environment and the results shown in the next section are the average over these sets.The height and width of the tasks are chosen randomly between 1 and a maximum value of 8 CLBs.The lifetime of the tasks is generated randomly between 1 and 500 time units.Delay between two consecutive tasks is also chosen between 1 and user defined  time units.The workload can be controlled using different upper bound .A smaller  means that the tasks arrivals are more frequent, and FPGA area utilization is higher.All parameters are assigned by sampling a uniform random distribution function in their respective validity intervals.The proposed work uses a simple scheduling algorithm which can place task from a waiting list.The experiment was repeated for 32 × 32 and 64 × 64 size and the result seems to be similar and run-length size does not increase as FPGA is scaled.
The following assumptions are used in this work.The tasks are independent and preemptive.Preemptive tasks once started cannot be stopped before its expiry.Due to this, relocation of tasks is also not permitted.Since the tasks are independent, they can be scheduled in any order.Rotation of task is not used.
The following parameters are measured to test the effectiveness of the proposed algorithm.Suppose during the simulation interval [0, ],  tasks arrived and  tasks were rejected.For a reconfigurable area of size  * , consider the following: (1) average task rejection ratio: a task may be rejected placement, if sufficient contiguous area is not available currently and it cannot meet its deadline, if scheduled at a later time: Average task rejection ratio =   * 100%;  (2) total waiting time for tasks: if the online placement cannot find a feasible space the task will be added to a waiting list; when some task that is currently running is completed, the new space will be created and the waiting list will be examined to place tasks that can meet the deadline: Penalty ratio is the ratio of volume of rejected task to the total volume of all tasks.When a task gets rejected, the total free area in reconfigurable device is called wasted area.Good placement algorithm will have more utilization, less penalty ratio, less waste area, and less rejection ratio.

Results and Discussion
In this section, snapshot of simulation results of output at particular instance is shown in Figure 3.The coloured boxes correspond to tasks that are currently running.A task that has been completed is not shown.The white region indicates empty region which is already getting fragmented due to placement and removal of tasks.The experiment is also repeated with skewed probability distribution of task's width and height to study the impact of task size on placement quality.Our placement method matches result with conventional methods.The rejection rate was more for the larger sized task as shown in Figure 4.
In the next experiment the intertask arrival time varied from 5% to 20% of execution time range.The rejection rate also increases with decrease in intertask arrival time range.When tasks arrive in quick succession, then more numbers of tasks will be running on the FPGA leaving less room for the newly arriving tasks.This is illustrated in Figure 5.
In order to examine the impact of deadline on the performance, we repeated the experiments with different values of slack.The deadline is calculated as the sum of arrival time, execution time, and slack.
When the deadline is tight, then more tasks get rejected.If the deadline is loose, then tasks can wait till their ALAP time and get placed whenever a free slot is available.When slack becomes very large, then none of the tasks gets rejected.Again the proposed method matches with existing methods as shown in Figure 6.Other results show that the average utilization for the reflected binary grey method is marginally better with lesser execution time than others.Table 1 gives the performance of various algorithms.The waiting time is zero for 32 × 32 and 64 × 64 FPGA hence is not shown in the table.Even though BL, Deng, and LIF seem to be faster, their speed reduces when the size of the FPGA is increased.CLook has more execution time but its rejection rate performance is better than others.The proposed methods have rejection rate performance equal to CLook algorithm with faster execution time.Another feature of the proposed technique is that the execution time increases less rapidly when the FPGA size is scaled up.For CLook, the time taken will be very slow for bigger FPGAs.

International Journal of Reconfigurable Computing
Table 2 lists average algorithm execution time, average number of tasks rejected, average waiting time for the tasks, average utilization ratio, average penalty ratio, average waste area, average size of FRL, and average size of TRL obtained by simulating a 64 × 64 FPGA.The test dataset load05 means that the intertask time interval is [1 to 5] time units.Results show that the utilization ratio increases with load but flattens beyond some particular value.Waste area decreases with increase in utilization ratio.Waiting time, algorithm execution time, and average wait time increases with increase in load.Another important finding is that the average size of FRL and task run-length (TRL) are very small even though their theoretical values are high.

Conclusions
In this paper, a new approach for scheduling and placement of task on a dynamic reconfigurable device based on reflected binary gray space filling curve method is being presented with the goal of minimizing task rejection ratio and increasing FPGA utilization.The free space is managed as onedimensional run-length based representation.Also, a new method to find the fragmentation is used.The algorithm does not consider routability, I/O communication, and heterogeneous FPGA.The algorithm can be improved to reduce the total reconfiguration overhead by reusing some of the task locations.Hence tremendous opportunities exist for research in this area.

Figure 1 :
Figure 1: FPGA with some tasks already placed.

Figure 2 :
Figure 2: Cumulative graph of distribution of TRL size.

Figure 5 :Figure 6 :
Figure 5: Rejection ratio for different values of load.

Table 1 :
Utilization factor average waiting time and execution time.

Table 2 :
Performance metric for RBG code in first fit mode.