Modern embedded systems are being modeled as Reconfigurable High Speed Computing System (RHSCS) where Reconfigurable Hardware, that is, Field Programmable Gate Array (FPGA), and softcore processors configured on FPGA act as computing elements. As system complexity increases, efficient task distribution methodologies are essential to obtain high performance. A dynamic task distribution methodology based on Minimum Laxity First (MLF) policy (DTD-MLF) distributes the tasks of an application dynamically onto RHSCS and utilizes available RHSCS resources effectively. The DTD-MLF methodology takes the advantage of runtime design parameters of an application represented as DAG and considers the attributes of tasks in DAG and computing resources to distribute the tasks of an application onto RHSCS. In this paper, we have described the DTD-MLF model and verified its effectiveness by distributing some of real life benchmark applications onto RHSCS configured on Virtex-5 FPGA device. Some benchmark applications are represented as DAG and are distributed to the resources of RHSCS based on DTD-MLF model. The performance of the MLF based dynamic task distribution methodology is compared with static task distribution methodology. The comparison shows that the dynamic task distribution model with MLF criteria outperforms the static task distribution techniques in terms of schedule length and effective utilization of available RHSCS resources.
Microprocessors are at the core of high performance computing systems and they provide flexibility for wide range of applications at the expense of performance [
The remainder of the paper is organized as follows. The literature review is presented in Section
This section brings the literature review of various task distribution methodologies for reconfigurable heterogeneous computing systems that have multiple dissimilar processing elements. A computing platform called MOLEN polymorphic processor described in [
The main objective of task distribution is to map a given application represented as Direct Acyclic Graph (DAG) to the resources of computing platform RHSCS to minimize total execution of the application while utilizing the resources effectively. This section defines strategies like task graph representation, targeted computing architecture, and overview of task distribution model and finally demonstrates the dynamic and static task distribution with an example.
Applications can be represented as a Directed Acyclic Graph (DAG)
The RHSCS consists of a processor MicroBlaze (available as softcore IP in Xilinx Embedded Development Kit) configured in part of FPGA as softcore PE and multiple RLUs configured in remaining part of FPGA as hardcore PEs. The hardcore PEs in RHSCS act as reconfigurable computing area and support dynamic reconfiguration for hardware tasks. The softcore PE and hardcore PE in RHSCS are used to execute software tasks and hardware tasks of an application, respectively. The RHSCS is also equipped with shared memory and cache memory to store task executable files and data. The cache memory supports softcore PE to store instructions as well as data whereas the shared memory stores the task executables and input/output data for both softcore PE and hardcore PE. The resources in targeted architecture are interconnected through high speed communication protocols that support data interchange between memory and PEs. The memory and communication protocols are also configured on the chip where PEs exist. In RHSCS, the RLU size is maintained constant and tasks are assigned to the RLUs based on area required for their execution. In this research, the resource reconfiguration latency is assumed as constant and is not accounted for in performance calculations.
The RHSCS offers cost effective solution for computationally intensive applications through hardware reuse. So, there is a need for mapping potentially parallel tasks in an application to the resources of RHSCS. An overview of different steps in distribution of tasks of an application to the platform RHSCS is demonstrated in Figure
Overview of task distribution flow.
Initially, an application is represented as Directed Acyclic Graph (DAG) and the tasks of DAG are sent to prioritization module and then to HW/SW resource mapping module. The prioritization module assigns priorities to the tasks of DAG based on their attributes in such a way that ensures schedulability. The HW/SW resource mapping module partitions the tasks into three types called software tasks (ST), hardware tasks (HT), and hybrid tasks (HST) based on their attributes and preemption nature, as stated below.
The set of tasks which can be preempted and could not find required area RLU on RHSCS can be treated as software task set (ST):
The set of tasks which cannot be preempted and could find required area RLU on RHSCS can be treated as hardware task set (HT):
The set of tasks which can be preempted and could find required area RLU on RHSCS can be treated as hybrid task set (HST):
The partitioned tasks are further sent to task distribution stage. In the task distribution stage, distribution tasks list is prepared for the resources of RHSCS based on task distribution policy and resources availability. The task distribution can be done statically or dynamically, stated as follows.
The static task distribution and dynamic task distribution methodologies are demonstrated with an example in the next subsection.
A hypothetical sample task graph [
Hypothetical sample task graph [
Generally, execution time of task graph depends on computing resources on which the tasks are executed. The various configurations of computing platform RHSCS for execution of the hypothetical sample task graph (HTG), shown in Figure
RHSCS configurations for static and dynamic task distribution.
The HTG execution on single core microprocessor configured in FPGA is shown in Figure
The task distribution methodology dynamically decides task execution on the resources of RHSCS. The proposed DTD methodology decides optimal task execution sequence and speedup application execution.
In order to achieve high efficiency in hardware utilization and speed up the application execution, the DTD model is described in three levels as shown in Figure
Dynamic task distribution model.
The Application Decode Module (ADM) loads and stores the tasks of DAGs into DAG Queue. The Task Annotation Module arranges the tasks in DAG Queue based on their level in DAG. The HW/SW Task Partitioning Module maps the tasks in DAG Queue to the resources of computing platform RHSCS and stores them into ST Queue, HT Queue, and HST Queue. Dynamic Task Prioritization Module assigns priorities dynamically based on MLF distribution policy to the tasks in ST Queue, HT Queue, and HST Queue. The Task Load Module loads the task executable files for execution onto softcore PE of RHSCS. Similarly, the Task Configuration Module configures the task bit-stream files for execution onto hardcore PEs, that is, RLUs of RHSCS. The pseudo codes for reading the tasks of DAG and tasks level annotation in level 1, HW/SW resource mapping in level 2, and dynamic task distribution in level 3 are discussed in the coming subsections.
In level 1, the Application Decode Module loads the applications described as DAG and computes the adjacency matrix for the DAG that describes dependency of the tasks in a DAG. The adjacency matrix also holds the level [
/ Output: Level annotated tasks of DAGs (1) Read number of DAGs (2) (3) Read number of tasks in (4) (5) Read the number of tasks which depend on task (6) (7) Compute Level of the tasks in (8) (9) (10) Assign Level to the task (11) (12) (13) Sort the tasks in (14) / (15)
/ Output: ST and HT Partitioned tasks of DAG (1) Read the Level annotated tasks of DAG and number of tasks in DAG from algorithm (2) Initialize HT Queue and ST Queue (3) (4) (5) (6) assign (7) (8) assign (9) (10) (11) (12) /
/ Output: Resources assignment and dynamic task execution order for the tasks in a DAG (1) Read the partitioned tasks of a DAG and number of tasks in the DAG from the Algorithm (2) Initialize RLU Impementation Queue and CPU Impementation Queue (3) (4) (5) Compute the cost function MLF for the task (6) (7) Assign Priority to the partitioned tasks in queues according to their MLF (8) Sort tasks of DAG according to their assigned priority increasing order (9) (10) (11) assign (12) (13) assign (14) (15) (16) (CPU Implementation Queue! = empty)) (17) (18) (19) Assign next task from RLU Implementation Queue to available RLU (20) (21) (22) (23) (24) Assign next task from CPU Implementation Queue to available MP (25) (26) (27) (28)
Time complexity of the task level annotation algorithm depends on number of DAGs and maximum number of tasks in a DAG. The time complexity for task level annotation would be
In this stage, the level annotated tasks in DAG are mapped to the resources of RHSCS and partitioned [
Time complexity of the task resource mapping algorithm depends on maximum number of tasks in a DAG. The time complexity for resource mapping would be
Task distribution is demonstrated in two phases as combination of dynamic task prioritization and resource management. The partitioned tasks in Algorithm
The Task_Distribution function in Algorithm
Time complexity of the proposed DTD methodology depends on time complexity of task level annotation, HW/SW resources mapping, and task distribution algorithms. The time complexity of DTD model is
This section presents implementation scheme, experimental results obtained, performance evaluation of the DTD-MLF methodology, and RHSCS resources utilization.
Modelling of RHSCS environment and methods followed for application execution on RHSCS is discussed in this subsection.
In this research, RHSCS platform is realized on Virtex-5 FPGA (Virtex-5 XC5VLX110T), as shown in Figure
On-chip Reconfigurable High Speed Computing System.
The RLU configures its custom hardware for hardware tasks and also it supports hardware tasks interface with off-chip peripherals. The on-chip BRAM of size 64 KB acts as shared memory for MicroBlaze and RLUs to store executable files, input, and output data. BRAM memory controller is configured along with BRAM to load task executables, input, and output data from external environment and also it controls data interchange between BRAM and MicroBlaze. The data interchange between BRAM and custom hardware can be done through MicroBlaze with the help of communication protocols. These functional blocks MicroBlaze, RLUs, BRAM, instruction, and data cache memory are interconnected through communication protocols like Processor Local Bus (PLB), Local Memory Bus (LMB), and Fast Simplex Link (FSL). The PLB provides interface between MicroBlaze and BRAM through BRAM controllers that load instructions, input data, and store back output data after computation. The LMB supports interfacing of cache memory with MicroBlaze to minimize memory access overheads. The FSL is used to interface custom hardware configured in RLU with MicroBlaze and it has 32-bit FIFO implemented on BRAM to support data streaming between MicroBlaze and custom hardware. Since the Virtex-5 FPGA (Virtex-5 XC5VLX110T) device contains total 69120 bit slices, 148 BRAM, and 64 DSP cells for custom logic reconfiguration, the on-chip RHSCS configured on Virtex-5 FPGA utilized 3825 bit slices, 4 BRAM cells, and 3 DSP cells for various functional blocks and communication protocols. The configured MicroBlaze runs at 125 MZ speed.
The behavior of DTD methodology has been demonstrated in Figure
In the literature many researchers have developed methods to enhance execution speed, schedulable bound, and resource utilization. This paper is aimed at improving upon the schedule length, that is, execution speed of an application and effective utilization of RHSCS resources.
In a DAG, task without any predecessor is an entry task and task without successor is an exit task. Time taken to execute the tasks from entry task to exit tasks in a DAG is called schedule length of the DAG. The schedule length of a DAG depends on computing resources on which the tasks run. The schedule length has to be minimized to achieve optimum execution time for an application.
The resource utilization of computing platform is estimated based on the tasks allocated to individual resources of computing platform and time spent in execution of the tasks. An expression to calculate resources utilization is as follows:
The dynamic task distribution model based on MLF criteria (DTD-MLF) distributes the tasks of an application to the resources of computing platform RHSCS dynamically based on the cost function MLF of the tasks in DAG. Initially, the DTD-MLF methodology is applied to a HTG [
JPEG task graph.
The tasks in HTG are distributed to the resources of RHSCS based on DTD-MLF model as well as static task distribution [
Schedule length and resource utilization of HTG and JPEG based on STD-MLF and DTD-MLF distribution polices.
Task graph | Number of tasks | Schedule length (ns) | % of resource utilization | ||
---|---|---|---|---|---|
STD-MLF | DTD-MLF | STD-MLF | DTD-MLF | ||
HTG | 10 | 72.0 | 63.0 | 35.10 | 40.00 |
JPEG | 7 | 40.8 | 39.1 | 27.60 | 28.90 |
HTG + JPEG | 17 | 96.0 | 73.0 | 38.60 | 50.10 |
Figures
Performance improvement of HTG, JPEG task graphs on RHSCS (a) schedule length and (b) resource utilization.
The DTD-MLF and STD-MLF methodologies are further applied to few real life benchmark applications summarized in first column of Table
Benchmark applications and their tasks distribution to RHSCS.
Task graph | Number of tasks | Schedule length (ns) | % of resource utilization | ||
---|---|---|---|---|---|
STD-MLF | DTD-MLF | STD-MLF | DTD-MLF | ||
DCT | 43 | 96.25 | 80.53 | 58.86 | 70.93 |
Diffeq. | 15 | 40.75 | 28.15 | 45.50 | 65.00 |
Ellip. | 38 | 93.43 | 80.47 | 49.27 | 57.23 |
FIR | 15 | 52.85 | 34.37 | 37.28 | 57.32 |
IIR | 16 | 45.4 | 31.54 | 45.27 | 65.17 |
Lattice | 23 | 59.59 | 51.61 | 48.33 | 55.80 |
Nc. | 61 | 129.02 | 115.16 | 64.63 | 72.40 |
Voltera | 29 | 72.36 | 61.26 | 54.20 | 64.02 |
Wavelet | 43 | 88.04 | 78.04 | 63.32 | 71.43 |
Wdf7 | 53 | 103.92 | 95.57 | 63.53 | 69.09 |
As stated in Section
Schedule length of the benchmark applications on RHSCS in both STD-MLF and DTD-MLF scenario.
RHSCS resource utilization of benchmark application in both STD-MLF and DTD-MLF scenario.
From the results, the presented DTD-MLF methodology boosted the application execution over STD-MLF by 16.33% for DCT, 30.92% for Diffeq., 13.97% for Ellip., 34.96% for FIR, 30.52% for IIR, 13.39% for Lattice, 10.74% for Nc., 15.34% for Voltera, 11.36% for Wavelet, and 8.04% for Wdf7. The DTD-MLF enhanced the RHSCS resource utilization over STD-MLF model by 20.51% for DCT, 42.86% for Diffeq., 16.16% for Ellip., 53.76% for FIR, 43.96% for IIR, 15.46% for Lattice, 12.02% for Nc., 18.11% for Voltera, 12.08% for Wavelet, and 8.75% for Wdf7.
In this paper, we have presented DTD-MLF methodology for an on-chip heterogeneous reconfigurable computing platform RHSCS and estimated its effectiveness in execution of selected benchmark applications. The RHSCS has been realized on Virtex-5 FPGA device for applications execution. The RHSCS contains MicroBlaze as softcore PE and multiple RLUs configured on FPGA as hardcore PE. A few benchmark applications have been represented as DAG and design attributes of the tasks in DAG were obtained offline by executing them on the resources of RHSCS. The obtained design attributes of the tasks in DAG have been utilized to find cost function called Minimum Laxity First (MLF) which acts as the criteria for task distribution. The benchmark applications represented as DAG were distributed onto the resources of RHSCS based on DTD-MLF and STD-MLF methodologies. As compared to STD-MLF, the DTD-MLF model boosted the execution speed of benchmark applications up to 34.96%. The DTD-MLF methodology also enhanced the RHSCS resources utilization up to 53.75% for the chosen benchmark applications.
The authors declare that there is no conflict of interests regarding the publication of this paper.