The high integration density in today's VLSI chips offers enormous computing power to be utilized by the design of parallel computing hardware. The implementation of computationally intensive algorithms represented by
Today’s reconfigurable SoCs feature processing elements (PEs) with significant amount of programmable logic fabric present on the same die. The management of complexity and tapping the full potential of these RSoC architectures present many challenges [
Our main contribution in this paper is that we propose an augmented approach to the heuristic search. A new method of identifying the subspace to which the PE array is to be assigned is proposed based on the directional index of the computational expression that is explained in Section
A modified heuristic search is implemented using the proposed procedure to determine the optimal solution to the
The paper is organized as follows: in Section
The multidimensional (
Considering the input data set to the algorithm, the input data is represented using letter
Do Do Do or End Do End Do End Do
The input data set and computation in the first (
The output data is represented as
The vector representing the update direction in this example is given as
The form of representation of the
The functions
A general
First an (
The
Any node in the iteration space is
In an alternate generalization, we represent the
(a) The (
The collection of indexed node positions in (
A part of the output expression termed as the computational expression is assumed to be computed in the inner loop formed by the (
Following this, the PE array assignment is done to next
In the previous section, the (
In general, the direction of updation of the computational expression is defined as a vector termed as the Computational Trail Expression (CTV) of the
The mapping methodology used in the heuristic search of the mapping transformation matrix
Generate the iteration space for the
Find the data dependencies in the algorithm and formulate the dependence vector
The causality constraint is checked for using (
Dependence vectors for 2D filtering.
|
LHS |
RHS |
Dependence |
---|---|---|---|
assignment | assignment | vector | |
Image data |
|
|
|
Image data |
|
|
|
Window coefficient |
|
|
|
Window coefficient |
|
|
|
Output |
|
|
|
Output |
|
|
|
Generate or modify the search space for the
Chose a candidate
Save the candidate
The following are the steps in our approach for modification of the heuristic search based on the optimal allocation method evolved in Section
Identify the scheduling direction. Once a layer of PEs is assigned to the (
The scheduling vector representing the scheduling direction represented by the
Prune down the valid
Evaluate the cost function as given in (
The plots of Figure
The delay edge is calculated by the direct method as explained in Section
Delay-edge determination—Step
Case (i) window size = |
Case (ii) |
| |
Image size = |
|
Image size = one window size |
|
|
|
1 0 1 0 0 0 | 1 0 1 0 0 0 |
0 1 0 1 0 0 | 0 1 0 1 0 0 |
1 0 0 0 0 1 | 1 0 0 0 0 1 |
0 1 0 0 1 0 | 0 1 0 0 1 0 |
To determine delays use | Delays |
Sdd vector = |
Sdd vector = |
sdd * |
sdd * |
Delays = |
sdd * |
To determine edge_connectivity use | sde = |
Sde vector = |
|
sde * |
= sde * |
ans = 3 1 3 1 0 0 | ans = 4 1 0 0 1 4 |
The delay edge matrix in Table
Dependence vectors for each variable for 2D filtering.
2D filtering | dv1 | dv2 | dv3 | dv4 | dv5 | dv6 | *** |
---|---|---|---|---|---|---|---|
Index variables |
|
|
|
|
|
|
|
|
1 | 0 | 1 | 0 | 0 | 0 | Next row window |
|
0 | 1 | 0 | 1 | 0 | 0 | Next window-column wise |
|
1 | 0 | 0 | 0 | 0 | 1 | PE array along |
|
0 | 1 | 0 | 0 | 1 | 0 |
*
**index variables.
The main objective is to find the
First we take the boundaries of the search space between which the
As a whole, the implementation of the mapping methodology consists of two parts. The first is the heuristic search for the mapping. The heuristic search allows us to obtain the near optimal solutions and then pick up the feasible architecture by pruning the solutions based on Steps
The input to high-level synthesis system is the problem represented in behavioural description in a high-level language. The optimization in a high-level synthesis is done at a level higher than the boolean optimization done by the RTL synthesis tools. This is suitable for hardware optimization of DSP and image processing algorithms [
The algorithm is described in a high-level description in C, and this is used as the input design specification to the high-level synthesis tool. The high-level synthesis tool is used to obtain the Control Data Flow Graph (CDFG). The CDFG allows the designer to verify the design required at a later stage. It allows the tracing of data values as live variables in registers associated with the PE hardware. Also the high-level synthesis tool is used to obtain the design space exploration results which give the area Versus latency tradeoff.
The problem formulation of Section
Window for the 2D filtering algorithm.
Digital convolution can be thought of as a moving window of operations, where the window that is, mask, is moved from left to right and from top to bottom.
The 2D image filtering problem is a representative example of a 4D nested loop involving 2D convolution, as in Listing
For( For(
{
For ( For ( //Output one window function evaluation Ends do End End
The nested loop formulation for the 2D filtering algorithm for image size
For
//[Dvw1 = 1 0 0 0] // next old data is moved in the ( // pixel and old data is moved in the (
The SAS of the 4D edge detection algorithm is in Listing
Listing
The delay edge mapping is obtained by the product of dependence matrix
Step
Mapping results for 2D filtering are given in Table
Heuristic search results for 2D filtering
NPE | Ncyc |
|
Reg. cost |
---|---|---|---|
12 | 12 | 1 0 1 3 | |
0 | 0 | 1 1 1 2 | 10 |
12 | 14 | 1 0 1 3 | |
0 | 0 | 1 1 1 4 | 14 |
12 | 18 | 1 0 1 3 | |
0 | 0 | 1 1 1 3 | 12 |
12 | 16 | 1 0 1 3 | |
0 | 0 | 1 1 1 1 | 8 |
12 | 12 | 1 0 1 3 | |
0 | 0 | 1 1 2 1 | 10 |
12 | 15 | 1 0 1 3 | |
0 | 0 | 1 1 2 2 | 12 |
12 | 17 | 1 0 1 3 | |
0 | 0 | 1 1 2 0 | 8 |
12 | 13 | 1 0 1 3 | |
0 | 0 | 1 1 2 4 | 16 |
12 | 21 | 1 0 1 3 | |
0 | 0 | 1 1 2 3 | 14 |
12 | 19 | 1 0 1 3 | |
0 | 0 | 1 1 2 1 | 10 |
12 | 15 | 1 0 1 3 | |
0 | 0 | 1 1 0 4 | 12 |
12 | 15 | 1 0 1 3 | |
0 | 0 | 1 1 0 3 | 10 |
12 | 13 | 1 0 1 3 | |
0 | 0 | 1 1 4 1 | 14 |
12 | 21 | 1 0 1 3 | |
0 | 0 | 1 1 4 2 | 16 |
12 | 23 | 1 0 1 3 | |
0 | 0 | 1 1 4 0 | 12 |
12 | 19 | 1 0 1 3 | |
0 | 0 | 1 1 4 4 | 20 |
12 | 27 | 1 0 1 3 | |
0 | 0 | 1 1 4 3 | 18 |
12 | 25 | 1 0 1 3 | |
0 | 0 | 1 1 4 1 | 14 |
12 | 21 | 1 0 1 3 | |
0 | 0 | 1 1 3 1 | 12 |
Mapping results using the modified heuristic search results process 2D filtering
Window size = 3 × 3; 2D result arrived by using Step |
Window size = 4 × 3 | ||||
---|---|---|---|---|---|
[pe_arr, Ncyc_arr, |
[pe_arr Ncyc_arr |
||||
NPE | Ncyc |
|
NPE | NCYC |
|
9 | 9 | 1 0 0 4; 1 1 2 1 | 12 | 12 | 1 0 1 4; 1 1 3 1 |
9 | 9 | 1 0 0 4; 1 3 0 4 | 12 | 12 | 1 0 1 4; 1 3 1 4 |
9 | 9 | 1 0 0 4; 1 3 2 1 | 12 | 12 | 1 0 1 4; 1 3 3 1 |
9 | 9 | 1 0 0 4; 1 2 0 4 | 12 | 12 | 1 0 1 4; 1 2 1 4 |
9 | 9 | 1 0 0 4; 1 2 2 1 | 12 | 12 | 1 0 1 4; 1 2 3 1 |
9 | 9 | 1 0 0 4; 1 4 0 4 | 12 | 12 | 1 0 1 4; 1 4 1 4 |
9 | 9 | 1 0 0 4; 1 4 2 1 | 12 | 12 | 1 0 1 4; 1 4 3 1 |
9 | 9 | 1 0 0 4; 1 1 0 4 | 12 | 12 | 1 0 1 4; 1 1 1 4 |
9 | 9 | 1 0 0 4; 1 1 2 1 | 12 | 12 | 1 0 1 4; 1 1 3 1 |
9 | 9 | 1 0 0 4; 0 1 0 4 | 12 | 12 | 1 0 1 4; 0 1 1 4 |
9 | 9 | 1 0 0 4; 0 1 2 1 | 12 | 12 | 1 0 1 4; 0 1 3 1 |
9 | 9 | 1 0 0 4; 0 3 0 4 | 12 | 12 | 1 0 1 4; 0 3 1 4 |
9 | 9 | 1 0 0 4; 0 3 2 1 | 12 | 12 | 1 0 1 4; −0 3 3 1 |
9 | 9 | 1 0 0 4; 0 2 0 4 | 12 | 12 | 1 0 1 4; 0 2 1 4 |
9 | 9 | 1 0 0 4; 0 2 2 1 | 12 | 12 | 1 0 1 4; 0 2 3 1 |
9 | 9 | 1 0 0 4; 0 4 0 4 | 12 | 12 | 1 0 1 4; 0 4 1 4 |
9 | 9 | 1 0 0 4; 0 4 2 1 | 12 | 12 | 1 0 1 4; 0 4 3 1 |
9 | 9 | 1 0 0 4; 0 1 0 4 | |||
9 | 9 | 1 0 0 4; 0 1 2 1 | |||
9 | 9 | 1 0 0 4; 2 1 0 4 | |||
9 | 9 | 1 0 0 4; 2 1 2 1 |
*Search space for
The mapping was performed for 1D array. The generalized form of space time mapping matrix
The delay edge mapping is obtained by the product of dependence matrix
The direct method in deriving the delay edge connectivity is obtained from the dependence vector as given in Table The delay edge matrix based on the heuristic search is used to calculate the cost as given in Table Using the proposed modified search algorithm, 9, 9 or 12, 12 are the number of PEs and number of clock cycles in Table As mentioned above, the delay edge connectivity is obtained directly from the dependence matrix directly by considering the scheduling directions for delays and considering the PE directions for the edges as discussed in Section
The cost function is defined as (
Figure
Architecture for 2D filtering algorithm for window size
(a) Plot for Table
Figure
The main objective is to find the
Dependence vectors formulations have been presented for a reduced index space 4D FSBM algorithm [
The mapping results after the search are presented here.
The heuristic search results of Tables
4D FSBM—heuristic search.
Mmat | NPEI | NcycII | Reg costIII | Total cost = 0.4 * I + 0.4 * II + 0.2 * III |
---|---|---|---|---|
0 1 0 1 | 9 | 24 | 16 | 15.35 |
0 1 0 0 | 9 | 27 | Edge | |
0 1 1 1 | 9 | 24 | 19 | 15.5 |
0 1 0 0 | 9 | 27 | Edge | |
1 1 0 1 | 9 | 16 | 68 | 14.75 |
0 1 0 0 | 9 | 19 | Edge | |
1 1 1 1 | 9 | 24 | 71 | 18.1 |
0 1 0 0 | 9 | 27 | Edge | |
1 0 0 1 | 9 | 16 | 52 | 13.95 |
0 1 0 0 | 9 | 19 | Edge | |
1 0 1 1 | 9 | 24 | 54 | 17.25 |
0 1 0 0 | 9 | 27 | Edge | |
3 1 0 1 | 9 | 16 | 172 | 19.95 |
0 1 0 0 | 9 | 19 | Edge | |
3 1 1 1 | 9 | 24 | 174 | 23.25 |
0 1 0 0 | 9 | 27 | Edge | |
3 0 0 1 | 9 | 16 | 158 | 19.25 |
0 1 0 0 | 9 | 19 | Edge | |
3 0 1 1 | 12 | 24 | 160 | 24.2 |
0 1 0 0 | 12 | 27 | Edge | |
9 1 1 1 | 12 | 16 | 500 | 38 |
0 1 0 0 | 12 | 19 | Edge |
Results of modified method for 4D FSBM algorithm for
>> [pe_arr Ncyc_arr, Mmat] | Reg. |
Total cost = 0.4 * I + 0.4 |
---|---|---|
0 0 1 1 4 0 | 35 | 17 |
9 16 1 0 0 1 | 10 | |
0 0 1 3 2 0 | 56 | 21.2 |
9 16 1 0 0 1 | 10 | |
0 0 1 2 3 0 | 43 | 18.6 |
9 16 1 0 0 1 | 10 | |
0 0 1 4 1 0 | 69 | 23.8 |
9 16 1 0 0 1 | 10 | |
0 0 1 1 4 0 | 30 | 16 |
9 16 1 0 0 1 | 10 | |
0 0 0 1 4 0 | 28 | 15.6 |
9 16 1 0 0 1 | 10 | |
0 0 0 3 2 0 | 54 | 20.8 |
9 16 1 0 0 1 | 10 | |
0 0 0 2 3 0 | 41 | 18.2 |
9 16 1 0 0 1 | 10 | |
0 0 0 4 1 0 | 67 | 23.4 |
9 16 1 0 0 1 | 10 | |
0 0 0 1 4 0 | 28 | 15.6 |
9 16 1 0 0 1 | 10 | |
0 0 2 1 4 0 | 31 | 16.2 |
9 16 1 0 0 1 | 10 | |
0 0 2 3 2 0 | 58 | 21.6 |
9 16 1 0 0 1 | 10 | |
0 0 2 2 3 0 | 45 | 38.5 |
Graph-search results for reduced 4D FSBM heuristic search-cost function versus (normalized area and cycles) for Table
The final delay edge is given as follows:
The second row is the edge, and the first row is the registers connected obtained as the highest nonzero value, in the
The architecture is arrived at, based on above is in Figure
FSBM architecture after design space exploration.
The design space exploration results are presented in the following based on the architecture arrived at.
The architecture in Figure
CDFG of the FSBM architecture in high-level synthesis tool.
The high-level synthesis tool allows the designer to input the timing constraint as the cadency values to obtain the tradeoff of allocation of hardware as obtained in Table
Design space exploration of the FSBM for
Cadency | Operators, | Area | % use rate | Number of | FF | Latency |
---|---|---|---|---|---|---|
stages | muxes | |||||
40 | 22, 2 | 88 | 100 | 48 | 336 | 60 |
50 | 8, 2 | 64 | 100 | 96 | 288 | 80 |
100 | 5, 2 | 40 | 60,90, |
160 | 224 | 120 |
10,10 | ||||||
150 | 2, 1 | 16 | 60 | 128 | 144 | 140 |
200 | 2, 1 | 16 | 45 | 128 | 144 | 140 |
Search results of MATLAB for
Npe | Ncyc | Reg. |
Total cost |
---|---|---|---|
25 | 16 | 8 | 18 |
40 | 4 | 99 | 37 |
25 | 16 | 8 | 18 |
40 | 4 | 290 | 75 |
1 | 379 | 393 | 231 |
40 | 4 | 291 | 75 |
16 | 28 | 27 | 23 |
25 | 19 | 293 | 76 |
40 | 19 | 293 | 82 |
Design space exploration GAUT-FSBM for
|
Area | Number of operators | Latency |
---|---|---|---|
50 | 144 | 18 | 90 |
60 | 152 | 19 | 80 |
70 | 112 | 14 | 100 |
80 | 112 | 14 | 100 |
100 | 104 | 13 | 120 |
150 | 64 | 8 | 170 |
200 | 32 | 4 | 180 |
300 | 40 | 5 | 320 |
400 | 16 | 2 | 340 |
The search range
The results of the above are shown in Figure
Design space exploration using HLS tool (Tables
The merit of the modified heuristic algorithm is measured in terms of the search space complexity.
In general, in heuristic search procedures, the loop bounds are considered as the maximum values for searching. But as the loop bounds and the nested loop dimension increase, the search space will be huge if vectors are exhaustively generated. A graphical representation of search space expansion with respect to the different values of
Plot showing the search space size and FSBM algorithm parameter
The “a” bars show the search space obtained by taking the loop bounds, say
Tables
6D problem—full search block motion estimation (FSBM) problem
|
S&K-2D array | Our work |
Use of direct |
2D array considered as 1D array |
---|---|---|---|---|
|
Image file—image size— |
-do-* | -do- | -do- |
| ||||
|
|
-do- | -do- | -do- |
| ||||
|
0, 1,−1, |
-do- | -do- | -do- |
| ||||
CTV | Nil |
|
|
-do- |
| ||||
Scheduling direction = sd | Nil |
|
-do- | -do- |
| ||||
Search space complexity— |
|
Pruned down using |
Pruned down using |
|
| ||||
|
|
Pruned down using |
Nil*** | Nil |
| ||||
Example | 6612 + 666 = |
668 + 664 | 668 | 664 |
*-do-entry same as in previous column,***nil: not defined/not applicable.
Reduced index space
|
S&K-2D array | Our work-2D array |
Use of |
2D array considered as 1D array |
---|---|---|---|---|
|
Image file—image size— |
-do- | -do-* | -do- |
| ||||
|
|
-do- | -do- | -do- |
| ||||
|
0, 1, –1, |
-do- | -do- | -do- |
| ||||
CTV | Nil*** |
|
|
-do- |
| ||||
Scheduling direction = sd | Nil*** |
|
-do- | -do- |
| ||||
Search space complexity— |
|
Pruned down using |
Pruned down using |
|
| ||||
|
|
Pruned down using |
Nil*** | Nil |
| ||||
Example | 178 + 174 = |
174 + 172 | 174 | 172 |
**note 4
***nil: not defined; *do entry same as in previous column.
Table
The reduction in search space by modifying the 6D algorithm to 4D as reported in [
Many of the computationally intensive algorithms are of