Design Space Exploration of Deeply Nested Loop 2D Filtering and 6 Level FSBM Algorithm Mapped onto Systolic Array

The high integration density in today’s VLSI chips offers enormous computing power to be utilized by the design of parallel computing hardware. The implementation of computationally intensive algorithms represented by n-dimensional (n-D) nested loop algorithms, onto parallel array architecture is termed as mapping. The methodologies adopted for mapping these algorithms onto parallel hardware often use heuristic search that requires a lot of computational effort to obtain near optimal solutions. We propose a new mapping procedure wherein a lower dimensional subspace (of the n-D problem space) of inner loop is identified, in which lies the computational expression that generates the output or outputs of the n-D problem. The processing elements (PE array) are assigned to the identified sub-space and the reuse of the PE array is through the assignment of the PE array to the successive sub-spaces in consecutive clock cycles/periods (CPs) to complete the computational tasks of the n-D problem. The above is used to develop our proposed modified heuristic search to arrive at optimal design and the complexity comparisons are given. The MATLAB results of the new search and the design space trade-off analysis using the high-level synthesis tool are presented for two typical computationally intensive nested loop algorithms—the 6D FSBM and the 4D edge detection alternatively known as the 2D filtering algorithm.


Prelude to the New Search Method.
Today's reconfigurable SoCs feature processing elements (PEs) with significant amount of programmable logic fabric present on the same die. The management of complexity and tapping the full potential of these RSoC architectures present many challenges [1]. A large number of heuristic algorithms have been used in developing many novel scheduling and mapping algorithms [2][3][4][5]. However, these approaches face difficulties in dealing with large execution times.
n-dimensional (n-D) nested loop representations are used in the formulation of numerous computationally intensive multimedia computing/image processing and signal processing algorithms. Systolic array design style can effectively exploit parallelism inherent in the nested loop algorithm and, therefore, reduce processing time [2,3]. Often heuristic procedures are used to search for the mapping transformations that are used to map the nested loop algorithms onto array architectures [4,5]. Since the effort that goes into heuristic search is large and complex, the challenge lies in improving the process to reduce the computational effort in getting the mapping results.
Our main contribution in this paper is that we propose an augmented approach to the heuristic search. A new method of identifying the subspace to which the PE array is to be assigned is proposed based on the directional index of the computational expression that is explained in Section 2. The new vectors and terminologies used in the procedure are defined and elaborated in Section 2.
A modified heuristic search is implemented using the proposed procedure to determine the optimal solution to the n-D problem. The complexity analysis is performed by comparing the search space used in our method with the search space in [4]. The high-level synthesis tool GAUT is used to plot the design space trade-off curves to obtain the design space exploration curves.

VLSI Design
The paper is organized as follows: in Section 3, mapping steps used in the heuristic method and our proposed modified search method are described. The 4D nested loop formulation of the 2D filtering problem is explained in Section 4. The methodology and the implementation of the above approach for the 2D filtering algorithm and the mapping results are presented in Section 4. The mapping process for 6D FSBM is elaborated in Section 5 followed by the results of the heuristic search for the reduced 4D FSBM and modified heuristic search for the same in Section 6. The design trade-off results using the high-level synthesis tool GAUT are presented in Section 6. Section 7 discusses the complexity considerations and comparisons. Section 8 gives the conclusion and future work.

Data Representations.
Considering the input data set to the algorithm, the input data is represented using letter A with subscript z. The input data set consists of collection of data A 1 , A 2 , . . . , A k where k is some constant integer number. Each of this type of data is associated with the axis vector I. For example, for A 1 , we call it as A 1 (I). Now every such data is associated with a particular axis component i z in I. i z is the axis vector along which the data A 1 (I) is read into the n-D multidimensional algorithm using a set of ports. The input data is represented as A 1 (i 1 , I 2 , i 3 , . . . .i n ). This means that input data A 1 (I) is fed along i 2 axis. The corresponding word size is k 2 , and the port size required to feed this data is k 2 . The input data is reused either within the same computation or in different computations within iteration (depending on the application considered). If the reuse is within the same clock cycle/clock period CP, it is made possible by propagating the data (with zero delay) termed as data broadcast. The reuse direction of each data is represented by the directional vector termed as the "dependence vector"-D v . D v is determined as follows: as shown in Listing 1, the data A 1 on the LHS is assigned from the data A 1 (i 1 , I 2 , i 3 − 1, . . . , i n ) on the RHS in equation (1a) in Listing 1. This means that it is broadcast within the same iteration in the i 3 direction and fed along the i 2 axis using k 2 ports (Figure 1).
The output data is represented as C(i 1 , i 2 , i 3 , . . . , I n − 1) which means that the data is output along i n axis and propagated along i n direction. When we consider the output data, the word "propagation" is replaced by the term "update direction." The vector associated with update direction is termed as the Computational Trail Vector (CTV). The updation of CTV may be with delay or without delay as demanded by the application.
The vector representing the update direction in this example is given as The form of representation of the n-D algorithm in Listing 1 wherein the broadcast direction and computations are shown with the complete detail is termed as the uniform recurrence relation or the URE form of the n-D nested loop algorithm. In the expression (2) for CTV in Listing 1, the computational output data C is represented as C[ I] (arrow line on top of the symbol) which indicates that it is associated with an update direction. The corresponding vector d in the RHS of (2) represents the CTV defined in (1).
The functions f 1 and f 2 in (3) in Listing 1 are simple commutative operators which are executed independent of any other output component computations of C. These are assumed to be operators with no precedence constraint. f 2 especially is an operator that has no precedence constraint. It needs not wait for any past computations. It can proceed independently provided as much parallel hardware is available. There is only one output computation expression in Listing 1. Listing 1 is said to have a single CTV with no precedence constraint.

n-D Nested Loop Problem.
A general n-D nested loop algorithm is illustrated in Listing 1. i 1 , i 2 , and i 3 , . . . , i n are the loop indices. Together they form the n-D (iteration) index space. Representation of the n-D loop computations as a dependence graph (DG) leads to each point in the index space corresponding to a single node in the DG. Theoretically each node can be assigned a processing element (PE). The n-D iteration space is constructed as follows.

An n-D Iteration Space Computation in
Terms of (n−1)-D Subspace. First an (n−1)-D dependence graph (DG) as in Figure 1 with an (n−1) multidimensional indexed positions given by is constructed showing the data input directions and data broadcast directions. Here we show one of the data input directions and data broadcast directions for the sake of illustration. The data specifications or the dependence relations within each cell in the iteration space show the different data broadcast directions as shown in Figure 1.
The n-D iteration space is constructed by replicating the (n−1)-D iterationspace along the i n direction. Each (n−1)-D subspace is termed as a cell (or iteration). An array of PE is assigned to this cell, and the computation of the cell is completed in 1 clock period (CP). In the next CP, the PE array is assigned to the next cell along the i n direction. The direction of PE array assignment to consecutive subspaces is termed as the scheduling direction VLSI Design 3 Do i 1 = 1 to k 1 ; Do i 2 = 1 to k 2 ; Do i 3 = 1 to k 3 ; Listing 1: n-D multidimensional algorithm in URE form. represented as the scheduling vector sd. As per Listing 1, the CTV is also updated along the same i n direction. The CTV is partially updated in CP1, and the updation continues as the scheduling advances along the i n direction in every CP till the completion of computation in k n CPs.

Mapping and Scheduling.
Any node in the iteration space is N[i 1 , i 2 , i 3 , . . . , 1] and is mapped onto the PE array assigned to the iteration subspace. This is termed as mapping. The time "t" at which the node N[i 1 , i 2 , i 3 , . . . , 1] is mapped on the PE in the PE array is termed as scheduling. The mapping and scheduling are derived for each application in detail in the corresponding sections.

Computation of n-D Iteration Space Using an (n−2)-D Subspace.
In an alternate generalization, we represent the n-D nested loop problem as identified to have an iterative (n−2)-D subspace as shown in Figure 2. An (n−2)-D dependence graph (DG) with an (n−2)-D multidimensional indexed positions is given by The collection of indexed node positions in (3) is termed as the (n−2)-D subspace or hyperplane, which is represented showing the data input directions and data broadcast directions in Figure 2(a). The n-D iteration space computation is completed by replicating the (n−2)-D DG. We expand the iteration space along the i n−1 direction, followed by its expansion along i n direction. Each (n−2)-D subspace is termed as a cell or iteration cell. An array of PE is assigned to this cell, and the computation of the cell is completed in 1 CP.
A part of the output expression termed as the computational expression is assumed to be computed in the inner loop formed by the (n−2)-D iteration space as depicted in Figure 2(a). The directional index representing the propagation direction or the update direction of the computational expression is termed as the Computational Trail Vector (CTV). The CTV is partially updated in CP1, and the updation continues as the scheduling advances along the i n−1 , showing that in the next CP the PE array is assigned to the next iteration cell along the i n−1 direction (as shown in Figure 2(b)) to complete the first row of computation in k n−1 CPs. The sequence direction of subspace assigned to the PE array in consecutive CPs is termed as the scheduling direction represented by the scheduling vector sd 1 , which is along the i n−1 direction, and CTV is also updated along the same i n−1 direction.
Following this, the PE array assignment is done to next i n giving the scheduling vector sd 2 as i n as in Figure 2(b). The total number of CPs used to complete the computation is k n−1 × k n .

n-D to (n − x)-D with CTV and Scheduling Directions.
In the previous section, the (n−1)-D subspace is built using a sequence of (n−2)-D subspaces by scheduling along Figure 2: (a) The (n−2)-D iteration cell. (b) (n−2)-D iteration space with scheduling sd 1 along i n−1 direction CTV is also updated along the same i n−1 direction, followed by sd 2 along i n , and CTV is also updated along the same i n direction.

VLSI Design
the appropriate (n−1)th dimension followed by scheduling along the appropriate nth dimension-say along i n with an assumption that CTV has the same direction as the scheduling vector which may not be true always. There are two approaches to complete the n-D computation using the (n−2)-D subspaces. The PE array assignment to the (n−2)-D subspace is one order closer to the physical realization. For a practical implementation, this process has to be continued down to 2D level. In general, the direction of updation of the computational expression is defined as a vector termed as the Computational Trail Expression (CTV) of the n-D problem. We identify the corresponding (n − x)-D computational hyperplane in which the CTV lies, forming an (n − x)-D subspace in the n-D space. The PE array is assigned to this plane. This is followed by the reuse of the (n − x)-D plane along the scheduling direction/s.

Methodology of Mapping
The mapping methodology used in the heuristic search of the mapping transformation matrix M is explained hereafter. In general, the mapping matrix M is constituted of the timing vector or hyperplane S and the space matrix or vector also called the space hyperplane P [6,7]. Any node in the iteration

Heuristic Method [4]
Step 1. Generate the iteration space for the n-D nested loop application under consideration.
Step 2. Find the data dependencies in the algorithm and formulate the dependence vector D v .
Step 3. The causality constraint is checked for using (5), that is, whether the condition for all dependencies is satisfied, where D v is dependence vector for each data variable (Table 1). Choose those s elements of S which satisfy the condition.
Step 4. Generate or modify the search space for the M matrix (M set ) to satisfy the rank condition [4].
Step 5. Chose a candidate M matrix from the above set.
Step 6. Save the candidate M matrix in M result .

The Proposed Modified Heuristic Method.
The following are the steps in our approach for modification of the heuristic search based on the optimal allocation method evolved in Section 2. Identify the scheduling direction. Once a layer of PEs is assigned to the (n − x)-D subspace, the same array of PEs is to be used in the next computation. This reuse direction is known as the scheduling direction in Section 2. All these conditions are used in the modified heuristic search procedure in the following steps.  Step 7. The scheduling vector representing the scheduling direction represented by the sd 1 vector is used to prune down the valid M matrices.
Step 8. Prune down the valid M matrices by choosing the (n − x)-D subspace to which the PE plane is discussed. This is done by identifying the iterative subspace. To summarise, the selected M mat is obtained by pruning down the M result using the CTV and PE plane assignment done as discussed in Section 2.
Step 9. Evaluate the cost function as given in (10) in Section 5.2. If Cost actual < Cost required , proceed to Step 6 else to Step 3.
The plots of Figure 5 show the comparison of heuristic method of Section 3.1 with the modified heuristic search method described in Section 3.2.

Direct Method
Step 10. The delay edge is calculated by the direct method as explained in Section 4.5. The results are presented in Table 2.
Step 11. The delay edge matrix in Table 2 is determined using the expression D v defined in Tables 1 and 3 for 2D filtering algorithm.

Mapping Process.
The main objective is to find the M matrix which consists of the processor allocation vector (P t ) and the scheduling vector (S t ).
array represented as a 1D array. * * index variables.
First we take the boundaries of the search space between which the P t and S t are to be searched. The selection of search space is an important factor, because there is an exponential growth in both area and time complexity of the mapping methodology. Consider that U i , U j , U k , . . . , U n are the upper bounds of an n-dimensional nested loop algorithm. The heuristic followed in this work is to generate the search space that can be obtained by the following element set

Methods and Resources Used in Obtaining the Mapping
Methodology. As a whole, the implementation of the mapping methodology consists of two parts. The first is the heuristic search for the mapping. The heuristic search allows us to obtain the near optimal solutions and then pick up the feasible architecture by pruning the solutions based on Steps 4-9 as described in Section 3.3. The new mapping methodology is explained with respect to the 2D filtering algorithm in Section 4. The modified heuristic method based on the new method followed after implementing the steps in Section 3.1 are implemented using MATLAB to obtain the results of the search procedure of Sections 3.1 and 3.2. Also the comparative results between the heuristic and the modified heuristic method for the 6D full search block motion estimation (FSBM) algorithm are given. The second part is the design space exploration of resultant architecture. It is obtained as explained in the next section.

High-Level Synthesis (HLS)
. The input to high-level synthesis system is the problem represented in behavioural description in a high-level language. The optimization in a high-level synthesis is done at a level higher than the boolean optimization done by the RTL synthesis tools. This is suitable for hardware optimization of DSP and image processing algorithms [8]. This is followed by scheduling and allocation [9]. The GAUT [10] tool used incorporates all the above features and allows the design space exploration.
The algorithm is described in a high-level description in C, and this is used as the input design specification to the high-level synthesis tool. The high-level synthesis tool is used to obtain the Control Data Flow Graph (CDFG). The CDFG allows the designer to verify the design required at a later stage. It allows the tracing of data values as live variables in 6 VLSI Design registers associated with the PE hardware. Also the high-level synthesis tool is used to obtain the design space exploration results which give the area Versus latency tradeoff.

2D Filtering for Image Processing: A 4D
Problem. The problem formulation of Section 2 and the methodology in Section 3 are applied to the 2D filtering problem. 2D filtering or convolution is one of the essential operations in digital image processing required for image enhancements. The grey levels are usually represented with a byte or 8bit unsigned binary number, ranging from 0 to 255 in decimal. Equation (6) shows the two dimensional discrete convolution algorithm, where I[x, y] represents the input pixel data image, W is the window coefficient, and O is the output image. The movement of the mask window function to calculate the window function value for the whole image region is shown in Figure 3: Digital convolution can be thought of as a moving window of operations, where the window that is, mask, is moved from left to right and from top to bottom.
The 2D image filtering problem is a representative example of a 4D nested loop involving 2D convolution, as in Listing 2 and Figure 3. The computation is highly redundant and requires high data reuse. This is considered here for systolic mapping. An image of size 0 to +k 1 ; 0 to +k 2 is considered convolved with a mask of size 0 to +k 3 ; 0 to +k 4 . The mask coefficients are stored in memory. The significant features of the algorithm are listed in the following section.

Nested Loop Formulation.
The nested loop formulation for the 2D filtering algorithm for image size k 1 × k 2 and window function size k 3 × k 4 is given in Listing 2-the same is represented in uniform recurrence equation form (URE) in Listing 3.

Single Assignment Statement Formulation or Uniform Recurrence Equation (URE) Form of 2D
Filtering. The SAS of the 4D edge detection algorithm is in Listing 3, and the dependence vectors for the four level algorithms have 4 indices and the index space is generated by varying the four index values till the upper limit of each index as in Listing 3. The dependencies give the propagation direction of the input variables and update direction of the output data. In Listing 3, W new represents the mask values in 2D filtering algorithm that are to be input at the fresh windowing and I new to indicate the loading of pixel values for a new frame of image.

Dependence Vectors for 2D Filtering Algorithm.
Listing 3 is well commented to bring out the formulation of the following dependence vectors in Table 1.

Delay-Edge Matrix-Direct Method of Determining Delay and Edge
Connectivity. The delay edge mapping is obtained by the product of dependence matrix (D v ) and M matrix as shown in Table 2.
Step 11 in the mapping process uses the dependence matrix to compute the edges and delays as follows: D v = [0 1 0 1; 1 0 1 0; 0 0 0 1; 0 0 1 0; 0 0 01; 0 0 1 0] (Table 1); the first half in each vector in D v stands for the scheduling direction and the second half for the PE array directions. The first half (termed as sdd vector-sdd) gives the delays associated with the corresponding edges given by the second half (sde vector = sde): This is computed and presented in Table 2.
Mapping results for 2D filtering are given in Table 4(a) for heuristic method, and Table 4(b) gives the modified heuristic method.

Space-Time Mapping Matrix (M)
Illustration. The mapping was performed for 1D array. The generalized form of space time mapping matrix M is given here as shown in (3); if P t = 0 0 3 1 ; S t = 4 1 5 1 .

Direct Method-Edge Connectivity and Delay Registers.
The direct method in deriving the delay edge connectivity is obtained from the dependence vector as given in Table 6.
(1) The delay edge matrix based on the heuristic search is used to calculate the cost as given in Table 4(a) and the above can be used to pick up the good solution based on minimum cost, but does always guarantee the feasibility. So we do not consider the delay edge obtained from this method.
(2) Using the proposed modified search algorithm, 9, 9 or 12, 12 are the number of PEs and number of clock cycles in Table 4(b) (for assumed window size in Table 4(b)) are arrived at after pruning down the search results using the PE-plane subspace based on CTV.
(3) As mentioned above, the delay edge connectivity is obtained directly from the dependence matrix directly by considering the scheduling directions for delays and considering the PE directions for the edges as discussed in Section 4.5 and as shown in Table 2 and the architecture is obtained using the mapping results and direct delay edge connectivity.

Mapping Results.
The cost function is defined as (10) and is used as an additional constraint mentioned to Step 9 in Section 3.2 for selecting architecture according to the modified heuristic method heuristic search Here a, b, c are the scalar coefficients which represent weights for the corresponding costs to minimize the overall cost function.

4.9.
Architecture. Figure 4 shows the architecture for edgedetection algorithm. It consists of 2 ports, one for accessing the image data and the other for the output. The architecture consists of w 1 × w 2 PEs, where w 1 × w 2 is the size of the window used. The intermediate output is propagated to the successive PEs within a row but has to be passed through a line buffer when passing the intermediate output between rows of PEs. The buffer width is equal to the number of pixels per row. The final output is at the w 1 × w 2 PE. Figure 5(a) shows the search results giving the possible solutions including the register cost. Registers represent the delays in the connecting edges which are the result of heuristic search, but which may not be feasible or realizable.
The Pareto optimal and near optimal solutions are shown in the plots Figures 5(a) and 5(b) based on the heuristic search and the modified heuristic search, respectively. The modified heuristic search developed by us picks up the good solutions with respect to the number of PEs and cycles concerned, but we see that the register cost does not reflect the Pareto optimal solution and does not guarantee feasibility. The delay-edge connectivity is obtained directly from the dependency vectors as explained in Section 3.3 and in Table 2 for 2D filtering and leads to the feasible architecture in Figure 4.

Mapping of 6D FSBM
The main objective is to find the M matrix which consists of the processor allocation vector (P t ) and the scheduling vector (S t ). The method used is same as explained in Section 3.

Dependencies for 6-Level FSBM Algorithm.
Dependence vectors formulations have been presented for a reduced index space 4D FSBM algorithm [11]. Due to lack of space, it is not presented.

Results of Modified Method for FSBM Algorithm.
The mapping results after the search are presented here. The heuristic search results of Tables 5 and 8(a) (using MATLAB) for p = 1 and 2, respectively, are shown in the graph in Figure 6. Table 5 Results This is obtained as a good solution from Table 5 by selecting the optimum cost taking into consideration the feasibility.

Delay-Edge Connectivity for FSBM Algorithm Using
(2) The final delay edge is given as follows: The second row is the edge, and the first row is the registers connected obtained as the highest nonzero value, in the D v values along other indices other than p-direction in Listing 2. p-direction is the direction of orientation of the systolic array (PE array) in the n-D problem space. The above gives a minimal cost connectivity and register delay elements simultaneously satisfying the feasibility and implementability checked by the direct method.

Architecture of the FSBM Algorithm
The architecture is arrived at, based on above is in Figure 7.

Design Space Exploration Using High-Level Synthesis.
The design space exploration results are presented in the following based on the architecture arrived at. Figure 7 is input using a behavioural description using a C type language to the GAUT tool, and it generates the control data flow graph (CDFG) architecture as in Figure 8 and also integrates into ModelSim and Xilinx ISE.    Table 7 for p = 1 for FSBM algorithm.

CDFG of the Design. The architecture in
6.4. Design Space Exploration for p = 2. The search range p in FSBM algorithm is increased to p = 2, and the design space exploration is done in MATLAB for the modified heuristic and also using the HLS GAUT tool. The results of the above are shown in Figure 9.  Table 5 for search range p = 1 and

Complexity Analysis
The merit of the modified heuristic algorithm is measured in terms of the search space complexity.

Search Space Complexity.
In general, in heuristic search procedures, the loop bounds are considered as the maximum values for searching. But as the loop bounds and the nested loop dimension increase, the search space will be huge if vectors are exhaustively generated. A graphical representation of search space expansion with respect to the different values of n for n-level nested loop algorithms is given in Figure 10.
The "a" bars show the search space obtained by taking the loop bounds, say −U i to +U i , as the limit for each variable, and the "b" bars are obtained by using our proposed modified heuristic elaborated in Section 3, where   Latency  50  144  18  90  60  152  19  80  70  112  14  100  80  112  14  100  100  104  13  120  150  64  8  170  200  32  4  180  300  40  5  320  400  16  2  340 it is observed from the plot in Figure 10 that the increase in cost is not high. Tables. Tables 9(a) and 9(b) show the complexity calculations for 6D FSBM and 4D FSBM and the proposed modified heuristic method whose results are in Tables 4(b) and 5(b). Table 9(a) shows the complexity calculations for varying values of n and gives a comparison between the general heuristic method and the method presented in this paper. [11] and 4D Problem-2D FIR Filtering Problem. The reduction in search space by Table 9 (a) 6D problem-full search block motion estimation (FSBM) problem n = 6 S&K-2D array Our work 2D array-use of sd Use of direct determination of S vector of expression

6D Problem Reduced to 4D FSBM
2D array considered as 1D array   modifying the 6D algorithm to 4D as reported in [11] and also the benefit of the modified heuristic are reflected by the last entry in Table 9(b).

Conclusion and Future Work
Many of the computationally intensive algorithms are of n-D deeply nested loop type. The methodology of mapping of algorithms involves heuristic search wherein the search complexity is large. The search space of the 2D filtering and 4D FSBM has been pruned down using the scheduling vector sd and the constraints imposed by it. The search has been performed using MATLAB, for the PE array assigned to the identified (n − x)-D subspace evolved with the nature of the CTV. The resultant mapping matrix is useful in determining the PE assignment and the exact clock cycle at which a particular node in n-D space represented by the DG is mapped onto a PE in the PE array. The search results are presented for 2 computationally intensive applications-2D filtering and the reduced index space 4D FSBM algorithm. The graph in Figure 5(a) corresponds to Table 4(a) showing the heuristic search results that show the distribution of PEs and cycles and cost. Figure 5(b) corresponds to Table 5(b) that gives the number of PEs and cycles pruned down after applying the modified heuristic algorithm. The delay edge connectivity is determined by the proposed direct approach as described in Sections 3.3 and 4.5 using Tables 2 and 4, instead of using the Mapping Transformation Matrix M or Tmat in Tables 4(a) and 5(a) as in [4]. The use of high-level synthesis tool is to obtain the CDFG. Also the design space exploration results obtained using high-level synthesis tool GAUT have been presented. The search have been performed for varying search ranges of P values P = 1 and P = 2 and the number of resources used, and latency for different input cadency values gives the design trade-off results presented in Tables 7 and 8(b) shown in the graph in the Figure 9. The output file of the GAUT tool could be used to interface with simulation tools and synthesis tools to build the RTL design and map it onto target FPGA architecture in the future for elaborate timing verification. The complexity comparison of our method with heuristic method is given in Tables 9(a) and 9(b).