Using PDM on Multiport Memory Allocation in Data Path

A data path consists of memory elements (i.e. registers), data operators (i.e. ALUs) and interconnection units (i.e. buses) to control the data transfers in the digital system. Many approaches to memory synthesis have been proposed in the literature. However, only single port memory is considered for register allocation and no efficient synthesis approach for multiport memory synthesis. In this paper, an efficient method, Partitioned Dependence Matrix (PDM), is presented for memory synthesis which deals not only with single port memory synthesis but also multiport memory synthesis according to the design constraints. With suitable modifications, the proposed technique can also be applied to multiport memory synthesis in which the maximum number of read ports is different from the maximum number of write ports. Therefore, the entire design space is explored and has the capability to handle early architectural design exploration so that the quality of designs produced by an automatic synthesis tool is more adequate for production use in comparison to manual design. Illustrations of applying this method to different synthesis examples are presented. Results and improvements over previous techniques are demonstrated. A key element in our approach is the successful adoption of techniques originally developed for problems in test generation to the field of memory synthesis.

INTRODUCTION rom the input specification, the synthesis system produces a description of register-transfer (RT) level structure that realizes a specified behavior. The structure may be divided into two parts, a data path and a control unit. The data path consists of memory elements (i.e. registers), data operators (i.e. ALUs) and interconnection units (i.e. buses) to control the data transfers in the digital system. The control unit is a finite state machine capable of generating the control signals to evoke a data transfer in the data path so as to produce the specified behavior. Highlevel data path synthesis is concerned with the automatic generation and allocation of registers, ALUs and buses. Many approaches to automated data path synthesis have been proposed in the literature [1][2][3][4][5][6][7][8][9][10][11][17][18][19][20][21][22]. However, current high-level synthesis tools lack the capability to handle early architectural design exploration so that the quality of designs produced by an automatic synthesis tool is not completely adequate for production use in comparison to manual design. An interesting and powerful approach to all phases of data path synthesis was proposed by Tseng and Siewiorek [1,3]. In their approach, the problems of register, operator and bus synthesis are cast into the "Generalized Clique Partioning" (GCP) problem in which it is necessary to partition the nodes of a compatibility graph into a set of disjoint clusters. Their objective is to partition the nodes of this graph in such a way that the minimal number of disjoint clusters is obtained. In terms of the corresponding data path element, this implies that the minimal hardware cost for a given degree of system concurrency is thereby achieved.
One of the drawbacks of their approach is that the time complexity of the GCP algorithm is large and involves following multiple paths with backtracking; this occurs especially in bus system synthesis. To overcome these problems, a more efficient approach, Weighted Cluster Partitioning (WCP), was presented in [12], which eliminates the need for backtracking. WCP, and a variation called WCP II, were shown to produce designs requiring the fewest number of buses in large synthesis examples. These algorithms have polynomial time complexity and also yield excellent results when applied to the test generation problem of Built-In Self-Test design in [13].
With suitable modification, WCP and WCP II can be applied to the other two phases of resource allocation. However, WCP or WCP II like GCP can only be applied to allocate registers having the property of disjoint access requirements to single port memory modules. Obviously, there are several advantages to merge registers to form multiport memories, such as saving interconnections, reducing the number of multiplexers, minimizing the chip area, and being more easily testable due to a smaller number of design modules. Moreover, multiport memory synthesis can be applied to many applications according to the architectural design. In [11], it was shown how the problem of finding the maximal set of registers which can be grouped into a multiport memory module can be treated as the 0-1 integer programming problem. A branch-and-bound strategy [14] was then used to obtain the solution. As stated in [11], this technique uses a sequential approach (i.e., generating the memory modules in sequence) which generates a locally optimal solution and may not generate the globally optimal solution of producing the minimum number of multiport memory modules.
In this paper, we exploit and modify the algorithm that has been proposed to solve the test generation problem in Ref. [15,16] to generate a more globally optimal solution for multiport memory synthesis than does the previous techniques [11]. It will be shown how these algorithms can be modified and adapted to the multiport memory synthesis. The limits of the problems we aim to solve and the definitions and algorithms of the proposed PDM techniques will be discussed. The efficiency of this approach is illustrated through some explicit synthesis examples. Section 5 describes the PDM techniques can be adapted and modified for the synthesis of multiport memory with ports of different type. Simulation  In the following, Lemma 1 and Lemma 3 are the design constraints of multiport memory synthesis with ports of same and different type respectively, which are adapted from Ref. [11] for the convenient discussion of the algorithm proposed in this paper. Lemma 1: Let R, R2, Rm be registers. R, R2, and Rm can be allocated to the same memory module with K ports if and only if no more than K of these registers are accessed simultaneously.
The ports in the multiport memory module may not all be of the same type, i.e., some may be read only while others may be write only or read/write. First, we will consider only multiport memory modules with the same type of ports (i.e. read/write). Second, with suitable modification, the technique can be applied to synthesis of memory modules with ports of different type, and this part of discussion will be presented in the later section. Definition 1: A dependence matrix DM(C) for a code sequence C has s rows and n columns. Each row represents one of the control steps in the code sequence C and each column represents one of the registers. An entry is 1 if and only if the corresponding register is accessed in the specified control step. All other entries are 0. The dependence matrix DM(C) is easily calculated from a given code sequence C. In the following discussion, we use the code sequence of Ref. [1], shown in Table I, as an example. The DM(C) for the code sequence C in Table I is shown in Table II.   (1) each row of a set has at most K 1-entries.
(2) the number of sets p is a minimum. The partitioned dependence matrices PDM(C,K),         Table I, we need eight single-port memory modules or four 2-port memory modules or three 3-port memory modules or two 4port memory modules.

PDM: ALGORITHM, IMPLEMENTATION AND RESULTS
In this section, a detailed discussion of the algorithm used to form PDM(C,K) from DM(C) is presented.
We will make use of the technique proposed to solve the graph coloring problem posed in Ref. [16] and show that, with a suitable modification of the technique, it can be applied to solve the memory synthesis problem. The basic components of the algorithm are as follows. The columns of the DM(C) are ordered from 1 to n where n is the number of registers. An integer number label is assigned to each column. The first column is labeled as 1. The other columns are sequentially labeled with as small a label number as possible with some constraints (for details, see Definition 4 below.) After that, we check if the current maximum label equals the lower bound (for details, see Lemma 2 below). If it does, then that maximum label is the minimum number of sets p in PDM(C,K).
Those columns having the same label are allocated into the same set. In other words, the registers corresponding to columns having the same label are allocated to the same memory module. Otherwise, there is an attempt to decrease the label of that column which has the maximum label number and which is the lowest in the ordering. It can be shown that this is an efficient approach to change the labels of the columns which are lower than the maximum label column in the ordering so that the maximum label number can be decreased (i.e., the number of partitioned sets, p, is decreased). We continue this process until a minimal labeling scheme is found. Before we discuss the full details of the algorithm, we need to define some basic operations: CR has 1-entry in the same row and whose order (the ordinal of the column associated with a register) are less than that of CRi in the ordering (i.e. j < i). Example: In Table II CRi is a column to be labeled. Assume that the columns CR1 through CR_ have been labeled. The column CR is labeled with as small a positive integer as possible with the constraint that no more than K columns in Column-Parents(CRy) have the same label and the number is greater than zero. Definition 5: ColumnAncestors(CR): CR is a column whose ancestors are to be determined. Every column in the ColumnParents(CR) is defined as a member of ColumnAncestors(CR). Every column which is a ColumnParent of a ColumnAncestor of CR is also a member of ColumnAncestors(CR).
Example: In Table II Definition 6: Relabel(s, maxorder, maxlabel): Let s be the order of the column where the relabel operation begins. Let CR,. be the column which has the maximum label and lowest in order (we may have more than one column having the maxlabel) after the relabeling procedure and let maxlabel be the maximum label before the relabeling process. The order of CRm,. is returned in maxorder. Starting from s + 1 to n, we label CR by Label(CRy). If a column is labeled as maxlabel, or if the last column is labeled, then the procedure is completed. If Relabel is called with s 1, then the label of CR is set to 1 and the regular procedure is executed.
Definition 7: Backtrack(sl,s2,flag, maxlabel): CR.I is the column whose label is being decreased. If the backtrack procedure cannot make any improvement in the labeling, then it returns false in the flag. Otherwise, it returns true in the flag, and that indicates that there exist a column CRs2 whose label may be increased so that the maximum label can be decreased. Let maxlabel be the maximum label before the backtracking process (i.e. the label of CR,1). Let V be the set of the columns whose order is lower than CRs, and S be the set of ColumnAncestors of CR.l. The detailed procedure of backtrack includes two steps: Step 1; If S is empty, then set flag is false and exit.
Otherwise, set flag is true and find the largest ordered column CRs2, which belongs to both S and V. Let S S-{CRs2}.
Step 2; The label of CRs2 should be increased as little as possible with the constraints that no more than K columns in ColumnParents(CRs2) have the same label and the label should be less than maxlabel.
Then, call the procedure Relabel(s2,maxorder, maxlabel). After the relabeling process, if the maximum label is decreased, than flag is set to true and exit the backtrack procedure. Otherwise, go to step 1. Example: In the code sequence of example 1: single port memory synthesis shown in Table IV, after the application of Label procedure the column registers CR and CR2 are labeled '1', CR3 is labeled '2', CR4 is labeled '3' and CR5 is labeled '4'. Therefore, the maxorder is '5' and the maxlabel is '4'. The Backtrack procedure is then applied to decrease the maxlabel number among all the column registers. In the Table IV (c), Backtrack(5,3,true,4) indicates that CR is the column register whose label is being decreased and CR3 is the column register whose label is being increased (the true flag is returned, which indicates there exists the column register CR3 whose label can be increased so that the maxlabel '4' may be decreased). The procedure Relabel (3,5,4) indicates that '3' is the order of the column register CR3 where the label operation begins. CR5 is the column register which has the maximum label '4' and lowest in the order. So, starting from 4 to 5, we label CR by the operation Label(CRi).

Lemma 2:
The lower bound on the number of multiport memory modules can be derived from the   -). -f -n memo module The complete algorithm of forming PDM(C,K) from DM(C) can be stated as follows.
Step 1: Order the columns CRi of DM(C) from 1 to n, where n is the number of registers.
Step 3: If label of the column whose order is maxorder is equal to the LowerBound, then go to step 4. Otherwise, call Backtrack(maxorder,s2,flag, maxlabel). If flag is false, then go to step 4. Otherwise, call Relabel(s2,maxorder, maxlabel). If the label of the column whose order is maxorder is less than maxlabel, then maxlabel is set to the lower label. Return to step 3.
Step 4: The value of maxlabel is the minimum number of sets p in PDM(C,K). The columns of DM(C) that have the same labels belong to the same partition.
Example 1: Consider the single port memory synthesis of the registers in the code sequence as shown in However, using the 0-1 integer programming technique proposed in [11], the problem reduces to Max (x + x2 + x3 + x4 + x.s) The formulation of the algorithm in Ref. [11] attempts to include the largest number of registers into the current memory module, a desirable result. However, our overall objective is to find the minimum number of memory modules that will cover all the registers which do not have any conflict in use during memory access. Thus, one may expect that the local optimization produced by the 0-1 integer programming technique will act to limit the degree of global optimization that can be achieved. Besides, the formulation leads to the arbitrary selection of one of several equally valid choices at each step of the algorithm. Thus, many possible solutions may be obtained. In the above problem, the algorithm may produce the first grouping of registers to be {R, R2} (that is x x2 1). with the final result of three memory modules obtained at the final step of our procedure. Thus, the lack of any backtracking step in the 0-1 integer programming approach leads to the selection of a locally optimal but globally sub-optimal solution to the problem. For completeness, we also used the proposed technique for multiport memory synthesis for the example code sequence in Ref. [1]. After applying the technique to the dependence matrix DM(C) in Table   II, we find the partitioned dependence matrices PDM(C,K) for K 1,2,3,4 as shown in Tables 3(a)-(d). (1) no more than rn w of these registers are accessed by read instructions simultaneously.
(2) no more than m r of these registers are accessed by write instructions simultaneously.
(3) no more than rn of these registers are accessed by read/write instructions simultaneously. Definition 8: A dependence matrix DM(C,R,W) for a code sequence C has s rows and n columns.
Each row represents one of the control steps in the code sequence C and each column represents one of the registers. An entry is R (or W) if and only if the corresponding register is accessed by read (or write) instruction in the specified control step. All other entries are 0.
The dependence matrix DM(C,R, W) is easily calculated from a given code sequence C. In the following discussion, we use the code sequence in Table   I, as an example. The DM(C,R, W) for the code sequence C in Table I is shown in Table VI.
The problem of finding the minimum number of memory modules with ports of different type for registers allocation still can be solved by partitioning the dependence matrix DM(C,R,W) to form a partitioned dependence matrix PDM(C,m,r,w). Definition 9: Given a code sequence C and a memory module M with m ports where r ports are read only, w ports are write only, and the rest m r w are read/write ports, a partitioned dependence matrix PDM(C,rn,r,w) corresponding to DM(C,R, W)  (1) each row of a set has at most rn w R-entries.
(2) each row of a set has at most rn r W-entries.
(3) each row of a set has at most rn R/W-entries. (4) the number of sets p is a minimum.
In the following a detailed discussion of the algorithm used to form PDM(C,m,r,w) from DM(C,R, W) is presented. We will make use of the technique for memory synthesis with ports of same type which was proposed in the last section and show that, with a suitable modification, it can be applied to solve the multiport memory synthesis problem with ports of different type.
Definiton 10 (a): ColumnParents(CRi) for multiport memory with READ and WRITE ports, but without READ/WRITE ports: Let (CRi)r be the entry of the i-th column, r-th row of the dependence matrix DM. CR is considered to be a parent of (CR) if and only if (i) CR/has R-entry in the r-th row where CR has R-entry in the r-th row; or (ii) CR has W-entry in the r-th row where CR has W-entry in the r-th row.
ColumnParents of CR are then defined as a union of all parents of (CR)r for I -< r -< s, i.e. LJ__ parents of (CR,)r.
Example: In Table VI CRi has R-entry or W-entry in the r-th row where CR has R-entry or W-entry in the r-th row. ColumnParents of CR are then defined as a union of all parents of (CRi)r for 1 -< r -< s, i.e. U__ parents of (CRi) Example: In Table VI, (CRT)I, (CR7)4 and (CR7)5 have no parents. CR3, CR4, CR5 and CR6 are parents of (CR7)2. CRy, CR3, CR5 are parents of (CRT)3.
Therefore, ColumnParents of CRy are the union of all parents of (CR7)r for 1 _< r -< 5, i.e. U = parents of (CR7)r {CR,, CR3, CR4, CRs, CR6}. Definition 11" Label(CRi)" CR is a column to be labeled. Assume that the columns CR1 through CR_ have been labeled. The column CRi is labeled with as small a positive integer as possible with the following constraints" (1) no more than rn w columns in Column-Parents(CRi) have the same label and have a R-entry in a row where CR has a R-entry in the same row.
(2) no more than rn r columns in Column-Parents(CR;) have the same label and have a W-entry in a row where CR has a W-entry in the same row.
(3) no more than rn columns in Column-Parents(CRy) have the same label and the number is greater than zero.
Definition 12" Relabel(s, maxorder, maxlabel)" same as Definition 6. Definition 13." Backtrack(sl ,s2,flag, maxlabel) is the column whose label is being decreased. If the backtrack procedure cannot make any improvement in the labeling, then it returns false in the flag. Oth-$I $2 $3 $4 $5  that there exist a column CRs2 whose label may be increased so that the maximum label can be decreased. Let maxlabel be the maximum label before the backtracking process (i.e. the label of CR,1). Let V be the set of the columns whose order is lower than CR,1, and S be the set of ColumnAncestors of CR.,.. The detailed procedure of backtrack includes two steps" Step 1: If S is empty, then set flag is false and exit.
Otherwise, setflag is true and find the largest ordered column CR,.2 which belongs to both S and V. Let S s-{cR,..}.
Step 2: The label of CRs2 should be increased as little as possible with the following constraints" (1) no more than rn w columns in Column-Parents(CRi) have the same label and have a R-entry in a row where CR has a R-entry in the same row.
(2) no more than rn r columns in Column-Parents(CRi) have the same label and have a W-entry in a row where CR has a W-entry in the same row.
(3) no more than rn columns in Column-Parents(CRs) have the same label.
Moreover, the increased label should be less than maxlabel. Then, call the procedure Relabel(s2,maxorder, maxlabel). After the relabeling process, if the maximum label is decreased, then flag is set to true and exit the backtrack procedure. Otherwise, go to step 1. Step 1: Order the columns CRi of DM(C,R,W) from 1 to n, where n is the number of registers.
Step 3: If label of the column whose order is maxorder is equal to the LowerBound, then go to step 4. Otherwise, call Backtrack(maxorder, s2,flag, maxlabel). If flag is false, then go to step 4. Otherwise, call Relabel(s2,maxorder, maxlabel). If the label of the column whose order is maxorder is less than maxlabel, then maxlabel is set to the lower label. Return to step 3.
Step 4: The value of maxlabel is the minimum number of sets p in PDM(C,m,r, w). The columns of DM(C,R, W) that have the same labels belong to the same partition.
Example 3: Consider the 3-port memory synthesis (2 ports are read only, 1 port is write only) of the registers in the code sequence as shown in Table I.

SIMULATION RESULTS
The fifth order elliptic wave filter was chosen as a benchmark example for which the force-direct schedule [17] is shown in Figure 1. The available hardware consists of two adders and one pipelines multiplier where the multiplication requires an execution time that is twice as long as that for additions. One input  to the multiplier is always a constant, and has been omitted in the Figure 1.
Using the PDM with 2-port memory synthesis for the wave filter example, 5 memory modules are partioned and they are: ( The register to memory port mapping for different control steps, generated by PDM, is listed in Table  IX. The data path design is shown in Figure 2. For completeness, we also give the results produced by six previously proposed methods ("HAL" [17], "SPLICER" [18], "CATREE" [a9], "SPAID" [20], "EMUCS" [21], and "Grant and Denyer" [22]) for the wave filter example. Table X compares the cost using this approach "PDM" with that obtained using six previously proposed methods. As can be seen from Table X, the PDM requires fewer MUX inputs, register files, and buses.
Using the PDM with 2-port memory synthesis for the code sequence shown in Table I, 4 memory modules are partioned. The register to memory port mapping is listed in Table XI. The data path design is shown in Figure 3. The ALU1 and ALU3 do no operations but pass data to registers in control step 5.

COMPLEXITY ANALYSIS
The PDM algorithm described in the previous sections has been implemented in C code where we use the two-dimensional linked list data structure to store Control Step  SUMMARY This paper presents an efficient algorithm to explore the design space for memory allocation in data path synthesis. The technique can be applied not only to single-port memory synthesis but also to multiport memory synthesis. It avoids certain locally optimal solutions to achieve more globally optimal solutions  12 than were obtained in the previously proposed 0-1 integer programming technique. With suitable modifications, the proposed technique can also be applied to multiport memory synthesis in which the maximum number of read ports is different from the maximum number of write ports. Thus, an alternative data path design which requires less hardware in multiport memory synthesis can be achieved. The proposed techniques of memory synthesis is applied to the code sequence generated from com-piler. As we know, the results of memory synthesis can be affected by manipulating the code sequence. Therefore, investigating the interface between the compiler and memory synthesis to achieve a better design becomes interesting and important in this area. Finally, with suitable modifications, the partitioning techniques should also be applicable to the other phases of data path synthesis, namely the allocation of data operators and interconnection units. These issues are currently under investigation.