Stepwise Transformation of Algorithms into Array Processor Architectures by the DECOMP UWE VEHLIES

A formal approach for the transformation of computation intensive digital signal processing algorithms into suitable array processor architectures is presented. It covers the complete design flow from algorithmic specifications in a high-level programming language to architecture descriptions in a hardware description language. The transformation itself is divided into manageable design steps and implemented in the CAD-tool DECOMP which allows the exploration of different architectures in a short time. With the presented approach data independent algorithms can be mapped onto array processor architectures. To allow this, a known mapping methodology for array processor design is extended to handle inhomogeneous dependence graphs with nonregular data dependences. The implementation of the formal approach in the DECOMP is an important step towards design automation for massively parallel systems.


INTRODUCTION
rogress in VLSI technology allows to integrate more and more transistors into a single chip.Thus, microelectronic systems with an increasing complexity can be realized.But this also results in a large quantity of design work manageable only with efficient support by design tools.
At the same time the algorithms developed in digital signal processing (DSP) grow in their com- plexity thereby requiring more and more computa- tional power and higher throughput rates.This is in pa.rticular the case in the area of image and video processing where algorithms for high definition tele- vision (HDTV) and video telephone have to be applied under real time conditions.Application spe- cific integrated circuits (ASICs) for such systems can be realized only using special purpose architectures (cf [1]).One possible architecture are array proces- sors [2] because they meet the requirements by a massive application of pipelining and parallel pro- cessing.In addition, due to their regularity and modularity array processors are well suited for a design process automated by design tools.These trends influence the design methodology for micro- electronic systems.In the past design work mainly consists of logic design and layout synthesis.Today these design tasks are well supported by commercial tools.But there is a need to extend these tools because increasing emphasis is given to decisions at the architecture level.
Today the derivation of architectures is manually performed by an intensive and error prone process.
In most cases only few different architectures are examined.Due to this an unsuitable architecture may be derived and it becomes impossible to fulfill the requirements of a given algorithm.Thus, design methodologies supporting the architecture level must be developed and implemented in CAD-tools which enable designers to explore different architectures in a short time.
In this direction a lot of research is performed.But due to the complexity of the design process the solutions inevitably are restricted to small and/or regular design problems.Exploiting regularity sev- eral methodologies for mapping algorithms onto ar- chitectures have been developed [2,3,4,5,6,7,8,9,10,11] and partly implemented in design tools (see references in [12]).A disadvantage of these methodologies and tools is that most of them are restricted to special architecture types (e.g.array processors consisting of one type of processing element (PE) connected by regular data dependences) or to a special class of algorithms (e.g.regular algorithms representable by nested loop programs).Furthermore, they do not support the complete de- sign flow starting with the specification of the algo- rithms and ending with a netlist description at the gate level.
Due to these reasons the CAD-tool DECOMP has been developed to support the mapping of algo- rithms onto array processor architectures [13, 14].The DECOMP requires PASCAL-descriptions [15] of the algorithms as input and produces EDIFnetlists [16] at register-transfer-level as output.Later developments lead to a new implementation of the frontend in the DECOMP which now is able to compile data independent algorithms [17] into de- pendence graphs (DGs)  [12, 18].The resulting DGs consist of different node types connected by nonregular data dependences.Thus, they cannot be mapped onto array architectures by the known design methodologies.To allow the mapping of these DGs a procedure for combining nodes of a different type into one PE has been derived [19], and in addition the mapping procedure proposed in [2] has been extended to handle nonregular data dependences.Currently the new mapping is implemented in the DECOMP.
The design process captured by the DECOMP cannot be performed in one step.Because of its complexity it has to be split into manageable design tasks each of them performing a specific design step.This results in a method referred to as stepwise transformation.A similar technique is known from high-level synthesis where it is applied to transform a behavioural description step by step into hardware (cf  [20]).The purpose of this paper is to outline the formal approach underlying the stepwise transfor- mation and its implementation.Furthermore, two data representations, one assigned to the algorithm level and the other assigned to the architecture level, are defined, based on which one of the main design steps of the transformation is explained in more detail.With the presented approach data in- dependent algorithms can be mapped onto highly parallel array processor architectures.The main ad- vantages of the presented transformation are its ability to process nonregular algorithms and its de- gree of automation.
In Section 2 of this paper the stepwise transfor- mation is outlined, and in section 3 the data repre- sentations are introduced.One of the main design steps is explained in more detail in section 4. The implementation in the CAD-tool DECOMP is de- scribed in Section 5, and finally a design example is given in Section 6.

THE STEPWISE TRANSFORMATION
The design process of mapping a given algorithm onto an array processor architecture is performed in four phases.These are 1. a specification phase, 2. a compilation phase, 3. a mapping phase, and 4. an optimization phase.
The phases themselves are divided into smaller design tasks each of them performing a correct- ness preserving transformation.This means, without changing the I/O-behaviour of the algorithm.Thus, a given algorithm is step by step transformed into an array processor architecture.The phases and its design tasks are depicted in Fig. 1.
The specification phase consists of only one step which is the Program development.In this step the given algorithm is manually specified in a high-level language which is executable using standard compil- ers.Besides the algorithm the specification may contain an interface description specifying how the input data is provided and how the output data is required.In addition design constraints like maxi- mum chip area and maximum delay times can be specified for the array processor or its PEs.
In the four steps of the compilation phase the description of the algorithm is modifed in a way that a dependence graph can be built from it.First, by application of compiler techniques [21] the given specification is symbolically executed [17] and the performed assignments are listed in the so-called run time protocol (RTP) [12].For example the sym- bolic execution of the statements a[1] :=0; A unique extraction of the data dependences needed to build a DG only is possible if the RTP is given in single assignment code (SAC).In this code every variable is assigned one value only during the execu- tion of the algorithm.Thus, in the second compila- tion step SAC is automatically introduced by adding an additional index where necessary.The RTP The allocation of the assignments to nodes in the DG is performed according to the indices of the variables on the left side of the assignments.Consequently, all left side variables must have the same number of indices.Furthermore, more regularity in the DG can be achieved if the nodes for the assign- ments are allocated under consideration of loop- counters in the input program or the similarity of the right sides of the assignments.Thus, in the third compilation step different placement algorithms can be applied to the RTP in order to achieve the highest possible degree of regularity and parallelism.
At last a localisation can be performed to avoid global data dependences in the resulting DG.During the localisation additional assignments which propagate the variables via neighbouring nodes only are introduced in the RTP.In the given example (0) is propagated via (2 0) and (3 0) to b(4 0).Finally ttle RTP may be as follows: a(10) 0 b(20) c(0) c (20) c(0) c(30) c( 20) a (11) (+a(10)b(1)) b(40) c(30) a (12) ( + a(11) b(0) ) (4) The mapping phase consists of two steps which are the DG-derivation and the mapping onto signal flow graphs (SFGs).In the DG-derivation the RTP is transformed into a DG consisting of nodes and arcs (see example in Sec. 6).Then a mapping which is based on the multi projection method proposed in [2] is applied, and the DG is mapped onto a SFG thereby reducing the number of dimension by one.The mapping step can be applied to the resulting SFGs again to reduce the number of dimensions to a maximum of two.For a realization the advantage is given that the architectures can be implemented using only short and local connections between the PEs.Global connections on the chip should be avoided because their delays may dominate the de- lays of the gates.
In contrast to most of the known methodologies the proposed mapping has the advantage not to be restricted to homogeneous DGs with regular data dependences.It is also capable of mapping different nodes into the same PE by merging their internal structure as described in [19].In addition, the method proposed in [2] is extended to handle DGs with irregular data dependences [22].Furthermore, in the mapping step different projections can be applied.They result in a number of possible SFGs which differ in the number of PEs, the connections between the PEs, the data-I/O, the time needed for a complete computation of the algorithm, and other criteria.A suitable SFG fulfilling given design con- straints is then selected and passed to the next phase.
In the optimization phase the selected SFGs are modified to array processor architectures.Because the SFGs are represented at the word level, first of all word widths for all arcs of the SFGs must be introduced.For this, the minimum and maximum values of the input signals are specified as a number range at the inputs of the SFG.Then a simulation function runs over the SFG calculating the mini- mum and maximum value for every arc in the graph as well as for every internal connection of the PEs.
Thereafter the derived architecture can be adapted to given design constraints.Not in every case for example it is possible to derive an architec- ture which requires the input data in the same way as it is provided by the external input interface.Therefore, register-multiplexer circuits for sorting the data coming from the input interface can be synthesized and put in front of the derived array processor.The problem of data supply for array processors has been studied in [23, 24].
The last step, the extraction of a netlist in a hardware description language, is performed by a direct conversion of the used data structure.The netlist is given at the register transfer level.The smallest blocks at this level are registers and arith- metic building blocks like adders and multipliers which can be generated using building block genera- tors [25, 26] or other synthesis tools.
For implementation purpose it would be conve- nient to have the same data structures between every design step.Thus, the order of the steps can be changed or single steps can be left out.On the other hand the requirements from the algorithmic side of the transformation are totally different from those of the architectural side.Consequently, as shown in Fig. 1 the stepwise transformation is based on two different data representations, the RTP and a graph representation.The RTP is only used in the compilation phase.It is set-oriented, recursively de- fined (see Sec. 3.1) because it describes a sequence of assignments.In contrast the graph representation is blockoriented, hierarchically defined (see Sec. 3.2)   because it describes the interconnections of hierar- chically organized blocks.
3. REPRESENTATION OF DESIGN DATA

Algorithm Based Data Structure
The run time protocol which is produced by a sym- bolic program execution describes a given algorithm in a maximum expanded form.It holds one entry for every performed assignment and lists them in the uv, n PUn TABLE Sets and elements used in the run time protocol denotes the n-th processing step external input variable, set of all external input variables constant, set of all constants known variable, set of all known variables in the n-th processing step used variable, set of all used variables in the n-th processing step variable produced (assigned) in the n-th processing step at the beginning.Then, starting at n 0 for every time step n the produced variables pU are calcu- lated using the set @n of used variables which is a subset of JU n containing all variables known at that time step.At the same time, the sets JUn+l are calculated new for the following time steps.For that the variable pu and the set JU from the actual time step are used.Because in every time step the calculations use values from the directly preceding time step only, the RTP is said to be defined recursively.
given execution order.Furthermore to all successive entries a discrete time step is assigned.
Formally a RTP can be described by grouping together the variables of the algorithm into different sets.Besides the set of external input variables (oe'7/) and the set of constants (') two more sets are defined at every discrete time step.One set (JU n) contains all variables which are known at a specific time step, the other set (@) contains all variables which are used by the assignment per- formed at that time step.For the definition of the RTP the sets and elements are named as given in Table 1.The RTP itself is defined in Def.D.1.
Def.: Structure of the run time protocol (D.1)

Jo {kvlkv v kv
PVn fn(n)ln .Vn>O This definition expands to the run time protocol shown in Table 2. Based on the external input variables iv and the used constants c the RTP is initialized at time step n 0 with the set Y'.All input variables and constants can be collected in JUT/0 because in ordinary programs they are known At the register transfer level array processors are represented by DGs and SFGs.These graphs consist of interconnected PEs which themselves consist of building blocks like adders and multipliers repre- senting the basic operations of the implemented algorithm.The interconnections in DGs are arcs with zero delay, and in addition the DG is free of loops and cycles by definition.Interconnections of SFGs may contain registers (delays).Thus, SFGs may have loops and cycles with at least one register on them.
A design hierarchy in which DGs, SFGs, PEs, and building blocks can be represented as blocks is given by the following 3 levels: 1. Graph-level Representation of DGs and SFGs by PEs and their interconnections (also re- ferred to as arcs). 2. Processor-level Representation of PEs by building blocks and their interconnections.
3. Operator-level Representation of building blocks.These are the smallest blocks at the register transfer level.Thus, their internal structure is not considered here.
DGs and SFGs are represented on the same level.Thus, in the following the statements for DGs are valid for SFGs as well except otherwise mentioned.Furthermore, nodes of DGs and PEs of SFGs are always referred to as PEs.
The blocks on each level formally are defined as 4-tuple as follows: Deft: Block description (D.2) bb ( A, ., ._/',.)with A as a 2-tuple describing the attributes of the block, ' as the set of subblocks used in the block.
a*' as the set of signals which represent the inter- connections of the subblocks, as well as the inputs and outputs of the block, as the set of functions which represent the func- tionality of the block.The functions also repre- sent the netlist of the block.
This definition was originally developed in [19] slightly modified as a model for complex processing elements.The attributes A in bb are defined as follows: Def.: Attribute (D.3) A (typ, with typ as the identifier for the type Of the block.It is one of the symbols DG, SFG, PE, or BB (building block), -' as a set of type specific attributes tsa Each type specific attribute tsa is a 2-tuple of the following form: The given definition implicitly assumes that every block has at least one input and one output.Furthermore, the subblocks bb and signals s must have unique names inside a block because the netlist is given by referencing to these names.The order of the signals in the sets ,.in and out is not important except when the functions are used for a symbolic verification at the operator level.
Then the order is important for operators (blocks) which are not commutative (e.g., subtraction or divi- sion).
In addition, the signals s ' optionally can be defined as a 3-tuple as follows: Def.: Signal (D.6) att val indicating the attribute type.att also is used as keyword to distinguish different attributes.as the value of the attribute.S (Si, bbin, bb ut) Type specific attributes for example are TSAo (dim(2 4)) describing the index space of a DG or TSAee (ind(ll)) as the index point at which a PE is located in a DG.Different types of blocks may also have the same type of attribute.For example a building block may have the attribute TSAtB (area 42), hnd a PE which contains three of these building blocks may have the attribute TSAeE (area 126).
For describing the reference to the attributes the following two functions are used: fatt( TSA ) att Lal( TSA ) val As example: fatt(TSApE) fatt(ind(ll))) ind (5) with bb in as the block at which the signal starts and bb eer as the block at which it ends.Then the description of a block contains redundant informa- tion which for example can be used for a consistency check of the netlist.
A problem arises if internal signals of a block are allowed to be an output, too.In this case the output cannot be recognized automatically.In the pre- sented model this is handled by fork-elements which have one input and more.than one output.The function for a fork-element for example is ffork ( fot'kl, {Sl}, {s2s3}).
As an example for the presented data representa- tion the PE shown in Fig. 2 is given.This PE could be used in a DG as block bbo_ o.Due to space limitations a detailed explanation of the complete stepwise transformation does not fit into this paper.For this reason, the main design step dg-derivation has been selected for explanation because it is the interface between the two data representations.It transforms the run time protocol into a graph based, object-oriented data structure which can be easily converted into a netlist descrip- tion.
Normally graphs are described by a 2-tuple G (//, 5e') consisting of a set of nodes and a set of arcs.The arcs are described by a & j in which and j are index points out of //(cf [2]).For the mapping of algorithms onto VLSI architectures in addition the external interfaces must be described.Herein start and end denote the index points of the start and end node of the dependence, name is the individual name of the dependence and delay is the number of delays associated with the depen- dence.It should be noticed that the arcs in a DG always have zero delay by definition.The delay is specified because the definitions apply to SFGs as well.Furthermore, the values of such dependences are accessed by the following four functions: For the derivation of a DG first the set 9 of produced variables and the set ' of used vari- ables are built from a maximum expanded RTP (see Table 2) as follows: With these sets the external input and output arcs of the DG are given by Eq. 8 and 9, respectively.The symbol e means that there is no value at this place.
As defined, the functions find and fnam index and the name of a variable. (i c (/\s))} (8)   ou, {depou,[dWou (fina(PVi) e, Lam(POi), O) with pv fi( @ apv ( @)} ( 9) The sets /nd contain assignments of arbitrary complexity described in prefix form, for example (O(UVl[jljz]UVz[klk2]) uv3[1112]) in which and denote any operators.For the translation of these assignments into functions of a block as de- fined in Def. 5 they first must be split into simple assignments only containing one operator which can be implemented by an arithmetic building block.This is performed by introducing additional vari- ables which lateron lead to node internal signals.
After the splitting the simple assignments can be directly translated into functions as shown in Eq. 14:   The dependences between the nodes are determined by Eq. 10.Therein >-k means the assign- ment is performed before the assignment k.
depp, deppv u, (Lnd(PUi), Lnd(PUk), Lam(PU,), O) The orientation of a dependence in the n-dimen- sional index space of the derived DG is given by the vector r as follows: ( , {SuSu}, {s,.,)) /,, ( e, {s,,,,tSu}, {s,o,}) FeA Fe (14) As the result of these transformations all sets snd are mapped into sets /n*a which describe the func- tionality of the nodes in the architecture based data structure (see Eq. 15)   r fe(dep) fs(dep) (11) Due to the placement according to the indices of the produced variables dependences with r N (N is the zero vector) may occur for the moment.Be- cause such dependences are not allowed in a DG they later on are transformed into node internal signals.
After determination of the dependences the inter- nal structure of the nodes is derived.Because this is a complicated operation first the index set is developed with Eq. 12.It contains all index points of the n-dimensional index space at which a node of the DG is located.
:= {indlEIpv amit ind find( pu)} ( 12) Then for every ind , i.e. for every node, a set Jind is developed by application of Eq. 13.This set contains all assignments of the given run time proto- col which produce a variable with the index ind.
Thus, all these assignments are placed in the same node.
ind {PVi f/(aW/ )lf/na(PVi) ind) (13)   {FIF (bbin,ut) with Ela gind witha F/AF/xFk/x ...} (15)   From the sets /*d and with the functions a set of node descriptions can be derived as follows: //:= {nln (Annnnnn ) The derived sets completely describe the DG which becomes a ( ,if', ..o-,., ..tg, in, .out ) (18)   Finally the description of the DG required for the hierarchically defined, architecture based data rep- resentation (see Def. with ADa (DG, {(dim, dim) }) A A DG .'),U in U out A a(FIF (pe, r, with pe /A pe (An, n, n, ) of the run time protocol once only.The other sets are calculated from these basic sets without time consuming search operations.Furthermore, if regu- lar algorithms like matrix-matrix-multiplication are processed the basic sets could be derived immedi- ately during the symbolic execution in the compila- tion phase.This is very fast, but only possible if no special placement is required.Another point is that the placement according to the indices of the produced variables may result in DGs with cycles which are not allowed by definition.
But, these cycles only exist at the graph level.Due to the single assignment property, which is automat- ically introduced during the compilation phase they do not exist at the processor level.Therefore, they can be broken up easily by assigning the concerned operations to other nodes in the DG.Furthermore, in the presented approach the existence of such cycles is already checked during the placement at the compilation phase.Thus the derived DGs are free of loops and cycles.

A'p/e '-'pe
A''ut pe '#'ppe} 5. THE CAD-TOOL DECOMP In Eq. 19 the two sets 'J'pe and derived from the functions of the PE as follows: are {slZlF with s fpin(F) A tF n with s fjou,(F)} {sl :IF nn with s fjout(F) A F nn with s f,in(F)} (20) (21) Furthermore, the dimension dim of the DG is given by dim:= max (n(pui)) (22) Vi=I N where n(pt i) denotes the number of indices of the variable pu i.In the stepwise transformation the di- mension is already determined during the placement in which the assignments of the RTP are assigned to index points of a n-dimensional index space by consistently changing the indices of the produced variables.
A synthesis algorithm for the derivation of DGs from the RTP has to apply the given equations in the order as described above.The main advantage of this approach is that the basic sets 97, 7/and the respective dind are derived by linear processing To show the feasibility of the formal approach pre- sented in this paper the stepwise transformation has been implemented in the CAD-tool DECOMP.COMMON LISP [27] has been used as implementa- tion language because it offers various possibilities for the implementation of object oriented data structures as required by the presented formal ap- proach.In addition, it is well suited for rapid proto- typing.The program structure of the DECOMP is shown in Fig. 3.
The transformation of a given algorithm starts with an input description in the high-level language PASCAL.Besides the algorithm this input descrip- tion also may contain a specification of the external interfaces and given design constraints.The first transformation step in the compilation phase, the symbolic execution, is performed by application of compiler techniques [21].Precise, the frontend com- piler of the DECOMP analyses the input descrip- tion based on a grammar describing the input language.Thus, by changing this grammar, other input languages can be implemented easily without changing the source code of the compiler.
The symbolic execution as well as the introduction of single assignment code (SAC), the localisation, and the placement (see EDIF VHDL FIGURE 3 The structure of the DECOMP as output.This RTP is input to the program compo- nent DG-derivation which translates it into the in- ternal data structure.The internal data structure is able to represent graphs (DGs and SFGs) as well as PEs and their building blocks.Furthermore, to avoid confusion with the design data the internal data structure is only accessible via a macro-shell.
The second step of the mapping phase, the Mapping onto SFG, is implemented in an own program component providing various functions for project- ing a given DG onto different SFGs.Additionally, it provides functions for comparison and verification of the SFGs.For example the verification function simulates the data flow in the SFGs by reading input data from a file (the same file as used by the original PASCAL program) into the SFG and then calculat- ing the output values of each PE.The PEs are processed in the order given by the inherent sched- ule of the SFG as many times as necessary to calculate all output data.The output data then can be compared to the data produced by running the PASCAL program.
From the optimization phase the two steps 'word width assignment' and 'extraction' are implemented in two successive program components.The word width assignment takes a SFG from the internal data structure and after introducing the maximum needed word widths it passes the SFG to the extraction.The extraction generates a special design interchange format (DIF)which is used to communicate with other design tools.
Via the DIF a building block generator [25, 26] is connected to the DECOMP.Thereby, the synthesis of the required building blocks can be performed outside the DECOMP and the performance data of the generated building blocks can be written back into the DIF.Furthermore, different netlist convert- ers are implemented allowing the translation of the DIF into the standards EDIF and VHDL Thus, further design steps like simulation and layout syn- thesis can be performed with commercially available CAD-tools.
The performance data of the generated building blocks can be transferred back into the internal data structure of the DECOMP.Then, based on this data the created architectures, or rather SFGs, can be interactively analysed with respect to the specified design constraints.The program component interac- tive analysis provides functions e.g., for the calculation of area and delay parameters of the architec- tures.If an architecture does not fulfill required constraints it can be modified and extracted again.This design cycle implements the optimization step 'adaptation to design constraints'.
In addition, the program component data I/Oadaptation provides functions for generation of reg- ister-multiplexer circuits which adapt the input in- terface of a designed architecture to a given external interface.The generated circuits are represented in the same data structure as the SFGs.Therefore, the same design steps can be applied to them.
The program components of the DECOMP allow a straightforward conversion of an algorithm into an architecture as intended by the formal approach of the stepwise transformation.The components of the DECOMP can be used interactively or in an auto- mated way.In the latter case the program compo- nent control/strategy performs an experience based heuristic search to find architectures close to given design constraints.To allow this the control / strategy has access to all data structures. 6.A DESIGN EXAMPLE" In this section a design example for the stepwise transformation of an algorithm into an array proces- sor architecture is given.Here the blockmatching  FIGURE 4 Algorithmic part of the input description algorithm from the area of image processing has been chosen (cf [28]).It is shown in Eq. 23.
n and m can only take integer values of the speci- fied interval.In the following only the calculation of U is considered.First of all, the algorithm has to be formulated in the high-level language PASCAL which is required as input for the DECOMP.The algorithmic part of the input description which has been formulated for N 3 is shown in Fig. 4. It consists of four nested loops, two of them for the calculation of the sums over and k and two of them for calculating the sums over the variable displacements in m and n.The input description neither is localized nor is it given in single assign- ment code.It should be noticed that the declaration of all variables x and y as array of the length one is necessary due to the prototype imple- mentation of the DECOMP-compiler which otherwise cannot distinguish between variables used for calculation of data and variables used as loop counter.Further, for verification purpose the input description can be execut.edusing a standard PAS- CAL compiler.
With the DG-derivation as described in Sec. 4 a 4-dimensional DG is derived.It consists of 81 nodes of 4 different types.So, the assignments 116 to 119 are placed in the same node because the indices of their produced variables are the same.Due to the regularity of the blockmatching algorithm, the nodes of the DG are connected in a regular manner allow- ing several different projections.With the program component mapping onto SFG the multiprojection described in [2] is applied and the DG is projected three times.The used projection and schedule vec- tors are as follows: 1. projection: d (1000) s (1000) (100) 3. projection: d 3 (10) s 3 (11) Every projection reduces the dimension of the graph by one.Thus, the resulting SFG which is shown in Fig. 5 has one dimension and consists of three different PEs.The numbers at the connections de- note the number of delays (registers) associated with them.Besides the two inputs x_in and y_in for the image data every PE has several inputs for interme- diate values calculated during processing an image.
In addition, every PE has control inputs s which determine whether a signal (e.g., x_i) is taken from the feed-back loop or from an external input.The external inputs are used for the input of start values (e.g., 0 or maxim) for the intermediate sums.The resulting minimum U for Eq. 23is produced by the output x_m of the third PE after 43 clock periods.The PE at the index (1) is shown in Fig. 6.It consists of two multiplexers and five arithmetic building blocks.The path from the inputs x_in and y_in via the blocks -, ABS, +, +, and MIN to the output represents the calculation of Eq. 23.The two multiplexers are used to select an external input value or one stored in the registers of the feed-back loops.The sequences of control signals for the mul----s._x_i (1)x..i "l "s..xi(2) x_i k------.Acm(3) xi J.It is also possible to verify the SFG at the word-level.Therefore a verification function is implemented in the DECOMP which simulates the SFG using the same input data as the PASCAL description.The SFG represents an architecture for the given algorithm if the output data of the verifi- cation function is equal to the output data of the PASCAL program.
In the final design steps word widths for all con- nections are automatically introduced based on specifications given in the input description.Such specifications contain a number range or a concrete word width for the external input signals.After this the SFG is translated into an intermediate design interchange format from which netlists in EDIF or VHDL can be generated.
The example has been processed with a prototype implementation of the DECOMP on a VAXstation 4000/60.The compilation into a DG and the straightforward multiprojection onto the SFG to- gether were performed in approximately 40 CPU seconds, the verification took approximately 17 CPU seconds, and the introduction of the maximum needed word widths approximately 27 CPU seconds.

CONCLUSIONS
The presented formal method performs an auto- matic mapping of data independent digital signal processing algorithms onto array processor architectures.It covers the complete design flow from algo- rithmic specifications in a high-level language to the generation of netlist descriptions at the register transfer level.The presented approach extends known design methodologies because it can handle inhomogen dependence graphs with nonregular data dependences.Thereby, also irregular algorithms can be mapped onto an architecture which is as regular as the given algorithm.As basic architecture type array processors have been chosen because they provide high computation power due to a massive application of pipelining and parallel processing.In addition, due to their regularity, they are well suited for an automatically performed design process.The presented stepwise transformation divides the com- plete design task into smaller parts which can be independently handled by a CAD-tool.
The feasibility of the approach has been shown by the implementation in the CAD-tool DECOMP.With the DECOMP it is possible to translate an executable PASCAL-description of a data indepen- dent algorithm into a dependence graph.Single assignment code is automatically introduced and global connections are localized.The dependence graphs can be mapped onto different signal flow graphs which are modified to architectures by the introduction of maximum needed word widths.The DECOMP supports comparison of the derived ar- chitectures to each other and to given design con- straints.Thus, an optimal architecture can be se- lected and translated into an intermediate design interchange format from which netlists at the regis- ter transfer level can be generated in EDIF or VHDL.Furthermore, the interfaces of the synthe- sized architectures can be adapted to an external data supply by generation of register-multiplexer circuits.Supporting designers at the architecture level, which currently is not sufficiently supported by commercial design tools, the DECOMP is an impor- tant step towards design automation.In addition, the given blockmatching example shows that the derived architectures are comparable to those known from the literature.
With the DECOMP the designers are enabled to explore interactively different architectures in short time.Herein, the problem, which also is present in other design methodologies, is the existence of an exhaustive number of different design solutions.It is not possible to control an automatically performed mapping process in such a way that it maps the algorithm straightforward onto an optimal architec- ture.Therefore, the decisions in the design steps (e.g., which placement algorithm has to be chosen, how the localisation has to be performed, and which projection vectors have to be applied) are based on heuristic and empirical knowledge.For example, the projection directions are chosen parallel to the coor- dinate axes.Therefore, in the future, criteria must be developed which allow estimation of the perfor- mance data of the architectures (e.g., area, time, number of inputs...) at a very early design stage.Currently a design strategy is developed which based on such criteria allows the further automation of the interactively performed design process in the DECOMP.
FIGURETransformation steps and representation of design data Y/0)lr/c PV ?n(@'n) @7/n ,)K7/n PVN_ :=)N_ I(O'N_ 1)['N_ C,N_ pv 3.2 Architecture Based Data Structure Def.: Type specific attribute (D.4) TSA ( att, val) with The set oin Def.D.2 contains functions F which are defined as follows: De[.: Function (D.5) F (bb, Sn, out) with bb as a block which represents an opera- tion like addition or more complex the operation of a PE (the blocks in a PE are called operators, too), in C as the ordered set of input signals for the block, out , as the ordered set of output signals for the block.

FIGURE 2
FIGURE 2 Processing element and its formal representation Due to this, the DGs derived during the stepwise transformation are described by a 4-tuple as given in Def.D.7.DeL: Dependence graph (and also signal flow graph) (D.7) with as the set of nodes represented by their indices, as the set of dependences connecting the nodes, usually referred to as arcs, as the set of external input dependences, as the set of external output depen- dences, The dependences are given by a 4-tuple as follows: Def.: Dependences (D.8) dep (start, end, name, delay)

Def
2) is derived by the following equation: DG ( A Da Da :Da Da )

FIGURE 5
FIGURE 5 Architecture for the blockmatching example

FIGURE 6
FIGURE 6 Internal structure of processing element at the index

TABLE 3
Parts of the RTP for the blockmatching example