We address the
automatic synthesis of DSP algorithms using
FPGAs. Optimized fixed-point
implementations are obtained by means of considering (i) a multiple wordlength approach; (ii) a complete datapath
formed of wordlength-wise resources (i.e., functional units, multiplexers, and registers); (iii) an FPGA-wise resource
usage metric that enables an efficient distribution of logic fabric and embedded DSP resources.
The paper shows (i) the benefits of applying a multiple wordlength approach to the implementation of fixed-point
datapaths and (ii) the benefits of a wise use of embedded FPGA resources. The use of a complete fixed-point datapath
leads to improvements up to 35%. And, the wise mapping of operations to FPGA resources (logic fabric and embedded
blocks), thanks to the proposed resource usage metric, leads to improvements up to 54%.
1. Introduction
This paper addresses the architectural synthesis (AS) of Digital Signal Processing (DSP) algorithms implemented using modern FPGAs. High levels of optimization are achieved by means of the use of Multiple wordlength (MWL) fixed-point descriptions of the algorithms and also the use of both LUT-based and embedded FPGA resources. The former reduces implementation costs notably, and the latter minimizes area in FPGA implementations.
The MWL implementation of fixed-point DSP algorithms [1–4] has proved to provide significant cost savings when compared to the traditional uniform wordlength (UWL) design approach. The introduction of MWL issues in AS increases optimization complexity, but it opens the door to significant cost reductions [2, 3, 5, 6].
FPGA devices have been extensively used in the implementation of DSP algorithms, especially due to the recent introduction of specialized embedded blocks (i.e., memory blocks, DSP blocks, etc.). Traditional approaches to estimate FPGA resource usage do not apply to modern FPGAs, which present a heterogeneous architecture composed of both logic fabric and embedded blocks, since they only account for lookup table- (LUT-) based resources [7]. This situation calls for new resource usage metrics that can be integrated as part of automatic synthesis techniques to fully exploit the possibilities that embedded resources offer [8–10].
The current approaches to perform MWL-oriented architectural synthesis are not tuned to modern FPGAs [2, 3] or an efficient distribution between logic fabric and specialized embedded blocks is not applied [11, 12]. Also, the resource set used during the optimization process does not include the multiplexers necessary to transfer data from memory elements to arithmetic resources.
The main contributions of this paper are the following.
The presentation of a novel resource usage metric that guarantees minimum resource usage for heterogeneous FPGA implementations if integrated within an optimization framework.
The presentation of an architectural synthesis procedure tuned to fixed-point implementations, that handles a complete datapath (functional units, multiplexers, and registers).
A novel strategy for fixed-point data multiplexing.
The paper is divided as follows. In Section 2, the architectural synthesis of DSP datapaths using multiple wordlength systems and modern FPGAs is introduced. Section 3 deals with the implementation results from synthesizing several DSP benchmarks for different latency constraints and output noise constraints. Finally, in Section 4, the conclusions are drawn.
2. Synthesis of Fixed-Point Datapaths2.1. Formal Description
This work focuses on the time-constrained resource minimization problem [13]. The notation used is based on [13], and it is similar to that in [2, 4, 6].
Given a sequencing graph GSEQ(O,S), a maximum latency λ, and a set of resources R (e.g., functional units RFU, registers RREG, and steering logic RMUX), it is the goal of AS to find the time step when each operation is executed (scheduling), the types and number of resources forming R (resource allocation), and the binding between operations and variables to functional units and registers (resource binding) that comply with the constraints, while minimizing cost (i.e., area). As a result, a datapath able to compute the algorithm's operations (see Figure 1) as well as the required control logic is generated.
Datapath architecture: FUs, registers, and multiplexers.
GSEQ(O,S) is a formal representation of a single iteration of an algorithm, where O is the set of operations and S⊂O×O is the set of signals that determine the data flow. We consider O=OM∪OG∪OA∪OD∪OI∪OO composed of typical DSP operations: multiplications, gains (multiplication by a constant value), additions, unit delays, and input and output nodes. Signals are in two's complement fixed-point format, defined by the pair (p,n), where p is number of integer bits [4] and n is the wordlength of the signal not including the sign bit (see Section 2.5). The values of the couples (p,n) have been computed during a previously performed wordlength optimization (WLO) [1, 14–16]. See Section 2.5 for more information about the wordlength optimization process.
Functional units (RFU) are in charge of executing the set of operations from O. Registers (RREG) store the data produced by FUs and some intermediate values. Finally, steering logic (RMUX) interconnects FUs and registers by means of multiplexers. The set of functional units RFU=RALUT∪RMLUT∪RMEMB is composed of LUT-based adders, LUT-based generic multipliers, and embedded multipliers. This set of FUs covers a representative set of modern FPGA devices. An FU r∈RFU is defined by its type type(r)={AdderLUT,MultiplierLUT,MultiplierEMB} and by its size, that depends on the input wordlengths. An operation is compatible with an FU if they have compatible types and if the size of the operation is smaller than or equal to the size of the FU [4, 6].
Scheduling is expressed by means of function φ:O→Z+, which assigns a start time to each operation. Resource binding is divided into FU binding and register binding. FU binding makes use of the compatibility graph GCOMP(O∪R,C) [2], which indicates the compatible resources for each o∈O by means of the set of edges C⊂O×R. The binding between operations and resources is expressed by means of function β:O→R×Z+, where β(o)={r,i} indicates that operation o is bound to the ith instance of resource r. The compatibility rules impose that (o,r)⊂C. In a similar fashion, register binding links variables d∈D to registers r∈RREG by means of function γ:D→RREG×Z+. The set of variables D is extracted from O considering that there is a variable assigned to the output of each operation from the subset OM∪OG∪OA and to each delay oD connected to another delay. Registers have an associated size nr that determines the maximum allowed wordlength of the variables bound to them.
The steering logic consists of multiplexers connected at the inputs of FUs and registers. They are in charge of sending data to and from these two types of resources. RMUX is determined by φ, β, and γ, since φ determines when data are generated, β when data are used by FUs, and γ where data are stored.
2.2. Handling Resource Heterogeneity
The recent appearance of specialized blocks in FPGAs calls for new design methods to efficiently exploit their advantages. In [8], it is proposed to use a normalized resource usage vector. Given an FPGA with M different types of resources Ri (i=0⋯M-1), each type with a maximum number of |Ri| resources, the resource requirements of a particular design implementation d can be expressed as the following normalized area vector:
Â≡〈#r0|R0|,#r1|R1|,…,#rM-1|RM-1|〉,
where #ri is the number of resources of type Ri used. Two useful norms are the ∞-norm and the 1-norm:
∥Â∥∞=max{#r0|R0|,#r1|R1|,…,#rM-1|RM-1|},∥Â∥1=∑i=0i=M-1#ri|Ri|.
The inverse of ∞-norm represents the number of times that the same implementation of design d can be replicated within the FPGA device (see [8]), and the 1-norm gives information about the overall resource usage of the implementation. Each norm is interesting on its own, but they have some pitfalls. On the one hand, if two implementations have the same ∞-norm this implies that they can be replicated the same number of times, but there is no way to know which implementation requires less resources. On the other hand, the 1-norm can tell if a design implementation requires less resources than other, but that does not guarantee that the implementation with less resources can be replicated more times than the other. In the work presented here a combination of ∞-norm and 1-norm , called +-norm (plus-norm), is proposed and applied. A metric ∥·∥+ that exploits the benefits of both norms but none of the drawbacks should fulfill the following conditions:
∀i,j:∥Âi∥+<∥Âj∥+⇒(∥Âi∥∞<∥Âj∥∞)∨((∥Âi∥∞=∥Âj∥∞)⋀(∥Âi∥1<∥Âj∥1)).
This can be expressed by means of a combination of the two norms
∥Â∥+=K·∥Â∥∞+∥Â∥1.
A feasible solution for K can be found by trying to comply with (6) for areas A1 and A2, such that A1 requires only one type of resource, ∥Â1∥∞>∥Â2∥∞ and ∥Â2∥1 has the biggest value that ∥Â1∥∞ allows
∥Â1∥+>∥Â2∥+.
First let us find upper bounds for ∥Â2∥∞ and ∥Â2∥1∥Â2∥∞≤∥Â1∥∞-1max(|Ri|),∥Â2∥1≤M·(∥Â1∥∞-1max(|Ri|)).
Substituting (5) and (7) into (6) allows
K·∥Â1∥∞+∥Â1∥1>(K+M)·(∥Â1∥∞-1max(|Ri|)).
Since ∥Â1∥1=∥Â1∥∞ and ∥Â1∥∞≤1, a possible range of values of K that complies with (4) is expressed in terms of the number of types of resources (M) and the maximum number resources of any type (max(|Ri|))
K>(M-1)·(max(|Ri|))-M.
K guarantees that for any two implementations di and dj: (i) if ∥Ai∥+<∥Aj∥+ then di can be replicated more times than dj; (ii) if ∥Ai∥+≤∥Aj∥+ then di can be replicated more times than dj, or the same number of times consuming less resources. Therefore, minimizing +-norm implies that the design can be replicated within the FPGA the maximum possible number of times while using the minimum possible number of resources.
The metric +-norm has a low computational cost and it is suitable for integer linear programming approaches [4, 15] and heuristic approaches [6, 17].
2.3. Resource Modeling
Resources are divided into three types: functional units (RFU), registers (RREG), and steering logic (RMUX). The area and latency of FUs and registers (i.e., A(r) and l(r)) are expressed as functions of the input and output wordlength information (p and n). They are obtained by applying curve fitting to hundreds of synthesis results. The use of accurate delay cost functions proved to provide significant performance improvements compared to some other existent naive approaches (from 12% to 63%, see [6]). Registers are assumed to have a zero latency in terms of clock period, which is true as long as the clock frequency enables to comply with setup and hold times.
Note that A is a vector with as many components as types of FPGA resources. Thus, it is possible to apply the +-norm to A in order to optimize the total datapath area. Multiplexers and wiring latencies are neglected, which could be easily overcome by means of multiplying the clock period by an empirical factor [18].
2.3.1. MWL Multiplexers
The area of multiplexers in UWL systems is only affected by the data wordlength, which sets the multiplexer size, and by the number of different data sources (e.g., registers or FUs), which determines the multiplexer width. An estimation of the area of an N-input multiplexer of wordlength M for Virtex-II devices is given by
AMUX=M·N4slices.
This estimation is specific for Virtex-II, Spartan-3, and Virtex-4, since the implementation of multiplexers relies on the combination of 4-input LUTs and dedicated multiplexers. Another FPGA architectures (i.e., Altera's Stratix-II) that make use only of 4-input LUTs would require a different estimation.
In MWL systems, data must be aligned before being processed by FUs or stored by registers. In [19] the problem of data alignment and multiplexing is tackled by means of alignment blocks introduced before multiplexers. In this work, multiplexers are used for both data multiplexing and data alignment, since the combination of these two tasks leads to a reduction in the number of control signals, and therefore, control logic. In addition, the chances for logic optimization are greater than if two separate blocks (an alignment block and a multiplexer) are used.
Alignment is required at the inputs of adders and at the outputs of both adders and multipliers. On one hand, adders require the alignment of their inputs in order to obtain a meaningful result. If an adder is shared to compute several additions (i.e., a1 and a2), an alignment block is required to arrange the MSB of the inputs in the right position for each operation (different alignments will be necessary for a1 and a2). On the other hand, the output of the different arithmetic operations—both additions and multiplications—in an algorithm can have the MSBs in different positions. Again, if the FUs are shared the output's MSB changes its position depending on the operation executed, therefore, it is necessary to dynamically align the FU's output using in order to store the data in a register.
Figure 2 presents three different types of alignments for a 4-input multiplexer with inputs signals a, b, c, and d and output o: arbitrary alignment (see Figure 2(a)), least significant bit (LSB) alignment (see Figure 2(b)), and most significant bit (MSB) alignment (see Figure 2(c)). Note that sign extension (see Figures 2(a) and 2(b)) does not offer any opportunity for logic optimization, while zero padding (see Figures 2(a) and 2(c)) does offer it, due to the reduction in the number of signals and the introduction of constant bits (zeros) that can be hard-wired into the multiplexer logic. In fact, it is MSB alignment (see Figure 2(c)) the option which allows a greater logic reduction. Therefore, it is recommended to apply this alignment whenever possible.
Signal alignment.
Arbitrary
LSB
MSB
A lower bound on the multiplexers' area if the MSB alignment is adopted can be computed as
AMUX̲=14∑i=0N-1(ni+1)slices,
where M is the maximum wordlength present and ni is the wordlength of signal i.
2.4. Optimization Procedure
In this subsection we extend on the work presented in [6, 17], where the optimization was steered by the ∞-norm and registers and multiplexers were not considered. The optimization procedure is based on Simulated Annealing (SA) [20] and it is shown in Algorithm 1. The inputs are the sequential graph GSEQ(O,S) and the total latency constraint λ. The optimization procedure determines the set of resources of the datapath R=RFU∪RREG∪RMUX, the scheduling φ, the FU binding β, and the register binding γ, which define the datapath, the steering logic, and the timing of the control signals.
Algorithm 1: Optimization procedure.
Input: GSEQ(O,S), λ
Output: R=RFU∪RREG∪RMUX, φ, β, γ
(1) Extract GCOMP(O,RFU), RFU, RREG
(2) Find initial mapping m0
(3) Compute initial area A0 from m0
(4) Amin=Alast=A0
mmin=mlast=m0
T=T0
iteration = accepted = exit = 0
(5) while¬exit condition do
(6)m=mlast
(7)iteration = iteration + 1
(8)Perform change to current m
(9)Compute area A from m (Algorithm 2)
(10)ΔA=(∥Âlast∥n-∥Â∥n)/∥Â0∥n
(11)ifΔA<0then
(12)Âmin=Âlast=A
mmin=mlast=m
accepted = accepted + 1
(13)else
(14)p=eΔA/T, r=rand[0⋯1)
(15)ifr≤pthen
(16)Alast=A, mlast=m
accepted = accepted + 1
(17)end if
(18)end if
(19)if equilibrium state then
(20)T=αT·T
(21)iterations = accepted = exit = 0
(22)else iffrozen state then
(23)T=αT·T
(24) iterations = accepted = 0
(25) exit = exit + 1
(26)end if
(27)if restart condition then
(28)Alast=Amin
mlast=mmin
(29)end if
(30) end while
2.4.1. Simulated Annealing
First, the set of functional units RFU, the set of registers RREG, and the compatibility graph GCOMP(O,RFU) are extracted (line 1). An initial resource mapping m0 is selected by mapping each operation to the fastest LUT-based resource among the available compatible resources for that operation (line 2), and the area A0 occupied by the resulting datapath is used as the initial area (line 3). From this point (lines 5–30), the optimization proceeds following the typical SA behavior: the algorithm iterates while producing changes (line 8)—also referred to as movements—that modify the value of the cost function (i.e., area) until a certain exit condition is reached. If these changes lead to a cost reduction, they are accepted (line 11), if not, they are accepted with a certain probability which depends on the current temperature T (line 15). The temperature starts at a high value and decreases with time. Most movements are accepted at the beginning of the process, thus enabling a wide design space exploration. As temperature decreases, only those movements which produce small cost deviations are accepted. The temperature is decreased when the equilibrium state is reached (line 19). Sporadic restarting [21] is also allowed (line 27), which repositions the optimization variables at the last minimum state found.
A summary of simulated annealing parameters and conditions is in Table 1. The annealing factor of 0.95 was chosen empirically aiming at balancing the tradeoff between optimality and solving time.
Simulated annealing parameters and conditions.
λ
Latency constraint (in clock cycles)
αT
Annealing factor (0.95)
|O|
Number of operations in algorithm
Equilibrium state
Accepted>|O|×|O|
Frozen state
Iterations>k·|O|×|O|, k>1.0
3 consecutive times
Restart condition
Alast>1.5·Amin
The variation in cost ΔA is normalized with respect to the initial area A0 (line 10). This is a simple way to control that the behavior of SA is not affected by the complexity of the algorithm [22], which it is approximated by ∥Â0∥n. The value of n must be set to 1 (or ∞) for homogeneous-architecture FPGAs, and to “+’’ for heterogeneous-architecture FPGAs.
The changes on the cost function (line 8) are performed by applying with equal probabilities the following movements to the resource mapping function m:
MA: map an operation o∈O to a non mapped resource,
MB: map an operation o to another already mapped resource,
MC: swap the mapping of two compatible operations mapped to different resources.
2.4.2. Area Computation
The computation of the area cost is shown in Algorithm 2. First, it is checked whether the current resource mapping m complies with latency λ (lines 1–4). If it does not, the actual latency λ′ is computed. Later on (line 26), any deviation from the design constraints is penalized by means of increasing the area cost of the solution. Thus, solutions that do not meet the latency constraint are included within the design space exploration [23]. Even though these solutions are never accepted as valid, their inclusion allows a wider architecture exploration than rejecting solutions that do not comply with λ.
Algorithm 2: Computation of area cost.
Input: GSEQ(O,E), λ
Output: R=RFU∪RREG∪RMUX, φ, β, γ, A
(1) Compute minimum latency λ′ for mapping m
(2) ifλ′<λthen
(3)λ′=λ
(4) end if
(5) Find set of functional units RFU′={ro′,…,rN-1′} with mapped operations
(6) for all r′∈RFU′: Compute instances lower bound inst̲(r′) [24] and upper bound inst¯(r′)
Then, the resource allocation and resource binding that minimizes FU area is sought by means of a loop where several list-based scheduling operations are performed (lines 5–18). The purpose of the loop is to check different combinations on the number of instances of the resources. Both lower [24] and upper bounds on the number of instances for each resource are computed (line 6). All combinations of possible instances are computed and stored in the set of vectors I. The list-based scheduling performs an ASAP scheduling to the operations sorted by mobility in ascendant order, providing a fast way to find a valid solution. Note that the size of I is being pruned while the loop iterates; all combinations of FU instances that require areas greater than the minimum found so far are removed (line 15). Thus, resource allocation is sped up.
Once the minimum FU area scheduling is found, the datapath is defined. The tasks of register binding and multiplexer allocation are not commonly included within the optimization loop, in spite of their impact in the final architecture. In this work, these two tasks are part of the optimization procedure.
Register binding is performed by applying a left-edge algorithm [13]. Inputs signals are supposed to be available for all λ cycles and do not require storing. Each variable assigned to a delay is initially assigned a register, and after that, the left-edge algorithm is applied as usual.
From sets RFU and RREG and functions φ, β, and γ it is possible to extract the steering logic resources RMUX. Registers have a single multiplexer (see Figure 1), while FUs have two. A goal of multiplexer definition is to maximize the use of the MSB alignment. This aligment can be applied directly to registers and multipliers. However, adders require that the inputs must be aligned to each other. Thus, if an MSB alignment if applied to the mux connected to one of the inputs, it is not possible to do so for the remaining mux, and vice versa. Finally, the control signals can be easily extracted from the scheduling contained in φ.
The area vector is computed by adding the area of each resource multiplied by the number of instances required (line 25). If λ′>λ the area is penalized by means of factor αλ. If the implementation does not comply with the latency constraint and if the resulting penalized area is smaller than Amin, then the area is forced to be bigger than Amin (see line 28).
Summarizing, the optimization procedure is actually controlled by changing iteratively the mapping between operations and FUs. These changes impact on the structure of the datapath and, therefore, on its area cost, which is the function to be minimized. This method provides a robust way to simultaneously perform the tasks of scheduling, resource allocation, and resource binding for multiple wordlength systems. This procedure was satisfactorily applied in [25].
2.5. Wordlength Optimization: A Case Study
Let us introduce this section through a simple LTI case study (Algorithm 3).
Algorithm 3:
Case study.
Input: a,b,c∈(-1/2,1/2), uniformly distributed
Output: d
(1) while true do
(2)Get new value of a, b, and c
(3)m1=a*2.384
(4)m2=b*0.0036
(5)s1=m1+m2
(6)s2=s1+c
(7) New value of output: d=s2
(8) end while
Algorithm 3 performs the weighted summation of three signals. The operations involved are two constant multiplications (i.e., gains) and two additions. There are a total of 8 signals.
The goal of WLO is to define the fixed-point format for each signal that enables to produce a hardware implementation of the algorithm. The fixed-point format, as mentioned in Section 2.1, is composed of the pair (p,n). Thus, the ultimate goal of WLO is to find the proper set of (p,n) pairs to optimize the hardware realization of an algorithm. Figure 3 depicts the meaning of this parameters: p is the distance in bits from the fractionary point to the MSB (a zero distance implies that there is no integer part in the number); n is the number of bits used to represent the number without considering the sign bit. A common way to address WLO is to split it into two sequential subtasks: scaling, where the values of p are selected, and wordlength selection, where the values of n are chosen.
Fixed-point format.
Scaling is performed by means of performing a floating-point simulation and gathering the maximum absolute value of each signal s and computing:
ps=⌊log(max|s|)⌋+1.
Once scaling is accomplished, the values of p are fixed and the values of n are obtained through an optimization process (wordlength selection). The number of bits assigned to a signal (i.e., n) determines the quantization noise that the signal introduces, and, therefore, it has a high impact in the final precision of the system, producing an error in the output signal. During the optimization process different combinations of n are tried in order to look for a particular set that minimizes cost (i.e., area, speed, power) while complying with the output error constraint. The error of the system is typically measured in terms of the peak error value [5, 26], the signal to quantization noise ratio (SQNR) [11, 27], and the variance of the output error [15]. In this work, we adopt the variance of the output error (σ2).
Table 2 contains the fixed-point formats (i.e., (p,n)) of the signals of Algorithm 3 for both UWL and MWL WLO approaches for different error constraints (σ2=10-5 and σ2=10-6). The UWL synthesis is achieved by computing the minimum values of p and n that if applied to all signals the fixed-point realization complies with the noise constraint [15]. The MWL synthesis is achieved by means of an SA-based approach, which minimized the area of a resource-dedicated implementation (with no resource sharing) [28].
UWL versus MWL wordlength optimization.
Signal
σ2=10-5
σ2=10-6
UWL
MWL
UWL
MWL
(p,n)
(p,n)
(p,n)
(p,n)
a
(1,9)
(-1,8)
(1,11)
(-1,10)
b
(1,9)
(-1,2)
(1,11)
(-1,5)
c
(1,9)
(-1,7)
(1,11)
(-1,8)
m1
(1,9)
(1,9)
(1,11)
(1,10)
m2
(1,9)
(-5,3)
(1,11)
(-5,4)
s1
(1,9)
(1,9)
(1,11)
(1,10)
s2
(1,9)
(1,8)
(1,11)
(1,10)
d
(1,9)
(1,8)
(1,11)
(1,10)
Let us focus on the results for σ2=10-5. The UWL approach clearly requires longer wordlengths than the MWL approach. The main reason for this is that the UWL optimization is far too simple. Also, note that some signals' wordlengths are decreased considerably in the MWL approach (b and m2). This is due to the fact that signal b is multiplied by a small constant, so the quantization noise introduced is also small. Similar results are also present for σ2=10-6. In this case the values of n are bigger since the error constraint is more restrictive.
Summarizing, the MWL approach enables the generation of fixed-point realizations that require a small number of bits. The only drawback is that the complexity of the design process is increased and techniques, such as the proposal in this section, are required.
3. Results
Here, the implementation results are presented. The following benchmarks are used:
ITU RGB to YCrCb converter (ITU) [15],
3rd-order lattice filter (LAT3) [29],
4th-order IIR filter (IIR4) [30],
8th-order linear-phase FIR filter (FIR8).
All algorithms are assigned 8-bit inputs and 12-bit constant coefficients. The algorithm implementations have been tested under different latency and output noise constraint scenarios assuming a system clock of 125 MHz. In particular, the noise constraints were σ2={10-k,10-(k+1),10-(k+2)}, where k is the minimum number that makes 10-k as close as possible to the variance of the quantization noise that would present the output of the benchmark if quantized to 8 bits (σ2=(2-2n+2p/12)|n=7).
The target devices belong to the Xilinx Virtex-II family. The area results are normalized with respect to the XC2V40 device (256 slices, 4 embedded 18×18 multipliers) and expressed according to (2). For instance, an area vector with ∞-norm equal to or smaller than 1.0 implies that the device XC2V40 is the smallest-cost device able to hold the design; whereas a ∞-norm greater than 1.0 and equal to or smaller than 2.0 implies that the smallest-cost device able to hold the design is the XC2V80, and so on.
Before AS, each algorithm is translated to a fixed-point specification by means of two wordlength optimization procedures, that follow a UWL approach and an MWL approach, respectively.
The area results in this section are computed using the resource model explained in Section 2.3, which provides a good estimation of actual synthesis results.
3.1. Uniform Wordlength versus Multiple Wordlength Synthesis: Homogeneous Architectures
Figures 4 and 5 display results on the comparison of UWL versus MWL synthesis using a homogeneous-resource architecture (i.e., only LUT-based resources). Note that the subfigures are arranged in couples, which are related to the same benchmark. The left subfigures depict the area versus latency curves for a particular output noise constraint (see Figures 4(a), 4(c), 5(a), and 5(c)), while the right subfigures contain the detailed resource distribution graph of a particular point of its counterpart (see Figures 4(b), 4(d), 5(b), and 5(d)). Let us define λminUWL-HOM as the minimum latency attainable for a UWL-homogeneous implementation of an algorithm, and λminMWL-HOM the equivalent for an MWL-homogeneous implementation. The latency used for the experiments ranges from λminMWL-HOM to λminUWL-HOM+10.
UWL versus MWL: homogeneous implementations (I).
ITU, σ2=10-1
ITU, σ2=10-1, λ=9
LAT3, σ2=10-5
LAT3, σ2=10-5, λ=22
UWL versus MWL: homogeneous implementations (II).
IIR4, σ2=10-3
IIR4, σ2=10-3, λ=17
FIR8, σ2=10-4
FIR8, σ2=10-4, λ=8
Figures 4(a) and 4(b) contain the implementation results of the ITU benchmark with an output noise variance of 10-1. Figure 4(a) depicts how both the UWL and MWL areas decrease as long as the latency increases. This is expected since the greater the latency the greater the chance of FU reuse. The comparison of the two implementation curves yields that the improvement obtained by means of using an MWL approach ranges from 51% to 77%. Also, the minimum latency that each implementation achieves differs considerably. The fine-grain tradeoff between area and quantization noise performed by the MWL approach allows important area reductions when compared to the UWL approach. Figure 4(b) displays the detailed resource distribution for the ITU UWL and MWL implementations correspondant to σ2=10-1 and λ=9. The overall area savings are 77%, and it is due to the fact that the wordlengths of the majority of signals, which impact on FUs, multiplexers and registers, have been highly reduced; FUs' area has been reduced 83%, FU's multiplexers 59%, registers 62%, and registers' multiplexers 39%. It is important to highlight that the area due to multiplexers and registers, although smaller than the FUs' area, makes up a significant part of the total area (20% for UWL and 39% for MWL). Hence the importance of including its cost within the optimization loop is analyzed in Section 3.4.
The other benchmarks also show large area improvements: LAT3 up to 49% (see Figure 4(c)), IIR4 up to 49% (see Figure 5(a)), and FIR8 up to 28% (see Figure 5(c)). As observed in the detailed resource distribution subfigures (see Figures 4(d), 5(b), and 5(d)), the area of the majority of the resources has been highly decreased. Also, it is noted that the percentage of area devoted to multiplexation and data storing is high in proportion to the overall implementation area. The minimum latency is also improved (see Figures 4(a), 4(b), and 5(a)).
In Figure 4(c) the MWL area does not decrease as long as the latency increases. This is due to the fact that the wordlengths are small enough as to allow maximum resource sharing for all latencies, thus the coincidence in the area results for the MWL implementations. This situation might change if a different error constraint (σ2) is applied during WLO.
Table 3 contains the implementation results for all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from λminUWL-HOM to λminUWL-HOM+10, and the minimum, maximum, and mean values of the area improvements obtained by the MWL implementations in comparison to the UWL implementations are computed. The first column in the table contains the name of the benchmark. The second, the output noise variance. And the third column contains area improvement values. The last row holds the minimum, maximum, and average improvements considering all results simultaneously.
UWL versus MWL for homogeneous architectures.
Bench.
σ2
Area improvement (%)
Min
Max
Mean
ITU
10-1
51.36
77.66
68.32
10-2
48.44
76.31
66.41
10-3
46.51
75.40
65.13
LAT3
10-3
44.07
44.07
44.07
10-4
33.66
33.66
33.66
10-5
33.62
49.89
45.42
IIR4
10-3
37.85
49.86
39.69
10-4
34.30
63.55
50.65
10-5
37.08
64.99
52.72
FIR8
10-3
40.28
47.68
43.54
10-4
24.16
28.10
25.14
10-5
22.04
25.73
22.82
All
22.04
77.66
46.46
The area improvements obtained are remarkable: mean improvements range from 47% to 77%. Note that the minimum improvements obtained for all benchmarks are quite close to both the maximum and the mean. The results clearly show that an MWL-based AS approach achieves significant area reductions.
Regarding latency, the minimum latency achievable by UWL implementations is reduced in average a 22% by means of MWL AS.
3.2. Uniform Wordlength versus Multiple Wordlength Synthesis: Heterogeneous Architectures
Figures 6 and 7 contain results on the comparison of UWL versus MWL synthesis using a heterogeneous-resource architecture (i.e., both LUT-based and embedded resources present). The arrangement of figures is similar to that of the previous subsection. Now, the latency ranges from λminMWL-HET to λminUWL-HET+10 (HET indicates heterogeneous implementations).
UWL versus MWL: homogeneous implementations (I).
ITU, σ2=10-1
ITU, σ2=10-1, λ=9
LAT3, σ2=10-5
LAT3, σ2=10-5, λ=22
UWL versus MWL: homogeneous implementations (II).
IIR4, σ2=10-3
IIR4, σ2=10-3, λ=17
FIR8, σ2=10-4
FIR8, σ2=10-4, λ=8
Figures 6(a), 6(c), 7(a), and 7(c) contain the implementation area versus latency curves. The graphs clearly show how the area is reduced by means of an MWL synthesis: ITU up to 79% (see Figure 6(a)), LAT3 up to 35% (see Figure 6(c)), IIR4 up to 40% (see Figure 7(a)), and FIR8 up to 26% (see Figure 7(b)). The detailed resource distribution in Figures 6(b), 6(d), 7(b), and 7(d) shows how the majority of resources are decreased, and in particular the embedded multipliers and the FUs' multiplexers are clearly optimized. For instance, the ITU resource distribution for σ2=10-2 and λ=10 (see Figure 6(b)) shows an overall area reduction of 72%. The LUT-based resources are reduced 59% (LUT-based FUs' area has been reduced 32%, FU's multiplexers 74%, registers 32%, and registers' multiplexers 36%); while the embedded FUs are reduced 75%.
Note that the area of embedded resources for Figures 6(d) and 7(d) is the same for both the UWL and MWL approaches, in fact a single multiplier is being used (1 out of 4). This happens because the wordlengths involved in multiplications, though not the same, are small enough for both UWL and MWL approaches as to enable the use of a single embedded multiplier. However, the LUT-based areas are quite different, and, as a result, the overall resource usage is much smaller for the MWL implementation.
In Figure 6(c) the UWL and MWL areas do not decrease as long as the latency increases. Again, this is due to the fact that the particular wordlengths obtained allow maximum resource sharing for all latencies. Different error constraints (σ2) might change this situation.
Again, the figures show how the minimum latency can be highly improved by means of an MWL approach. Also, it can be seen that the LUT-based resources are devoted almost entirely to data multiplexing and storing.
Table 4 contains the implementation results of all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from λminUWL-HET to λminUWL-HET+10, and the minimum, maximum, and mean area improvements obtained by the MWL implementations in comparison to the UWL implementations are computed considering ∞-norm area, the LUT-based area, and the embedded FUs area. The first column in the table contains the name of the benchmark. The second, the output noise variance applied. The third column contains the minimum, maximum, and mean ∞-norm area improvement values. The fourth column contains the minimum, maximum, and mean values of the LUT-based resource area. And the last column contains the minimum, maximum, and mean values of the embedded FUs area.
UWL versus MWL for heterogeneous architectures.
Bench.
σ2
∥Â∥∞ %
ALUT %
AEMB %
Min
Max
Mean
Min
Max
Mean
Min
Max
Mean
ITU
10-1
54.85
80.77
73.69
55.56
81.07
63.19
50.00
75.00
72.73
10-2
52.37
79.71
71.37
52.37
79.71
60.41
50.00
75.00
72.73
10-3
47.80
77.76
66.71
47.80
77.76
56.33
50.00
75.00
72.73
LAT3
10-3
48.87
48.87
48.87
48.87
48.87
48.87
0.00
0.00
0.00
10-4
35.84
35.84
35.84
35.84
35.84
35.84
0.00
0.00
0.00
10-5
31.70
31.70
31.70
31.70
31.70
31.70
0.00
0.00
0.00
IIR4
10-3
37.47
41.27
38.33
37.47
45.10
39.05
0.00
0.00
0.00
10-4
32.97
40.70
34.74
32.97
40.70
34.74
0.00
0.00
0.00
10-5
40.32
65.13
49.57
40.32
65.13
48.55
50.00
83.33
71.67
FIR8
10-3
32.45
38.01
33.97
38.79
43.83
39.68
0.00
0.00
0.00
10-4
27.53
32.83
28.62
27.53
32.83
28.62
0.00
0.00
0.00
10-5
19.53
26.12
21.17
19.53
26.12
21.17
0.00
0.00
0.00
All
19.53
80.77
44.88
19.53
81.07
42.76
0.00
83.33
24.03
The area improvements obtained are considerable; ITU obtains up to 80.77%, LAT3 up to 48.87%, IIR4 up to 65.13%, FIR8 up to 38.01%. Note that the minimum improvements obtained for most of the benchmarks are again quite close to both the maximum and the mean.
The LUT-based area reductions are up to 81.07% for ITU, up to 48.87% for LAT3, up to 65.13% for IIR4, and up to 43.83% for FIR8. The embedded resources are only reduced for benchmarks ITU (up to a 75.00%) and IIR4 (up to 83.33%). Benchmarks LAT3 and FIR8 use the minimum possible number of embedded resources (1 embedded multiplier), hence the 0% improvement.
Area improvements up to 80.77% are achieved. The average improvement is 44.88% for the overall area, 42.76% for the LUT-based resources, and 24.03% for the embedded resources. The results clearly show that an MWL-based approach for AS leads to significant area reductions.
As a final note regarding these area results, the authors would like to emphasized that the plus-norm has been used during the optimization process, but it is not used to present the results as it cannot be directly related to the percentage of occupation of the FPGA. Thus, the ∞-norm is used instead.
The latency analysis throws that the minimum UWL latency is reduced an average 19% by means of MWL AS.
3.3. MWL Synthesis: Heterogeneous versus Homogeneous
Table 5 contains the implementation results of all the benchmarks corresponding to three different quantization noise scenarios. For each quantization scenario the latency ranges from λminMWL-HOM to λminMWL-HOM+10, and the minimum, maximum, and mean values of the area improvements, in terms of ∞-norm, obtained by the MWL implementations in comparison to the UWL implementations are computed. The first column of the table contains the name of the benchmark. The second, the output noise variance applied. And, the third column contains the minimum, maximum and mean area improvement values.
MWL synthesis: homogeneous versus heterogeneous architectures.
Bench.
σ2
∥Â∥∞ %
Min
Max
Mean
ITU
10-1
40.73
52.69
51.60
10-2
43.29
53.98
53.01
10-3
50.45
54.76
54.32
LAT3
10-3
42.68
43.09
42.72
10-4
39.36
39.36
39.36
10-5
38.73
39.74
38.82
IIR4
10-3
34.83
44.77
36.04
10-4
32.92
48.23
35.96
10-5
33.21
48.79
35.79
FIR8
10-3
21.42
41.46
28.37
10-4
27.02
44.68
33.24
10-5
27.46
44.62
33.52
All
21.42
54.76
40.23
The area improvements obtained are remarkable; ITU obtains up to 54.76%, LAT3 up to 43.09%, IIR4 up to 48.79%, FIR8 up to 44.68%. Note that, again, the minimum improvements obtained for all benchmarks are quite close to both the maximum and the mean. Area improvements up to 54.76% are achieved, being the average improvement 40.23%. The results clearly show that the inclusion of embedded resources within AS leads to higly optimized DSP implementations.
Regarding latencies, the minimum latency achievable by both homogeneous and heterogeneous implementations is the same for the experiments performed. This is due to the fact that the latency of resources is very similar in the particular conditions used for the tests. The same experiments presented in this section were repeated increasing the constant wordlength to 16 bits, obtaining that heterogeneous implementations reduced 7% the minimum latency in constrast to homogeous implementations.
3.4. Effect of Registers and Multiplexers
In this subsection the effect of including the cost of registers and multiplexers within the optimization loop is investigated. As in the previous experiments, the analysis is performed implementing the benchmarks using different noise and latency constraints. Before AS is applied a gradient-descent quantization [28] is applied according to the given noise constraint. The comparison is done by using Algorithm 1 to perform the AS using two different area cost estimation solutions: (i) Algorithm 2, which is referred to as the complete area estimation algorithm, and (ii) a simplified version of Algorithm 2 (simplified area estimation algorithm) where the cost of registers and multiplexers is neglected. When the simplified area estimation is used, the cost of registers and multiplexers is included after the optimization loop has finished its execution, using the complete area estimation (Algorithm 2).
Table 6 contains the results of this analysis. The latencies range from λminARCH to λminARCH+10, where ARCH refers to the type of FPGA architecture used (homogeneous or heterogeneous). The noise constraints are the same used in the previous subsection (three σ2 for each benchmark), though the results have been combined into a single row. The first column contains the type of FPGA architecture. The second column indicates the benchmark used. And the fourth column contains the minimum, maximum, and average area improvements obtained by the complete area estimation synthesis in contrast to the simplified area estimation synthesis. The last row includes the minimum, maximum, and mean improvements for all benchmarks.
Complete versus simplified cost estimation: area improvement (%).
Arch.
Bench.
Area improvement
Min
Max
Mean
HOM
ITU
0.00
0.95
0.30
LAT3
0.71
3.50
1.53
IIR4
0.00
5.35
1.26
FIR8
0.00
1.77
0.31
HET
ITU
0.00
25.85
1.89
LAT3
1.15
5.77
2.52
IIR4
0.00
35.57
8.09
FIR8
0.00
8.11
0.83
All
0.00
35.57
2.09
The average improvements for the different benchmarks range from 0.00% to 8.09%, being the overall average improvement of 2.09%. The maximum improvement found is 35.57%. These results clearly show that failing to include the cost of registers and multiplexer during the optimization procedure can lead to unwanted area penalties.
4. Conclusions
In this paper an architectural synthesis procedure able to produce optimized fixed-point implementations using modern FPGA devices is presented. The key to success is provided by the use of highly accurate models of the datapath resources, a complete datapath resource set that includes multiplexer and registers, a novel method to handle fixed-point data alignment and multiplexing, and also the introduction of a novel resource usage metric that can cope with LUT-based and embedded FGPGA resources.
The AS procedure produces area improvements of up to 80% when compared to uniform-wordlength implementations, and latency improvement of up to 22%. The efficient use of embedded resources achieves area improvements of up to 54% when compared to homogeneous implementations. Also, the inefficiency of current FPGA architectures to implement data steering was exposed.
These results are intented to be further improved by means of tightly combining the fixed-point refinement process within the architectural synthesis [4, 31]. Also, the inclusion of the control logic in the resource model is regarded as a future research line.
Acknowledgment
This work was supported by the Spanish Ministry of Education and Science under Research Project TEC2006-13067-C03-03.
KumK.-I.SungW.Combined word-length optimization and high-level synthesis of digital signal processing systems200120892193010.1109/43.9363742-s2.0-0035424364ConstantinidesG.CheungP.LukW.Heuristic datapath allocation for multiple wordlength systemsProceedings of the Conference on Design, Automation, and Test in Europe (DATE '01)2001Munich, Germany791796CongJ.FanY.HanG.Bitwidth-aware scheduling and binding in high-level synthesisProceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC '05)2005Shanghai, China856861CaffarenaG.ConstantinidesG. A.CheungP. Y. K.CarrerasC.Nieto-TaladrizO.Optimal combined word-length allocation and architectural synthesis of digital signal processing circuits20065353393432-s2.0-003425095310.1109/TCSII.2005.862175WadekarS. A.ParkerA. C.Accuracy sensitive word-length selection for algorithm optimizationProceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD '98)1998San Jose, Calif, USA54612-s2.0-0032302589CaffarenaG.LópezJ. A.CarrerasC.Nieto-TaladrizO.High-level synthesis of multiple word-length DSP algorithms using heterogeneous-resource FPGASProceedings of the International Conference on Field Programmable Logic and Applications (FPL '06)2006Madrid, Spain6756782-s2.0-003114754710.1109/FPL.2006.311288NayakA.HaldarM.ChoudharyA.BanerjeeP.Accurate area and delay estimators for FPGAsProceedings of the 39th Design Automation Conference (DAC '02)June 2002New Orleans, La, USA862869BouganisC.-S.ConstantinidesG. A.CheungP. Y. K.A novel 2D filter design methodology for heterogeneous devicesProceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '05)April 2005Napa, Calif, USA132210.1109/FCCM.2005.102-s2.0-33746160312LiangX.VetterJ. S.SmithM. C.BlandA. S.Balancing FPGA resource utilitiesProceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA '05)June 2005Las Vegas, Nev, USA1561622-s2.0-18644369928SmithA. M.ConstantinidesG. A.CheungP. Y. K.Fused-arithmetic unit generation for reconfigurable devices using common subgraph extractionProceedings of the International Conference on Field Programmable Technology (FPT '07)December 2007Kitakyushu, Japan1051122-s2.0-3414718610810.1109/FPT.2007.4439238RocherR.MenardD.HerveN.SentieysO.Fixed-point configurable hardware components2006200613231972-s2.0-014207843010.1155/ES/2006/23197HervéN.MénardD.SentieysO.About the importance of operation grouping procedures for multiple word-length architecture optimizationsProceedings of the International Workshop on Applied Reconfigurable Computing (ARC '07)March 20071912002-s2.0-33846965181De MichelliG.1994New York, NY, USAMcGraw-HillElectrical and Computer Engineering seriesCantinM.-A.SavariaY.ProdanosD.LavoieP.An automatic word length determination method5Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '01)May 2001Sydney, Australia53562-s2.0-0034999506ConstantinidesG. A.CheungP. Y. K.LukW.Wordlength optimization for linear digital signal processing200322101432144210.1109/TCAD.2003.8181192-s2.0-0142054116HolzerM.KnerrB.BelanoviçP.RuppM.Efficient design methods for embedded communication systems200620066491310.1155/ES/2006/649132-s2.0-33746893452CaffarenaG.LópezJ. A.CarrerasC.Nieto-TaladrizO.Optimized implementation of DSP cores on FPGAs using logic-based and embedded resourcesProceedings of the International Symposium on System-On-Chip (SoC '06)November 2006Tampere, Finland103106EnzlerR.JegerT.CottetD.TrösterG.High-level area and performance estimation of hardware building blocks on FPGAsProceedings of the International Conference on Field Programmable Logic and Applications (FBL '00)August 2000Villach, Austria525534SchoofsK.GoossensG.De ManH.Bit-alignment in hardware allocation for multiplexed DSP architecturesProceedings of the Conference on Design, Automation and Test in Europe (DATE '93)October 1993289293KirkpatrickS.GelattC. D.Jr.VecchiM. P.Optimization by simulated annealing198322045986716802-s2.0-26444479778BenvenutoN.MarchesiM.UnciniA.Applications of simulated annealing for the design of special digital filters19924023233322-s2.0-002682048010.1109/78.124942OrsilaH.KangasT.SalminenE.HämäläinenT. D.Parameterizing simulated annealing for distributing task graphs on multiprocessor SoCsProceedings of the International Symposium on System-On-Chip (SoC '06)November 2006Tampere, Finland1410.1109/ISSOC.2006.3219712-s2.0-50049087101Lopez-VallejoM.GrajalJ.LopezJ.Constraintdriven system partitioningProceedings of the Conference on Design, Automation and Test in Europe (DATE '00)March 2000Paris, France411416OhmS. Y.KurdahiF. J.DuttN. D.A unified lower bound estimation technique for high-level synthesis19971654584722-s2.0-0031147547CaffarenaG.LópezJ. A.LeyvaG.CarrerasC.Nieto-TaladrizO.Optmized architectural synthesis of fixed-point datapathsProceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig '08)2008Cancun, Mexico8590López-VallejoM.LópezJ. C.On the hardware-software partitioning problem: System modeling and partitioning techniques200383269297LópezJ. A.CaffarenaG.CarrerasC.Nieto-TaladrizO.Fast and accurate computation of the round-off noise of linear time-invariant systems20082439340810.1049/iet-cds:200701982-s2.0-49149114657CantinM.-A.SavariaY.LavoieP.A comparison of automatic word length optimization procedures2Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '02)May 20026126152-s2.0-0036286912ParhiK. K.1999New York, NY, USAJohn Wiley & SonsKumK.-I.KangJ.SungW.AUTOSCALER for C: an optimizing floating-point to integer C program converter for fixed-point digital signal processors200047984084810.1109/82.8684532-s2.0-0034270457CaffarenaG.2008Madrid, SpainUniversidad Politécnica de Madrid