Modern FPGAs contain embedded DSP blocks, which can be configured as multipliers with more than one possible size. FPGA-based designs using these multigranular embedded blocks become more challenging when high speed and reduced area utilization are required. This paper proposes an efficient design methodology for implementing large size signed multipliers using multigranular small embedded blocks. The proposed approach has been implemented and tested targeting Altera's Stratix II FPGAs with the aid of the Quartus II software tool. The implementations of the multipliers have been carried out for operands with sizes ranging from 40 to 256 bits. Experimental results demonstrated that our design approach has outperformed the standard scheme used by Quartus II tool in terms of speed and area. On average, the delay reduction is about 20.7% and the area saving, in terms of ALUTs, is about 67.6%.
Nowadays modern FPGAs offer highly
sophisticated resources in the form of embedded blocks. These blocks vary in
complexity from small size multipliers to core processors. Some of these blocks
offer high degree of flexibility to cover a wide range of applications. A
typical example is the DSP blocks in Altera’s FPGA. These blocks can be
configured to operate as
Arithmetic computations are needed in a wide range of applications
and products. Some of these arithmetic computations deal with large size
operands. Typical applications include scientific computation, cryptography, and
data intensive systems. For instance, in climate modeling and computational
physics, high-precision floating point processing is needed [
There are various
techniques presented in the literature which deal with efficient realization of
signed array multiplication through optimization of partial product generation
and partial product addition. The focus primarily is to reduce the delay of the
critical path, and sometimes to meet other design objectives such as power
dissipation and chip area [
For large size multiplication using FPGA devices, known algorithms
normally segment the input operands based on single size embedded blocks [
In our previous work, we developed an efficient design approach for
the implementation of large size unsigned multipliers [
The remainder of this paper is organized as follows. Section
In this section, we describe the decomposition method of large size multiplications based on single size embedded blocks, and a new sign-extension scheme to sum the generated partial products.
To implement a large size multiplication using single size embedded
multipliers in FPGAs, the input operands are decomposed based on the size of
the embedded blocks [
By multiplying the segmented inputs presented in (
Organization of the partial products of large size signed multiplier.
After all operands shown
in Figure
To save area and reduce
the execution delay, our new sign-extension scheme proposed in [
After the first level addition, all operands to be added further are
2's complement and have different sizes. Then, at the second level and subsequent
levels of addition, our proposed new sign-extension scheme extends the sign bits
of the larger size operand, as shown in Figure
Structure of the adders in the new sign-extension scheme.
In this section, we describe our proposed multilevel decomposition approach for the implementation of large size signed multiplications using multigranular embedded multipliers.
We assume that
multigranularity is based on three types of building blocks. They are of
different bit widths:
To optimize the design of
the large size multipliers, the decomposition is first processed based on the largest
size building blocks. Figure
Decomposition based on the largest size of embedded blocks.
By multiplying the
segmented inputs, three types of multipliers are required to generate the
partial products. They are
The first type of
multiplication can be implemented using
The second type of
multiplication,
The last type of
multiplication is
Decomposition of input bit sizes for large size multipliers.
Segments | Range 1: double decomposition | Range 2: single decomposition |
---|---|---|
37 to 52 | 53 to 71 | |
72 to 86 | 87 to 106 | |
107 to 120 | 121 to 141 | |
142 to 154 | 155 to 176 | |
177 to 188 | 189 to 211 | |
212 to 222 | 223 to 246 | |
247 to 256 | 257 to 281 |
In Range 1, smaller size embedded
multipliers can be used for the implementation based on double-level
decomposition. To do this, the size of
the first level decomposition is based on
First level decomposition for a large size multiplication in Range 1.
The second level decomposition is to separate each 34-bit operand as
two 17 bits. Thus, the
On the other hand, the
cases in Range 2, double level decomposition will not lead to optimized
solution since the size of the most significant segment,
In this section, we will
describe the implementation approaches for the realization of large size
multipliers for the scenarios presented in the Section
Double decomposition is used for the bit size located in Range 1. The
first level decomposition is to decompose each input operand,
Once all the segmented partial products are generated, the required
additions can be performed following the design rules presented in [
In the case of Range 2, single decomposition is
performed since the most significant segment of the input operands contains
more than
The special cases are
referred to the multiplications such that the most significant segment of each
operand has
The ranges of input bit size with special cases for decompositions.
Range/segments | Range 1 | Special cases of Range 1 | Range 2 | Special cases of Range 2 |
---|---|---|---|---|
40 to 52 | 53 to 55 | 56 to71 | 72 to 74 | |
75 to 86 | 87 to 89 | 90 to 106 | 107 to 109 | |
110 to 120 | 121 to 123 | 124 to 141 | 142 to 144 | |
145 to 154 | 155 to 157 | 158 to 176 | 177 to 179 | |
180 to 188 | 189 to 191 | 192 to 211 | 212 to 214 | |
215 to 222 | 223 to 225 | 226 to 246 | 247 to 249 | |
250 to 256 | 257 to 259 | 260 to 281 | 282 to 284 |
In this special
case, the most significant segment,
Generation of the partial product for the special case of Range 1.
In addition, Algorithm
For these cases, the
size of the most significant segments of the input operands,
We also let
Gereration of the partial product for special cases of Range 2.
In the special cases of Range 2, the partial product
As a design example,
this section summarizes the implementation of a
Since the size of
the operands is in Range 1, two levels of decomposition are required. The first
level decomposition is based on
After the first-level decomposition,
these segmented input operands are multiplied and the organization of all the
partial products is shown in Figure
The segmented partial products of a
In Figure
At second level
decomposition, each 34-bit operand is split into 17 bits each. Then, the 18 × 34-bit
multiplication is implemented using the process shown in Figure
In this implementation, one 18 × 18-bit, forty nine 34 × 34-bit and fourteen 18 × 34-bit multipliers are required. Based on Altera's FPGAs, the total number of embedded blocks used (in terms of 9-input DSP elements) is equal to 49 × 8 + (14 × 2) × 2 + 1 × 2 = 450. However, if there is no second level decomposition, the 18 × 34-bit multiplication requires a 36 × 36-bit embedded multiplier. Under this condition, the total number of embedded 9-input DSP elements used is equal to 49 × 8 + 14 × 8 + 1 × 2 = 506, an increase of 12.5%.
The last step is to sum
all partial products shown in Figure
Flow digram of the additions required for a
The multigranular signed multipliers
were implemented using Altera's FPGAs. The synthesis tool used is Quartus II
version 7.2 targeting the device EP2S180F1508C3
from the Stratix II family [
The proposed approach and the
traditional scheme are compared based on the following metrics extracted from
the implementation summary and the timing analyzer summary: (1) the clock
period, (2) the number of (Adaptive Look Up Tables) ALUTs used, (3) the number of
embedded blocks in terms of 9-input DSP elements used. All these results are
presented in Figure
Comparison of the results.
Delay
Area
DSP blocks
The Delay-ALUT product and the Delay-DSP element product.
Delay-ALUT product
Delay-DSP element product
Compared to the results of the standard scheme, the proposed
multigranular multiplier method has resulted in considerable improvements in
terms of timing and area saving. The
performance has been improved by 20.7% compared to the standard scheme. For the
number of ALUTs used, the multigranular approach consumes an average of 67.6%
less area compared to the standard scheme.
Although our approach has outperformed the standard method, however, there
are roughly 25% cases where our approach requires more number of 9-input DSP
elements than that of the standard scheme, as shown in Figure
Moreover, the implementation results
of the multiplications can be improved using the algorithms of special cases as
explained earlier, which allow reducing the number of embedded blocks. These
results will be presented later. Also, considering the product of the delay and
the number of ALUTs as well as the product of the delay and the number of
embedded blocks, a significant improvement has been achieved as it can be
noticed from Figure
For the case of
Result of a
Delay (ns) | Number of ALUTs | Number of DSP elements | Delay-ALUT product | Delay-DSP-element product | |
---|---|---|---|---|---|
Standard | 41.530 | 16050 | 449 | 666557 | 18647 |
Multigranular | 32.487 | 4577 | 450 | 148693 | 14619 |
Saving (%) | 21.775 | 71.483 | 77.692 | 21.600 |
For the special cases, Figure
Result comparison for special cases.
Delay
Number of ALUTs used
Number of DSP elements used
The focus of this paper is to realize large size signed multipliers using DSP blocks with multigranular embedded signed multipliers in FPGAs. Multiple decompositions are used to efficiently make use of the multigranularity offered in modern FPGAs. The effectiveness of the proposed design approach has been tested using various benchmarks, and compared with a standard approach using commercial tool. Although this tool has complete access to the features available in the DSP blocks and in the 6-LUTs of the target FPGA, however, using our methodology has always outperformed the standard scheme.