A Convolve-And-MErge Approach for Exact Computations on High-Performance Reconfigurable Computers

This work presents an approach for accelerating arbitrary-precision arithmetic on high-performance reconfigurable computers (HPRCs). Although faster and smaller, fixed-precision arithmetic has inherent rounding and overflow problems that can cause errors in scientific or engineering applications. This recurring phenomenon is usually referred to as numerical nonrobustness. Therefore, there is an increasing interest in the paradigm of exact computation, based on arbitrary-precision arithmetic. There are a number of libraries and/or languages supporting this paradigm, for example, the GNU multiprecision (GMP) library. However, the performance of computations is significantly reduced in comparison to that of fixed-precision arithmetic. In order to reduce this performance gap, this paper investigates the acceleration of arbitrary-precision arithmetic on HPRCs. A Convolve-And-MErge approach is proposed, that implements virtual convolution schedules derived from the formal representation of the arbitraryprecision multiplication problem. Additionally, dynamic (nonlinear) pipeline techniques are also exploited in order to achieve speedups ranging from 5x (addition) to 9x (multiplication), while keeping resource usage of the reconfigurable device low, ranging from 11% to 19%.


Introduction
Present-day computers built around fixed-precision components perform integer and/or floating point arithmetic operations using fixed-width operands, typically 32 and/or 64 bits wide.However, some applications require larger precision arithmetic.For example, operands in public-key cryptography algorithms are typically thousands of bits long.Arbitrary-precision arithmetic is also important for scientific and engineering computations where the roundoff errors arising from fixed-precision arithmetic cause convergence and stability problems.Although many applications can tolerate fixed-precision problems, there is a significant number of other applications, such as finance and banking, in which numerical overflow is intolerable.This recurring phenomenon is usually referred to as numerical nonrobustness [1].In response to this problem, exact computation, based on exact/arbitrary-precision arithmetic, was first introduced in 1995 by Yap and Dube [2] as an emerging numerical computation paradigm.In arbitrary-precision arithmetic, also known as bignum arithmetic, the size of operands is only limited by the available memory of the host system [3,4].
In the earlier days of computers, there were some machines that supported arbitrary-precision arithmetic in hardware.Two examples of these machines were the IBM 1401 [6] and the Honeywell 200 Liberator [7] series.Nowadays, arbitrary-precision arithmetic is mostly implemented in software, perhaps embedded into a computer compiler.Over the last decade, a number of bignum software packages have been developed.These include the GNU multiprecision (GMP) library, CLN, LEDA, Java.math,BigFloat, BigDigits, International Journal of Reconfigurable Computing and Crypto++ [4,5].In addition, there exist stand-alone application software/languages such as PARI/GP, Mathematica, Maple, Macsyma, dc programming language, and REXX programming language [4].
Arbitrary-precision numbers are often stored as largevariable-length arrays of digits in some base related to the system word-length.Because of this, arithmetic performance is slower compared to fixed-precision arithmetic which is closely related to the size of the processor internal registers [2].There have been some attempts for hardware implementations.However, those attempts usually amounted to specialized hardware for small-size discrete multiprecision and/ or to large-size fixed-precision [8][9][10][11][12] integer arithmetic rather than to real arbitrary-precision arithmetic.
High-performance reconfigurable computers (HPRCs) have shown remarkable results in comparison to conventional processors in those problems requiring custom designs because of the mismatch with operand widths and/or operations of conventional ALUs.For example, speedups of up to 28,514 have been reported for cryptography applications [13,14], up to 8,723 for bioinformatics sequence matching [13], and up to 32 for remote sensing and image processing [15,16].Therefore, arbitrary-precision arithmetic seemed to be a good candidate for acceleration on reconfigurable computers.
This work explores the use of HPRCs for arbitrary-precision arithmetic.We propose a hardware architecture that is able to implement addition, subtraction and multiplication, as well as convolution, on arbitrary-length operands up to 128 ExibiByte.The architecture is based on virtual convolution scheduling.It has been validated on a classic HPRC machine, the SRC-6 [17] from SRC Computers, showing speedups ranging from 2 to 9 in comparison to the portable version of the GMP library.This speedup is in part attained due to the dynamic (nonlinear) pipelining techniques that are used to eliminate the effects of deeply pipelined reduction operators.
The paper is organized as follows.Section 2 presents a short overview of HPRC machines.The problem is formulated in Section 3. Section 4 describes the proposed approach and architecture augmented with a numerical example for illustrating the details of the proposed approach.Section 5 shows the experimental work.Implementation details are also given in Section 5, as well as performance comparison to the SW version of GMP.Finally, Section 6 presents the conclusions and future directions.

High Performance Reconfigurable Computing
In the recent years, the concept of high-performance reconfigurable computing has emerged as a promising alternative to conventional processing in order to enhance the performance of computers.The idea is to accelerate a parallel computer with reconfigurable devices such as FPGAs where a custom hardware implementation of the critical sections of the code is performed in the reconfigurable device.Although the clock frequency of the FPGA is typically one order of magnitude less than the one of high-end microprocessors, significant speedups are obtained due to the increased parallelism of hardware.This performance is especially important for those algorithms not matching the architecture of conventional microprocessors, because of either the operand lengths (e.g., bioinformatics) or the operations performed (e.g., cryptography).Moreover, the power consumption is reduced in comparison to conventional platforms, and the use of reconfigurable devices brings flexibility closer to that of SW, as opposed to other HW-accelerated solutions such as ASICs.In other words, the goal of HPRC machines is to achieve the synergy between the low-level parallelism of hardware with the system-level parallelism of high-performance computing (HPC) machines.
In general, HPRCs can be classified as either nonuniform node uniform systems (NNUSs) or uniform node nonuniform systems (UNNSs) [13].NNUSs consist of only one type of nodes.Nodes are heterogeneous containing both FPGAs and microprocessors.FPGAs are connected directly to the microprocessors inside the node.On the other hand, UNNS nodes are homogeneous containing either FPGAs or microprocessors which are linked via an interconnection network.The platform used in this paper, SRC-6 [17], belongs to the second category.
SRC-6 platform consists of one or more general-purpose microprocessor subsystems, one or more MAP reconfigurable processor subsystems, and global common memory (GCM) nodes of shared memory space [17].These subsystems are interconnected through a Hi-Bar Switch communication layer; see Figure 1.Multiple tiers of the Hi-Bar Switch can be used to create large-node count scalable systems.Each microprocessor board is based on 2.8 GHz Intel Xeon microprocessors.Microprocessors boards are connected to the MAP boards through the SNAP interconnect.The SNAP card plugs into the memory DIMM slot on the microprocessor motherboard to provide higher data transfer rates between the boards than the less efficient but common PCI solution.The peak transfer rate between a microprocessor board and the MAP board is 1600 MB/sec.Hardware architecture of the SRC-6 MAP processor is shown in Figure 1.The MAP Series C board is composed of one control FPGA and two user FPGAs, all Xilinx Virtex II-6000-4.Additionally, each MAP unit contains six interleaved banks of on-board memory (OBM) with a total capacity of 24 MB.The maximum aggregate data transfer rate among all FPGAs and on-board memory is 4800 MB/s.The user FPGAs are configured in such a way that one is in the master mode and the other is in the slave mode.The two FPGAs of a MAP are directly connected using a bridge port.Furthermore, MAP processors can be chained together using a chain port to create an array of FPGAs.

Problem Formulation
Exact arithmetic uses the four basic arithmetic operations (+, −, ×, ÷) over the rational field Q to support exact computations [1][2][3][4][5].Therefore, the problem of exact computation is reduced to implementing these four basic arithmetic operations with arbitrary-sized operands.The asymptotic computational complexity of each operation depends on the bit length of operands [5,18], see Table 1.Our formal representation of the problem will consider only the multiplication operation.This is based on the fact that multiplication is a core operation from which the other basic arithmetic operations can be easily derived, for example, division as a multiplication using Newton-Raphson's approximation or Goldschmidt algorithm [19,20].Our proposed arithmetic unit can perform arbitraryprecision addition, subtraction, multiplication, as well as convolution operations.We decided to follow the Basecase/Schoolbook algorithm.Although this algorithm is not the fastest algorithm having a complexity of O(n 2 ), see Table 1, it is the simplest and most straightforward algorithm with the least overhead.In addition, this algorithm is usually the starting point for almost all available software implementations of arbitrary-precision arithmetic.For example, in the case of GMP, the Basecase algorithm is used up to a predetermined operand size threshold, 3000-10,000 bit length depending on the underlying microprocessor architecture, beyond which the software adaptively switches to a faster algorithm [20,21].
The main challenge of implementing arbitrary-precision arithmetic in HPRC machines is the physical/spatial limitations of the reconfigurable device.In other words, the reconfigurable device (FPGA) has limited physical resources, which makes it unrealistic to accommodate for the resource requirements of arbitrarily large-precision arithmetic operators.Therefore, the problem can be formulated as given a fixed-precision arithmetic unit, for example, p-digit by pdigit multiplier, how to implement an arbitrary-precision arithmetic unit, for example, arbitrary large-variable-size m 1 -digit by m 2 -digit multiplier.Typically, p is dependent on the underlying hardware word-length, for example, 32-bit International Journal of Reconfigurable Computing or 64-bit.In achieving this objective, our approach is based on leveraging previous work and concepts that were introduced for solving similar problems.For example, Tredennick and Welch [22] proposed architectural solutions for variablelength byte string processing.Similarly, Olariu et al. [23] formally analyzed and proposed solutions for the problem of sorting arbitrary large number of items using a sorting network of small fixed I/O size.Finally, ElGindy and Ferizis [24] investigated the problem of mapping recursive algorithms on reconfigurable hardware.

Formal Problem Representation. An arbitrary-precision
m-digit number in arbitrary numeric base r can be represented by It can also be interpreted as an n-digit number with base r p , where p is dependent on the underlying hardware wordlength, for example, 32-bit or 64-bit.This is represented by (2) as follows: where ( Multiplication, accordingly, can be formulated as shown in Figure 2 and expressed by (3).In other words, as implied by (4a), multiplication of high-precision numbers can be performed through two separate processes in sequence.The first is a low fixed precision, that is, p-digits, multiply-accumulate (MAC) process for calculating the coefficients/partial products C i s as given by (4b).This is followed by a merging process of these coefficients/partial products into a final single high-precision product as given by (4a).Equation (4b) shows that the coefficients C i s can be represented at minimum by 2p-digit precision.
The extra digits are due to the accumulation process.Therefore, C i s can be expressed as shown by (4c); where (3) p digits p digits p digits

C(x)
where

Multiplication as a Convolve-And-MErge (CAME) Process.
It can be easily noticed that the coefficients C i s given by (4b) are in the form of a convolution sum.This led us to believe that virtualizing the convolution operation and using it as a scheduling mechanism will be a straightforward path for implementing multiplication and hence the remaining arithmetic operations.The different sub operands, that is, As and Bs, being stored in the system memory, will be accessed according to the convolution schedule and passed to the MAC process.The outcome of the MAC process is then delivered to the merging process which merges the partial products, according to another merging schedule, into the final results.The final results are then scheduled back into the system memory according to the same convolution schedule.The convolution schedule, on one hand, can be derived from (4b).It is simply a process that generates the addresses/ indexes for As, Bs, and Cs governed by the rules given in (4b).On the hand, the merging schedule can be derived from (5) which results from substituting (4c) into (4a).Figure 3 shows the merging schedule as a high-precision addition of three components.The first component is simply a concatenation of all the first p-digits of the MAC output.The second component is a p-digit shifted concatenation of all the second p-digits of the MAC output.Finally, the third component is a 2p-digit shifted concatenation of all the third p-digits of the MAC output: where ( The merging schedule, as described above, is a high-precision schedule which will work only if the merging process is performed after the MAC process has finished completely.Given the algorithm complexity O(n 2 ) and allowing the two processes to work sequentially one after another would dramatically impact the performance.However, modifying the merging process to follow a small-fixed-precision scheduling scheme that works in parallel and in synchrony with the MAC process would bring back the performance to its theoretical complexity O(n 2 ).The modified merging scheme can be very easily derived either from (5) or Figure 3 resulting in (6): This would mean that, as soon as the MAC process finishes one partial result, C i in 3p-digit precision, the merging process, in-place, produces a final partial result S i in p-digit precision, see Figure 4.This precision matches the wordlength of the supporting memory system which allows easy storage of the final result without stalling either the MAC or the merging process.The merging process registers the remaining high-precision digits for use in subsequent calculations of S i s.
In addition to performing multiplication, the derived architecture in Figure 4 can also natively perform the convolution operation for sequences of arbitrary size.This is because the MAC process generates the coefficients in (4b) according to a convolution schedule and in fact they are the direct convolution result.In other words, only the MAC process is needed for the convolution operation.Furthermore, the same unit can be used to perform addition and/or Accum  subtraction by passing the input operands directly to the merging process without going through the MAC process.

Illustrative Example.
In order to show the steps of the proposed methodology, we will consider multiplying two decimal numbers, A = 987654321 and B = 98765.We will assume that the underlying hardware word-length is 2 digits.The formal representation for this example is As described in Section 4.2, multiplication is performed as a two-step CAME process, as shown in Figure 5.The MAC process is first applied to calculate the convolution of the two numbers, see Figure 5(a), after which the partial results are then merged, as shown in Figure 5(b), to obtain the final result.

Precision of the Arithmetic Unit.
One challenge of implementing an arbitrary-precision arithmetic unit is to ensure that it is possible to operate with any realistic size of operands.It is therefore necessary to investigate the growth of the MAC process as it represents an upper bound for the unit precision.As discussed earlier, shown in Figure 2, and given by (4c), the outcome of the MAC process, C i , consists of three parts: the multiply digits, , and the accumulation digits, Δ i .The corresponding number of digits, , n Δi , can be expressed as given by (8a), (8b), (8c), and shown in Figure 6: Controlling the growth of the MAC process by keeping the accumulation/carry digits less than or equal to p-digits, (8c) gives the upper bound on the precision of the input operands.This is in terms of the hardware unit word-length, that is, p-digits, and the numeric system base r.For example, for a binary representation, that is, r = 2, and a 64-bit arithmetic unit, that is, p = 64, the accommodated operand precision is 128 ExibiByte which is beyond any realistic storage system.In other words, a hardware unit with such parameters can provide almost infinite-precision arithmetic.

Experimental Work
Our experiments have been performed on one representative HPRC systems, SRC-6 [17], previously described in Section 2. The proposed architecture was developed partly in Xilinx System Generator environment as well as in VHDL.In both environments, the architectures were highly parameterized.
The hardware performance was referenced to one of the most efficient [21] software libraries supporting arbitraryprecision arithmetic, namely, GMP library on Xeon 2.8 GHz.We considered two versions of GMP.The first was the compiled version of GMP which is a precompiled and highly optimized version for the underlying microprocessor.The other was the highly portable version of GMP without any processor-specific optimizations.

Implementation Issues.
Large-precision reduction operations used in both the MAC (i.e., 3p-digit accumulation) and the merging processes proved to be a challenge due to critical-path issues.For example, the accumulation of a stream of integers can be impractical for FPGA-based implementations when the number of values is large.The resultant circuit can significantly reduce the performance and consume an important portion of the FPGA.
To eliminate those effects of reduction operations, techniques of deep pipelining [25,26] and those of nonlinear pipelining [27,28] were considered.The buffering mechanism, presented in [25,26], showed either low throughput and efficiency, or high latency and resources usage for our case, see Figure 7(a).Therefore, we leveraged the techniques of nonlinear pipelines [27,28] which proved to be effective, see Figure 7(b).Furthermore, we derived a generalized architecture for nonlinear pipelined accumulation; refer to (9).The main component of this structure is a p-digit accumulation stage in which delay elements are added for synchronization purposes, see Figure 8.The p-digit stages are arranged in a manner such that overflow digits from a given stage j are passed to the subsequent stage j + 1.The total number of stages n s depends on the size of the input operand m A as well as on the number of accumulated operands N, see (9).In doing so, we have proved (a) Deeply pipelined accumulator  where that large-size accumulation operations which are prone to deeply-pipelined effects and hardware critical-path issues can be substituted with multiple and faster smaller-size accumulations.In other words, a single large-size (p • n s )digit accumulator can efficiently be implemented using n s faster p-digit accumulators, see Figure 8.We analyzed the pipeline efficiency, as defined in [27,28], of our proposed architecture.This is expressed by (10a), (10b), and shown in Figure 10.The pipeline efficiency η as expressed by (10a) can be easily derived by considering the pipeline reservation table [27,28].As shown in Figure 9, the example reservation table is used to calculate the pipeline efficiency η.We implemented two versions of the arithmetic unit, that is, 32-bit and 64-bit.For very large data, the efficiency for the 32-bit unit was lower bounded to 80% while the efficiency for the 64-bit unit was lower bounded to 84.62%, see (10b) and Figure 10: where L mac ≡ Latency of the MAC − process, L merger ≡ Latency of the merging − process, International Journal of Reconfigurable Computing Z −1 ≡ one-unit delay element    2, the proposed architecture consumes relatively low hardware resources requiring at maximum 19% of the reconfigurable device for the 64-bit unit.These implementation results allow taking advantage of the inherent parallelism of the FPGA and adding more than one operational unit upper bounded by the number of available memory banks.For example, in SRC-6, there are six memory banks which make it possible to allocate two units per FPGA, see Figure 4(a).Next, we show the results of the 64-bit Basecase algorithm for addition/subtraction, and the 32-bit and 64-bit Basecase algorithm for multiplication.
The arbitrary-precision addition/subtraction performance was measured on SRC-6 and compared to both the compiled and portable versions of GMP.As shown in Figure 11, T COMP HW, that is, the total computation time of the hardware on SRC-6, is lower than the execution time of both the compiled and portable versions of GMP.The performance speedup is shown in Figure 12.The hardware implementation asymptotically outperforms the software, by a factor of approximately 5, because of the inherent parallelism exploited by the hardware.We can also notice that, for small-precision addition/subtraction, the speedup factor starts from approximately 25.This is due to the large overhead, relative to the data size, associated with the software, while the only overhead associated with the hardware is due to the pipeline latency.This latency is independent on the data size.It is also worth to notice the linear behavior, O(n), of both the software and the hardware.This is because both execute the same algorithm, that is, Bascase addition/ subtraction [20,21], see Table 1.
In the case of multiplication, we notice a nonlinear behavior O(n 1+e ), 0 < e < 1; see Figure 13 and Table 1.We notice also a similar behavior to the addition/subtraction for small-size operands.Figures 13(a version of GMP.This is because this version of GMP uses the same algorithm as ours, that is, Basecase with O(n 2 ) see Table 1, independent of the data size [20,21].As shown in Figure 14, the hardware behavior asymptotically outperforms the portable GMP multiplication by a factor of approximately 2 for the 32-bit multiplication, see Figure 14(a), and 9 for the 64-bit multiplication, see Figure 14(b).However, this is not case with compiled GMP multiplication which is optimized and adaptive.Compiled GMP uses four multiplication algorithms, and adaptively switches from a slower to a faster algorithm depending on the data size and according to predetermined thresholds [20,21].For these reasons, the hardware, as can be seen from Figure 13(c), outperforms the compiled GMP up to a certain threshold, approximately 10 Kbits, beyond which the situation reverses.

Conclusions and Future Work
This paper shows the feasibility of accelerating arbitraryprecision arithmetic on HPRC platforms.While exact computation presents many benefits in terms of numerical robustness, its main drawback is the poor performance that is obtained in comparison to fixed-precision arithmetic.The results presented in this work show the possibility of reducing the performance gap between fixed-precision and arbitraryprecision arithmetic using HPRC machines.The proposed solution, the Convolve-And-MErge (CAME) methodology for arbitrary-precision arithmetic, is derived from a formal representation of the problem and is based on virtual convolution scheduling.For the formal analysis, only the multiplication operation was considered.This decision was made due to the fact that multiplication is a core operation from which other basic arithmetic operations, such as division and square root, can be easily derived.
Our proposed arithmetic unit can perform arbitraryprecision addition, subtraction, multiplication, as well as arbitrary-length convolution operations.Our approach in implementing the CAME process was based on leveraging previous work and concepts that were introduced for solving similar problems.Dynamic (nonlinear) pipelines techniques were exploited to eliminate the effects of deeply pipelined reduction operators.The use of these techniques allowed us reaching a minimum of 80% pipeline utilization for 32-bit units and reaching 84.6% efficiency for 64-bit units.This implementation was verified for both correctness and performance in reference to the GMP library on the SRC-6 HPRC.The hardware outperformed GMP by a factor of 5x speedup for addition/subtraction, while the speedup factor was lower bounded to 9x compared to the portable version of GMP multiplication.
Future directions may include investigating hardware support for floating-point arbitrary precision, considering faster algorithms than the Basecase/Schoolbook presented in this paper, as well as adopting methods for adaptive algorithm switching based on the length of the operands.In addition to, full porting of an arbitrary-precision arithmetic library such as GMP to HPRC machines might also be favorable.

Figure 7 :
Figure 7: Accumulator of the MAC process.

Table 2 :
FPGA resource utilization.L mac = 4, L merger = 1=⇒ η ∞ = 80.00%, p = 64 bits, L mac = 11, L merger = 2=⇒ η ∞ = 84.62%.Our experiments were performed for two cases, that is, 32-bit and 64-bit units.FPGA resources usage and clock frequency are shown in Table2.While resource usage in the 64-bit unit is larger, as expected, clock frequency is similar in both cases due to the clock frequency requirement imposed by SRC-6 (100 MHz).As it can be seen in Table