Simple Hybrid Scaling-Free CORDIC Solution for FPGAs

COordinate Rotation DIgital Computer (CORDIC) is an effective method that is used in digital signal processing applications for computing various trigonometric, hyperbolic, linear, and transcendental functions. This paper presents the theoretical basis and practical implementation of circular (sine-cosine) CORDIC-based generator. Synthesis results of this generator based on Altera Stratix III FPGA (EP3SL340F1517C2) using Quartus II version 9.0 show that the proposed hybrid FPGA architecture significantly reduces latency (42% reduction) with a small area overhead, compared to the conventional version. The proposed algorithm has been simulated for sine and cosine function evaluation, and it has been verified that the accuracy is comparable with conventional algorithm.


Introduction
COordinate Rotation DIgital Computer (CORDIC) is an effective method for the rotation vector in circular and hyperbolic coordinate systems.CORDIC was described by Jack Volder (J.Volder) in 1959.Since then, a great deal of work on this issue was published [1].The method is popular now, due to its ease of software and hardware implementation.Indeed, for classical CORDIC method only a small amount of memory and some basic operations such as loading from memory, addition, subtraction, and shift are needed.The disadvantage of classical CORDIC is that the method has linear convergence.Thus, in order to get the  correct result fractional bits, one must hold all  iterations.
The comprehensive review of the modern state of theory and hardware representation of CORDIC method is provided in [1][2][3].To achieve high performance there are attempts to build hybrid algorithms, such as LUT + CORDIC or CORDIC + final multipliers to boost performance [1,4,5].However, there is no hybrid algorithm, which combines these three components: LUT, CORDIC, and final multipliers.Moreover, practical realization of such algorithm is unknown.An interesting method to increase productivity and reduce the average computation time was proposed in [6,7], known as scaling-free CORDIC method.However, this approach has a narrow range of input angles, complex structure, and thus makes it impossible to design CORDIC calculators with accuracy of more than 16 correct bits.
In this paper we propose hybrid architecture for high performance with a simple structure that allows one to get more than 16 correct bits of the result.

General Approach
The known methods to improve performance of conventional [1,2,8,9] and scaling-free CORDIC methods [6,7,10,11] are tabular calculation method and the residual method of multiplication described in [4,10,12], respectively.Table method is used to process the most significant bits of the input angle (argument), while the remaining multiplication is for the least significant bits.
The first method is to create a precalculated table of values for the most significant bits of the computing function argument.In this method [10] proposes an approach based on the scaling-free CORDIC, which combines the functionalities of memory units (LUT) and the corresponding iterative processing.The algorithms [4,12] compute sine and cosine functions, where the first iterations are performed by the classical CORDIC, and then output multipliers are applied, allowing significant savings in hardware resources compared to the classical approach described in [13].We propose combining both described approaches.The combination of the memory units, iterative scaling-free CORDIC algorithms, and multipliers provides a method without deformation module of the vector, which significantly saves hardware resources in FPGA implementation and reduces delays.The theoretical basis for the proposed method and its 16-bit microcontroller implementation are described in [11,14].However, no hardware implementation on FPGA has been reported until now.This paper proposes the algorithm of the hybrid FPGA calculator of trigonometric sine and cosine functions.

Proposed Algorithm
We give an improved and expanded hybrid CORDIC algorithm [11,14] shown as follows.
Let  be the given number of bits of the CORDIC, which determines the absolute error Δ  = 2 − for calculating sine and cosine functions.We assume that the input angle  is in the range 0 − /4 (the range of values  will be discussed below) and is represented as where   are the coefficients of the binary representation of an  angle.
To construct the sine-cosine generator, we use three elements-LUT, simple scaling-free CORDIC stages, and output multipliers.Accordingly, the input angle  is divided into three groups, which are processed sequentially using LUT, scaling-free CORDIC, and output multipliers. Therefore 2 − . ( We must set the numbers  1 and  2 to determine the size of each of these construction elements.By using  1 and  2 , the number of bits of each group can be written as  LUT = 1, . . .,  1 ;  CORDIC = ( 1 + 1), . . .,  2 ;  Multiplier = ( 2 + 1), . . ., .
To do this, we first consider the possibility of using CORDIC method in this computational scheme.The scale factor for conventional CORDIC is √ 1 + 2 −2 and for the scaling-free CORDIC it is √ 1 + 2 −4−2 on every iteration .] accordingly [11,14].For a given value , influence of the scale factor on the results of calculations for the conventional CORDICs is canceled when √ 1 + 2 −2 − 1 < Δ  , beginning from [1, 4], and for the scaling-free CORDIC it is canceled when It is caused by the fact that the scale factor becomes equal to 1.
Examples of the distribution of iterations between LUT, scaling-free (SF) CORDIC, and multipliers for different number of digits shown in Table 1.
Step 1. Preprocessing (or range reduction).Calculation of a sine-cosine with any of angles , which is in the range 0−2 , can be reduced to the calculation of sine-cosine with angle , which is in the first octant of 0 − /4.Such reduction of a range is allowed by the rule of 8-point of circle.
Methods for transformation of the input angle  to the  are well known, for example, [8,10], so we will not discuss it here.
Step 2. For  = 16 the angle  is divided into the following three angles: Step 3. Value is selected from LUT: and then the following iterations are made for  = 5, 6, 7: For  = 8 In vector-matrix form, step 3 is as follows: Step Obviously from these equations, final multiplication can be realized on typical for CORDIC operations as shift-add.
Step 5. Postprocessing.Computed sine-cosine with angle  is necessary to be converted to sine-cosine with angle  by the rule of 8-point symmetry of circle (everything is reduced only to the appropriate definition of sign  9 and  9 ).

LUT CORDIC Add
Shift a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 16 . . .

FPGA Implementation Results
To show the effectiveness of the proposed algorithm, we implemented steps 3 and 4 in the algorithm with the Altera Stratix III FPGA (EP3SL340F1517C2) using Quartus II version 9.0.Figure 1 shows the architecture of the algorithm implemented on the FPGA.In this figure, LUT and CORDIC blocks realize step 3 in the algorithm, and the block labeled with "Shift Add" realizes step 4. To achieve high-throughput computation, the architecture is pipelined with 9 stages: 1 stage for the LUT block, 4 stages for the CORDIC block, and 4 stages for the Shift Add block.
The LUT block is implemented by ROM with a 4-bit input and two 16-bit outputs.In the CORDIC block, each iteration of the algorithm from  5 to  8 is implemented as a pipeline stage by logic elements, and it has two 20-bit outputs.The Shift Add block is implemented by dedicated DSP blocks on the FPGA, which have 4 pipeline stages.The outputs of the Shift Add block have 28 bits, and they are rounded to 16 bits.
In the best-known FPGA implementation of the conventional CORDIC algorithm for high throughput, all iterations of the algorithm from  1 to  16 are unrolled, and each of them is implemented as a pipeline stage.Thus, such implementation method requires a long latency and many logic elements.On the other hand, our implementation of the hybrid CORDIC requires fewer logic elements and causes shorter latency due to fewer pipeline stages.
Table 2 shows FPGA implementation results of the conventional CORCIC algorithm and our hybrid CORDIC algorithm.As shown in this table, our hybrid algorithm is much more efficient than the conventional one in terms of hardware cost.Compared to the conventional CORDIC, the proposed algorithm has approximately the same bandwidth.However, there is a considerable gain in reduction of latency, as well as the number of logic elements.Moreover, the number of pipeline stages is also reduced.Therefore, the proposed method is easier to implement and uses less logic elements, while it has similar bandwidth.Although our hybrid algorithm requires not only logic elements but also DSP blocks, this is efficient for modern FPGAs.This is because most of modern FPGAs have DSP blocks and memory blocks, and a balanced use of them as well as logic elements is more efficient.Therefore, we can conclude that our hybrid algorithm is suitable for modern FPGA implementation.

Concluding Remarks
This paper describes the theoretical bases and practical pipelined FPGA implementation of a new hybrid scalingfree CORDIC algorithm.Logical combination of three construction elements of modern FPGAs which are LUT, simple scaling-free CORDIC stages, and multipliers allowed a considerable improvement of calculation efficiency of sine and cosine functions without the loss of accuracy, compared to conventional CORDIC algorithms.The proposed method allows the reduction of the FPGA hardware resources by more than 30%, while providing the same throughput and accuracy of calculations.

Figure 1 :
Figure 1: Architecture of the hybrid scaling-free CORDIC algorithm.

Table 2 :
FPGA implementation results of conventional and hybrid algorithms.