The numerical controlled oscillator has wide application in radar, digital receiver, and software radio system. Firstly, this
paper introduces the traditional CORDIC algorithm. Then in order to improve computing speed and save resources, this paper
proposes a kind of hybrid CORDIC algorithm based on phase rotation estimation applied in numerical controlled oscillator (NCO).
Through estimating the direction of part phase rotation, the algorithm reduces part phase rotation and add-subtract unit, so that it
decreases delay. Furthermore, the paper simulates and implements the numerical controlled oscillator by Quartus II software and
Modelsim software. Finally, simulation results indicate that the improvement over traditional CORDIC algorithm is achieved in
terms of ease of computation, resource utilization, and computing speed/delay while maintaining the precision. It is suitable for high
speed and precision digital modulation and demodulation.
1. Introduction
Numerical controlled oscillator (NCO) is an important part of digital downconversion. It is widely used in radar wireless transceiver system and software radio system [1–3]. The main function of NCO is to produce two path sine and cosine data samples with variable frequency, discrete time, and mutually orthogonal. It has an advantage of high frequency precision and fast response.
The traditional implement method of NCO is lookup table and polynomial expansion method. Data accuracy of lookup table method depends on the size of the lookup table ROM. The size of the memory and the precision of phase accuracy are exponential relationship, which enlarges the resource consumption and reduces the processing speed of the system. In [4], it solves this problem by using store content mapping technology of odd-even symmetry to optimize the storage unit and reduce the storage resources to 12.5%. However, under the request of high precision, it still consumes a lot of resources. Polynomial expansion method is a real-time computing method which needs multiplier resources and has certain restrictions on the complexity and speed of the hardware. It is too hard for the two methods to trade off speed, accuracy, and resource. Coordinate rotation digital compute algorithm (CORDIC) is proposed to solve the problem. CORDIC algorithm uses a basic algorithm to replace the complex algorithm. CORDIC algorithm is easy to hardware implementation. It does not require hardware multiplier and all operations are only shift accumulation, which meets the hardware requirements of modular and regularization algorithm requirements.
Along with proposing high speed broadband receiver, the data accuracy and processing speed have a higher request. Under the background, traditional CORDIC algorithm has some inherent drawbacks, such as limited coverage angle and too much pipeline series which increase resource consumption and limit data processing speed. Aiming at these shortcomings, this paper puts forward an efficient pipeline architecture CORDIC algorithm for NCO design.
2. Traditional CORDIC Algorithm
Volder CORDIC algorithm was proposed in 1959, and in 1971, Walther unified the form of the algorithm. Meyer-base realized the algorithm [5, 6], using FPGA implementation for the first time. CORDIC algorithm has been applied in many fields, such as direct digital frequency synthesizer, fast Fourier transform, discrete cosine transform, digital modulation/demodulator, and stream processors [7–10]. According to certain phase, starting point (xi,yi) rotates continuously and approaches the final point gradually. Rotation vector diagram is shown in Figure 1.
CORDIC vector rotation diagram.
In Figure 1, it is easy to get
(1)[xjyj]=[cosθ-sinθsinθcosθ][xiyi].
From the start to the end position, spinning process can be done by several steps and each step only rotates a certain phase:
(2)[xn+1yn+1]=[cosθn-sinθnsinθncosθn][xnyn].
After extracting cosθn, formula (2) can be expressed as follows:
(3)[xn+1yn+1]=cosθn[1-tanθntanθn1][xnyn].
In order to simplify the hardware implementation, every operation sets each rotation phase to θn=arctan(2-n). The total rotation phase is θ=∑Snθn. So tanθn=Sn2-n. Formula (3) can be expressed as follows:
(4)[xn+1yn+1]=cosθn[1-Sn2-nSn2-n1][xnyn].
From formula (4), in addition to the cosθn coefficient, the operation is simple shift and addition.
In the final result, cosθn can be eliminated by multiplying a known constant. For example, P, the number of iterations is 16 and |θ|≤π/4. K can be expressed as follows:
(5)K=∏n=116cosθn=∏n=116cos(arctan(2-n))=∏n=116(1-2-2i)-1/2.
In the phase rotation process, approximative rotational iterative formula is
(6)[xn+1yn+1]=[1-Sn2-nSn2-n1][xnyn].
Parameter z is used to judge when the iteration is over: zn+1=zn-θn, z0=θ. When zn<θ, Sn=-1. When zn≥θ, Sn=+1. If the initial value is (xi,yi)=(x0,y0)=(K,0), (xn,yn) of Pth iteration will converge to (cosθ,sinθ). The phase convergence satisfies the CORDIC convergence theorem [6]. The constant scaling factor K is fixed and can be precomputed as long as the precision N is determined. After analysis of traditional CORDIC algorithm calculation accuracy, the iteration number and phase precision are expressed as follows:
(7)P≥-log2[tan(Δθmin)],
where Δθmin=2π/2N and the input phase data width is N.
3. Hybrid CORDIC Algorithm Based on Phase Rotation Estimation
Common operation structures are iteration, pipeline, and differential CORDIC algorithm. Iterative structure occupies less hardware resources, but the processing data efficiency is low. Although the pipeline structure occupies more hardware resources, it can improve the throughput. Based on the two realization structures, implementation schemes have parallel pipelines, hybrid rotation CORDIC, angle encoding method, and so forth [11, 12]. The work in [13] puts forward the way of prediction rotation direction. The algorithm, applied in error analysis and elimination, has the advantages of fast speed. But it does not optimize hardware structure. Using the structure of the parallel hybrid CORDIC algorithm, the prediction scheme of [14] is more regular and simpler compared to previous approaches, which can reduce the number of iterations by more than 50 percent. However, the judgment of rotation direction is not optimized, which increases latency time and resources, so that it affects the throughput. The work in [15] puts forward a modified hybrid CORDIC algorithm and improved the precision of output data, but the method is more complex. Trading off the disadvantages of the above methods and advantages of pipeline structure and iterative structure, this paper simplifies the CORDIC algorithm further. By using the arctangent function property, it reduces the rotating judgment and add-subtract unit operation.
In this paper, attention is focused mainly on techniques that reduce the number of iterations, while keeping the low latency. The hybrid CORDIC algorithm based on phase rotation estimation is presented in this section, which can be addressed by digit-on-line pipelined CORDIC circuits and repetitive multiple accumulations architecture.
3.1. Rotation Phase Estimation
Assuming that the input phase length of CORDIC algorithm is N and pipeline series is P, rotation phase θ can be represented as follows:
(8)θ=∑n=1PSnθn=∑n=1PSnarctan(2-n),
where Sn=±1. It is noted that the initial value of n is 1, and the reason is that we restrict the rotation angle within the range |θ|≤π/4 in the application example of NCO.
With the increase of rotational coefficient n, arctan(2-n) gets close to 2-n. When n≥1, 2-n>arctan(2-n). Error is εn=2-n-arctan(2-n). Arctangent function is developed through the tailor equation:
(9)εn=2-n-[2-n-13(2-n)3+15(2-n)5-⋯]=13(2-n)3-15(2-n)5+⋯,
where εn<(1/3)2-3n. The minimum phase value is 2-N. In the process of phase rotation, when the error estimate is (1/3)2-3n≤2-N, error generated by estimated value 2-n can be ignored. The range of n is
(10)n≥N-log233.
When n≥m=⌈(N-log23)/3⌉, arctan(2-n)≈2-n. Through (7), pipeline series of CORDIC algorithm is P≥N-2. The less the pipeline series are, the faster the speed is. When P=N-2, we define the hybrid radix set:
(11)θ=∑n=1N-2Snθn=∑n=1mSnarctan(2-n)+∑n=m+1N-2Sn2-n.
After iterating m+1 times, the sum of residual rotation phase is ∑θn, as shown in formula (12):
(12)∑θn=∑n=m+1N-2Sn2-n=Sm+12-m-1+Sm+22-m-2⋯+SN-22-N+2<2-m.
The actual residual phase is zm+1. According to the traditional CORDIC algorithm theory, zm+1≈∑θn. So zm+1<2-m. When the (m+1)th rotation begins, the new rotation phase θ~m+1 is ϕm+1, where the absolute value of zm+1 is ϕm+1. Thus tanθ~m+1=S~m+1θ~m+1. When zm+1<0, S~m+1=-1. When zm+1≥0, S~m+1=+1. Taking it into formula (3),
(13)[xm+2ym+2]=cosϕm+1[1-S~m+1θ~m+1S~m+1θ~m+11][xm+1ym+1].
After the rotation, the residual phase is 0. It shows that xm+2 and ym+2 are the output of cosine data and sine data.
3.2. Rotation Function Optimization and Error Analysis
In order to obtain cosine data from the new pipeline process, we put forward unidirectional rotation method to reduce the comparator and choose addition or subtractor. θ~m+1 should be expressed firstly. N-bits input phase needs to iterate m times. The results can be expressed as W. The residual phase at this time is zm+1. θ~m+1 is expressed as follows:
(14)T=∑i=1N-2Ai2i=AN-22N-2+AN-32N-3+⋯+A121,
where Ai=1 or 0. When W[N-2]=0, Ai=W[i]. When W[N-2]=1, Ai is that W[i] flips every bit and adds 1:
(15)θ~m+1=2πT2N=T~2N.
From formula (15), T~ is unknown. In the hardware implementation, T~ needs to be expressed as follows:
(16)T~=∑i=1NBi2i=2πT=∑i=1NAi2i(22+2+2-2+2-5+2-9),
where Bi=1 or 0. Taking all figures of T~ into (15),
(17)θ~m+1=∑i=1N-m-1Bi2i-N=BN-m-12-m-1+⋯+B222-N+B121-N.
Uniting formulas (13) and (17),
(18)[xm+2ym+2]=[xm+1-S~m+1·∑i=1N-m-1Bi2i-N·ym+1S~m+1·∑i=1N-m-1Bi2i-N·xm+1+ym+1],
where S~m+1=±1 and Bi=1 or 0.
CORDIC algorithm of efficient pipeline uses {arctan2-1,…,arctan2-m-1,θ~m+1} instead of the traditional rotation phase {arctan2-1,…,arctan2-m-1,…,arctan2-P}. The last set of rotation phase can be expressed as binary. Rotation direction is obtained directly from the last set. One more shift and add operation reduces N-m-3 rotation times. Under the premise of ensuring phase and data accuracy, it reduces the resource consumption and improves the operation speed. Finally, with fewer lines series, constant coefficient is as follows:
(19)K^=∏n=1m+1cos(arctan(2-n)).
At this time set the initial value (x0,y0) to (K^,0). According to the above process, (xm+2,ym+2) converges to (cosθ,sinθ).
The cosine error of this algorithm can be divided into three parts:
the quantization errors are caused due to the limited word length,
limited phase word length leads to approximation error,
the phase estimation gives rise to the rotation estimation error.
Quantization error is in an inverse ratio to word length and output word length is set by pipeline series. The more the pipeline series are, the lower the quantization error is. But the increase of pipeline series will lead to resources consumption. So according to the data figure, it is necessary to trade off pipeline series and quantization error. Considering the hardware consumption, computing speed, and precision, [7] proposes the optimization method of data bits and pipeline series. According to [16], the quantization error consists of two parts, the quantization error produced before and this time. It can be expressed as follows:
(20)|En|≤||en|+∑i=0n-1(∏j=in-1Sj)|ei||,
where En is the sum of quantization error and en is the nth phase rotation quantization error with Sj=±1.
When the output data is N and ei=[exieyi]T, |exi|≤ε, |eyi|≤ε, ε=2-N-1, |ei| can be expressed as follows:
(21)|ei|=exi2+eyi2≤2×2-N-1.
According to formula (7), when phase length is N, phase resolution is φ=2π/2N. Approximation error produced by limited phase word length can be expressed as follows:
(22)|A|=|V-V′||V′|≤2sin(Δθ2)≤Δθ≤φ,
where V is the actual value and V′ is the error value. Δθ is the difference between real phase and approximate phase. In the final rotating phase estimate, rotating phase arctan(2-n) is instead of 2-n. Arc value of 2π can be replaced only by binary values similarly. In formula (16), the generated error can be expressed as follows:
(23)|B|=∑i=m+2N-3(2-i-arctan(2-i))+2π-6.2832≈∑i=m+2N-3(2-i-arctan(2-i))+2×10-5.
4. The FPGA Design and Implementation of NCO4.1. High Speed and Precision NCO Structure
This paper adopts efficient pipelining structure CORDIC algorithm for high speed and high precision NCO. Its structure is shown in Figure 2. We take 16-bit phase control words as an example. Firstly, input is a 16-bit phase control word and 16-bit frequency control word. Secondly, through the phase accumulator and phase adder, the output is 16-bit phase value. Phase map generates the {0,π/4} phase. Thirdly, the shift-add efficient pipelining structure processes phase data. Finally according to the previous mapping relation, 16-bit sine and cosine data can be generated.
High speed and precision NCO structure.
The range θ of rotation angle value is {-44.855∘,44.855∘} and approximates to {-π/4,π/4}. It does not meet the {0,2π} scope of phase. Before 16-bit phase values are sent into the algorithm, cosine function property can judge the highest, second highest, and third highest bit. According to certain mapping relation, the highest 3 bits of 16-bit phase value and phase can be reduced to 3′b000 and {0,π/4}, respectively. The highest bit controls sine data symbol. If the bit is 1, the algorithm flips the sine data and adds 1. On the other hand, the algorithm does not process input data. The highest bit and second highest bit control cosine data symbol. If they are different, the algorithm flips sine data and adds 1. Otherwise, it remains to be the input data. The second highest bit and third highest bit control the location of cosine data and sine data. If they are different, the algorithm exchanges cosine data and sine data. Or else it remains to be the input data.
4.2. Internal Architecture Design and the Major Implementation Steps
According to formula (10), our algorithm needs m=⌈(16-log23)/3⌉=5 times for traditional phase rotation and one time for rotation phase estimation. If θ~6=2-9+2-10+2-13, the pipeline structure is shown in Figure 3. Each level only needs three adder-subtractors, two or six phase shift registers, and a phase coefficient memory and reduces more than a half of the rotation phase judgment and shift operation. For reducing the critical path in the pipelined implementation of traditional CORDIC, the differential CORDIC (D-CORDIC) algorithm based on digit-on-line pipelined CORDIC circuits [17] can be used to achieve higher throughput and lower pipeline latency. D-CORDIC algorithm is equivalent to the usual CORDIC in terms of accuracy as well as convergence. The system architecture uses parallel and pipeline differential CORDIC architecture to reduce latency and improve throughout. Digit-on-line pipelined CORDIC circuits take place of continuous phase accumulation in Figure 3.
Phase rotation estimation based on pipeline structure.
From what has been discussed above, the major steps of our algorithm are as follows.
Step 1.
Phase rotation is limited in the range of {-π/4,π/4}.
Step 2.
Traditional or differential CORDIC algorithm implements partial phase rotation.
Step 3.
Using a relatively simple prediction scheme, we divide original CORDIC rotations into the lower part and the higher part.
Step 4.
Differential CORDIC or traditional architecture is proposed to compute rotation direction. The lower part is computed by continuous accumulation or online architecture [18] based on differential CORDIC and the higher part is predicted by rotation phase estimation.
Step 5.
According to phase mapping relationship, the required high precision and high speed cosine data is produced.
4.3. Simulation Results
Table 1 compares the delay of some CORDIC rotation methods. Our proposed algorithm could obtain good performance in delay and resource.
Comparison of resource use.
Algorithm
Delay
N=16
N=24
N=32
Traditional pipeline structure
64TFA
110TFA
160TFA
Hybrid CORDIC algorithm
43TFA
72TFA
96TFA
Ours
26TFA
37TFA
72TFA
To compare our pipeline CORDIC algorithm with other previously proposed methods fairly, we assume CSA is universal adder in all algorithms and fast carry-propagate adders (CPA) are used in the last stage to take carry-save forms back to the input initial phase value.
In [13], the first m iterations use the traditional continuous comparison method, the same as the traditional CORDIC. The delay increases logarithmically with the maximum number of shifts. If the delay of carry-propagate adder (CPA) is ⌈log2N⌉·TFA, the latency of (N-m) iterations increases linearly with the word length and the delay is (4N/3)·TFA.
Based on the calculation method above, the traditional CORDIC based on pipeline architecture has the delay of ⌈log2N⌉·TFA·N.
Unlike the above methods, our proposed method reduces the number of iterations and simplifies the Z datapath. The first iterations still adopt the traditional CORDIC algorithm where a delay of ⌈log2N⌉·TFA is assumed for an N-bit CPA. The accumulations of final iteration use repetitive multiple accumulations architecture [19], which has much higher throughput and less delay compared with serial accumulator and pipelined adder based on carry-save addition as well. The last iteration increases linearly and the delay is (4K/3)·TFA, where K is the full-adder number for the accumulations based on adder-tree architecture.
According to the structure shown in Figure 2, traditional pipeline structure and efficient pipeline structure based on rotation phase estimation are implemented by verilog language, respectively. Hardware platform is a Cyclone II series EP2C8Q208C8 chip and software platform is in Quartus II of Altera company. Modelsim 10.0 simulation software tests the experience result. Firstly, the input frequency control word, phase control word, and clock frequency are set to 16′h1999, 16′d0, and 100 MHz. Output frequency is 10 MHz. Compared to the use of resources, the result can be expressed in Table 2.
Algorithm resource use comparison.
Algorithm
Resource
Logic unit
Register
Storage size
Traditional pipeline structure
1177
754
63
Hybrid CORDIC algorithm
1034
576
26
Ours
819
393
26
Through the comparison in Table 2, our proposed algorithm reduced resource obviously.
This algorithm precision is the same as traditional CORDIC algorithm, Δθ^min=2π/2N. The input frequency control word, phase control word, and clock frequency are set to 16′h00B6 and 16′d0. The output frequency is 0.3125MHz. Compared with the theoretical value and experiment value, the error statistic is shown in Figures 4 and 5. The simulation runtime of our proposed algorithm costs less than the traditional CORDIC algorithm in Figure 6.
Sine and cosine error statistics of traditional pipeline structure.
Sine value error
Cosine value error
Sine and cosine error statistics of ours.
Sine value error
Cosine value error
The runtime of algorithms’ comparison.
Compared with Figures 4 and 5, our proposed algorithm has the larger error volatility, while the two kinds of the algorithm error will be controlled in (-5×10-4,5×10-4).
Though our algorithm structure reduces logic unit, it guarantees the cosine data accuracy. Figure 7 shows the NCO simulation waveform of efficient pipeline structure.
NCO simulation waveform.
It is necessary to obtain efficient bits of phase, optimum iteration number, and data width. We do the above experiment 200 times. The random angle value is restricted from 0 to 45°. When the iteration number is 5~8 and the series of data width are 15, 16, 18, and 21, we can obtain the effective bits. The relationship of effective bit number with iteration times and data width is shown in Table 3. The data unit is degree.
Relationship of effective bit number with iteration number and data width.
Data width (iteration number)
14 (5)
16 (6)
18 (7)
21 (8)
Estimated value
22.137
22.137
22.137
22.137
Simulated value
23.672
22.458
22.281
22.132
The algorithm error will be controlled in (-5×10-4,5×10-4), when the iteration number is greater than 6. The experimental results show that the effective bit number is 13. Through calculating the minimum number of microrotation, the effective bit number is generally seven greater than iteration number. The calculation of total quantization errors could be calculated through this method.
5. Conclusion
In this paper, the hybrid CORDIC algorithm based on phase rotation estimation is proposed to design NCO. In the case of assuring the high precision output, the efficient CORDIC algorithm reduces more than a half of the rotation phase judgment and shift operation. Resource consumption, operation speed, and system delay have much better performance than traditional CORDIC algorithm. In terms of electronic countermeasures, it has a certain practicality. The algorithm has been successfully used in high speed broadband ADS-B receiver and shows good performance.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Grant no. 61172159) and the Fundamental Research Funds for the Central Universities (HEUCFT1101).
GuoL.TianS.WangZ.LuoJ.Study of NCO realization in parallel digital down anversion201233599810042-s2.0-84863508445ZhangQ.LuoY.ChenS.YanJ.Design and implementation of NCO based on phase rotation201032590891110.3969/j.issn.1001-506X.2010.05.0072-s2.0-77954011352YangX.-N.LouY.-C.XuJ.-L.2010Beijing, ChinaBeijing Institute of Technology PressQinW.-B.LuoL.-Y.LiT.-Y.Study on the efficient technology applied to high precision and high resolution storage in high speed NCO2007391156159VolderJ.The CORDIC trigonometric computing technique19598333033410.1109/TEC.1959.5222693WaltherJ.A unified algorithm for elementary functions38Proceedings of the Spring Joint Computer Conference1971379385WanS.-Q.ChenW.-F.HuangS.-R.JiH.YuZ.Implementation of a high-speed direct digital frequency synthesizer based on improved CORDIC algorithm20103111258625912-s2.0-78650795664ParkS. Y.YuY. J.Fixed-point analysis and parameter selections of MSR-CORDIC with applications to FFT designs201260126245625610.1109/TSP.2012.2214218MR30064162-s2.0-84870533834AggarwalS.MeherP. K.KhareK.Scale-free hyperbolic {CORDIC} processor and its application to waveform generation201360231432610.1109/TCSI.2012.2215778MR30175422-s2.0-84873405640HuangH.XiaoL.CORDIC based fast radix-2 DCT algorithm201320548348610.1109/LSP.2013.22526162-s2.0-84875660762JuangT.Low latency angle recoding methods for the higher bit-width parallel CORDIC rotator implementations200855111139114310.1109/TCSII.2008.20025662-s2.0-57949090564MeherP. K.ParkS. Y.CORDIC designs for fixed angle of rotation201321221722810.1109/TVLSI.2012.21870802-s2.0-84872892494WangS.PiuriV.SwartzlanderE. E.Jr.Hybrid CORDIC algorithms199746111202120710.1109/12.6442952-s2.0-0031345204HsiaoS.-F.HuY.-H.JuangT.-B.A memory-efficient and high-speed sine/cosine generator based on parallel CORDIC rotations200411215215510.1109/LSP.2003.8217052-s2.0-0442311215ZhangX.XinR.WangQ.LiH.Design of direct digital frequency synthesizer based on improved hybrid CORDIC algorithm2008366114411482-s2.0-47549104279HuY. H.The quantization effects of the CORDIC algorithm199240483484410.1109/78.1279562-s2.0-0026841735DawidH.MeyrH.The differential CORDIC algorithm: constant scale factor redundant implementation without correcting iterations199645330731810.1109/12.4855692-s2.0-0001003575ErcegovacM. D.LangT.Redundant and on-line CORDIC: application to matrix triangularization and SVD199039672574010.1109/12.535942-s2.0-0025444846MeherP. K.New approach to scalable parallel and pipelined realization of repetitive multiple accumulations200855990290610.1109/TCSII.2008.924376