A ROM-Less Direct Digital Frequency Synthesizer Based on Hybrid Polynomial Approximation

In this paper, a novel design approach for a phase to sinusoid amplitude converter (PSAC) has been investigated. Two segments have been used to approximate the first sine quadrant. A first linear segment is used to fit the region near the zero point, while a second fourth-order parabolic segment is used to approximate the rest of the sine curve. The phase sample, where the polynomial changed, was chosen in such a way as to achieve the maximum spurious free dynamic range (SFDR). The invented direct digital frequency synthesizer (DDFS) has been encoded in VHDL and post simulation was carried out. The synthesized architecture exhibits a promising result of 90 dBc SFDR. The targeted structure is expected to show advantages for perceptible reduction of hardware resources and power consumption as well as high clock speeds.


Introduction
Recent applications of digital communication impose rigid specifications on frequency synthesizers, which include the ability to achieve ultrathin frequency increments, low spurs level, and fast switching speed, with an efficient power system. Direct digital frequency synthesizers (DDFS), among other frequency synthesizer types, exhibit a greater flexible capability to satisfy these needs, which are rapidly growing.
A classical DDFS architecture, introduced by Tierney et al. [1] and shown in Figure 1, is no longer used. Extensive research efforts, during the last four decades, have led to major modifications in Tierney's architecture, even to the extent of introducing alternative architectures that no longer employ the concept of a lookup table (LUT). The aim is to imitate the proficiency of ROM-based DDFS in terms of signal integrity, using the lowest computational cost.
One of the most interesting concepts that have been explored is based on polynomial approximation, in which the operations. The additional circuits that are needed occupy a large amount of die area, consuming significant power and slowing down the whole system's speed. To circumvent this problem, segmentation, with an adequate polynomial approximation for each segment interval, appears to be a perfect solution. In this case, the appropriate polynomial approximation can be individually and nicely fitted to each sine curve segment. Therefore, in this paper, we introduce a proposed DDFS architecture design, which is based on a hybrid polynomial approximation. It is expected that this new combination can precisely approximate the sine curve, requiring only minimal segmentation, thereby achieving excellent spectral purity with lower employed circuitry and working at high clock rates with low power consumption. This paper is structured as follows. Section 2 describes the proposed DDFS algorithm. The efficient proper polynomial arrangements are discussed in Section 3. Polynomial coefficient digitization is analyzed in Section 4. DDFS architecture design including radian phase accumulator, constant coefficient multipliers, squarer and variable coefficient multiplier architectures are explained in detail in Section 5. Section 6 describes the gate level simulation results while Section 7 presents the experimental findings, and Section 8 concludes the paper.

Proposed DDFS Algorithm
For a sine curve approximation, it is desirable to partition the first quadrant interval [0-/2] into small subintervals and then approximate each interval by polynomials of a certain degree, rather than employing a single polynomial approximation. It goes without saying that the more divided the interval, the more accurate the resulting approximation will be, and it is also true that the higher the order, the better the approximation that is achieved. However, increasing the number of sub-intervals will stretch the coefficients' ROM, and raising the polynomial degree will increase the number of polynomial coefficients, resulting in extra arithmetic circuitry being required. This study exploits the nature of the sine curve, where the portion near the zero point can be represented by a single linear segment, and the remaining part is represented by another segment with an adequate degree polynomial. As a result, the approximation employs the least number of segments and with a lower polynomial order. The process should be subject to balancing the enhancements of spectral purity against the hardware complexity. Figure 2 shows this principle where the phase sample (the polynomial type switching phase point) can be anywhere on the phase input axis from zero to /2. A fourth-order  polynomial has been chosen to approximate the second segment, which represents a compromise solution to face the tradeoff between a high degree polynomial approximation and digital implementation requirements. The approximated function can be expressed as follows: where , ( = 1, 2, = 0, 1, 2, 3, 4) represents the coefficients of the polynomials and is the linear segment upper bound. The extreme points = 0 and = /2 represent a single fourth-and single first-order polynomial approximation, respectively. Next, we have to determine the appropriate value of that corresponds to the maximum SFDR, and to do so, we have to first find the optimal sets of polynomial coefficients for certain points ∈ [0, /2], and then, for each set of real-valued coefficients, we have to determine the corresponding SFDR level. For this purpose, we employed a powerful MAPLE optimization package to apply the following Minimum-Mean Square Error (MMSE) criterion: The values of SFDR are found with the aid of MATLAB and are depicted in Figure 3. Next, we examine the behavior of the approximated function with respect to . The aim is to Figure out the maximum achievable SFDR. From the plot, we observe that for = 0, the ( ) is minimized to a single fourth polynomial approximation, ( ) = 24 4 + 23 3 + 22 2 + 21 + 20 , 0 ≤ ≤ /2, and the approximated function has an SFDR of 83.75 dBc. During the 0 ≤ ≤ 7 /128 interval, the SFDR gradually increases until it reaches its maximum level of 91.244 dBc at = 7 /128. This result is in line with the expected result. The value of sin( ) is almost equal to during this interval, and therefore the linear segment has been fitted precisely to the sine curve. Beyond this point, the SFDR is decreased until reaching the  lowest level at = /2. The approximated function becomes ( ) = 10 , 0 ≤ ≤ /2, which represents a single first-order polynomial approximation. Substituting the phase sample point = 7 /128 in (1) The optimal set of polynomial coefficients is obtained and is presented in Table 1. Figure 4 shows the spectrum of the ( ) for this set of real-valued coefficients. The largest unwanted frequency component has an amplitude of −91.244 dB with respect to the target sinusoid and is also indicated. Figure 5 shows the residual error of the approximated sinusoidal wave. The maximum absolute error (MAE) is equal to 1.25 × 10 −4 (0.000125 < 2 −12 ).
To show the contribution of the linear segment, the residual error of the approximated sine curve, based on one segment fourth-order approximation, is also shown (dashed red line), with MAE equals 2.2 × 10 −4 . Conspicuously, the residual error is observed to be much lower for the same polynomial approximation when it is combined with the linear segment.

Efficient Polynomial Arrangement
Before quantizing the polynomial coefficients, we have to simplify the approximated function. The aim is to produce the   targeted sine output with a minimum of arithmetic evaluation. The hard part is the fourth-order polynomial evaluation, which requires more careful handling. Many arrangements have been proposed to simplify the computation of highorder polynomials [9,10]. Most of these embody the Horner form. Forming a polynomial in the Horner arrangement results in efficient computation, and for this reason, the Horner arrangement has been extensively used in highorder polynomial-based sine approximations. Thus, the usual way to simplify this type of polynomial is to use a nested multiplication algorithm (NMA) as follows: ( ) = 20 + ( 21 + ( 22 + ( 23 + 24 ))) . (4)  E q u a t i o n ( 5) 4 (2 fixed coefficients and 2 squarer) 4 Proposed rule Equation (6) 3 (1 fixed coefficient, 1 squarer, and 1 variable coefficient) 5 Evaluation of such a polynomial requires four multipliers, three of which have variable operands that cannot be simplified. One of the most interesting techniques for simplifying the evaluation of the variable coefficient multiplier is to replace it with its squarer counterpart. Following this idea, [9] employed the following arrangement: In this case, the implementation requirement was reduced to two squares and two constant multipliers. Accordingly, the hardware intricacy was significantly reduced. Now, in our design and for further simplification, we choose to remain close to the ideas of replacing the variable multiplier and using the nested form. To do so, we apply the arrangement presented in [14]. Press et al. [14] stated that a polynomial of order greater than 3 can be determined using less than multiplications if some auxiliary coefficients are assumed to be computed in advance, usually with a penalty of doing an extra addition. Following [14] for a Quartic polynomial, . This can be evaluated using three multiplications and five additions as follows: where the , = 1, 2, 3, 4, 5, represents the new polynomial coefficient. Using this arrangement, we can easily deduce the new coefficients of (6) from the standard arrangement and Table 1. The resulting values are reported in Table 2. According to (6), the implementation cost is reduced to one squarer, one constant multiplier, and one variable operand multiplier. Table 3 summarizes the required arithmetic circuits for the proposed arrangement in comparison to the prior reported arrangements.

Polynomial Coefficients Digitization
To complete the design, in the following we will quantize the optimal real-valued coefficients (detailed in Table 2) as well as the coefficient of the linear segment 1,0 presented in Table 1.
Reducing the coefficient word length is highly desirable: the lower the coefficient's word length, the lower the hardware computational cost. In contrast, excessive quantization may further decrease the SFDR level. The design has to balance circuit complexity against quantization accuracy [15].
To satisfy the targeted SFDR level, the coefficient detailed in Table 2 is quantized with sufficient finite precision as follows: Where ⌊⋅⌋ denotes the floor function, is the coefficient word length, and 0.5 ensures that the halfway values (2 ) are rounded up. The resulting coefficients are shown in Table 4. The coefficients 1 , 2 , 3 , and 4 are quantized to 13 bits, and the coefficient 5 is quantized to 14 bits. The coefficient 1,0 is quantized to 10 bits. The phase boundary value ( /2) is quantized to 13 bits.
Next, we have to analyse the spurs level in the presence of quantization error. Figure 6 shows the resulting spectrum where the largest unwanted frequency component has an amplitude of −90.58 dBc, and for comparison purposes, the spurs distributions of the digitized and nondigitized coefficients have been depicted together in Figure 7.

DDFS Architecture Design
The schematic of the sine generator based on the hybrid polynomial approximation is shown in Figure 8. The proposed architecture is composed of mainly two parts, the linear part (the shaded rectangle), which has one constant multiplier, and the fourth-order polynomial part, which has one constant multiplier, one squarer, one variable coefficient multiplier, and five adders.
For proper operation, one part must be selected at a time. For this purpose, a multiplexer is used to switch between the two polynomials at a specific phase sample. Furthermore, the architecture employs the normalized phase accumulator (PA) in conjunction with a simple constant coefficient multiplier instead of the complex modulo /2 phase accumulator. The issue of choosing the appropriate phase accumulator is discussed further in the following section.

Radian Phase Accumulator.
In the LUT-based phaseto-sine amplitude converter (PSAC), the phase accumulator is a simple -bit binary adder, followed by a clocked phase register. At each clock cycle, the phase accumulator updates its phase register with a new phase sample simply by incrementing the previous output by an amount FIW, the frequency instruction word, which is a unique, predefined binary number for a certain output frequency. An overflow occurs whenever the sum of the adder operands exceeds its capacity (2 −1 ) coincident to one period of the synthesized waveform. The resulting phase sample is then used for indexing the sine-amplitude lookup table (LUT). However, for our case, the aforementioned phase accumulator architecture is not applicable and a more complicated modification is required. The contents of the phase register represent the input phase [0, /2], normalized by /2 in the interval [0, 1], whereas the 4th-order approximation defined in the above equations needs to evaluate the sine amplitude sample from the actual input phase sample measured in radians. Therefore, the normalized phase sample needs to be mapped into an equivalent radian value in the interval [0, /2] before being applied to the input of the sine generator.
To the best of our knowledge, there are two approaches that have been proposed to convert the normalized phase accumulator to the radian counterpart. The first approach used a modulo /2 arithmetic. In this case, the -bit phase accumulator needs to be truncated at the nearest integer to ( /2) ⋅ 2 −2 = ⋅ 2 −3 , and for generation the second quadrant, the common simple negation circuit has to be replaced by ( ⋅ 2 −3 − ) two's complement adder and multiplexer, where = × represents the instantaneous accumulated phase sample. An extra gate for controlling the sine symmetry is also required. This technique was used in [4,9,10] but suffers from amplitude mismatching between successive quadrants at extreme points (0, /2). In other words, as the step increment FIW is any quantity in the range of FIW [2 −1 : 1], the truncated point will then be ( ⋅2 −3 − (FIW mod ⋅ 2 −3 )), 0 ≤ (FIW mod ⋅ 2 −3 ) < FIW. The next quadrant (as designed) starts, this time, from ⋅ 2 −3 not from ( ⋅ 2 −3 − (FIW mod ⋅ 2 −3 )), and, therefore, an amplitude discontinuity of (FIW mod ⋅ 2 −3 ) occurs.
The second approach uses the normalized phase accumulator and hence multiplies its phase output by a quantized /2 constant coefficient. A similar hardwire multiplier has been used in CORDIC-based DDS architectures as a radian converter, as presented in [3].
By using the standard PA, the algorithm is free from the amplitude discontinuity and the controlling of quadrant symmetry is quite simple. At first glance, the alternative solution may seem slightly costly due to the added constant coefficient multiplier, but in the following section, we show that this can be performed with a simple arrangement, which reduces its required hardware resources significantly.

Constant Coefficient Multipliers.
The complexity of the architecture, as is seen in Figure 8, is heavily dominated by the complexity of the squarer and variable multiplier circuits. The other constant coefficient multipliers are rather simple and they can be significantly simplified as follows.
For the linear part, which has one constant multiplier, we can apply the concept introduced by [13], based on which the coefficient can be expressed using the following canonic signed digit (CSD) representation: where ∈ {−1, +1}, ∈ , denote the set of all integers, and is a fixed number, which has to be as small as possible for efficient realization. By doing this, the digital multiplier can be realized by summing the hardwired shifted versions of phase sample . Following this concept, with 15-bit phase resolution, the coefficient 10 can be approximated by nine nonzero digits: 0.9970862 ≈ 1021/2 10 = 2 −1 + 2 −2 + 2 −3 + 2 −4 + 2 −5 + 2 −6 + 2 −7 + 2 −8 + 2 −10 . The implementation of such a multiplier requires nine partial products, which need further simplification to be applicable. Instead, applying the analogy of Booth's encoding can help to reduce the partial products substantially. By Booth's encoding, the binary number can be represented by the difference of two binary numbers, which is assumed to be rather efficient in hardware implementation. This is true as long as the binary string  consists of more consecutive ones. Following this concept, the coefficient 10 can be expressed by only three nonzero digits: 0.9970862 ≈ 1021/2 10 = 1 − (2 −9 + 2 −10 ). The error is less than (0.000016 < 2 −15 ), which is acceptable. In this case, the multiplier can be replaced by a simple twooperand adder as shown in Figure 9(a). It should be noted that the required right hardwired shifting does not involve a digital gate. Furthermore, the range of phase is limited to 7 /128. Therefore, we can reduce the adder word length to 11 bits when the phase boundary value ( /2) is quantized to 14 bits. The same procedure can be applied to the coefficient 1 : 0.419941 ≈ 3440/2 13 = 2 −2 + 2 −3 + 2 −5 + 2 −7 + 2 −8 + 2 −9 , which has six partial products, and with Booth's encoding, the partial products are reduced to four 2 −1 − (2 −4 + 2 −6 + 2 −9 ) as depicted in Figure 9(b). The resulting error is (0.000019125 < 2 −15 ) which is satisfactory. Following the same procedure, the radian multiplier with a constant coefficient of 3217/2 −11 = 1.57080078125 can be expressed by five nonzero digits: 1 + 2 −1 + 2 −4 + 2 −7 + 2 −11 as shown in Figure 9(c). The error due to the quantization process is less than (0.00000445 < 2 −17 ), which is highly sufficient. Even though the constant multipliers are apparently in the simplest form, still another improvement can be achieved. The architecture displayed in Figure 8 can be further simplified by merely merging the cascaded multipliers in each path as depicted in Figure 10.
Here, = radian × = 5404/2 13 is the resultant merged multiplier of the radian and linear segment multipliers and 1 = radian × 1 = 3209/2 11 represents the merged multiplier of the radian and 4th-order segment multipliers. The reconstructed structure exhibits one multiplier less than the aforementioned arrangement.

Squarer
Architecture. Next, we need to turn our attention to the design of the squarer and the variable multipliers, which represent the main sources of computational cost. For our design, it is important to keep the internal data path word length as small as possible to accommodate the data input word length for the subsequent arithmetic operations.
As shown in Figure 8, a 13-bit fixed-width squarer has been used instead of the regular 26-bit full-length squarer to satisfy the word length restriction. The simplest way to achieve such a fixed-width squarer is by omitting the less significant part at the partial product array (direct truncation) resulting in a significant area and decreased power consumption. For such truncation, a visible portion of useful information has normally been lost, resulting in high arithmetic error. Another type of truncation occurs at squarer output (post truncation). This type of truncation offers the best accurate fixed-width squarer [16], in which full partial products are realized, but the required hardware structure occupies significant die area. Next, one can think about reducing the partial products before applying the post truncation. In this case, only a small portion of needs to be realized, resulting in accurate fixed-width squarer with reasonable die area and power consumption. In designing the 13-bit squarer, following the above guideline, the most popular folding technique based on the symmetry of the partial products matrix has been used in conjunction with a Divide-and-Conquer approach.
The Scientific World Journal The two terms of the last equation have common 7 LSB zeros which can be truncated: The first term in (10) represents the 7-bit primitive squarer, which can be heavily simplified by exploiting the symmetry property of the partial products matrix. Figure 11 shows the reduced partial products matrix. The structure exhibits a partial products reduction of 50% in comparison with the standard multiplier. In an attempt to achieve an accurate fixed-width squarer, the 7 × 6 multiplier has been realized with full , which is also depicted in the same graph.
To satisfy the 13-bit fixed-width squarer, the 19-bit adder input has to be truncated. As a consequence, part of the realized needs to be truncated at the squarer output (post truncation). Accordingly, some die area and power consumption are normally wasted as a price for achieving high accuracy. One can instead apply a fixed-width multiplier, with the trade-off of high arithmetic error. Therefore, the given architecture in Figure 11 has been considered as a compromise solution in terms of accuracy and computational cost. A block diagram depicting the resulting 13-bit squarer is presented in Figure 12, with two pipelining stages. Figure 8, with the 15-bit amplitude resolution required at the final stage, the 28-bit full-length multiplier output should be reduced to a 14-bit word length. However, by exploiting the fixed-width property, one can simplify the multiplier architecture, such that only the most significant product bits are generated. Dropping the less significant partial products causes a substantial arithmetic error that has to be compensated for. Many error compensation methods for fixed-width multipliers have been proposed in [12,[16][17][18].

Multiplier Architecture. As seen in
For our design, to implement the 14-bit multiplier, the fixed-width multiplier with linear compensation function introduced in [18] is employed. To do so, we first partition the matrix into MSP and LSP, where MSP and LSP are the most significant and the least significant parts, respectively. The LSP is then partitioned into LSP major and LSP minor subsets. Following [18], the LSP minor part is discarded, and an appropriate compensation function is then introduced to alleviate the impact of the dropped partial products.

Gate Simulation Results
To validate the proposed algorithm, we have coded the design pipelined version architecture, seen in Figure 13, in VHDL using ALTERA QUARTUS II 12.1 software. The design included the arithmetic blocks shown in Figures 10   8 The Scientific World Journal 7 × 6 multiplier X 1 X 13 X 6 X 7 X 6 X 13 X 1 X 7 Multiplier output P 13 P 12 P 11 P 10 P 9 P 8 P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 13 P 12 P 11 P 10 P 9 P 8 P 7 13-bit squarer output Figure 11: The reduced partial products matrix.   Figure 14 shows the synthesized output waveform observed by ModelSim with FIW = 255, amplitude resolution = 15 bits, and clock = 125 MHz. The data stream was then imported into MATLAB to evaluate the spurious level. Figure 15 shows the output spectrum for an output clock frequency of 0.275, with FIW set to 9012. It is observed that an SFDR of 90 dBc is achieved as well.

Experimental Result and Comparison
The next step to verifying the designed DDFS is by programming the targeted EP4SGX230KF40C2 device as shown in Figure 16. Note that the Stratix IV GX FPGA device is a part of the Altera DE4 development board. At this step, the Quartus II Programmer is activated to configure the EP4SGX230KF40C2 device. The generated project file is uploaded to the FPGA platform and the intended DDFS circuit is implemented in a physical FPGA chip. By this time, the functionality of a DDFS can be tested on a circuit board.
Instead of using external logic analyzer, we were using the powerful Signal Tap II embedded logic analyzer (ELA) to observe the output waveform. The Signal Tap II ELA is a system-level debugging tool integrated with Quartus II software capable of monitoring the real-time signal behavior in the FPGA design [19]. Figure 17   For further validation, the spurious level as well as the analog sine waveform can be observed experimentally using Rohde & Schwarz FSIQ3 Signal Analyzer and the Agilent DSO3202A Digital Storage Oscilloscope, respectively.
An available 14-bit digital to analog converter, DAC5672 from Texas Instruments, which is integrated with the Terasic AD/DA data conversion card, was used in this work [20]. The aforementioned High Speed Mezzanine Card (HSMC) can be add-on FPGA host board (Altera DE4 Development Board), where the targeted DDFS is implemented to convert the digital sine data stream into analogue waveform. Using the 14-bit unipolar DAC5672 imposes two modifications in our architecture as follows.
First, we have to modify the architecture to have a 14-bit amplitude resolution.
Second, we have to use the offset binary format instead of the two's complement for the digital data output. The system under test SUT is shown in Figure 18.
The output spectrum for the DDFS is shown in Figure 19 for out = 6.1 MHz and clock frequency = 50 MHz, which indicates spurious component of −78.7 dBc due to the 14-bit resolution of the DAC5672 used in this test. The analogue waveform is shown in Figure 20 for out = 0.9804 MHz.
The characteristics of the proposed work are summarized in Table 5 and compared with previously published algorithms. The power required has been estimated by the Power-Play Power Analyzer tool using a relative toggle rate of 25%.
As stated in the literature [13], it is difficult to achieve fair comparison between different DDFS circuits in terms of performances because of different implementation techniques, fabrication processes, frequency resolution, spurious level, and so on. One of the most interesting parameters that can aid in fair comparisons is normalized area. In the following, we introduce a simple method to obtain the approximate die area.
By using the Migration compatibility features in QUAR-TUS II 12.1 software, one can migrate the current FPGA device to the compatible Hard Copy IV ASIC device to find the equivalent 40-nm TSMC cells for the current FPGA logic utilization. In the same Quartus II project, the FPGA and a Hard Copy companion device have been designed using the FPGA first design flow.
We know that the Hard Copy IV has a 0.9 V core voltage using the 40-nm TSMC process, and each H-cell has 24transistor cells [21]. For our device EP4SGX230KF40C2, the compatible Hard Copy is HC4GX35FF1517, and after full compilation, we found that the current designed project can fit within the targeted Hard Copy utilizing 938 H-cells, so the total number of employed transistors is 938 × 24 = 22512.
According to TSMC 40-nm technology [22], the static RAM cell size for a 40-nm process node is 0.242 m 2 , and each SRAM has 6 transistors. Thus, the equivalent SRAM can be found by dividing the total number of employed transistors by 6. The number of SRAM modules = 22512/6 = 3752, and the total area = 3752 × 0.242 m 2 = 908 m 2 .
As illustrated in Table 5, if we exclude the work of [10], we can easily observe that the proposed work demonstrates the best performance in terms of area, power consumption, and speed. In comparison with the design present in [10], the DDFS in this paper also exhibits low power consumption, about one-twenty fourth that of architecture present in [10], and noticeable reduction in silicon area about one-fifth of a comparable [10] design, but it runs 0.86 lower speed, and shows 5 dBc SFDR less than the aforementioned architecture.   One can indicate, for the best case, that the design from [10] exhibits little bit improvement in terms of speed and SFDR while it consumes much power and occupies large die area.
However, in contrast, our design shows an excellent merit in almost all aspects.
To show the significance of the hybrid technique, a comparison with the work in [9] is helpful and important. The mentioned work used two segment fourth-order approximations, whereas this paper used one segment fourth-order and one segment first-order approximation. The comparison indicates that the proposed design exhibits the best performance in terms of area, switching speed, and power consumption while maintaining the same SFDR level of 90 dBc.

Conclusions
In this paper, we have presented a new DDFS architecture, using a combination of two carefully chosen polynomials to approximate the first sine quadrant. An exhaustive search was conducted to figure out the segment transition point that corresponds to the minimum approximation error. A simplified fourth-order polynomial architecture with low computational cost was introduced using only three multipliers. The squarer as well as the multiplier circuits were minimized, resulting in lower hardware implementation cost. The proposed DDFS was observed at the gate level. The spurious free dynamic range of a synthesized sinusoid achieved 90 dBc. The design was compared with an equivalent approach in terms of reduction of computation, speed, and power consumption. The comparison shows significant improvement in all features.