Power and Area Efficient Cascaded Effectless GDI Approximate Adder for Accelerating Multimedia Applications Using Deep Learning Model

Approximate computing is an upsurging technique to accelerate the process through less computational effort while keeping admissible accuracy of error-tolerant applications such as multimedia and deep learning. Inheritance properties of the deep learning process aid the designer to abridge the circuitry and also to increase the computation speed at the cost of the accuracy of results. High computational complexity and low-power requirement of portable devices in the dark silicon era sought suitable alternate for Complementary Metal Oxide Semiconductor (CMOS) technology. Gate Diffusion Input (GDI) logic is one of the prompting alternatives to CMOS logic to reduce transistors and low-power design. In this work, a novel energy and area efficient 1-bit GDI-based full swing Energy and Area efficient Full Adder (EAFA) with minimum error distance is proposed. The proposed architecture was constructed to mitigate the cascaded effect problem in GDI-based circuits. It is proved by extending the proposed 1-bit GDI-based adder for different 16-bit Energy and Area Efficient High-Speed Error-Tolerant Adders (EAHSETA) segmented as accurate and inaccurate adder circuits. The proposed adder's design metrics in terms of delay, area, and power dissipation are verified through simulation using the Cadence tool. The proposed logic is deployed to accelerate the convolution process in the Low-Weight Digit Detector neural network for real-time handwritten digit classification application as a case study in the Intel Cyclone IV Field Programmable Gate Array (FPGA). The results confirm that our proposed EAHSETA occupies fewer logic elements and improves operation speed with the speed-up factor of 1.29 than other similar techniques while producing 95% of classification accuracy.


Introduction
e basic building blocks of any computational system are arithmetic circuits. In recent days, resource-constrained sensor-enabled less power-consuming devices such as mobile phones and embedded processors are used in various real-time machine and deep learning applications. Generally, our human visual and hearing system can tolerate some errors. Hence, for some of the signal processing applications, namely, speech, audio, image, and video processing, exact computation is not necessarily required. Moreover, integrated circuits for multimedia applications, particularly image and video processing which uses deep learning, take a large circuit area and high power due to the size of the data and computational complexity. Approximate computing is one of the most straightforward solutions to reduce circuit complexity and power consumption [1]. erefore, this kind of inexact computing circuit is beneficial in low-power, resource-constrained devices, particularly for IoT and errorresilient applications such as multimedia and deep learning applications where exact computation is not that much significant. In addition to area and power, the delay is also reduced by using inexact computational circuits. Based on the idea, there are varieties of approximate adders (inexact adders) [2][3][4][5][6][7][8][9], and multiplier circuits [10][11][12][13][14][15] have been described in the literature for signal processing, image processing, and deep learning applications.
Carry-free arithmetic initiated with modified XoRbased approximate adder is presented in [2], and it is applied in image processing. Approximate mirror adders [3] of various configurations and truncated adders [4] are developed and used for image compression applications to verify their performance. Other similar types of inexact adders such as Significance Approximation Error-Tolerant Carry Select Adder (SAET-CSLA) based on Approximate Full Adder (AFA) [5], Modified Full Adder-(MFA-) based High-Speed Error-Tolerant Adder (HSETA) [6], and MUX-based High-Performance Error-Tolerant Adder (HPETA) [7] are designed, and better performance is demonstrated with image blending applications. Carry Truncate Adder (CTA) [8] is deployed in Convolution Neural Network (CNN) structure for accelerating the computation of the softmax layer. Generic Accuracy Configurable Adder (GeAr) is modified and implemented in the CNN-based Caffe network to accelerate the image to column conversion process [9].
Inexact adders are extended to form multipliers for the variant applications. Bioinspired Imprecise Computational (BIC) block adders and multipliers [10] are implemented for soft computing-based face recognition. Some of the other types of multipliers, namely, dynamic segmented multiplier [11], partial product perforated multiplier [12], compressorbased approximate multiplier [13], and truncation-based multiplier [14], are developed and evaluated for a variety of multimedia applications. e possible extent of loss of accuracy in the multiplier for neural network accelerator is analyzed by considering energy consumption in [15].
Most of the works mentioned above are extended for 16bit operation. Fixed-point 16-bit arithmetic suits well for multimedia and deep learning applications. Optimized circuit structure for 16-bit arithmetic is of more interest. Generally, inexact computation uses a segmented approach. A 16-bit adder for the 16-bit word inexact computation can be divided into two halves such as 8-bit Least Significant Bit (LSB) portion and 8-bit Most Significant Bit (MSB) portion. As per the design convention, the LSB part where less weight information resides is designed using the approximate adder circuits, and the MSB part is designed using an accurate adder [5-7, 11, 16-19].
is segmented methodology minimizes the error and improves the circuit's performance in terms of area, power, and speed.
ere is a limitation in attaining the area, power, and speed performances for deep learning systems in CMOS dark silicon era. Hence, a new technology that will trade off and achieve those performances is the need of the hour [20].
One such promising technology that we propose in this work to satisfy all the requirements of the real-time portable deep learning systems is Gate Diffusion Input (GDI) logic. e GDI logic is popular because it produces full swing output, which reduces the power consumption in digital circuits [21]. Hence, Gate Diffusion Input-based logic cells are trending and suitable alternate for the CMOS-based cells, especially for the low-power application. Also, the GDI logic-based design reduces the number of transistors used in the circuits [22] compared to the conventional CMOS logic design. Using GDI logic, many functions, including logic gates circuits, can be realized. e primary GDI cell, which is shown in Figure 1, can perform six operations that include four fundamental and two special functions with the two MOSFET transistors, as listed in Table 1 [21].
GDI cells can be used to construct full adder circuits. However, while cascading multiple GDI cells to produce the sum and carry, the full swing problem is inevitable. To overcome this, a few more MOS transistors should be accompanied for having a full swing effect [22] at the expense of area and power.
One-bit adders are the building blocks of the 16-bit High-Speed Error-Tolerant Adders. Plenty of configurations of GDI-based full adders (1-bit adder) for low-power computing are found in literature [23][24][25][26]. GDI-based full adder for inexact computing is presented in [27]. e performance of an inexact full adder depends on its erroneous output and error distance [5][6][7]. Inexact circuit with minimal error and minimal error distance (ED) while using a smaller number of resources is challenging, and that meets the purpose of inexact computation. Our research contributes to a novel architecture that mitigates cascaded effects and addresses those challenges. Significant contributions in this work are as follows: (v) e proposed EAHSETA logic is implemented in the CNN-based Light Weight Digit Detector (LWDD) for accelerating handwritten digit classification application. (vi) FPGA implementation of the proposed logic-based accelerator is done on Intel Cyclone IV FPGA to verify the practicability of the proposed EAHSETA, and its performance as an accelerator is validated by comparing resource utilization, power, and speed of operation with other similar adders.
e rest of the manuscript is organized as follows. e methods and materials are given in Section 2. e design procedure of the proposed approximate adder circuits is detailed in Section 3. Experimental results and the performance comparison of the proposed design over other similar recent methods are presented in Section 4. Section 5 describes the proposed adder logic as the CNN accelerator. Section 6 concludes the paper.

Methods and Materials
ere are numerous Approximate Full Adder circuits presented in the literature. Out of several inexact adders, more recent and minimum error distance-based adders are considered for discussion and comparison. An inexact adder, namely, Significance Approximate Error-Tolerant Adder (SAETA), is proposed to minimize the number of logic gates [5]. e proposed approximate adder circuit produces two errors in sum output and no error in carry output. e proposed SAETA is used to design a 16-bit error-tolerant circuit that uses common CSLA for accurate part and SAETA in inaccurate part. e performance of the adder in terms of area, power, and delay is compared with that of common and other inexact adders. In addition, the designed adder was tested using image processing applications.
In [6], the authors proposed a 1-bit Modified Full Adder (MFA) with less error distance. e one-bit MFA is extended to design a 16-bit HSETA that uses common CSLA for higher-order 8 bits (i.e., for 8 MSB bits) and their MFA for lower-order 8 bits (i.e., for 8 LSB bits). e presented design is justified and compared experimentally with the standard and recent related works based on area, power, and delay performance. ough it confers better results compared to others, its normalized error distance is more.
Recently, to overcome the voltage swing problem of CMOS logic in [27], the author presented a Modified Full Adder [6] with GDI logic using 14 and 12 transistors designs. It claims better area and power performances compared to a common adder at the cost of increased error distance. e circuit is simulated in the Cadence Design suite, and logic is realized in FPGA.
e hybrid method has been deployed by combining multithreshold voltage transistor logic with GDI logic to overcome the full swing issue [28]. e presented full adder uses 14 transistors to produce accurate results, and the proposed full adder is extended to the 32-bit adder. Even though the adder's accuracy is much better, the area occupied by design is still higher than that of other recent works.
In order to compare our proposed full swing inexact adder results with accurate adder results, the low-power full adder circuit presented in [22] with GDI logic is taken along with AFA and MFA. e truth table of the same is listed in Table 2, and the injected errors are highlighted. In this work, initially, primary gate circuits such as AND, Multiplexer, and OR using two different functions based on GDI logic are designed and are shown in Figures 2-4, respectively. Later two different full swing full adder circuits have been designed based on the primary blocks with a lesser number of transistors.

Proposed GDI-Based Adders
In this section, two proposed error-tolerant EAFA designs featuring GDI with full swing logic are discussed with the aim of reducing the circuit area and power and attaining speed at multibit addition operation. e proposed EAFA design minimizes the error distance with reduced circuit area (less transistor count), power, and delay compared to the similar work presented in [5,6] with two errors.
In Table 3, Boolean terms of common and 1-bit Error-Tolerant Adders with two errors are listed. From the expressions, it is evident that a sum expression of the common adder, AFA, and MFA uses cascaded logic gates for the adder logic realization. e cascaded logic gates reduce the voltage swing level in GDI logic implementation. is eventually needs full swing implementation for the proper sum output, and interns increase the area of implementation through transistor count.
Except MFA, the remaining existing adders carry expression also uses the cascaded logic gates and leads to the aforesaid problem. Our proposed full swing EAFA adder carefully avoids cascaded logic, and it uses the AND, OR, and Multiplexer (MUX) functionalities of the GDI logic cell Computational Intelligence and Neuroscience to realize the sum and directly takes input A as the carry output. is design implements the same kind of 1-bit adder of two errors with minimal transistor delay and power compared to others. Here, Multiplexer plays a vital role in all the minimization. EAFA Design 1 is implemented with 10 transistors using full swing AND and OR gates, as shown in Figure 5. EAFA Design 2 is deployed with 6 transistors, which uses standard AND and OR gates along with Multiplexer to produce the full swing output. is ability of full swing with a smaller number of transistors is achieved by the noncascaded structure of the proposed circuit, which is presented in Figure 6.

Segmented Approximate Adders.
Approximate adders are the core and essential part of any approximate circuits used for processing the signals or data. Segmented approximate addition is the most widely used method for its accuracy and error trade-off [5-7, 11, 16-19]. In segmented approximate adders, half of the binary data from the MSB and half of the binary data from the LSB are segmented separately. Upper MSB segment data is added by the accurate adder to preserve the quality of the results, and lower LSB segment data is added by an inaccurate adder, which aids in the speed and energy efficiency of computation. In the existing 16-bit GDI High-

Proposed Segmented Approximate Adders.
In our work, we proposed two 16-bit GDI Energy and Area Efficient High-Speed Error-Tolerant Adders (GDI-EAHSETA) as described in Figure 7 based on our 1-bit adder EAFA Design 1 and EAFA Design 2. In the proposed 16-bit adder, we modified the accurate 8-bit adder, which uses a combination of 4-bit CSLA and 4-bit RCA [29][30][31][32] for the upper MSB 8bit segment. Lower 8-bit LSB segment uses our proposed EAFA designs. From the detailed proposed block diagram presented in Figure 8, it is evident that our proposed architecture has only 12 numbers of common full adders and 5 multiplexers in the accurate 8-bit adder, and the optimized design portion is highlighted in the same diagram. Our modified accurate 8-bit adder in the 16-bit GDI-EAHSETA uses 25% less number of FAs and 50% less number of multiplexers compared to the existing 16-bit GDI-HSETA. EAFA Designs 1 and 2 used in the inaccurate 8-bit adder consist of 10 and 6 transistors, respectively. ose proposed EAFA designs occupy 16.67% and 50% less area, respectively, in comparison with the efficient existing MFA Design 2 inaccurate 8-bit adders.

Experimental Results and Discussion
All the circuits presented in this work have been simulated with the Cadence EDA tools with 90 nm PTM.

Proposed 1-Bit Full Swing EAFA.
Both the proposed full adders exhibit full swing performance. In the EAFA Design 1, full swing AND and OR GDI logic gate is used to maintain  Computational Intelligence and Neuroscience the voltage level, and it is given through the GDI MUX for selecting the proper sum value of the given input. Full swing structure of AND and OR there itself manages the signal levels not to go down below and above the specific voltage level to represent logic 0 and 1. MUX simply passes the logic value through it to the sum output. Its simulated results for all the input combinations are shown in Figure 9. From simulation results of Figures 9 and 10, it is evident that there are no voltage level moderations in the sum and carry outputs of both the adders. Our proposed 6-transistor EAFA Design 2 itself generates full swing output without any extra transistors for modifying the voltage levels at par with the common adder, AFA, MFA, and EAFA Design 1.

Power Consumption and Delay.
e average power consumption is computed through the measurement tool in the SPICE simulation. e maximum of the average power is calculated and taken as a consumed power of the adder circuits [27]. e delay of all the adders is measured by calculating the time difference between the time taken by the input voltage swing to rise or fall from its 50% of the maximum value. e maximum delay got from various input and output combinations is taken as the worst-case delay [24]. e power and delay results of the various simulated adders are presented in Table 4.
From the simulated results of all the 1-bit full adders, it is prompted that our proposed EAFA Design 1 has consumed 47.62% less power compared to best among common adders (Design 1) and 46.15% less power compared to best among AFA adders (Design 2) and 44.42% less power compared to best among MFA adder (Design 2).
In the same way, our proposed EAFA Design 2 outperforms other adders with 99.99% less power consumption.
is power performance has been achieved in our proposed   Computational Intelligence and Neuroscience structure by means of handling switching activity involved in producing the sum and carry terms. For any combination of input, in a given time, only three transistors are in ON state for producing a sum output, and for generating carry, transistors need not spend energy since it is directly taken from one of the inputs. e speed performance of the proposed EAFA adders is good with a minimal delay compared to other simulated adders. Our EAFA Design 1 has reduced the delay of 38.79%, 41.16%, and 30.66%, compared to corresponding best performing adders in their groups, namely, common adder Design 1, AFA Design 2, and MFA Design 2, respectively. e worst-case delay of 459.447 ps is measured between the sum and the input A value. e proposed EAFA Design 2 has 86.7%, 87.21%, and 84.93% less delay compared to the common adder Design 1, AFA Design 2, and MFA Design 2, respectively, and the same is illustrated in Figure 11. e worst-case delay of the adder has arrived between the input A and the sum output, and its value is 99.866 ps. e reason behind the minimal delay is that both the proposed designs need not spend time in carry calculation, and hence carry does not have an effect over the delay. e sum output is the dominant player in the delay. In our proposed Design 2, to generate the sum for any combination of input, the signal has to travel through two numbers of transistors only. is makes sense for the speed of operation of our proposed circuit architecture.

Performance Evaluation of 16-Bit Adders.
In order to prove the cascaded effectless operation of the proposed adder, 1-bit EAFA Design 1 and Design 2 are extended to form 16-bit adders and are compared with various configurations of 16-bit adders of different types as listed in  Among the listed adders in Table 5, our proposed 16-bit GDI-EAHSETAI occupies 12 × 18 � 216 transistors for 12 common full adder Designs 1, 5 × 6 � 30 transistors for 5 multiplexers, and 8 × 10 � 80 transistors for inaccurate 8-bit adder based on EAFA Design 1. All put together, it occupies 326 transistors (216 + 30 + 80) only. at is less than 33.74%, 31.51%, and 26.57% area of relatively well-performing adders, namely, GDI-CMBAI, GDI-AMBAII, and GDI-HSETAII adders, respectively. Similarly, our proposed 16-bit GDI-EAHSETAII occupies a total of 294 transistors which are summed up from 12 × 18 � 216 transistors for common full adders, 5 × 6 � 30 transistors for multiplexers, and 8 × 6 � 48 transistors for inaccurate EAFA Design 2-based inaccurate adder. e area occupied by the GDI-EAHSETAII adder is 40.24%, 38.23%, 33.78%, and 9.8% less than the area needed to implement the GDI-CMBAI, GDI-AMBAII, GDI-HSETAII, and GDI-EAHSETAI, respectively. A comparison of the area occupied by the proposed high-speed adder with other adders is illustrated in Figure 11. e average power has been measured for all the 16-bit adders using Cadence SPICE. e substantial reduction in the number of transistors and the circuit structure aids less power consumption of our proposed 16-bit GDI-EAHSETAI and II. In comparison with other 16-bit adders proposed, GDI-EAHSETAI consumes 87.5%, 78.92%, and 75.78% less power, GDI-EAHSETAII consumes 88.31%, 80.27%, and 77.34% less power than GDI-CMBAI, GDI-AMBAII, and GDI-HSETAII, respectively, and this is illustrated in Figure 12. Our proposed GDI-EAHSETAII outperforms all the types of listed adders by means of area and power while producing distortion-free outputs.

Performance Evaluation of the Adders in FPGA.
For the practicability and to evaluate the proposed logic performance, we have implemented the best inaccurate-based 16bit adder logic among all ten types, which are synthesized using Quartus Prime 18.0 tool for the Intel Cyclone IV FPGA platform, and parameters are listed in Table 6. In this hardware platform also, our proposed adder outperforms other adders by occupying fewer LUTs, consuming less power, and with improved speed of operation.

Error Characteristic of Approximate Adders.
Approximate adders are characterized by their nonconformity of calculated results from the precise results. Various error parameters such as error distance (ED) and mean error distance (MED) are used to properly characterize the approximate adders for practical applications. Error distance is the absolute difference of the added results with definite results. Mean error distance is calculated as the average of all the error distances of an inaccurate adder circuit, and it is given in equation (1). In order to make an additional comparison with similar adders, one more parameter is derived from MED called normalized mean error distance (NMED). NMED is calculated as in equation (2) as the ratio of MED to the maximum error value (D) of that specific adder.
Error characteristics are useful to validate the approximate adders for their suitability for application deployment. Idea MED of accurate adder is "0." So, an approximate adder produces the MED value nearing the ideal value, which is measured as good for use in computation applications. Similarly, NMED is the derived parameter from MED to sense the overall error distribution with respect to the maximum possible error in the designed approximate adder.
Based on equations (1) and (2), error metrics have been calculated and tabulated in Table 7. From this, our proposed added adder exhibits better error characteristics compared to other existing adders, and it is very well suited for real- Computational Intelligence and Neuroscience

Convolutional Neural Network Accelerator for the Handwritten Digit Classification Inference: A Case Study
Convolutional neural networks are drawing major attention in deep learning-based applications, especially classification and detection [33,34]. Even CNN is also used to predict high-frequency details lost in low-resolution images to create superresolution images [35]. CNNs are inheriting error tolerance through their learning and updating process of weights by random initialization. In this work, we took the privilege of the errorresilient property of CNN and focused on developing the most common block, which is used in the convolution computation. e fundamental block frequently involved in all the computation processes of CNN is the adder. Adder contributes to the whole system's performance and influences the total energy consumption. us, introducing the inexactness in the addition process curtails the power, area, and delay while improving the whole system's performance [8,9].
Since CNN is the most commonly used method in any deep learning process, we attempted to accelerate the computationally intensive convolutional blocks in it. In this work, an accelerator to accelerate the CNN for handwritten digit classification application is developed through dedicated fixed-point approximate arithmetic blocks for the convolution operation. A similar kind of accelerator work can be found in the literature, but for accelerating softmax regression [8] and image to column operation [9] alone.

CNN System Architecture for Digit Classification.
A CNN is a multilayer filter specially designed to visualize data information through preprocessing operations. In CNN, the input parameter size has been decreased layer by layer at the same time the size of the filter increases. e general system architecture of CNN for digital identification is shown in Figure 13. As the depth of the network increases, that is, the number of layers in between the input and the output increases, the accuracy of the digit classification also increases [36].

VGG Net.
In this work, we have concentrated only on VGG-based Low-Weight Digit Detector (LWDD) CNN system architecture, which is deployed in real-time handwritten digit recognition applications to detect digits from 0 to 9 [37].
In LWDD, authors optimized various layers and made the weights as a 16-bit fixed-point to minimize the total size and count of the weight parameters. ese modifications have greatly reduced the storage requirements of the parameters, which made them easily fit in the FPGA. Different layers structure and the corresponding sizes of the images and parameters of the same in the LWDD are given in Table 8.
is LWDD architecture gets 28 × 28 size handwritten digit images as input and classifies it as a number between 0 and 9. Handwritten digit images from the open MNIST dataset are used for training the network, and the trained parameters are used for inferencing digit classification.

Performance of Evaluation of the Proposed Adder in CNN
Accelerator. In order to accelerate the real-time handwritten digit recognition process, we replaced conventional adders used in the convolution layers of the LWDD architecture with our proposed GDI-EAHSETA approximate adder logic and evaluated.
For a fair comparison of the acceleration process, the existing competing adders GDI-AMBA and GDI-HSETA are also implemented in the accelerator. e hardware complexity for processing convolution layer is much simpler in our proposed architecture, the comparison of time taken for calculation at different layers in terms of clock cycles is listed in Table 9, and its graphical representation is presented in Figure 14.
From the comparison, it is evident that our proposed GDI-EAHSETAII greatly reduces the number of clock cycles needed for the convolution compared to conventional accurate adder GDI-CMBAI and speeds up the process by the factor of 1.29. In comparison with related approximate adders, the proposed adder exhibits relatively better performance by taking a smaller number of clock cycles. Comparison of clock cycles taken by the different levels of convolution layers for the various LWDD system deployed different approximate adders.
FPGA implementation of LWDD accelerator system based on various adders is done in Intel Cyclone IV EP4CE22F17C6N FPGA, and the resources utilized for the deployment of the accelerator are listed in Table 10 along with the digit classification accuracy. GDI-AMBAII utilizes more resources compared to other approximate adders with the speed-up factor of 1.04, but still, it produces moderate accuracy of 88%. GDI-HSETAII consumes moderate resources while showing very little accuracy of 80% among others, even though it speeds up the computation process by the factor of 1.23. e  Fully Connected Neural Network Figure 13: Proposed approximate convolution in the typical CNN architecture.  proposed GDI-HSETAII utilizes a smaller resource while producing 95% accuracy; it speeds up the computation process by 1.29 factors. Since the proposed adder is outperforming all other adders in terms of resource utilization, speed, and accuracy, it is best suited for the CNN inference accelerator.

Conclusion
In this work, we designed and evaluated the transistor-level 16-bit Energy and Area Efficient High-Speed Error-Tolerant Adders based on the proposed GDI-based full swing energy and area efficient error-tolerant full adders for handling the cascading effect. In comparison with the similar kind of inexact full adders, the proposed EAFA Design 1 and EAFA Design 2 have a lesser area, consuming less power and better speed of operation while giving better reliability with the minimum error distance. e efficiency of the proposed EAHSETA is compared with the various multibit adders based on common CSLA designs, AFA, MFA, and HSETA. e improvements in the speed, area, power of EAHSETAI and EAHSETAII are achieved through the cascaded effectless architecture of the EAFA designs. e proposed EAHSETAII has a relatively lesser area and consumes 88.31%, 80.27%, and 77.34% lesser power than in-group best performing GDI-CMBAI, GDI-AMBAII, and GDI-HSE-TAII, respectively. ese features of EAHSETAII satisfy the area and power requirements of resource-constrained highspeed, low-power deep learning applications. e proposed EAHSETAII, AMBAII, and HSETAII logics are deployed to accelerate the convolution computation of the LWDD network, which is used in real-time handwritten digit recognition. All the systems are implemented in the Intel Cyclone IV FPGA, and performance has been evaluated. e proposed EAHSETAII outperforms other similar logics by producing 95% classification accuracy with the speed-up factor of 1.29 while consuming fewer resources. Experimental results also confer that AMBAII is able to produce moderate accuracy with less speed and HSETAII logic exhibits poor accuracy with the moderate speed-up factor. In the future, the proposed logic can be extended to an inexact multiplier design for the multiply and accumulate unit for further acceleration.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.