An FPGA-Based Hardware Accelerator for CNNs Using On-Chip Memories Only: Design and Benchmarking with Intel Movidius Neural Compute Stick

During the last years, convolutional neural networks have been used for diﬀerent applications, thanks to their potentiality to carry out tasks by using a reduced number of parameters when compared with other deep learning approaches. However, power consumption and memory footprint constraints, typical of on the edge and portable applications, usually collide with accuracy and latency requirements. For such reasons, commercial hardware accelerators have become popular, thanks to their architecture designed for the inference of general convolutional neural network models. Nevertheless, ﬁeld-programmable gate arrays represent an interesting perspective since they oﬀer the possibility to implement a hardware architecture tailored to a speciﬁc convolutional neural network model, with promising results in terms of latency and power consumption. In this article, we propose a full on-chip ﬁeld-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application. We started from the model implemented in a previous work for the Intel Movidius Neural Compute Stick. For our goals, we appropriately quantized such a model through a bit-true simulation, and we realized a dedicated architecture exclusively using on-chip memories. A benchmark comparing the results on diﬀerent ﬁeld-programmable gate array families by Xilinx and Intel with the implementation on the Neural Compute Stick was realized. The analysis shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design eﬀort and development time through the FPGA solution.


Introduction
During the last years, convolutional neural networks (CNNs) found application in many different fields like object detection [1,2], object recognition [3,4], and KeyWord Spotting (KWS) [5,6].Although they proved excellent results on cloud, their applicability for portable systems is challenging because of the additional constraints in terms of memory footprint and power consumption, which generally conflict with latency and accuracy requirements.In particular, in general purpose solutions based on the use of a microcontroller, the limited available memory limits the complexity of the network, with possible impact on the accuracy of the system [7].In the same way, microcontroller-based systems feature the worst trade-off between power consumption and timing performances [8].
For this reason, commercial hardware accelerators for CNNs such as Neural Compute Stick (NCS) [9], Neural Compute Stick 2 (NCS2) [9], and Google Coral [10] were produced.Such products feature optimized hardware architectures that allow to realize inferences of CNN models with low latency and reduced power consumption.Standard communication protocols, such as Universal Serial Bus (USB) 3.0., are generally exploited for communication purposes.
Nevertheless, since they were designed for the implementation of generic CNNs, their architectures are extremely flexible at the expense of the optimization of the single model.
For such a reason, hardware accelerators customized for a specific application might offer an interesting alternative for accelerating CNNs.In particular, field-programmable gate arrays (FPGAs) represent an interesting trade-off between cost, flexibility, and performances [11], especially for applications whose architectures have been changing too rapidly to rely on application-specific integrated circuits (ASICs) and whose production volumes might be not sufficient.FPGAs offer high flexibility at the same time, which permits the implementation of different models with a high degree of parallelism [8] and the possibility of customizing the architecture for a specific application.e aim of this paper is to investigate the use of custom FPGA-based hardware accelerators to realize a CNN-based KWS system, analysing their performances in terms of power consumption, number of hardware resources, accuracy, and timing.A KWS system represents an example of application whose porting on the edge requires much effort, owing to the hard design trade-offs.
e study involves the use of different FPGA families by Xilinx and Intel, analysing design portability on devices with different sizes and performances.is allowed to realize a benchmark that compares the obtained results with the ones presented in our previous work for the full-SCNN (separable convolutional neural network) model [12], which implements the same architecture exploiting a NCS (version 1, mounting Myriad 2 Vision Processing Unit (VPU)).
To realize the architecture implemented on-board FPGA, a bit-true simulation was performed to appropriately quantize the model, reducing the number of resources used, saving power, and increasing throughput when compared with a floating-point approach.
e remainder of the paper is structured as follows: the Keras model used to describe the KWS system is presented in Section 2. Section 3 presents the approach used to quantize and compress the model to optimize its implementation onboard FPGAs.In Section 4, the results of the quantization analysis are provided and discussed.e preferred FPGAbased accelerator architecture is then described in Section 5, focusing on the analysis of design trade-offs.Results of the implementation on the different FPGA families are presented in Section 6.In Section 7, results in terms of maximum achievable clock frequency, hardware resources, and power consumption are presented and compared with the NCS solution.In Section 8, the usability of FPGA devices to accelerate the inference of CNNs is discussed with respect to the presented solution and similar applications.Finally, in Section 9, conclusions are given.

Architecture of the KWS System
KWS systems are a common component in speech-enabled devices: they continuously listen to the surrounding environment with the task to recognize a small set of simple commands in order to activate or deactivate specific functionalities.Commercial examples of KWS systems include "OK Google" and "Hey Siri." e proposed KWS system is designed to operate inside a domotic installation for improving the quality of life of people with disabilities.In particular, it is able to recognize 10 different commands: "yes," "no," "up," "down," "left," "right," "on," "off," "stop," and "go."Moreover, it identifies two additional classes: "silence," when no word is pronounced, and "unknown," when the pronounced word does not belong to any class.
e KWS system was pretrained in the Python framework called Keras [13], using Google Speech Command dataset.
e proposed architecture is based on the SCNN described in [12], whose architecture is shown in Figure 1.
e input of the network is a 63 × 13 mel frequency spectral coefficient (MFSC) matrix [14].e bin (n, k) of the matrix contains information over the spectral content at frequency f, as shown in equation ( 1): where f sample � 16 kHz is the sample rate and N � 512 (32 ms) is the number of bins used to calculate the fast fourier transform (FFT), measured at the instant n/f sample , with n ∈ [0, N − 1].Every N-sample window is weighted through a Hann window and overlapped with the previous N/2 samples for the calculation of the FFT.e input layer provides the 63 × 13 MFSC input matrix.en, three separable convolutional (SC) layers follow, and their generic structure is shown in Figure 2.
SC layers improve standard convolutional layers by reducing the number of parameters used to process the inputs [12].For this reason, SCNNs are particularly interesting for the realization of FPGA-based hardware accelerators because they reduce memory and computation requirements in comparison with the classic CNN approach.
A standard convolutional layer contains c out (w f xh f ) filters that are convolved over.
c in (w cin xh cin ) input channels, producing c out (w cin − w f + 1)x(h cin − h f + 1) output channels.On the contrary, a separable convolution is realized through two distinct convolutions performed by means of filters, whose dimensions are, respectively, (f w x1) and (1xf h ). Figure 3 better illustrates the difference between these two approaches.
Considering the structure of the MFCS input matrix, each SC layer performs two separated convolutions, realizing a "time" convolution followed by a "frequency" convolution.
A batch normalization (BN) layer, which has the role to accelerate deep network training by reducing internal covariance shift [15], follows the frequency convolution.Finally, the rectified linear unit (ReLU) is the activation function of each SC layer.ReLU is defined in equation (2) as A classic convolutional layer follows the three SC layers.Table 1 summarizes the dimension of time/frequency filters, number of input channels (C in ), output channels (C out ), and input/output matrix dimensions for each convolutional layer of the network.Time_0 and freq_0 are, International Journal of Reconfigurable Computing respectively, the temporal and frequency convolutional layer of the hidden layer 0, and similarly time_1/freq_1 for the hidden layer 1 and time_2/freq_2 for the hidden layer 2. Final_conv refers to the last convolutional layer of the network.e average pooling layer computes the average value of each output channel of the final_conv layer, condensing them in 12 values, one for each class of the KWS system.Finally, a Softmax (or normalized exponential function) layer activation function follows.It takes a vector Z J as input and produces an output vector in which each element f softmax (Z J ) is normalized in the interval [0, 1] and can be interpreted as the probability that input belongs to the class j.
e standard Softmax function is described by equation (3): In this network, the Softmax input vector is composed of 12 elements, one for each of the class of the KWS system.e proposed SCNN model was implemented on the Intel Movidius NCS, showing an accuracy of 87.77%.e number of parameters necessary for its implementation is 15000, including bias, weights, and batch normalization parameters.

Keras Model Optimization toward the Hardware Implementation
In the next sections, methods to map the Keras-Python model of the KWS system on an FPGA are analysed.In fact, this model is implemented in a high-level language and its parameters are based on the floating-point representation.
e main issue about the implementation of a CNNbased model on an FPGA regards the limitation in terms of available hardware resources (combinatorial elements, sequential elements, Digital Signal Processors (DSPs), ram blocks, etc.) of such devices [11,16,17].CNN algorithms are based on Multiply-and-ACcumulate (MAC) operations that require a large amount of combinatorial logic elements or DSPs.Furthermore, CNNs are characterised by a great number of parameters that shall be stored into off-chip memories if exceeding the available on-chip memory.e use of off-chip memory could be inevitable, complicating the design and increasing the inference time.For these reasons, the architecture of the hardware accelerator was carefully designed considering the trade-off between inference time and available resources.
3.1.Model Quantization.Before realizing the FPGA implementation, a quantization of the SCNN model was performed.In literature, there are many examples of quantization applied to CNNs [18][19][20][21].e main advantage offered by a fixed-point representation is the possibility to shrink the model dimension and complexity with a negligible loss in accuracy [22].In addition, fixed-point arithmetic requires simpler calculation than floating-point arithmetic, with advantages in terms of complexity and power consumption [23].
e quantization of the original floating-point model was performed through a bit-true simulation.e aim of the simulation is to determine the number of bits necessary to represent numbers in every internal node of the network by limiting the loss in accuracy.
e fixed-point representation of the model weights (or filter elements) w q was calculated by using the approach described by the following equation: where w is the floating-point representation of the weight and lsb w is the value of the least significant bit (lsb).e latter is calculated by dividing |w| max , which represents the absolute value of the maximum weight over each layer, by 2 b w − 1 , where b w is the number of bits used to represent weights, as required by the 2's complement format.In particular, since the range of weights amplitude is roughly the same for every layer, the same value of |w| max was used for each layer.Such choices are due to the necessity to reduce the conspicuous degrees of freedom in the simulation.Furthermore, in order to reduce the number of operations to implement in hardware, the effects of the BN are included in weight and bias values (BN simply consists in algebraic operations).In formulas, each weight w(i) and each bias b(i) belonging to a frequency convolutional layer International Journal of Reconfigurable Computing or final_conv was modified as described by equations ( 5) and ( 6): where c and β are the scaling factors and bias of the BN, respectively, σ the standard deviation, and μ the average of the weights of a given input channel.
At the end of each layer, the acceptable number of truncated bits b tr i and a saturation (truncation of the most significant bits) of b sat i bits were also studied through the bittrue simulation to reduce the complexity of the hardware.In terms of formulas, truncation consists in changing the value of the lsb, as described by the following equation: Instead, saturating b sat i bit means discarding the b sat i most significant bits.Such operation does not affect lsb w .For this aim, the worst case (greatest value in absolute meaning) of each layer output was considered so as to eliminate unused bits that were previously added for avoiding overflow of arithmetic operations.To sum up, the accuracy of the A possible optimization of the model consists in quantizing separately the weights of the last convolutional layer, by using b w last bits.Indeed, the coefficients of final_conv may be divided by the divisor of the average pooling, saving hardware operations. is optimization significantly changes the range of weights for the last layer and a different quantization should be applied to it.For this reason, a second model to evaluate the overall accuracy takes into

Pruning.
Another technique to reduce the complexity of the hardware accelerator is pruning.It consists in dropping the least important connections of the network [24,25] by identifying the weights or biases with a magnitude smaller than a given threshold.In this network, the biases of the temporal convolutional layers have magnitudes in the order of 10 − 9 -10 − 7 .Considering their small values with respect to the other network parameters, they were pruned to reduce the model size.Indeed, it is possible to eliminate temporal bias terms without significantly affecting accuracy and reducing the number of sums to be computed.

Results of the Quantization Analysis
In this section, the results obtained from the quantization process are presented and discussed.e SCNN model of this network has many degrees of freedom.For this reason, the first simulation step is finalised to identify a starting point for a more complex analysis, and it only focuses on the quantization of input layer words and weights.
Figure 4 shows simulation results, in terms of accuracy and mean square error (MSE) in relation to the floatingpoint model, when only the number of bits for input words representation is quantized.A number of 4 or 5 bits optimize accuracy and minimize the MSE. is first analysis gives intuitions about a possible optimization for the input layer: in particular, the number of bits of every input can be forced to be multiple of 4 bits, so that several inputs might be contained in buses such as the Advanced eXtensible Interface 4 (AXI4) bus [26], whose size is usually a multiple of 8 bits.
Figure 5 shows simulation results, in terms of accuracy and MSE, when only the number of bits for the representation of weights has been quantized.In this case, accuracy rapidly grows between 8 and 12 bits, reaching even higher values than the original ones in correspondence of 11 and 12 bits.Finally, accuracy saturates for 16 or more bits.is parameter is crucial because it influences the number of bits necessary to represent the result of MAC operations and, consequentially, the complexity of the entire network.
is first analysis was the starting point for a more detailed design exploration, involving the number of bits for the representation of the output of each layer.
Table 2 reports the best results obtained in terms of accuracy.Only the number of bits of SC layers and final_conv outputs are presented in the table, whereas data regarding temporal convolutional sublayers are omitted.e parameters listed in the table are as follows: (i) b_in: number of bits for the representation of input words.(ii) b_filter: number of bits for the representation of filters.(iii) bit_out_0: number of bits for the representation of the outputs of the first hidden layer.(iv) bit_out_1: number of bits for the representation of the outputs of the second hidden layer.(v) bit_out_2: number of bits for the representation of the outputs of the third hidden layer.(vi) bit_out_fc: number of bits for the representation of the outputs of the last convolutional layer.
Collected data show that it is possible to increase model accuracy through quantization.In fact, the best accuracy obtained for the floating-point model is 87.77%, whereas for the fixed-point representation, the highest accuracy is 90.23%.
e second part of the simulation considers a different quantization for the final_conv layer due to the inclusion of the average pooling effects, as explained in Section 3.1.e results of this simulation are summarized in Table 3. ese models show smaller hardware requirements than the single-quantization versions presented in Table 2. Sets of data are the same as those in Table 2, excepting for b_last that represents the number of bits for the representation of final_conv layer weights, whereas b_filter refers only to SC layer weights. is second model allows to shrink weights representation for all convolutional layers, significantly reducing the impact of MAC operations on hardware resource requirements.Furthermore, several quantized models have an accuracy score higher than the original one (87.77%).
e model chosen for the FPGA implementation considers both accuracy and the possibility to shrink parameter representations.Model number (7) from Table 3 was selected: it has a higher accuracy than the Keras-Python model (88.09versus 87.77), and it minimizes the number of bits necessary for the representation of layer outputs and weights.Input layer results compatible with AXI4 because Input words are represented on 4 bits.e number of bits for the representation of temporal convolutional outputs of model ( 7) is 10 for time_0, 8 for time_1, and 10 for time_2.

FPGA Hardware Architecture
is section describes the architecture of the hardware accelerator that was implemented on different FPGA families.
anks to the reduced number of parameters of the SCNN investigated in our previous work [12], it was possible to realize a full on-chip design with high advantages in terms of latency and energy per inference, avoiding accesses to offchip memories [11,21].
Figure 6 shows the block diagram of the accelerator.e number of bits of the words read from and written into the Input memory and RAMs is related to our preferred model, described in the previous section.
An input memory is used as an interface between the hardware accelerator and the system that records and elaborates the audio samples.e input memory stores 4-bit input data.e time/frequency layers and final_conv layer perform convolutional operations and store the results into a RAM memory, used as a buffer.Once the previous layer completes an entire convolution, the next one starts reading out its input matrix from the memory.
Each of the seven convolutional layers has its own MAC module to perform multiply-accumulate operations.Figure 7 shows the structure of the MAC module, designed to compute one element of the output matrix per clock cycle.
It reads n elem elements from the RAM memory, where the value of n elem is shown in the following equation: where C in is the number of input channels and f elem is the number of elements composing a channel filter.e addertree structure is used for accumulation, and it was chosen to reduce the overall latency of the circuit.Considering this configuration of the MAC module, the total number of clock cycles needed to complete an entire convolution for each convolutional layer is N clk , as shown in the following equation: International Journal of Reconfigurable Computing where C out is the number of output channels and W out and H out are the dimensions of the output matrix.Table 4 shows N clk of each convolutional layer of the network considering the values of C out , W out , and H out listed in Table 1.Finally, 819 clock cycles shall be added in order to store the 63 × 13 input matrix in the Input memory.A total of 90278 clock cycles are required to complete an inference.
A major parallelization of MAC operations would offer the opportunity to speed-up accelerator performances, reducing the inference time.On the other hand, it is not generally possible to perform an arbitrary number of operations per clock cycle because of the limited number of FPGA resources (combinatorial logic, DSPs, etc.).Furthermore, if the level of parallelism is too high, routing can become the bottleneck of the implementation.
It is possible to boost MAC module operations, increasing the number of output elements n computed per clock cycle.In particular, for n > 1, N clk is reduced of a factor 1/n, as described by the following equation: Whilst this strategy leads to better timing optimization, it increases the design effort necessary to find the best combination that can fit on a specific FPGA device.Indeed, the appropriate value of n for each layer should be tuned depending on the size of the target FPGA, in order to guarantee design implementability.Furthermore, parallelizing each layer guarantees negligible advantages in terms of inference time when the number of operations necessary to carry out an entire convolution is strongly different for every layer.Considering the limitation of FPGA hardware resources, it results appropriate to parallelize MAC operations only for the layers with the highest values of N clk .In this specific case, freq_2 layer contributes to 60460 over 90278 total number of clock cycles due to the very high number of output channels (192).For this reason, the MAC module of the freq_2 layer was customized so that it calculates 4 values of the output matrix per clock cycle.According to equation (10), this allows to drastically reduce freq_2 N clk from 60480 to 15120 and consequently the total inference time from 90278 to 44918 clock cycles, halving the inference time.If a similar parallelization was realized for the other convolutional layers of the network, it would increase hardware resources without a significant improvement of timing performances because of their limited effect on the overall inference time.
As previously specified, batch normalization operations were absorbed in the frequency convolutional layer of each SC layer.e average pooling layer was included in final_conv that provides 12 outputs, corresponding to the sum of all the elements belonging the output matrix of each output channel.Finally, Softmax layer can be omitted.Indeed, to provide a direct decision on the pronounced word, it is sufficient to select the maximum value among the twelve outputs of final_conv.
is architecture was chosen because its simplicity heightens the possibility to fit the hardware accelerator in a target FPGA, reducing design time and increasing design portability among devices with different sizes.

Hardware Implementation Results
is section describes the performances of the hardware accelerator on different FPGA families.e presented architecture was implemented on several Xilinx and Intel devices to analyse its design portability on FPGAs with different sizes and performances.Results are presented in terms of hardware resource occupation, maximum achievable clock frequency, inference time, and power consumption.Finally, an analysis of how MAC module
Table 5 shows the hardware resources needed for the implementation of the accelerator on Xilinx FPGAs.Results are presented in terms of combinatorial elements, sequential elements, BRAMs, and LUTRAMs (LRAMs).e percentage of used resources out of the total is also indicated.Table 6 shows hardware resources needed for the implementation of the accelerator on Intel FPGAs.In this case, results are presented in terms of combinatorial elements, sequential elements, BRAMs, and DSPs.
All the implementations refer to the version of the accelerator in which the MAC module of the freq_2 layer was parallelized to compute 4 elements of its output matrix per clock cycle.e structure and the number of combinatorial/ sequential elements and memory dimensions and typologies are specific for each device.Please refer to FPGA datasheets for more information about the architecture of Xilinx devices [27][28][29][30][31] and Intel devices [32][33][34].
Figures 8 and 9 show the maximum achievable clock frequency and the inference time for Xilinx and Intel FPGAs, respectively.e minimum inference time for each layer can be calculated taking into consideration MAC module optimizations for the freq_2 layer and N clk values listed in Table 4. e best result is obtained for the Zynq UltraScale+ with a maximum clock frequency of 116.2 MHz and a corresponding inference time of less than 0.4 ms.
A power analysis was performed for both Xilinx and Intel FPGAs.To obtain a more accurate estimation of the power consumption for Xilinx devices, a postimplementation timing simulation was carried out by using Questa ® Advanced Simulator to extract information about the switching activity of the internal nodes of the circuit.Since Intel devices do not support postlayout simulation, only a RTL-level estimation of the switching activity has been included in the power consumption analysis as suggested by Intel guidelines [35].Results are shown in Table 7.
In general, Xilinx devices show a lower power consumption than Intel devices for both static and dynamic power.e only exception is the Arria 10, featuring a power consumption of 1 W and resulting the second best device after the Kintex-7 lv.

Design
Portability.An analysis of the hardware accelerator portability has been carried out in order to investigate how the proposed design fits in smaller FPGAs.In particular, the freq_2 layer has been customized to compute 1, 2, 4, and 8 elements (n_out) of a given output channel per clock cycle.Results are presented in terms of hardware resources, maximum clock frequency, and inference time.
Two FPGAs with different sizes belonging to the same family were selected: Tables 8 and 9 show the results in terms of hardware resource occupation for the Zynq-7000 FPGAs and for the Zynq-US+ FPGAs, respectively.e xc7z030 and the xczu3eg have a limited number of hardware resources and only the version of the accelerator with n_out equal to 1, 2, and 4 can be implemented in these devices; the version with n_elem equals to 8 fits only in the xc7z045 and in the xczu9eg.Owing to the limited number of LUTs available on-board xc7z030 and xczu3eg DSPs are included to perform MAC operations.In addition, xczu3eg implementations exploit all the available BRAMs on-board, and LRAMs have to be included.Versions of the hardware accelerator with a lower level of MAC parallelization have worst performance in terms of inference time but can fit in smaller devices because their requirements in terms of combinatorial elements are more relaxed.Unfortunately, the number of RAMs required does not change because intralayer RAM dimensions and the number of parameters of the network do not, and it can represent a bottleneck for the implementation of the not-customized version of the accelerator on smaller FPGAs.
Figures 10 and 11 show the performance in terms of clock frequency and inference time for Zynq-7000 and Zynq-US+ FPGAs, respectively.For the xc7z045 and the xczu9eg, the maximum achievable clock frequency does not show large variation increasing the level of MAC parallelism.For the xc7z030 and the xczu3eg, maximum achievable clock frequency tends to decrease because the limited size of these devices leads to a less optimized routing and consequently to worse timing performances.For this reason, inference times for xc7z030 with n_mac equals to 4 and n_mac equals to 2 are almost the same.
Similarly, when n_out is equal to 4, timing performance of the xc7z030 solution features an implementation loss of the 31% with respect to the solution on-board the xc7z045, and the xczu3eg solution features an implementation loss of 47% with respect to the one on-board the xczu9eg.

Comparison with Intel Movidius Neural Compute Stick
In this section, the FPGA-based accelerator is compared with a commercial hardware accelerator for machine learning on the edge: the Intel Movidius Neural Compute Stick.e same model of SCNN keyword spotting was implemented on the NCS in our previous work [12], and a direct comparison between the performances of the two  International Journal of Reconfigurable Computing solutions, in terms of inference time, power consumption, and energy per inference, is now presented.e NCS is a commercial deep learning hardware accelerator hosting the Myriad 2 VPU by Intel Movidius [9].
e VPU includes the following: (i) 4Gb of LPDDR3 DRAM (ii) 12 very long instruction word (VLIW) streaming hybrid architecture vector engine (SHAVE) processors optimized for machine vision used to run parts of a neural network in parallel (iii) 2MB on-chip memory shared between SHAVE processors and fixed-function accelerators (iv) 2 Leon microprocessors that coordinate the reception of the network graph file and of inputs via USB connection e Myriad 2 VPU supports fully connected, convolutional (with arbitrary sized kernel), and depthwise convolutional layers.
e NCS implements the floating-point version of the SCNN model with a maximum accuracy of 87.77.Quantization allows to increase this value to 90.23%, even if our preferred implementation has an accuracy of 88.09%.e inference time for the SCNN implemented on the NCS is approximately 10 ms. e FPGA-based accelerator has a lower inference time for all the FPGA implementations presented, swinging from 1.45 ms for the Cyclone V to 0.39 ms for the Zynq-US+.Finally, the NCS power consumption is 0.81 W. Such a result is provided by considering the hardware setup of our previous work [12], featuring a Rasperry PI 3B [36] connected to the NCS.Power consumption can be estimated by subtracting the Raspeberry PI 3B power consumption in the absence of the NCS (1.3 W) to the total power consumption of the system during an inference (2.11 W).
As shown in Table 7, power consumption for the design implemented on-board all FPGAs is higher than that for the NCS one.Nevertheless, for all the implementations, the energy dissipated during an inference (E inf ) is lower than the one of NCS.In fact, it is possible to calculate E inf as shown in the following equation:  10 International Journal of Reconfigurable Computing where P is the average power consumption during an inference and t inf is the inference time.Indeed, even if Xilinx and Intel devices show a higher power consumption, the significantly lower t inf leads to a reduced E inf .
Table 10 shows a comparison among FPGAs and NCS in terms of inference time, power, and energy, by using model (7) of Table 3. e power analysis was performed by considering the maximum achievable clock frequency (f clk ) of each FPGA in order to minimize the inference time.
Results show that FPGAs offer great design flexibility, allowing to tune inference time and power consumption through the choice of the different platforms.FPGAs are promising devices for the implementation of CNN-based hardware accelerators for portable applications and in particular for those requiring low latency and high accuracy.Indeed, inference time results to be diminished approximately of a factor between 7 and 25 and energy per inference is reduced, respectively, of a factor between 2.5 and 9 in the investigated cases.
Finally, Figure 12 provides a graphical representation of the power consumption/inference time results shown in Table 10.It is evident from results that all FPGA solutions feature a reduced inference time with respect to the NCS implementation at expense of a higher power consumption, even if comparable for some devices.International Journal of Reconfigurable Computing inference time, resulting in interesting solutions for on the edge computing.

Discussion
It is necessary to underline that these results were possible, thanks to the use of a CNN model optimized for resource-constrained devices [12], featuring a reduced number of parameters and layers.In view of that, a full on-chip design was achievable, with strong advantages in terms of latency and power consumption.Consequently, results are pertinent for applications requiring relatively small models, such as digit and letter recognitions systems [37,38], audio [39], and mobile vision applications [40].
Finally, the proposed full on-chip design guarantees a straightforward processing architecture (i.e., no data scheduling from external memories and no management of shared inference processing elements), further reducing the overall system design time.However, when compared with NCS and other plug and play solutions, the use of FPGA still requires much more design effort and competences, in view of the higher and heterogeneous design steps (i.e., model quantization and architecture definition) and of the broader design space.

Conclusions
is article presents a full on-chip FPGA-based hardware accelerator for on the edge keyword spotting.e KWS system is described focusing on its realization through a machine-learning algorithm and on traducing AI on the edge paradigm.
Starting from a Keras-Python model of a KWS based on a SCNN, the parameters of the network were quantized in order to shrink the hardware resources needed for its realization.CNNs have a large number of parameters and are characterized by multiplying and accumulating operations that make their implementation on an FPGA device challenging.Quantization analysis shows that fixed-point representation does not significantly affect model accuracy.On the contrary, it is possible to increase it for particular combinations of input words, weight, and layer output representations.
en, the accelerator architecture is described, focusing on design effort to exploit the intrinsic parallelism of these devices.
e SCNN accelerator was implemented on several Xilinx and Intel FPGAs to analyse design portability on different families.e obtained results are presented in terms of maximum achievable clock frequency, hardware resources needed for the network implementation, energy per inference, and power consumption.Finally, the proposed accelerator was compared with a commercial solution for on the edge AI applications: the Intel Movidius NCS.
is analysis shows that with a FPGA-based solution, it is possible to overcome NCS performances in terms of inference time and energy per inference.

Figure 4 :Figure 5 :
Figure 4: Accuracy and MSE to the change of the number of bits for input layer words.

Figure 8 :Figure 9 :
Figure 8: Maximum clock frequency and inference time for different Xilinx FPGA families.

Figure 10 :
Figure 10: Max frequency and inference time for the Zynq-7000 family.

Figure 11 :
Figure 11: Max frequency and inference time for the Zynq-US+ family.

Table 1 :
Convolutional parameters for the network.

Table 2 :
Results of the first quantization analysis.

Table 3 :
Results of the second quantization analysis.

Table 4 :
N clk values for the various layers.

Table 5 :
Hardware accelerator implementation on Xilinx FPGAs.

Table 6 :
Hardware accelerator implementation on Intel FPGAs.

Table 7 :
Power consumption for Xilinx and Intel FPGAs.
e results presented in this work highlight the value of the FPGA solutions to accelerate inference of CNNs.ey offer a remarkable trade-off between power consumption and

Table 10 :
Performance comparison between Xilinx FPGAs, Intel FPGAs, and NCS.