^{1}

^{1}

^{1}

^{2}

^{1}

^{1}

^{2}

During the last years, convolutional neural networks have been used for different applications, thanks to their potentiality to carry out tasks by using a reduced number of parameters when compared with other deep learning approaches. However, power consumption and memory footprint constraints, typical of on the edge and portable applications, usually collide with accuracy and latency requirements. For such reasons, commercial hardware accelerators have become popular, thanks to their architecture designed for the inference of general convolutional neural network models. Nevertheless, field-programmable gate arrays represent an interesting perspective since they offer the possibility to implement a hardware architecture tailored to a specific convolutional neural network model, with promising results in terms of latency and power consumption. In this article, we propose a full on-chip field-programmable gate array hardware accelerator for a separable convolutional neural network, which was designed for a keyword spotting application. We started from the model implemented in a previous work for the Intel Movidius Neural Compute Stick. For our goals, we appropriately quantized such a model through a bit-true simulation, and we realized a dedicated architecture exclusively using on-chip memories. A benchmark comparing the results on different field-programmable gate array families by Xilinx and Intel with the implementation on the Neural Compute Stick was realized. The analysis shows that better inference time and energy per inference results can be obtained with comparable accuracy at expenses of a higher design effort and development time through the FPGA solution.

During the last years, convolutional neural networks (CNNs) found application in many different fields like object detection [

For this reason, commercial hardware accelerators for CNNs such as Neural Compute Stick (NCS) [

Nevertheless, since they were designed for the implementation of generic CNNs, their architectures are extremely flexible at the expense of the optimization of the single model.

For such a reason, hardware accelerators customized for a specific application might offer an interesting alternative for accelerating CNNs. In particular, field-programmable gate arrays (FPGAs) represent an interesting trade-off between cost, flexibility, and performances [

The aim of this paper is to investigate the use of custom FPGA-based hardware accelerators to realize a CNN-based KWS system, analysing their performances in terms of power consumption, number of hardware resources, accuracy, and timing. A KWS system represents an example of application whose porting on the edge requires much effort, owing to the hard design trade-offs.

The study involves the use of different FPGA families by Xilinx and Intel, analysing design portability on devices with different sizes and performances. This allowed to realize a benchmark that compares the obtained results with the ones presented in our previous work for the full-SCNN (separable convolutional neural network) model [

To realize the architecture implemented on-board FPGA, a bit-true simulation was performed to appropriately quantize the model, reducing the number of resources used, saving power, and increasing throughput when compared with a floating-point approach.

The remainder of the paper is structured as follows: the

KWS systems are a common component in speech-enabled devices: they continuously listen to the surrounding environment with the task to recognize a small set of simple commands in order to activate or deactivate specific functionalities. Commercial examples of KWS systems include “

The KWS system was pretrained in the Python framework called

The proposed architecture is based on the SCNN described in [

SCNN architecture.

The input of the network is a 63 × 13 mel frequency spectral coefficient (MFSC) matrix [_{sample} = 16 kHz is the sample rate and _{sample}, with

The input layer provides the 63 × 13 MFSC input matrix. Then, three separable convolutional (SC) layers follow, and their generic structure is shown in Figure

SC hidden layer architecture.

SC layers improve standard convolutional layers by reducing the number of parameters used to process the inputs [

A standard convolutional layer contains

Convolutional layers: (a) classic CNN network and (b) SCNN network.

Considering the structure of the MFCS input matrix, each SC layer performs two separated convolutions, realizing a “

A batch normalization (BN) layer, which has the role to accelerate deep network training by reducing internal covariance shift [

A classic convolutional layer follows the three SC layers.

Table _{in}), output channels (_{out}), and input/output matrix dimensions for each convolutional layer of the network. Time_0 and freq_0 are, respectively, the temporal and frequency convolutional layer of the hidden layer 0, and similarly time_1/freq_1 for the hidden layer 1 and time_2/freq_2 for the hidden layer 2. Final_conv refers to the last convolutional layer of the network.

Convolutional parameters for the network.

Layer | Input matrix | Filter | _{in} | _{out} | Output matrix | |
---|---|---|---|---|---|---|

Hidden layer 0 | Time_0 | 63 × 13 | 5 × 1 | 1 | 1 | 59 × 13 |

Freq_0 | 59 × 13 | 1 × 3 | 1 | 8 | 59 × 11 | |

Hidden layer 1 | Time_1 | 59 × 11 | 5 × 1 | 8 | 8 | 55 × 11 |

Freq_1 | 55 × 11 | 1 × 3 | 8 | 16 | 55 × 9 | |

Hidden layer 2 | Time_2 | 55 × 9 | 11 × 1 | 16 | 16 | 45 × 9 |

Freq_2 | 45 × 9 | 1 × 3 | 16 | 192 | 45 × 7 | |

Final_conv | 45 × 7 | 1 × 1 | 192 | 12 | 45 × 7 |

The average pooling layer computes the average value of each output channel of the final_conv layer, condensing them in 12 values, one for each class of the KWS system. Finally, a Softmax (or normalized exponential function) layer activation function follows. It takes a vector

In this network, the Softmax input vector is composed of 12 elements, one for each of the class of the KWS system.

The proposed SCNN model was implemented on the Intel Movidius NCS, showing an accuracy of 87.77%. The number of parameters necessary for its implementation is 15000, including bias, weights, and batch normalization parameters.

In the next sections, methods to map the Keras–Python model of the KWS system on an FPGA are analysed. In fact, this model is implemented in a high-level language and its parameters are based on the floating-point representation.

The main issue about the implementation of a CNN-based model on an FPGA regards the limitation in terms of available hardware resources (combinatorial elements, sequential elements, Digital Signal Processors (DSPs), ram blocks, etc.) of such devices [

Before realizing the FPGA implementation, a quantization of the SCNN model was performed. In literature, there are many examples of quantization applied to CNNs [

The quantization of the original floating-point model was performed through a bit-true simulation. The aim of the simulation is to determine the number of bits necessary to represent numbers in every internal node of the network by limiting the loss in accuracy.

The fixed-point representation of the model weights (or filter elements)

At the end of each layer, the acceptable number of truncated bits

Instead, saturating

A possible optimization of the model consists in quantizing separately the weights of the last convolutional layer, by using

Another technique to reduce the complexity of the hardware accelerator is ^{−9}–10^{−7}. Considering their small values with respect to the other network parameters, they were pruned to reduce the model size. Indeed, it is possible to eliminate temporal bias terms without significantly affecting accuracy and reducing the number of sums to be computed.

In this section, the results obtained from the quantization process are presented and discussed.

The SCNN model of this network has many degrees of freedom. For this reason, the first simulation step is finalised to identify a starting point for a more complex analysis, and it only focuses on the quantization of input layer words and weights.

Figure

Accuracy and MSE to the change of the number of bits for input layer words.

Figure

Accuracy and MSE to the change of the number of bits of filter elements.

This first analysis was the starting point for a more detailed design exploration, involving the number of bits for the representation of the output of each layer.

Table

b_in: number of bits for the representation of input words.

b_filter: number of bits for the representation of filters.

bit_out_0: number of bits for the representation of the outputs of the first hidden layer.

bit_out_1: number of bits for the representation of the outputs of the second hidden layer.

bit_out_2: number of bits for the representation of the outputs of the third hidden layer.

bit_out_fc: number of bits for the representation of the outputs of the last convolutional layer.

Results of the first quantization analysis.

b_in | b_filter | bit_out_0 | bit_out_1 | bit_out_2 | bit_out_fc | Accuracy (%) |
---|---|---|---|---|---|---|

5 | 12 | 10 | 8 | 10 | 10 | 90.23 |

5 | 12 | 8 | 8 | 10 | 10 | 90.14 |

5 | 12 | 8 | 8 | 10 | 8 | 89.74 |

5 | 11 | 8 | 8 | 10 | 10 | 88.91 |

4 | 12 | 8 | 8 | 10 | 12 | 88.87 |

4 | 12 | 8 | 8 | 10 | 10 | 88.78 |

5 | 11 | 8 | 8 | 10 | 8 | 88.60 |

4 | 12 | 8 | 8 | 10 | 8 | 88.46 |

5 | 11 | 8 | 8 | 10 | 8 | 88.40 |

4 | 11 | 8 | 8 | 10 | 12 | 87.84 |

4 | 11 | 8 | 8 | 10 | 10 | 87.61 |

Collected data show that it is possible to increase model accuracy through quantization. In fact, the best accuracy obtained for the floating-point model is 87.77%, whereas for the fixed-point representation, the highest accuracy is 90.23%.

The second part of the simulation considers a different quantization for the final_conv layer due to the inclusion of the average pooling effects, as explained in Section

Results of the second quantization analysis.

No. | b_in | b_filter | b_last | bit_out_0 | bit_out_1 | bit_out_2 | bit_out_fc | Accuracy (%) |
---|---|---|---|---|---|---|---|---|

1 | 5 | 8 | 6 | 8 | 8 | 10 | 12 | 89.88 |

2 | 5 | 8 | 6 | 8 | 8 | 10 | 10 | 89.74 |

3 | 4 | 12 | 6 | 8 | 8 | 10 | 12 | 88.87 |

4 | 4 | 12 | 6 | 8 | 8 | 10 | 10 | 88.75 |

5 | 4 | 12 | 6 | 8 | 8 | 10 | 12 | 88.21 |

6 | 4 | 12 | 6 | 8 | 8 | 10 | 10 | 88.09 |

7 | 4 | 8 | 6 | 8 | 8 | 10 | 10 | 88.09 |

8 | 4 | 8 | 6 | 8 | 8 | 10 | 10 | 87.61 |

9 | 4 | 11 | 6 | 8 | 8 | 10 | 12 | 87.55 |

10 | 4 | 8 | 6 | 8 | 8 | 10 | 10 | 87.43 |

11 | 5 | 8 | 6 | 8 | 8 | 8 | 10 | 87.39 |

12 | 4 | 12 | 6 | 8 | 8 | 10 | 8 | 87.36 |

13 | 4 | 11 | 6 | 8 | 8 | 10 | 10 | 87.29 |

14 | 5 | 8 | 6 | 8 | 8 | 8 | 8 | 86.93 |

These models show smaller hardware requirements than the single-quantization versions presented in Table

The model chosen for the FPGA implementation considers both accuracy and the possibility to shrink parameter representations. Model number (7) from Table

This section describes the architecture of the hardware accelerator that was implemented on different FPGA families. Thanks to the reduced number of parameters of the SCNN investigated in our previous work [

Figure

SCNN architecture for FPGA implementation.

An input memory is used as an interface between the hardware accelerator and the system that records and elaborates the audio samples. The input memory stores 4-bit input data. The time/frequency layers and final_conv layer perform convolutional operations and store the results into a RAM memory, used as a buffer. Once the previous layer completes an entire convolution, the next one starts reading out its input matrix from the memory.

Each of the seven convolutional layers has its own MAC module to perform multiply-accumulate operations. Figure

MAC module architecture.

It reads _{clk}, as shown in the following equation:_{out} is the number of output channels and _{out} and _{out} are the dimensions of the output matrix. Table _{clk} of each convolutional layer of the network considering the values of _{out}, _{out}, and _{out} listed in Table

_{clk} values for the various layers.

Layer | _{clk} |
---|---|

Input memory | 819 |

Time_0 | 767 |

Freq_0 | 5192 |

Time_1 | 4840 |

Freq_1 | 7920 |

Time_2 | 6480 |

Freq_2 | 60480 |

Final_conv | 3780 |

Total | 90278 |

A major parallelization of MAC operations would offer the opportunity to speed-up accelerator performances, reducing the inference time. On the other hand, it is not generally possible to perform an arbitrary number of operations per clock cycle because of the limited number of FPGA resources (combinatorial logic, DSPs, etc.). Furthermore, if the level of parallelism is too high, routing can become the bottleneck of the implementation.

It is possible to boost MAC module operations, increasing the number of output elements _{clk} is reduced of a factor

Whilst this strategy leads to better timing optimization, it increases the design effort necessary to find the best combination that can fit on a specific FPGA device. Indeed, the appropriate value of _{clk}. In this specific case, freq_2 layer contributes to 60460 over 90278 total number of clock cycles due to the very high number of output channels (192). For this reason, the MAC module of the freq_2 layer was customized so that it calculates 4 values of the output matrix per clock cycle. According to equation (_{clk} from 60480 to 15120 and consequently the total inference time from 90278 to 44918 clock cycles, halving the inference time. If a similar parallelization was realized for the other convolutional layers of the network, it would increase hardware resources without a significant improvement of timing performances because of their limited effect on the overall inference time.

As previously specified, batch normalization operations were absorbed in the frequency convolutional layer of each SC layer. The average pooling layer was included in final_conv that provides 12 outputs, corresponding to the sum of all the elements belonging the output matrix of each output channel. Finally, Softmax layer can be omitted. Indeed, to provide a direct decision on the pronounced word, it is sufficient to select the maximum value among the twelve outputs of final_conv.

This architecture was chosen because its simplicity heightens the possibility to fit the hardware accelerator in a target FPGA, reducing design time and increasing design portability among devices with different sizes.

This section describes the performances of the hardware accelerator on different FPGA families. The presented architecture was implemented on several Xilinx and Intel devices to analyse its design portability on FPGAs with different sizes and performances. Results are presented in terms of hardware resource occupation, maximum achievable clock frequency, inference time, and power consumption. Finally, an analysis of how MAC module parallelization influences design portability on smaller FPGAs is provided.

The devices included in the analysis are as follows:

Zynq UltraScale+ (US+), xczu9eg-ffvb1156-2-e [

Virtex UltraScale+, xcvu3p-ffvc1517-2-e [

Virtex UltraScale (US), xcvu065-ffvc1517-2-e [

Zynq-7000, xc7z045ffg900-1 [

Virtex-7, xc7vx330tffg1157-2 [

Kintex-7 low voltage (lv), xc7k160tfbg484-2L [

Artix-7 low voltage (lv), xc7a200tfbg484-2 [

Arria 10 GX, 10AX027H3F35E2SG [

Stratix V GS, 5SGSMD4E1H29C1 [

Stratix V GX, 5SEE9F45C2 [

Stratix V E, 5SEE9H40C2 [

Cyclone V, 5CEFA9U19C8 [

All the implementations were realized by using

Table

Hardware accelerator implementation on Xilinx FPGAs.

FPGA family | Comb. elem. | Comb. elem. (%) | Seq. elem. | Seq. elem. (%) | BRAM | BRAM (%) | LRAM | LRAM (%) |
---|---|---|---|---|---|---|---|---|

Zynq US+ | 81345 | 30 | 860 | <1 | 228 | 25 | 2560 | 2 |

Virtex US+ | 81367 | 21 | 864 | <1 | 228 | 32 | 2560 | 1 |

Virtex US | 81427 | 23 | 952 | <1 | 228 | 18 | 2560 | 3 |

Zynq-7000 | 76283 | 35 | 632 | <1 | 244 | 45 | 0 | 0 |

Virtex-7 | 76163 | 37 | 632 | <1 | 244 | 33 | 0 | 0 |

Kintex-7 lv | 81737 | 81 | 633 | <1 | 244 | 75 | 0 | 0 |

Artix-7 lv | 87406 | 86 | 1081 | <1 | 228 | 70 | 0 | 0 |

Hardware accelerator implementation on Intel FPGAs.

FPGA family | Comb. elem. | Comb. elem. (%) | Seq. elem. | Seq. elem. (%) | BRAM | BRAM (%) | DSP | DSP (%) |
---|---|---|---|---|---|---|---|---|

Arria 10 GX | 23722 | 23 | 296 | <1 | 344 | 46 | 323 | 39 |

Stratix V GS | 25532 | 18 | 2851 | <1 | 344 | 36 | 323 | 31 |

Stratix V GX | 23370 | 7 | 1860 | <1 | 344 | 13 | 323 | 92 |

Stratix V E | 23099 | 7 | 1843 | <1 | 344 | 13 | 323 | 92 |

Cyclone V | 24111 | 21 | 2911 | <1 | 392 | 32 | 323 | 94 |

All the implementations refer to the version of the accelerator in which the MAC module of the freq_2 layer was parallelized to compute 4 elements of its output matrix per clock cycle. The structure and the number of combinatorial/sequential elements and memory dimensions and typologies are specific for each device. Please refer to FPGA datasheets for more information about the architecture of Xilinx devices [

Figures _{clk} values listed in Table

Maximum clock frequency and inference time for different Xilinx FPGA families.

Maximum clock frequency and inference time for different Intel FPGA families.

A power analysis was performed for both Xilinx and Intel FPGAs. To obtain a more accurate estimation of the power consumption for Xilinx devices, a postimplementation timing simulation was carried out by using

Power consumption for Xilinx and Intel FPGAs.

Device | Static power (W) | Dynamic power (W) | Total power (W) |
---|---|---|---|

Artix 7 | 0.151 | 0.892 | 1.043 |

Kintex-7 lv | 0.110 | 0.859 | 0.969 |

Zynq-7000 | 0.215 | 1.172 | 1.387 |

Virtex 7 | 0.204 | 1.147 | 1.351 |

Virtex-US | 0.626 | 1.235 | 1.861 |

Virtex-US+ | 0.839 | 1.302 | 2.141 |

Zynq-US+ | 0.627 | 1.532 | 2.259 |

Cyclone V | 0.570 | 1.731 | 2.301 |

Stratix V E | 1.607 | 2.150 | 3.757 |

Stratix V GS | 0.857 | 3.153 | 4.010 |

Arria 10 | 0.272 | 0.730 | 1.002 |

Stratix V GX | 1.244 | 2.141 | 3.385 |

An analysis of the hardware accelerator portability has been carried out in order to investigate how the proposed design fits in smaller FPGAs. In particular, the freq_2 layer has been customized to compute 1, 2, 4, and 8 elements (

Two FPGAs with different sizes belonging to the same family were selected:

xc7z045ffg900-2 (xc7z045) and xc7z030fbg484-2 (xc7z030) for the Zynq-7000 family [

xczu9eg-ffvb1156-2-e (xczu9eg) and xczu3eg-sfva625-2L-e (xczu3eg) for the Zynq UltraScale+ family [

Tables

Design portability analysis for Zynq-7000 family.

Zynq-7000 family | Num mac. | Comb. elem. (%) | Seq. elem. (%) | BRAM (%) | DSP (%) |
---|---|---|---|---|---|

xc7z030 | 1 | 85 | <1 | 92 | 0 |

2 | 88 | <1 | 92 | 0 | |

4 | 55 | <1 | 92 | 100 | |

8 | — | — | — | — | |

xc7z045 | 1 | 33 | <1 | 45 | 0 |

2 | 35 | <1 | 45 | 0 | |

4 | 38 | <1 | 45 | 0 | |

8 | 44 | <1 | 45 | 0 |

Design portability analysis for Zynq-US+ family.

Zynq-US+ family | Num mac. | Comb. elem. (%) | Seq. elem. (%) | BRAM (%) | LRAM (%) | DSP (%) |
---|---|---|---|---|---|---|

xczu3eg | 1 | 54 | <1 | 100 | 16 | 92 |

2 | 64 | <1 | 100 | 16 | 100 | |

4 | 70 | <1 | 100 | 16 | 100 | |

8 | — | — | — | — | — | |

xczu9eg | 1 | 25 | <1 | 27 | 0 | 0 |

2 | 26 | <1 | 27 | 0 | 0 | |

4 | 30 | <1 | 27 | 0 | 0 | |

8 | 36 | <1 | 27 | 0 | 0 |

The xc7z030 and the xczu3eg have a limited number of hardware resources and only the version of the accelerator with

Figures

Max frequency and inference time for the Zynq-7000 family.

Max frequency and inference time for the Zynq-US+ family.

Similarly, when

In this section, the FPGA-based accelerator is compared with a commercial hardware accelerator for machine learning on the edge: the Intel Movidius Neural Compute Stick.

The same model of SCNN keyword spotting was implemented on the NCS in our previous work [

The NCS is a commercial deep learning hardware accelerator hosting the Myriad 2 VPU by Intel Movidius [

4Gb of LPDDR3 DRAM

12 very long instruction word (VLIW) streaming hybrid architecture vector engine (SHAVE) processors optimized for machine vision used to run parts of a neural network in parallel

2MB on-chip memory shared between SHAVE processors and fixed-function accelerators

2 Leon microprocessors that coordinate the reception of the network graph file and of inputs via USB connection

The Myriad 2 VPU supports fully connected, convolutional (with arbitrary sized kernel), and depthwise convolutional layers.

The NCS implements the floating-point version of the SCNN model with a maximum accuracy of 87.77. Quantization allows to increase this value to 90.23%, even if our preferred implementation has an accuracy of 88.09%.

The inference time for the SCNN implemented on the NCS is approximately 10 ms. The FPGA-based accelerator has a lower inference time for all the FPGA implementations presented, swinging from 1.45 ms for the Cyclone V to 0.39 ms for the Zynq-US+. Finally, the NCS power consumption is 0.81 W. Such a result is provided by considering the hardware setup of our previous work [

As shown in Table _{inf}) is lower than the one of NCS. In fact, it is possible to calculate _{inf} as shown in the following equation:

Indeed, even if Xilinx and Intel devices show a higher power consumption, the significantly lower

Table _{clk}) of each FPGA in order to minimize the inference time.

Performance comparison between Xilinx FPGAs, Intel FPGAs, and NCS.

Device | _{clk} (MHz) | Inference time (ms) | Total power (W) | Energy (mJ) |
---|---|---|---|---|

Xilinx FPGA families | ||||

Artix 7 | 47.6 | 0.94 | 1.043 | 0.98 |

Kintex-7 lv | 48.2 | 0.93 | 0.969 | 0.90 |

Zynq-7000 | 67.8 | 0.65 | 1.387 | 0.90 |

Virtex 7 | 63.5 | 0.71 | 1.351 | 0.96 |

Virtex-US | 78.4 | 0.57 | 1.861 | 1.01 |

Virtex-US+ | 104.2 | 0.43 | 2.141 | 0.92 |

Zynq-US+ | 116.4 | 0.39 | 2.259 | 0.88 |

Intel FPGA families | ||||

Cyclone V | 31.4 | 1.43 | 2.301 | 3.29 |

Stratix V E | 57.4 | 0.78 | 3.757 | 2.9 |

Stratix V GS | 60.3 | 0.74 | 4.010 | 2.96 |

Arria 10 | 61 | 0.73 | 1.002 | 0.73 |

Stratix V GX | 80 | 0.56 | 3.385 | 1.9 |

Intel movidius neural compute stick | ||||

NCS | 600 | 10 | 0.810 | 8.1 |

Results show that FPGAs offer great design flexibility, allowing to tune inference time and power consumption through the choice of the different platforms. FPGAs are promising devices for the implementation of CNN-based hardware accelerators for portable applications and in particular for those requiring low latency and high accuracy. Indeed, inference time results to be diminished approximately of a factor between 7 and 25 and energy per inference is reduced, respectively, of a factor between 2.5 and 9 in the investigated cases.

Finally, Figure

Inference time/power consumption trade-off analysis.

The results presented in this work highlight the value of the FPGA solutions to accelerate inference of CNNs. They offer a remarkable trade-off between power consumption and inference time, resulting in interesting solutions for on the edge computing.

It is necessary to underline that these results were possible, thanks to the use of a CNN model optimized for resource-constrained devices [

Finally, the proposed full on-chip design guarantees a straightforward processing architecture (i.e., no data scheduling from external memories and no management of shared inference processing elements), further reducing the overall system design time. However, when compared with NCS and other

This article presents a full on-chip FPGA-based hardware accelerator for on the edge keyword spotting. The KWS system is described focusing on its realization through a machine-learning algorithm and on traducing AI on the edge paradigm.

Starting from a

The data used to support the findings of this study are included within the article.

The authors declare that there are no conflicts of interest regarding the publication of this paper.