Dedicated hardware implementations of artificial neural networks promise to provide faster, lower-power operation when compared to software implementations executing on microprocessors, but rarely do these implementations have the flexibility to adapt and train online under dynamic conditions. A typical design process for artificial neural networks involves offline training using software simulations and synthesis and hardware implementation of the obtained network offline. This paper presents a design of block-based neural networks (BbNNs) on FPGAs capable of dynamic adaptation and online training. Specifically the network structure and the internal parameters, the two pieces of the multiparametric evolution of the BbNNs, can be adapted intrinsically, in-field under the control of the training algorithm. This ability enables deployment of the platform in dynamic environments, thereby significantly expanding the range of target applications, deployment lifetimes, and system reliability. The potential and functionality of the platform are demonstrated using several case studies.
Artificial Neural Networks (ANNs) are popular among the machine intelligence community and have been widely applied to problems such as classification, prediction, and approximation. These are fully or partially interconnected networks of computational elements called artificial neurons. An artificial neuron is the basic processing unit of the ANN that computes an output function of the weighted sum of its inputs and a bias. Depending on the interconnection topology and the dataflow through the network, ANNs can be classified as feedforward or recurrent. The design process with ANNs involves a training phase during which the network structure and synaptic weights are iteratively tuned under the control of a training algorithm to identify and learn the characteristics of the training data patterns. The obtained network is then tested on previously unseen data. ANNs are effective at identifying and learning nonlinear relationships between input and output data patterns and have been successfully applied in diverse application areas such as signal processing, data mining, and finance.
Explicit computational parallelism in these networks has produced significant interest in accelerating their execution using custom neural hardware circuitry implemented on technologies such as application specific integrated circuits (ASICs) and field programmable gate arrays (FPGAs). The design process for ANNs typically involves offline training using software simulations and the obtained network is designed and deployed offline. Hence the training and design processes have to be repeated offline for every new application of ANNs. This paper is an extended version of [
The rest of the paper is organized as follows. Section
There has been significant interest in custom ANN implementations and many have been reported in the literature over the years. Dedicated hardware units for artificial neural networks are called neurochips or neurocomputers [
Neural network hardware classification.
Digital neural network implementations offer high computational precision, reliability, and programmability. They are targeted to either ASICs or FPGAs. The synaptic weights and biases for these implementations are typically stored in digital memories or registers, either on- or off-chip dictated by the design tradeoffs between the speed and the circuit size. ASIC neurochips can achieve higher processing speeds, lower power, and more density than corresponding FPGA implementations but have significantly higher design and fabrication costs. FPGAs are COTS chips and can be reconfigured and reused for different ANN applications significantly lowering implementation costs for low volume productions. The last decade has seen a lot of advancement in reconfigurable hardware technology. FPGA chips with built-in RAMs, multipliers, gigabit transceivers, on-chip embedded processors, and faster clock speeds have attracted many neural network designers. As compared to analog, digital implementations have relatively larger circuit sizes and higher power consumption.
Digital neural implementations represent the real-valued data such as weights and biases using fixed point, floating point, or specialized representations such as pulse stream encoding. The choice of a particular representation is a tradeoff between arithmetic circuit size and speed, data precision, and the available dynamic range for the real values. Floating point arithmetic units are slower, larger, and more complicated than their fixed point counterparts, which are faster, smaller, and easier to design.
Generally, floating point representations of real-valued data for neural networks are found in custom ASIC implementations. Aibe et al. [
For FPGA implementations the preferred implementation choice has been fixed point due to chip capacity restrictions, but advances in FPGA densities may make floating point representations practical. Moussa et al. demonstrate implementations of MLP on FPGAs using fixed and floating point representations [
As a third alternative many have proposed specialized data encoding techniques that can simplify arithmetic circuit designs. Marchesi et al. proposed special training algorithms for multilayer perceptrons that use weight values that are powers of two [
Chujo et al. have proposed an iterative calculation algorithm for the perceptron type neuron model that is based on a multidimensional, binary search algorithm [
An important design choice for neural network hardware implementations is the degree of structure adaptation and synaptic parameter flexibility. An implementation with fixed network structure and weights can only be used in the recall stage of each unique application, thus necessitating a redesign for different ANN applications. For ASIC implementations this could be quite expensive due to high fabrication costs. An advantage of FPGA ANN implementations is the capability of runtime reconfiguration to retarget the same FPGA device for any number of different ANN applications, substantially reducing the implementation costs. There are several different motivations of using FPGAs for ANN implementations such as prototyping and simulation, density enhancement, and topology adaptation. The purpose of using FPGAs for prototyping and simulation is to thoroughly test a prototype of the final design for correctness and functionality before retargeting the design to an ASIC. This approach was used in [
Runtime reconfiguration provides the flexibility to retarget the FPGA for different ANN designs but is impractical for use with dynamic adaptations required for online training. The overheads associated with runtime reconfiguration are on the order of milliseconds. Thus the overheads of repetitive reconfigurations required in the iterative training procedures may outweigh any benefits associated with online adaptations. The design presented in this paper is an online trainable ANN implementation on FPGAs that supports dynamic structure and parameter updates without requiring any FPGA reconfiguration.
ASIC implementations of flexible neural networks that can be retargeted for different applications have been reported in literature. The Neural Network Processor (NNP) from Accurate Automation Corp. was a commercial neural hardware chip that could be adapted online [
ANN training algorithms iteratively adapt network structure and synaptic parameters based on an error function between expected and actual outputs. Hence an on-chip trainable network design should have the flexibility to dynamically adapt its structure and synaptic parameters. Most reported ANN implementations use software simulations to train the network, and the obtained network is targeted to hardware offline [
The following references discuss analog and hybrid implementations that support on-chip training. Zheng et al. have demonstrated a digital implementation of backpropagation learning algorithm along with an analog transconductance-model neural network [
Activation functions, or transfer functions, used in ANNs are typically nonlinear, monotonically increasing sigmoid functions. Examples include hyperbolic tangent and logistic sigmoid functions. Digital ANN implementations use piecewise linear approximations of these to implement in hardware either as a direct circuit implementation or as a look-up table. Omondi et al. show an implementation of piecewise linear approximation of activation functions using the CORDIC algorithm on FPGAs [
Analog artificial neurons are more closely related to their biological counterparts. Many characteristics of analog electronics can be helpful for neural network implementations. Most analog neuron implementations use operational amplifiers to directly perform neuron-like computations, such as integration and sigmoid transfer functions. These can be modeled using physical processes such as summing of currents or charges. Also, the interface to the environment may be easier as no analog-to-digital and digital-to-analog conversions are required. Some of the earlier analog implementations used resistors for representing free network parameters such as synaptic weights [
Hybrid implementations combine analog, digital, and other strategies such as optical communication links with mixed mode designs in an attempt to get the best that each can offer. Typically hybrid implementations use analog neurons taking advantage of their smaller size and lower power consumption and use digital memories for permanent weight storage [
Due to the large number of interconnections, routing quickly becomes a bottleneck in digital ASIC implementations. Some researchers have proposed hybrid designs using optical communication channels. Maier et al. [
Custom neural network hardware implementations can best exploit the inherent parallelism in computations observed in artificial neural networks. Many implementations have relied on offline training of neural networks using software simulations. The trained neural network is then implemented in hardware. Although these implementations have good recall speedups, they are not directly comparable to the implementation reported here which supports on-chip training of neural networks. On-chip trainable neural hardware implementations have also been reported in literature. Most of the reported ones are custom ASIC implementations such as the GRD chip by Murakawa et al. [
A block-based neural network (BbNN) is a network of neuron blocks interconnected in the form of a grid as shown in Figure
Block-based Neural Network topology.
Four different internal configurations of a basic neuron block: (a) 1/3, (b) 2/2 (left), (c) 2/2 (right), and (d) 3/1 configurations.
Outputs of the neuron block are a function of the summation of weighted inputs and a bias as shown in (
Three different
Moon and Kong [
BbNN training is a multiparametric optimization problem involving simultaneous structure and weight optimizations. Due to multimodal and nondifferentiable search space, global search techniques such as genetic algorithms are preferred over local search techniques to explore suitable solutions. Genetic algorithms (GAs) are evolutionary algorithms inspired from the Darwinian evolutionary model where in a population of candidate solutions (individuals or phenotypes) of a problem, encoded in abstract representations (called chromosomes or the genotypes), are evolved over multiple generations towards better solutions. The evolution process involves applying various genetic operators such as selection, crossover, mutation, and recombination to the chromosomes to generate successive populations with selection pressure against the least fit individuals. Figure
Flowchart of the genetic evolution process.
BbNN encoding.
Many FPGA ANN implementations are static implementations, targeted and configured offline for individual applications. The main design objective of this implementation is enabling intrinsic adaptation of network structure and internal parameters such that the network can be trained online without relying on runtime FPGA reconfigurations. Previous versions of this implementation were reported in [
BbNN logic diagram.
The ability to use variable bit-widths for arithmetic computations gives FPGAs significant performance and resource advantages over competing computational technologies such as microprocessors and GPUs. Numerical accuracy, performance, and resource utilization are inextricably linked, and the ability to exploit this relationship is a key advantage of FPGAs. Higher precision comes at the cost of lower performance, higher resource utilization, and increased power consumption. But at the same time, lower precision may increase the round-off errors adversely impacting circuit functionality.
The inputs, outputs, and internal parameters such as synaptic weights and biases in BbNN are all real valued. These can be represented either as floating point or fixed point numbers. Floating point representations often have a significantly wider dynamic range and higher precision as compared to fixed point representations. However, floating point arithmetic circuits are often more complicated, have a larger footprint in silicon, and are significantly slower compared to their fixed point counterparts. On the other hand fixed point representations have higher round-off errors when operating on data with large dynamic range. Although escalating FPGA device capacities have made floating point arithmetic circuits feasible for many applications, their use in our application will severely restrict the size of the network (i.e., the number of neuron blocks per network) that can be implemented on a single FPGA chip. Thus our implementation uses fixed point arithmetic for representing real-valued data. Also, [
Activation functions used in ANNs are typically nonlinear, monotonically increasing sigmoid functions. A custom circuit implementation of these functions may be area efficient as compared to piecewise linearly approximated values stored in lookup tables (LUTs) but is inflexible and involves complicated circuitry. In our design we have chosen to use the internal FPGA BRAMs to implement LUTs which can be preloaded and reloaded as necessary with activation function values. The size of the lookup table required is directly associated with the data widths used. A 16-bit fixed point representation requires an LUT that is 16 bits wide and
Illustrating activation function implementation in LUT.
Kothandaraman designed a core library of neuron blocks with different internal block configurations for implementation on FPGAs [
A simplistic design can combine all cores in the library into a larger neuron block and use a multiplexor to select individual cores with the correct internal configuration. But the area and power overheads of such an approach render it impractical. Instead, a smarter block design is presented here that can emulate all internal block configurations dynamically as required and is less than a third the size of the simplistic larger block. For obvious reasons the block design is called the Smart Block-based Neuron (SBbN). An internal configuration register within each SBbN called the Block Control and Status Register (BCSR) regulates the configuration settings for the block. BCSR is a 16-bit register and is part of the configuration control logic section of the neuron block that defines the state and configuration mode of the block. All the bits of this register except 8 through 11 are read-only and the register can be memory or I/O mapped to the host systems address space for read and write operations. Figure
Block control and status register (BCSR).
SBbN emulation of the internal block configurations based on BCSR settings.
Configuration control logic uses the structure gene within the BbNN chromosome array as its input and sets the BCSR configuration bits of all the neuron blocks in the network. The translation process from the structure gene array into internal configuration bits of the neuron blocks is illustrated in Figure
Gene translation process within the configuration control logic.
Address decoder provides a memory-mapped interface for the read/write registers and memories in the network design. It decodes address, data, and control signals for the input and output register files, the BbNN chromosome array, the BCSRs within each neuron block, and the activation function LUTs.
To enable network scalability across multiple FPGAs the data synchronization between neuron blocks is asynchronous. Synchronization is achieved using generation and consumption of tokens as explained next. The logic associates a token with each input and output registers of every neuron block in the network. Each neuron block can only compute outputs (i.e., fire) when all of its input tokens are valid. On each firing the neuron block consumes all of its input tokens and generates output tokens. The generated output tokens in turn validate the corresponding input tokens of the neuron blocks next in the dataflow. This is illustrated in Figure
Dataflow synchronization logic.
The design features explained above enable dynamic adaptations to network structure and internal synaptic parameters. Hence the network can be trained online, intrinsically under the control of the evolutionary training algorithm described in Section
Escalating FPGA logic densities has enabled building a programmable system-on-chip (PSoC) with soft or hard-core processors, memory, bus system, IO, and other custom cores on a single FPGA device. The Xilinx Virtex-II Pro FPGA used for prototyping our implementation has on-chip, hard-core PPC405 embedded processors which are used to design our PSoC prototyping platform. The platform is designed using Xilinx EDK and ISE tools. A block diagram for the designed platform is shown in Figure
PSoC platform used for prototyping BbNN implementation.
In the case of target environments with less stringent area and power constraints, other higher-capacity embedded solutions such as single-board computers with FPGA accelerators can be used. The GA operators can thus be implemented on the on-board processor and the fitness evaluation can be performed in the FPGA using the BbNN core. Table
Peak and relative computational capacities and capacity per mW of commercial embedded processors. Relative values are normalized to PPC405 numbers.
Processor | Organization | Cycle freq | Power | MOPS | Relative MOPS | MOPS/mW | Relative MOPS/mW |
---|---|---|---|---|---|---|---|
MIPS 24Kc | 261 MHz | 363 mW | 87 | 0.65 | 0.24 | 0.14 | |
MIPS 4KE | 233 MHz | 58 mW | 78 | 0.59 | 1.33 | 0.76 | |
ARM 1026EJ-S | 266 MHz | 279 mW | 89 | 0.67 | 0.32 | 0.18 | |
ARM 11MP | 320 MHz | 74 mW | 107 | 0.80 | 1.45 | 0.83 | |
ARM 720T | 100 MHz | 20 mW | 33 | 0.25 | 1.67 | 0.95 | |
PPC 405 | 400 MHz | 76 mW | 133 | 1.00 | 1.75 | 1.00 | |
PPC 440 | 533 MHz | 800 mW | 178 | 1.34 | 0.22 | 0.13 | |
PPC 750FX | 533 MHz | 6.75 W | 355 | 2.67 | 0.05 | 0.03 | |
PPC 970FX | 1 GHz | 11 W | 667 | 5.02 | 0.06 | 0.03 |
The postsynthesis timing analysis for the design reports a clock frequency of 245 MHz on the Xilinx Virtex-II Pro FPGA (XC2VP30). The designed block takes 10 clock cycles to compute outputs for six synaptic connections per block. Thus, each block has a computational capacity of 147 MCPS per block. Computational capacity is the measure of throughput defined as computational work per unit time. Hence for an artificial neural network it is determined by the number of synaptic connections processed per second (unit CPS). The computational capacity of the network is determined by the number of concurrent block executions, which in turn is dependent on the network structure. At peak computational capacity one block from each network column computes concurrently. Hence an
Table
Device Utilization Summary on Xilinx Virtex-II Pro FPGA (XC2VP30).
Network size | Number of slice registers | Number of block RAMs | Number of MULT | |||
---|---|---|---|---|---|---|
Used | Utilization | Used | Utilization | Used | Utilization | |
2724 | 19% | 8 | 5% | 12 | 8% | |
4929 | 35% | 16 | 11% | 24 | 17% | |
7896 | 57% | 24 | 17% | 36 | 26% | |
10589 | 77% | 32 | 23% | 48 | 35% | |
12408 | 90% | 40 | 29% | 60 | 44% | |
3661 | 26% | 8 | 5% | 18 | 13% | |
7327 | 53% | 16 | 11% | 36 | 26% | |
11025 | 80% | 24 | 17% | 54 | 39% | |
14763 | 107% | 32 | 23% | 72 | 52% | |
18456 | 134% | 40 | 29% | 90 | 66% | |
4783 | 34% | 8 | 5% | 24 | 17% | |
9646 | 70% | 16 | 11% | 48 | 35% | |
14587 | 106% | 24 | 17% | 72 | 52% | |
19508 | 142% | 32 | 23% | 96 | 70% | |
24461 | 178% | 40 | 29% | 120 | 88% |
Device utilization summary on Xilinx Virtex-II pro FPGA (XC2VP70).
Network size | Number of slice registers | Number of block RAMs | Number of MULT | |||
---|---|---|---|---|---|---|
Used | Utilization | Used | Utilization | Used | Utilization | |
2497 | 7% | 8 | 2% | 12 | 3% | |
4929 | 14% | 16 | 4% | 24 | 7% | |
7390 | 22% | 24 | 7% | 36 | 10% | |
9915 | 29% | 32 | 9% | 48 | 14% | |
12403 | 37% | 40 | 12% | 60 | 18% | |
3661 | 11% | 8 | 2% | 18 | 5% | |
7327 | 22% | 16 | 4% | 36 | 10% | |
11025 | 33% | 24 | 7% | 54 | 16% | |
14788 | 44% | 32 | 39% | 72 | 9% | |
18461 | 55% | 40 | 12% | 90 | 27% | |
22233 | 67% | 48 | 14% | 108 | 33% | |
25652 | 77% | 56 | 17% | 126 | 38% | |
29254 | 88% | 64 | 19% | 144 | 43% | |
4783 | 14% | 8 | 2% | 24 | 7% | |
9646 | 29% | 16 | 4% | 48 | 14% | |
14561 | 44% | 24 | 7% | 72 | 21% | |
19534 | 59% | 32 | 9% | 96 | 29% | |
24470 | 73% | 40 | 12% | 120 | 36% | |
29221 | 88% | 48 | 14% | 144 | 43% | |
34389 | 103% | 56 | 17% | 168 | 51% |
Results from three case studies are presented. The case studies on
A parity classifier can be used to determine the value of the parity bit to get even or odd number of 1’s in an “
Fitness evolution trends for (a) 3-bit and (b) 4-bit parity examples.
Structure evolution trends for (a) 3-bit and (b) 4-bit parity examples. Each color represents a unique BbNN structure.
Evolved networks for (a) 3-bit and (b) 4-bit parity examples.
This case study uses a well-known dataset in the machine learning community originally compiled by R. A Fisher [
Training error for Iris data classification.
BbNN fitness evolution trends for Iris data classification.
BbNN structure evolution trends for Iris data classification. Each color represents a unique BbNN structure.
Evolved network for Iris data classification.
The objective of this case study is to demonstrate online training benefits of the BbNN platform. This feature is beneficial for applications in dynamic environments where changing conditions may otherwise require offline retraining and deployment to maintain system reliability. This simulation study will use a BbNN to predict future values of ambient luminosity levels in a room. The network will be pretrained offline to predict ambient luminosity levels in an ideal setup and then deployed in the test room. The actual ambient luminosity levels in the test room can be different from the training data due to various external factors such as a sunny or a cloudy day, number of open windows, and closed or open curtains. The actual levels can be recorded in real time using light sensors and used for online training of the BbNN predictor to improve its future predictions. This study could be applied to many applications sensitive to luminosity variations such as embryonic cell or plant tissue cultures.
The pretraining setup for our experiment is as follows. Figure
Pretraining result: (a) actual and predicted luminosity levels, and (b) training error.
(a) Average and maximum fitness values. (b) GA parameters used for training.
Evolved network after pre-training.
The evolved network from the previous step is deployed in the simulated test room. Two case studies are considered that simulate the ambient luminosity variations in the test room. The first represents a cloudy day with lower ambient luminosity levels as compared to the ones considered in the offline training step and the second represents a sunny day with higher luminosity levels. These are shown in Figure
Luminosity levels used for offline training, cloudy day, and sunny day test cases.
The deployed BbNN platform is set up to trigger an online retraining cycle on observing greater than 5% prediction error. For the cloudy day test case the BbNN predicts the ambient luminosity reasonably well until 7:50 hours when the first retraining trigger is issued. The second retraining trigger is issued at 17:50 hours. The actual and predicted luminosity values and the corresponding trigger points are shown in Figure
Online evolution operation for the cloudy day showing the two trigger points.
Prediction errors with and without online re-training.
Evolved networks after retraining triggers: (a) first trigger point and (b) second trigger point.
In the sunny day test case the pretrained network performs poorly and requires eight retraining trigger points as shown in Figure
Online training result for the sunny day test case.
Prediction errors with and without retraining cycles for the sunny day test case.
Evolved network after the eighth re-training cycle for the sunny day test case.
This case study demonstrates the online training benefits of the BbNN platform and its potential for applications in dynamic environments. Figures
In this paper we present an FPGA design for BbNNs that is capable of on-chip training under the control of genetic algorithms. Typical design process for ANNs involves a training phase performed using software simulations and the obtained network is designed and deployed offline. Hence, the training and design processes have to be repeated offline for every new application of ANNs. The novel online adaptability features of our design demonstrated using several case studies expand the potential applications for BbNNs to dynamic environments and provide increased deployment lifetimes and improved system reliability. The platform has been prototyped on two FPGA boards: a stand-alone Digilent Inc. XUP development board [
This work was supported in part by the National Science Foundation under Grant nos. ECS-0319002 and CCF-0311500. The authors also acknowledge the support of the UT Exhibit, Performance, and Publication Expense Fund and thank the reviewers for their valuable comments which helped them in improving this manuscript.