On Training Efficiency and Computational Costs of a Feed Forward Neural Network: A Review

A comprehensive review on the problem of choosing a suitable activation function for the hidden layer of a feed forward neural network has been widely investigated. Since the nonlinear component of a neural network is the main contributor to the network mapping capabilities, the different choices that may lead to enhanced performances, in terms of training, generalization, or computational costs, are analyzed, both in general-purpose and in embedded computing environments. Finally, a strategy to convert a network configuration between different activation functions without altering the network mapping capabilities will be presented.


Introduction
Neural networks (NNs) are generally accepted in literature as a versatile and powerful tool for nonlinear mapping of a generic -dimensional nonlinear function. The mapping capabilities of a NN are strictly related to the nonlinear component found in the activation function (AF) of the neurons. Indeed, without the presence of a nonlinear activation function, the NN would be a simple linear interpolator. The most generic representation of a NN is a group of elementary processing units (neurons) characterized by a weighted connection to other input units. The processing of the unit consists of a linear part, where the inputs are linearly combined through the weights values, and a nonlinear part, where the weighted combination of the inputs is passed through an activation function, which is usually a threshold/squashing function. The nonlinear part of a NN is completely separated from the linear combination of the weighted inputs, thus opening a large number of possibilities for the choice of an activation function. Given the representation of the elementary unit, the inner architecture of the NN expresses the way those units are connected between themselves and to the inputs and outputs of the NN itself. Numerous authors studied the mapping capabilities of a NN, according to the inner architecture. In particular, it has been proved that the simple feed forward architecture with a single layer [1] and multiple layer [2][3][4][5][6] can be used as universal approximator given mild assumptions on hidden layer. A feed forward neural network (FFNN) is a NN where the inner architecture is organized in subsequent layers of neurons, and the connections are made according to the following rules: every neuron of a layer is connected to all (and only) the neurons of the subsequent layer. This topology rule excludes backward connections, found in many recurrent NNs [7][8][9][10][11], and layer-skipping, found in particular NN architectures like Fully Connected Cascade (FCC) [12]. Another focal point is the a-dynamicity of the architecture: in a FFNN no memory or delay is allowed, thus making the network useful only to represent static models. On this particular matter, several studies showed that, even by lacking dynamic capabilities, a FFNN can be used to represent both the function mapping and its derivatives [13]. Nevertheless, the choice of a suitable activation function for a FFNN, and in general, for a NN, is subject to different criterions. The most common considered criterions are training efficiency and computational cost. The former is especially important in the occurrence that a NN is trained in a general-purpose computing environment (e.g., using Matlab); the latter is critical in embedded systems 2 Computational Intelligence and Neuroscience (e.g., microcontrollers and FPGA (Field Programmable Gate Array)) where computational resources are inherently limited. The core of this work will be to give a comprehensive overview of the possibilities available in literature concerning the activation function for FFNN. Extending some of the concepts shown in this work could be possible to dynamic NNs as well; however training algorithms for dynamic NNs are completely different from the one used for FFNN, and thus the reader is advised to exert a critical analysis on the matter. The paper will be structured as follows. In the first part of this survey, different works focusing on the training efficiency of a FFNN with a specific AF will be presented. In particular, three subareas will be investigated: first, the analytic AFs, which enclose all the variants on the squashing functions proposed in the classic literature [14,15]; then, the fuzzy AFs, which exploit complex membership functions to achieve faster convergence during the training procedure; last, the adaptive AFs, which focus on shaping the NN nonlinear response to mimic as much as possible the mapping function properties. In the second part of this survey, the topic of computational efficiency will be explored through different works, focusing on different order approximations found in literature. In the third and last part of this survey, a method to transform FFNN weights and biases to change the AF of the hidden layer without need to retrain the network is presented. Conclusions will follow in the fourth part. An appendix, containing the relevant figures of merit reported by the authors in their work, closes this paper.

Analytic Activation Functions.
The commonly used backpropagation algorithm for FFNN training suffers from slow learning speed. One of the reasons for this drawback lies in the rule for the computation of the FFNN's weights correction matrix, which is calculated using the derivative of the activation function for the FFNN's neurons. The universal approximation theorem [1] states that one of the conditions for the FFNN to be a universal approximator is for the activation function to be bounded. For these reasons, most of the activation functions show a high derivative near the origin and a progressive flattening moving towards infinity. This means that, for neurons having a sum of weighted inputs very large in magnitude, learning rate will be very slow. A detailed comparison between different simple activation functions based on exponentials and logarithms can be found in [16], where the authors investigate the learning rate and convergence speed on a character recognition problem and the classic XOR classification problem, proposing the use of the inverse tangent as a fast-learning activation function. The authors compare the training performance, in terms of Epochs required to learn the task, of the proposed inverse tangent function, against the classic sigmoid and hyperbolic tangent functions, and the novel logarithmic activation function found in [17], finding a considerable performance gain. In [18], the sigmoid activation function is modified by introducing the square of the argument, enhancing the mapping capabilities of the NN. In [19], two activation functions, one based on integration of the triangular function and one on the difference between two sigmoids (log-exponential), are proposed and compared through a barycentric plotting technique, which projects the mapping capabilities of the network in a hyper dimensional cube. The study has shown that logexponential function has been slowly accelerated but it was effective in MLP network with backpropagation learning. In [20] a piecewise interpolation by means of cubic splines is used as an activation function, providing performances comparable to the sigmoid function with reduced computational costs. In [21], the proposed activation function is derived by Hermite orthonormal polynomials. The criterion is that every neuron in the hidden layer is characterized by a different AF, which is more complex for every neuron added. Through extensive simulations, the authors prove that such network shows great performance in comparison to analogous FFNN with identical sigmoid AFs. In [22], the authors propose a performance comparison between eight different AFs, including the stochastic AF and the novel "neural" activation function, obtained by the combination of a sinusoidal and sigmoid activation function. Two tests sets are used for comparison: breast cancer and thyroid diseases related data. The work shows that the sigmoid AFs yield, overall, the worst accuracy, and the hyperbolic tangent and the neural AF perform better on breast cancer dataset and thyroid disease dataset, respectively, pointing out the dataset dependence of the AF capabilities. The "neural" AF is investigated in [23] as well (in this work, it is referred to as "periodic"), where the NN is trained by the extended Kalman filter algorithm. The network is tested, against classic sigmoid and sinusoidal networks, in handwriting recognition, time series prediction, parity generation, and XOR mapping. The authors prove that the periodic function proposed outperforms both classic AFs in terms of training convergence. In [24], the authors suggest the combination of sigmoid and sinusoidal and Gaussian activation function, to exploit their independent space division properties. The authors compare the hybrid structure in a multifrequency signal classification problem, concluding that even if the combination of the three activation functions performs better than the sigmoid (in terms of convergence speed) and the Gaussian (in terms of noise rejection), the sinusoidal activation function by itself still achieves better results. Another work investigating an activation function based on sinusoidal modulation can be found in [25], where the authors propose a cosine modulated Gaussian function. The use of sinusoidal activation function is deeply investigated in [26], where the authors present a comprehensive comparison between eight different activation functions on eight different problems. Among other results, the Sinc activation function is proved as a valid alternative to the hyperbolic tangent, and the sinusoidal activation function has good training performance on small FFNNs. In Table 1, a summary of the different analytical AFs proposed in this paragraph is shown.

Fuzzy
Logic. An approach often found in literature is to combine FFNN with fuzzy logic (FL) techniques to create fast converging FFNN. In [27], the authors define the hyperbolic Computational Intelligence and Neuroscience 3

Ref.
Name Expression Notes [14,15] Step-like [19,22] Logarithmic-Exponential is responsible for centering the function, distances the sigmoid, and is a coefficient that controls the slope Where , , and are tunable constants [24,26] Wave Where controls the slope while governs the frequency of the resultant function [23] Sinusoidal Where is a parameter related to the frequency of the sine function [25] CosGauss The parameter in the function scales the gradient of the Gaussian function; meanwhile controls the frequency of repetition of the "hills" [26] Sinc Where and are frequencies and and are proportional parameters 4 Computational Intelligence and Neuroscience tangent transfer using three different membership functions, defining in fact the classical activation function by means of the fuzzy logic methodology. The main advantage during the training phase is a low computational cost, achieved since weight updating is not always necessary. The performance validation is verified through a comparison between two identically initialized FFNNs, one with the hyperbolic tangent and one with the proposed activation function. The two FFNNs are tested on the problems of XOR gate, onestep-ahead prediction of chaotic time series, equalization of communication channels, and independent components analysis. The authors in [28] use a Type 2 Fuzzy Function, originally proposed in [29,30], as the activation function for a FFNN, and compare the training performance on the classic XOR problem and induction motor speed prediction. An additional application of NN with fuzzy AFs can be found in [31], where the authors use the network for detection of epileptic seizure, processing and classifying EEG signals.
In [32], the authors propose an analytic training method, noted as extreme machine learning (EML), for FFNNs with fuzzy AFs. The authors test the procedure on problems of medical diagnosis, image classification, and satellite image analysis.

Adaptive Strategies.
Although the universal mapping capabilities of a FFNN have been proven, some assumptions on the activation function can be overlooked in favor of a more efficient training procedure. Indeed, it has been seen that a spectral similarity between the activation function and the desired mapping gives improved performance in terms of training [33]. The extreme version of this approach consists in having an activation function that is modified during the training procedure, creating in fact an ad hoc transfer function for neurons. Variations of this approach can be found for problems of biomedical signal processing [34], structural analysis [35], and data mining procedures [36]. The training algorithm for such networks requires taking into consideration the activation function adaptation, as well as the weights tuning. The authors in [37] propose a simple BP-like algorithm to train a NN with trainable AF and compare the training performance with a classic sigmoid activation function on both XOR problem and a nonlinear mapping.

Activation Functions for Fast Computation
The computational cost of a FFNN can be split into two main contributions. The first one is a linear cost, deriving from the operations needed to perform the sum of the weighted inputs of each neuron. The second one, nonlinear, is related to the computation of the activation function. In a computational environment, those operations are carried out considering a particular precision format for numbers. Considering the AF is usually limited in a small codomain, the use of integer arithmetic is unadvisable. Fixed-point and floating-point arithmetic are the most commonly used to compute the FFNN elementary operations. The linear part of the FFNN is straightforward: operations of products and sums are carried out by multipliers and adders (usually found in the floatingpoint unit (FPU) of an embedded device). The nonlinear part, featuring transcendental expressions, is carried out through complex arithmetic evaluations (IEEE 754 is the reference standard) that, in the end, still use elementary computational blocks like adders and multipliers. Addition and product with floating-point precision are complex and long operations, and the computational block that executes these operations often features pipeline architectures to speed up the arithmetic process. Although a careful optimization of the linear part is required to completely exploit pipeline capabilities [38], the ratio between the two costs shows, usually, that the linear quota of the operations is negligible when compared to the nonlinear part. In embedded environment, computational resources are scarce, in terms of both raw operations per second and available memory (or resources, for synthesizable digital circuits like FPGAs and ASICs).
Since embedded applications usually require real-time interaction, the development of NN applications in embedded environments shows the largest contributions in terms of fast and light solutions for AFs computation. Three branches of approaches can be found in literature: PWL (piecewise linear) interpolation, LUT (Lookup -Table) interpolation, and higher order/hybrid techniques. The following part of this survey will present the different approaches found in literature, grouped under these three sets. Table Approximations. Approximation by LUT is the simplest approach that can be used to reduce the computational cost of a complex function. The idea is to store inside the memory (i.e., a table) samples from a subdomain of the function and access those instead of calculating the function. The table either can be preloaded in the embedded device (e.g., in the flash memory of a microcontroller) or could be calculated at run-time with values stored in the heap. Both alternatives are valid, and the choice is strictly application dependent. In the first case, the static approach occupies a memory section that is usually more available.

Lookup-
In the second case, the LUT is saved in the RAM memory, which is generally smaller than the flash; however, in this case the LUT is adjustable in case a finer (or coarser) version is needed. A variation of the simple LUT is the RA-LUT (Range Addressable LUT), where each sample corresponds not only to a specific point in the domain, but to a neighborhood of the point. In [39] the authors propose a comparison between two FPGA architectures which uses floating-point accelerators based on RA-LUT to compute fast AFs. The first solution, refined from the one proposed in [40,41], implements the NN on a soft processor and computes the AFs through a smartly spaced RA-LUT. The second solution is an arithmetic chain coordinated by a VHDL finite state machine. Both sigmoid and hyperbolic tangent RA-LUTs are investigated. In [42] an efficient approach for LUT address decoding is proposed for the hyperbolic tangent function. The decoder is based on logic expressions derived by mapping the LUT ranges into subdomains of the original function. The result obtained is Computational Intelligence and Neuroscience 5 LUT implementation that requires less hardware resources to be synthesized. A particular consideration was given to AFs derivatives in [43] where the authors studied the LUT and RA-LUT for on-chip learning applications. The authors conclude that the RA-LUT implementation yields on-chip training performances comparable with the Full-Precision Floating Point even introducing more than 10% precision degradation.

Piecewise Linear Approximations.
Approximation through linear segments of a function can be easily carried out in embedded environment since, for every segment, the approximated value can be computed by one multiplication and one addition. In [44] the authors propose an implementation of a NN on two FPGA devices by Xilinx, the Virtex-5 and the Spartan 3. On both devices, the NN is implemented in VLSI language and as a high-level software running on microblaze soft processor. The authors conclude that although the implementation in VLSI is more complex, the parallel capabilities of the hardware neurons yield much higher performance overall. In [45] authors propose a low-error implementation of the sigmoid AF on a Virtex-4 FPGA device, with a comprehensive analysis on error minimization. To further push the precision versus speed tradeoff, in [46] the authors present a comparison between PWL and LUT techniques under both floating-point and fixed-point arithmetic, testing the approximation on a simple XOR problem. In [47] four different PWL techniques (three linear and one quadratic that will be discussed in the next section) are analyzed considering the hardware resources required for implementation, the errors caused by the approximation, the processing speed, and the power consumption. The techniques are all implemented using the System Generator, a Simulink/Matlab toolbox released by Xilinx. The first technique implemented by the authors is called A-Law approximation, which is based on a PWL approximation where every segment has a gradient expressed as a power of two, thus making it possible to replace multipliers with adders [48]. The second technique is the Alippi and Storti-Gajani approximation [49], based on a segmentation of the function in specific breakpoints where the value can be expressed as sum of power of two numbers. The third technique, called PLAN (Piecewise Linear Approximation of a Nonlinear Function), uses simple digital gate design to perform a direct transformation from input to output [50]. In this case, the shift/add operations, replacing the multiplications, were implemented with simple gate design that maps directly the input values to sigmoidal outputs. All the three techniques are compared together and against the classic LUT approach. The authors conclude that, overall, the best results are obtained through PLAN approximation. Approximation through PLAN method is used in [51] as well, where the authors propose a complete design in System Generator for on-chip learning through online backpropagation algorithm.

Hybrid and Higher Order Techniques.
Hybrid techniques try to combine both LUT and PWL approximations to obtain a solution that yields a compromise between the accuracy of the PWL approximation and the speed of the LUT. Higher order techniques push the boundaries and try to represent the AF through higher order approximation (e.g., polynomial fitting). An intuitive approach is to use Taylor series expansion around origin and is used by [52,53] at the 4th and 5th order, respectively. In [54,55] the authors propose two approaches, one composed of a PWL approximation coupled with a RA-LUT and one composed of the PWL approximation with a combinatorial logic circuit. The authors compare the solutions with classic (Alippi, PLAN) PWL approximation focusing on resources required and accuracy degradation. Combination of LUT and PWL approximation is also used in [56,57], where different authors investigate the approximation in fixed point for the synthesis of exponential and sigmoid AFs. In [39,58] authors propose a piecewise II-degree polynomial approximation of the activation function for both the sigmoid and the hyperbolic tangent AFs. Performance, resources required, and precision degradation are compared to full-precision and RA-LUT solutions. In [59] authors propose a semiquadratic approximation of the AF (similar to the one proposed in [19]) and compare the embedded performances in terms of throughput and consumed area against simple PWL approximation. In [60] a simple 2nd-order AF, which features an origin transition similar to the hyperbolic tangent, is proposed. The digital complexity of this function is in the order of a binary product, since one of the two products required to obtain a 2ndorder function is performed by a binary shift. A similar 2ndorder approximation is proposed by Zhang et al. in [61] and is applied in [47] where the authors comprehensively compare it to 1st PWL techniques and LUT interpolations. In [62] the proposed approach is based on a two-piece linear interpolation of the activation function, later refined by correction factors stored in a LUT. This hybrid approach reduces the dynamic range of the data stored in the LUT, thus making it possible to save space on both the adder and the table itself.

Weight Transformation.
Different papers shown in this survey pointed out advantages and drawbacks of using an AF instead of another one. Two very common AFs that are found in almost any comparison are the sigmoid activation function and the hyperbolic tangent activation function.
Considering an embedded application, as the one suggested in [39], where the activation function is directly computed by a floating-point arithmetic chain of blocks, using a sigmoid AF instead of a hyperbolic tangent AF allows synthesizing the chain with less arithmetic units. However, as shown in several papers in Section 2, the low derivative of the sigmoid AF makes it a poor candidate for training purposes when compared to the hyperbolic tangent. In this final part of the survey, a set of transformation rules, for a single layer FFNN with arbitrary inputs and outputs, is proposed. The rules allow modifying weights and biases of a NN so that 6 Computational Intelligence and Neuroscience Right side of (1) shows the output for multiple inputs single output (MISO) FFNN with sigmoid activation function. Left side of (1) shows the output MISO FFNN with hyperbolic tangent activation function. In (2) and (3), right hand side and left hand side have been, respectively, rearranged to highlight the matrix relations that allow AF transformations. Indeed (1), (2), and (3) express the relations for a MISO FFNN, but it is straightforward to generalize a translation procedure valid for a MIMO network. Tables 2 and 3 allow an easy translation of weights and biases to switch between sigmoid and hyperbolic function AFs ([1] denotes a 1-byvector, where is the number of output neurons). A possible strategy to exploit these relations would be to create a tanh based FFNN in a general-purpose environment, like Matlab. Then, after the FFNN has been trained, translate the weights in sigmoid form to obtain a FFNN that features a simpler AF.

Conclusions
The topic of the choice of a suitable activation function in a feed forward neural network was analyzed in this paper. Two aspects of the topic were investigated: the choice of an AF with fast training convergence and the choice of an AF with good computational performance.
The first aspect involved several works proposing alternative AFs, analytical or not, that featured quick and accurate convergence during the training process, while retaining generalization capabilities. Instead of finding a particular solution that performs better than the others, the aspect that was highlighted by several works is that the simple sigmoid functions that were introduced with the first FFNN, although granting the universal interpolation capabilities, are far from being the most efficient choice for a neural problem. Apart from this negative consideration, no dealbreaker conclusion can be drawn from the comparative analysis found in literature. Indeed, it has been shown that some preliminary analysis on the mapping function allows the choice of a "similar" AF that could perform better on a specific problem. However, this conclusion is a methodological consideration, rather than a position on a particular advantageous AF choice. On the other hand, the adaptive strategies proposed, which feature tunable AFs that can mimic some of the mapping function characteristics, suffer from a more complex and error prone training algorithm.
The second aspect involved both approximations of the classic AFs found in literature and brand new AFs specifically proposed for embedded solutions. The goal pursued by the works summarized in this section was to find a good deal between accuracy, computational costs, and memory footprint, in a limited resources environment like an embedded system. It is obvious that the direct use of a FFNN in embedded environment, without the aid of some kind of numerical approximation, is unfeasible for any real-time application. Even using a powerful computation system, featuring floating-point arithmetic units, the fullprecision IEEE 745 compliant AF computation is a plain waste of resources. This is especially true for many embedded applications where the FFNN is used as a controller [63] that process data obtained from low precision sources [64] (e.g., a 10-Bit Analog-to-Digital Converter, which is found on numerous commercial microcontroller units). The point of the analysis was to show the wide range of possibilities that are available for the implementation, since, in this case, the particular choice of AFs is strictly application dependent. The optimum trade point between accuracy, throughput, and memory footprint cannot be generalized.
Unfortunately, the outcome of a successful NN solution of a problem is influenced by several aspects that cannot be separated from the AF choice. The choice of the training algorithm, for example, can go from simple backpropagationlike algorithms to heuristics assisted training, with all the possible choices in between. The suitable sizing of the hidden layer and the training/validation sets [65], the a priori simplification of the problem dimensionality [66], and use of pruning techniques like OBD (Optimal Brain Damage) [67] and OBS (Optimal Brain Surgeon) [68] to remove unimportant weights from a network all influence the outcome as well. And even if a standard procedure could be thought and proposed, there is no universally recognized benchmark that could be used to test and compare different solutions: as it can be seen from the works proposed, NN approach is tested on real world problems. This is not due to the lack of recognized benchmarks in literature [69] but rather due to the attitude of authors proposing a particular NN solution (whether it is a novel AF, an architecture, or a training algorithm), to test it only for a specific application. This hinders the transfer potential of the solution, enclosing it in the specific application it was thought for.

Appendix
This appendix contains a summarized description, in form of tables (see Tables 4-9), of the results in terms of convergence error and performance reported by the authors in their articles. Some terms and acronyms, used in Tables 4, 5