A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks

The increase in sophistication of neural network models in recent years has exponentially expanded memory consumption and computational cost, thereby hindering their applications on ASIC, FPGA, and other mobile devices. Therefore, compressing and accelerating the neural networks are necessary. In this study, we introduce a novel strategy to train low-bit networks with weights and activations quantized by several bits and address two corresponding fundamental issues. One is to approximate activations through low-bit discretization for decreasing network computational cost and dot-product memory. The other is to specify weight quantization and update mechanism for discrete weights to avoid gradient mismatch. With quantized low-bit weights and activations, the costly full-precision operation will be replaced by shift operation. We evaluate the proposed method on common datasets, and results show that this method can dramatically compress the neural network with slight accuracy loss.


Introduction
Deep neural networks, such as handwritten character, image recognition, and many burgeoning AI applications, have achieved great success in recent years [1][2][3]. All these achievements rely on complex deep models. In the 2012 ILSVRC contest, Krizhevsky constructed a multilayer network [4] with 60 million parameters, and this network has exceeded all previous methods in terms of classification accuracy. However, training the entire network requires 2 to 3 days. Deep networks introduce a large number of layers due to their complicated structure, thereby increasing the model size (such as 50, 200, 250, and 500 MB for Goo-gleNet, ResNet-101, AlexNet, and VGG-Net, respectively) [5], computational complexity, and demand for energy consumption. erefore, embedding these properties onto mobile devices is a large challenge. In deep neural networks, the computational cost and memory consumption are mainly dominated by convolution operation, which is exactly the dot-product between weight and activation vector. Most existing techniques focus on weight sharing, pruning, quantization, and activations discretion [6][7][8].
ey also exhibit large accuracy drop and high computation during training and testing with float operation. In this work, we introduce a method to train low-bit networks. On one hand, this study approximates activations through lowbit discretization. On the other hand, weight quantization and special update mechanism for discrete weights are introduced. With quantized low-bit network weights and output activations, the costly full-precision convolutional operation will be replaced by shift operation, and marginal accuracy cost will decrease slightly. Our method will be important on embedded devices, such as ASIC or FPGA for AI.

Related Work
In this section, we discuss related work from following aspects: (i) Pruning and Sharing. Parameter pruning and sharing has been used both to reduce the complexity of neural network and to avoid model overfitting. [6,[9][10][11] propose method to find and prune the redundant connections with small weight values, quantize the weights via weight sharing. e runtime memory saving and compression effect are very limited by those simple methods. (ii) Structured Pruning and Sparsifying. Generally speaking, L1 norm, L2 norm, Group Lasso, and other regularization terms are efficient ways to learning sparse weight structures in numerous researches. Wen et al. [12] proposes Structured Sparsity Learning by using Group Lasso to sparsify multiple DNN structures (filters, channels, and even layers). Besides, the authors of [13][14][15][16] also try to train network with sparsity regularizer, and transform the problem of measuring channel importance into optimization problem. (iii) Special Neural Architecture. Reducing calculation FLOPs and accelerating inference process of neural networks by designing special architecture. Related researches including Mobile-Net [17,18], Squeeze-Net [19], and Shuffle-Net [20] by adopting convolutional filters of small size, depth-wise convolution operations. (iv) Weight and Activation Quantization. Our proposed quantization method also falls into this category. Low-bit quantization methods mean that the network weights and activations are expressed by discrete values according to special mathematical method, which could replace the costly original floating-point operations by only accumulations or even binary logic operations. e authors of [21,22] firstly constrain the weights to the binary and ternary space. It follows that both weights and activations are mapped into binary space or ternary space, i.e., binary neural networks (BNN) [7], XNOR-Net [8], and ternary neural networks (TNN) [23], which directly replace multiply-accumulate operations by logic operations. DoReFa-Net [24] not only quantizes weights and activations, but also quantizes gradients to low-bit width floating-point numbers with discrete states in the backward propagation.

Low-Bit Neural Networks
In this section, we concentrate on training quantized low-bit networks. Specifically, the activations of layer output are quantized by either zero or powers of two to reduce storage and calculations. e weights of network are also restricted in the same way to obtain a sparse model. By constraining weights and activations to zero or powers of two, the costly floating-point multiplication operation can be replaced by cheaper shift operations [13].

Dot-product Function.
Deep neural networks generally consist of multiple layers, and each neuron in different layers computes activation function: where z is output activation, x is input vector, x is weight vector, b is bias, and f is a nonlinear function, such as ReLU. Given the convolutional networks, the computational complexity is mainly dominated by the convolution operation. e key point of quantization for compression on hardware applications can be summarized into two aspects. One is the large memory required to store weights and activations. e other is the computational cost required to compute large numbers of dot-products. e difficulty lies in floating-point arithmetic, which is limited in practical applications [5] and is examined in this study. Figure 1 shows the schematic of standard convolution process and our method (DST will be introduced in Section 3.3).

Low-Bit Activation Approximation.
In this section, we have proposed a novel approximation strategy for activations quantization and corresponding suitable methods to keep the efficiency of backpropagation.

Forward Approximation Process.
In accordance with the discussion above, the activations of network are quantized by either zero or powers of two in this section. e optimization model is formulated as follows: where numerous parameter values within the interval ) are quantized into a common value q i (− q i ), and P(x) is our new defined discrete activation function. We attempt to find the mean-squared error of all values for obtaining the optimal quantization method. us, optimization model (2) could be transformed into the following model: where φ(x) is the probability density function of x. Following Cai's implementation [4], we apply batch normalization to the dot-product in (1) to determine the closeness of distributions to Gaussian with zero mean and unit variance. Accordingly, the optimal solution of (3) can be acquired by Lloyd's algorithm [25]. As a result, the best partition is where P v denotes different value interval of x. e endpoints of each interval are where we set up q 1 � 0 and consider the symmetry of interval for x < 0. erefore, the final optimization function of our quantizer is where φ(x) is the probability density function of standard normal distribution, n is the number of bits of activation function. Only one variable is considered in (6). us, the above-mentioned formula has a theoretical solution. However, we adopt the genetic algorithm in the experiments given the difficulty in solving segmented variable limit integral. Table 1 shows the optimal error of different q 2 values. With the further refinement of q 2 , we still obtain the same error value of 0.0189.

Backward Approximation Process.
Since dot-product values are equal within the same interval after using proposed forward approximation method, zero derivative almost everywhere. us, we have proposed a better possible backward solution here, and the final experimental results prove its feasibility in backpropagation process. For 0 ≤ x ≤ x 1 , we approximate all values in this interval to be zero, similar to ReLU function, which does not need to update. Considering Gaussian distribution of dot-product mentioned above, plenty of activations fall into the interval near zero. We keep the gradient of this part as it was. For our quantization method, where activations are within interval, P v has tiny probability. In this case, we need to limit their updates, preventing them from updating to other intervals and keep network accuracy. e derivative of quantization function has the following form: For x < 0, consider interval symmetry. In our final experiment, we find this method keeping the efficiency of backpropagation and making learning stable.

Low-Bit Weight Quantization.
e weight quantization shown above can be solved using various methods, such as BWN, DoReFa-Net, and XNOR [8,21,24]. However, we have to save full-precision weights in backward computation in these networks; this approach may cause frequent data   exchange between external memory and parameter storage [26]. In this section, we propose a simple discretization function that maps weights into either zero or powers of two. is way replaces the floating-point operation by shift operations on hardware in backward process and avoids large computation and memory in hardware deployment.

Weight Quantization in Forward
Process. At the outset, we have considered weight discretization in forward process and updated them in constrained discrete domain. However, the weight is quantized into a discrete sequence of equal ratios here, which is difficult to update for the corresponding prescribed quantized values in backpropagation. e nonuniform distribution of discrete values is the main problem. Similar works such as BWN, DoReFa-Net, and XNOR, the derivative of weight in those ways is zero almost everywhere, making it apparently incompatible with backpropagation, and the gradient computation is based on the stored full-precision weights, and frequent data exchange is required during the training phase. In view of this, we seek to directly discrete network weights to either zero or powers of two in the backward process to avoid gradient mismatch problem, other than forward process.

Weight Quantization in Backward Process.
We introduce a weight update mechanism for discrete values in the backward process to avoid gradient mismatch. From previous works, we find that the weight value can be constrained to [− 1, 1] in our quantization method. At the beginning, we introduce discrete state transformation (DST) problem for later use. We let Δw be the variation in weight, w be the updated weight, and w′ be the raw weight. us, L is the minimum interval of defined quantization weight, for (0, ±2 − 2 , ±2 − 1 , ±2 0 ), and L is 2 − 2 . For convenience, seven possible integer states (0, ±1, ±2, ±4) are considered when we constrain weight to (0, ±2 − 2 , ±2 − 1 , ±2 0 ). Continuous weights need to be mapped into these discrete integer states. Accordingly, we adopt round operation: where round is round operation in math and x is the arbitrary value within [− 1, 1]. w state � ±3 is not the defined discrete weight stated above. us, we introduce the binomial distribution to jump into the defined integer state on both sides: where the positive and negative signs are both positive or both negative at the same time, and p has a probability of 0 or 1 (we use random number for p, which has equal probability to be 0 or 1). Figure 2 shows the above-mentioned process. Finally, the weight state needs to be transformed into defined weight value: In this way, we can transform continuous weights into defined discrete weights successfully. We transform the weight variation into defined discrete state transition. First, we decompose Δw into integer and decimal parts by the minimum interval of quantization weight: where floor represents the round down, k is the integer number of weight state transition, and v is the fine tuning parameter of weight state. us, the final state transition number is where gate submits to submits to binomial distribution, which has the opportunity p 1 to be 1 and opportunity 1 − p 1 to be 0. p 1 is defined by fine tuning parameter v, where th is a positive constant to adjust the state fine tuning probability p 1 , which will be explored in the experiments. Finally, we use the DST function, which is introduced above, to obtain the final quantized weight: In this way, we constrain all weights to (0, ±2 − 2 , ±2 − 1 , ±2 0 ). For other values, the same theory as described above applies.

Results and Discussion
In this section, we evaluate our proposed algorithm on MNIST (LeNet5), SVHN (VGG), and CIFAR10 (ResNet-18) for image classification by Pytorch. Most previous works do not quantize the first and last layers. In our method, we do not quantize the first layer only. Furthermore, we report the averaged results over three runs for each experiment by Adaptive Moment Estimation optimizer (ADAM).

Exploration the Quantization Combination of Weights and Activations.
We illustrate the behavior of the different combinations of weights and activations with a standard ResNet-18 on the CIFAR10 dataset. We quantize all weights into (0, ±2 0 ), (0, ±2 − 1 , ±2 0 ), and (0, ±2 − 2 , ±2 − 1 , ±2 0 ). For the activation approximation, we use q 2 � 0.125, 0.25, 0.5 and 1 as shown in Figure 1. For convenience, we set [p, q 2 ] to define the quantization combination mode, where p � − 2 represents above (0, ±2 − 2 , ±2 − 1 , ±2 0 ), and the value of q 2 determines the activation approximation degree. After cross combination, we set th � 0.5 here, and the results are shown in Figure 3. In general, weight quantization causes some accuracy degradation. Figure 3 confirms that accuracy increases with the deep degree of weight quantization. However, different approximation methods for activation do not influence test accuracy dramatically, but fluctuation during training occurs. Our method is also evaluated on other datasets. Table 2 shows the comparison results under same conditions and the results from [27]. As elaborated above, BWN, TWN, and XNOR methods quantize weights to 1 or 2 bits of floating point every layer but not in the entire network. However, our method achieves 2 or 3 bits of fixed-point in the entire network and can be used with shift operation on ASIC or FPGA. To demonstrate the effectiveness of proposed method, we also show the comparison results on CIFAR100 with more complex model (ResNet-34, ResNet-50), as shown in Table 3.

Effect with the Change of th.
We explore the effect of parameter th in this section. As explained above, th adjusts the weight state fine tuning probability to influence the final learning accuracy. Figure 4 shows the results, which indicate excellent nonlinearity. Here, we test the combination [− 3, 0.125]. Evidently, the curve has the best accuracy at approximately th � 0.5, whereas larger or smaller values may result in slight improvement. e same result is obtained for other combinations after several experiments. us, we adopt th � 0.5 for all experiments in this study.

Influence of the First and Last Layer Quantization.
e first and last layers are critical to network quantization research according to previous works. In the current study, all our experiments do not quantize the first layer only. We attempt to investigate the influence of first layer quantization. e results are summarized in Table 4. We test the weight and activation quantization combination [− 3, 0.125] here. "+" and "− " indicate with or without weight quantization of the corresponding layer.
Evidently, accuracy degradation may occur when quantizing the first or last layer. Our method is slightly better than BNN but is not better than BWN which quantizes weights only.

Parameter Sparsity.
Most of the current AI applications are based on ResNet. us, we analyze parameter sparsity on ResNet-18. Previous methods clip a large number of weights by setting most weights of small values to zero but not to be exactly zero [28]. By contrast, our method can obtain precise      Table 5.
Evidently, our method can obtain large sparsity on convolutional layer parameters, and several top layers of the network may be valuable for final evaluation. e back layer is sparser than the front one, which may be pruned in our future work. As an attempt, we prune the pretty sparse layers (conv19, conv20), finding accuracy dropping little and obtaining more compact layers. More meaningfully, training and inference time are reduced in a certain extent which may significant for hardware implementations.

Conclusions
In deep networks, computational cost and storage capacity are key factors that directly affect the learning performance. Compression and acceleration of networks aim to reduce the redundancy of complex models. Accordingly, we introduce a method to train networks with weights and activations quantized by several bits. We find that our method drops network accuracy slightly, whereas it decreases storage and computation sharply. Interestingly, our quantified model has evident sparsity, which may be pruned on ASIC or FPGA for AI in the future.

Data Availability
e data used to support the findings of this study are open datasets which could be found in general websites, and the datasers are also freely available.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.