SuperPruner: Automatic Neural Network Pruning via Super Network

,

e traditional network pruning method [16,17,24] consists of three steps: (1) pretraining, (2) filter pruning, and (3) fine-tuning. In the filter pruning process, human experts design rules to evaluate the importance of filter and delete unimportant filters. At the same time, the pruning rate of each layer (pruning rate affects the network structure) requires a lot of experiments to determine. In traditional network pruning, the pruning results are highly dependent on human experts, which often lead to suboptimal pruning.
In order to reduce the impact of human experts on pruning results, the automatic pruning method [18,19,25,26] came into being. e automatic pruning method uses the ideas of reinforcement learning [18] or intelligent search algorithm [19] and automatically prunes the network model through continuous iteration. ese methods free human experts from rule design and choose of pruning rate, not only saving a lot of time but also improving the performance of the pruned network model. In addition, Liu et al. [27] and Wang et al. [28] believed that the essence of network pruning is pruning the network structure, rather than pruning unimportant filters. erefore, we propose the SuperPruner algorithm, which automatically prunes the model by finding the optimal network structure.
Assuming that a deep neural network has l layers and each layer has fixed n channels, the total search space is l n . However, the search space will increase exponentially with increased channels. It is obviously unacceptable to search all network structures in the search space. We limit the number of channels that can be reserved for each layer to αn, α ∈ 10%, 20%, . . . , 100% { }, which means the convolutional layer of the pruned network has only |α| possible values. We reduce the search space from the original l n to l |α| from which we propose SuperPruner based on the above search space reduction. e algorithm is inspired by NAS [29][30][31][32], especially the one-shot model [32,33]. As shown in Figure 1, we first train a VerifyNet which can quickly predict the performance of any network structure in the search space and then find the optimal network structure through the search algorithm. And finally, we fine-tune the optimal network structure to obtain the pruned network model. When our algorithm predicts the performance of the network structure, it only needs one inference to obtain the accuracy on the validation set, without any fine-tuning, and the whole algorithm is simple and efficient. e SuperPruner algorithm we propose alleviates the slow and expensive problem of performance evaluation in the optimal network structure search process. Compared with the SOTA method, our algorithm can achieve higher pruned ratio with less accuracy cost.
Our contribution mainly includes the following three aspects: (1) We propose an automatic pruning algorithm, SuperPruner. e core of this algorithm is to train a VerifyNet, which can directly predict the performance of all pruning structures. Because network search and performance predictions are decoupled by VerifyNet, we can prune the network structure under arbitrary resource constraints. (2) Our algorithm can prune common network structures such as VGG [2], GoogLeNet [3], ResNet [4], and UNet [9]. We applied the model compression algorithm to the semantic segmentation task for the first time and achieved competitive results. (3) Compared with traditional network pruning algorithms, our algorithm obtains an improved pruning with little participation of human experts. Compared with the automatic pruning method, SuperPruner can directly get the performance of the network structure without any fine-tuning.

Related Work
Since the method proposed in this paper belongs to network pruning, we have summarized the recent work of network pruning in the following.

Traditional Network Pruning.
Traditional network pruning is divided into two categories, unstructured pruning and structured pruning. Unstructured pruning [16,34,35] is fine-grained, and its purpose is to cut off the unimportant weight connections in the pretrained neural network. is will result in sparse CNNs with irregularities, which usually require special software and hardware accelerators to speed up the inference speed.
In contrast, structured pruning [17,24,36,37] is coarsegrained and can completely remove unimportant filters. It is easy to achieve the purpose of computing acceleration. Li et al. [38] used the l1-norm, and Liu et al. [39] used the learnable scaling factor of the BN to remove unimportant filters. Luo et al. [36] proposed reconstruction errors to prune filters. Lin et al. [20] proposed a new global and dynamic pruning scheme, which can prune redundant filters to achieve CNN acceleration. However, the abovementioned methods require a lot of experiments to determine hyperparameter (the pruning rate of each layer). Furthermore, the pruning results are affected by human subjectivity, which likely cause suboptimal pruning. Different from the traditional network pruning method, we propose an automatic pruning algorithm. e whole pruning process hardly requires the participation of human experts, and the results obtained are better.

AutoML.
Recent years have seen that the emergence of AutoML frees human experts from tedious rule-of-thumb and hyperparameter design. He et al. proposed the AMC [18] method that automatically generates the pruning rate of each layer through the DDPG [40] in reinforcement learning. Lin et al. [41] trained a GAN and let the generator directly generate the pruned network model. Dong et al. [21] proposed to train an unpruned network and search for the most suitable depth and width of a network minimizing the computation cost. e parameters of the searched/ pruned networks are then learned by knowledge transfer from unpruned network. Liu et al. [25] proposed to train a PruningNet which can predict the weight of the network structure after pruning and then search the optimal network structure through PruningNet. However, Metapruning evaluates the performance of the searched network structure, and it needs to perform another calculation based on the weights generated by PruningNet to predict the performance of the network structure. Luo et al. [26] used search methods instead of reinforcement learning to compress the network model which the ADMM [35] as the core optimization algorithm. Lin et al. proposed the ABCpruner [19] that used the ABC algorithm to search the network structure. However, ABCpruner introduced the process of retraining when evaluating the performance of a searched network structure, which takes a lot of time and computing resources. Even if the weights in the pretraining model are used as the initial weights of the searched model and a few steps of fine-tuning are performed on this basis, the cost of ABCpruner on performance evaluation is unacceptable.
Different from the above method, our algorithm needs to train a super network which is requires only one forward propagation to predict the performance of the searched network. It does not require any fine-tuning during network performance evaluation, saving resource consumption.

NAS.
Our algorithm is inspired by one-shot architecture search in NAS. e core idea of one-shot architecture search is to reuse the trained network as much as possible by weight sharing weight generation [25,31] or [30,32] so that when evaluating the performance of the searched network structure, there is no need to retrain from scratch and reduce a lot of calculations. For example, Brock et al. [31] and Liu et al. [25] trained hypernetworks to generate the weights of the searched network structure. Pham et al. [30] proposed directed acrylic graph (DAG) representing the search space, all the subnetworks in the DAG mandatory sharing parameters. Guo et al. [32] and Li et al. [33] proposed to train a super network that includes all substructures. e common edges of different substructures share the weights in the super network. Only trained once, all substructures can get their weights directly from the super network.
In NAS, the input and output of each layer are fixed. However, during the channel pruning process, the input of the current layer will change with the output of the previous convolutional layer. It is not feasible to directly apply the one-shot architecture search in the NAS to the channel pruning task. erefore, we design a pruning module to replace the convolutional layer in the one-shot model, which perfectly solves the problem of unfixed input of the convolutional layer in the network pruning task.

Materials and Methods
In this chapter, we will introduce the SuperPruner, which can efficiently prune convolutional neural networks. A represents the entire search space, and the pruned network structure a is subset of A. We define M(a, ω) as a network model with structure a, and the accuracy of a on the test set is used to measure the network performance. As equation (1) shows, the purpose of network pruning is to find a compressed network structure with the optimal accuracy on the test set.
However, in real application scenarios, typically the parameter of the model, FLOPs, inference speed, and energy consumption have certain requirements. A common practice is to limit the parameters, such as the following equation: Para(a) * ≤ Para max .
(2) erefore, we need to optimize equation (1) under the conditions of equation (2) to obtain the optimal network structure, such as the pruned network with the highest accuracy.  (1) Train a VerifyNet, the input is the coding vector of the network structure, and the output is the accuracy of the prediction on the given data set. Update the encoding vector once per iteration. (2) Search for the best network structure. We use the improved PSO algorithm to find the optimal network structure on VerifyNet. In the search process, only one inference is needed to predict the accuracy of the network structure on given data set without any retraining. (3) Fine-Tuning. e searched optimal network structure inherits the weight on VerifyNet and fine-tunes it to obtain the best pruned network.
However, solving the real accuracy requires retraining the searched network from scratch, which will cost a lot of computing resources. As shown in Figure 1, to solve this problem, we propose to train an auxiliary network (Ver-iftyNet), which can quickly predict the accuracy of all subnetworks on the test set without retraining. en, the PSO algorithm is used to search the optimal network structure in VeriftyNet. Because structure search and performance evaluation are separated, our algorithm can obtain the optimal network structure under arbitrary hardware constraints.

VerifyNet Structure.
e input of VerifyNet is the encoding vector of the network structure, and the output is the prediction of the accuracy on the given data set. When the test set is given, we can use the following equation to calculate accuracy to predict the true performance of the network structure.
acc(a) � Verify Net a∈A a; T test . (3) We limit the number of channels that can be reserved for each layer to αn, α ∈ 10%, 20%, . . . , 100% { }. is means that no matter which layer of the neural network is concerned, there are only 10 cases where the number of channels is reserved. When constructing VerifyNet, we allocate a channel block for each possible situation and use 10 channel blocks corresponding to 10 feasible solutions of this layer. e same channel block is shared between different paths. In this way, through the sharing of channel blocks, only 10L channel blocks are needed for an L-layer convolutional neural network to represent all possible situations in the search space of L 10 . Figure 2 shows the structure of Ver-ifyNet with three convolutional layers. Given the encoding vector, we can predict the Top-1 accuracy of the network structure corresponding to the encoding vector.
In the network pruning task, the input of the current convolutional layer is determined by the output of the previous convolutional layer. Using the same channel block for different types of network models can be difficult to handle. As shown in Figure 3, in order to make our algorithm effective for various common network models, we design three different blocks. Block (a) is composed of Conv, BN, and ReLu, suitable for LeNet, VGG, MobileNet, and other types of networks without shortcut. For block (a), the output and maximum input of the channel block are fixed, and the real input can change according to the output of the previous convolutional layer. We can easily implement this function by slicing the convolution kernel. Block (b) is suitable for shortcut networks such as ResNet and UNet. We fix the input and output of block (b) unchanged so that the convolution kernel at the short connection will not be changed and only the middle layer of the block (b) will be pruned. e block (c) is suitable for GoogLenet and other similar network structures. For block (c), only 1 × 1 convolutional layer on both sides of the branch is not pruned.
We use the appropriate block to construct a VerifyNet according to the type of network to be pruned. VerifyNet is 5.5x the size of the original network. After training, any network structure can be evaluated with only one inference. Compared with the computational power consumption of retraining the subnetwork, it is acceptable to have 5.5x more memory consumption in the training.

VerifyNet Training.
e purpose of VerifyNet is to train VerifyNet only once to predict the performance of all subnetworks through channel sharing. We hope that the subnetwork weight inherited from VerifyNet and the subnetwork weight trained from scratch are as close as possible.
is requires equal training of all paths in VerifyNet. To solve this problem, we propose a random sampling path strategy to train VerifyNet.
In forward propagation, the encoding vector (representing the structure of the neural network) is randomly generated as the input of VerifyNet. e path corresponding to the coding vector is activated, while the remaining paths are in an inactive state. e coding vector is updated every time a batch size is trained. In backpropagation, unlike traditional training, we do not update all weights. Only the activated path will perform gradient calculation and update the weight of the channel block on the path.
In our VerifyNet, overlapping parts in different paths share the same block. For the path that is not sampled, the blocks on this path will be trained in other paths. When all blocks have been trained, it also means that this path without sampling has also been trained. All paths in the solution space will be trained equally due to the special shared structure of VerifyNet. Because only one path is selected each time, our VerifyNet is not significantly different from the normal network during training, and it can quickly reach convergence.
We cannot guarantee that the order of network performance predicted by VerifyNet is the same as the real order, but we can guarantee that it will not differ too much. Because each path in VerifyNet is trained by random sampling, it is as close as possible to the weight of the real training. When we choose another path for training, it will affect the originally trained channel block in this path, but every channel block in VerifyNet will be affected in this way, so the entire VerifyNet will maintain a dynamic balance. e network performance predicted by VerifyNet is often lower than the real result, but it is almost the same as the real performance ranking.

Network Structure Search.
After completing the training, the VerifyNet at this time is no longer a network in the traditional sense but a network estimator. If the encoding vector and the validation set are input, only one inference gets accuracy to the subnetwork corresponding to the encoding vector, without any fine-tuning. Because of the large search space, random search is not advisable. In order to find the optimal network structure, SuperPruner uses PSO to search the network structure on VerifyNet.
We first randomly initialize m one-dimensional particles C i m 1 , with position of the particle representing the network structure (the encoding vector input by VerifyNet) and 4 Scientific Programming velocity V i representing the update direction of the particle. en, we calculate the fitness function value of each particle according to equation (4) and update the local optimal particle P i (the optimal particle found by the i-th particle during the algorithm operation) and the global optimal particle G (the optimal particle found during the algorithm operation).
where μ ∈ (0%, 100%) is a preset constant. It represents the proportion of the network performance when calculating the fitness function. Finally, the velocity and position of each particle are updated according to equation (5), and the algorithm is executed iteratively.  Conv (α · c in , β · c out , 3, 3)

BN (β · c out ) ReLu
Input channels (α · c in ) Max input channels (c in )  Figure 3: ree blocks of the VerifyNet. e red font represents that part which needs to be pruned. We use block (a) to prune VGG16, block (b) to prune ResNet, and block (c) to prune Googlenet. For ResNet and Googlenet, we only prune the middle layer and do not change the channel number of input and output of the block.

Scientific Programming
where r ∈ (0, 1) is a random number. e standard PSO algorithm fixes the size of w, and the particles cannot obtain a balance between the global search and the local search. is will reduce the diversity of the model, and the particles cannot search for new regions and eventually fall into a local optimal solution. According to equation (6), in order to solve this problem, we dynamically change the size of w when updating the model speed to help the particles expand the search space and jump out of the local optimal solution.
where N is the maximum number of iterations of the PSO and n is the current number of iterations of the PSO.
In addition, we have introduced the concept of detection particles. When a particle has not been updated for a long time, we think that the particle has fallen into a local optimum. In this case, a detective particle will be generated to replace the particle that has fallen into the local optimum. By introducing detection particles, the search space can be well expanded, which helps the particles to jump out of the local optimal solution and avoid the premature phenomenon.
After the algorithm is executed, the optimal network structure we searched for is the neural network represented by the global optimal particles. SuperPruner evaluates the searched network structure performance through VerifgNet which has been trained before searching. erefore, after performance evaluation and structure search are completely decoupled, we can easily search for the optimal network structure under arbitrary hardware constraints by modifying the fitness function in PSO. e weight of the searched optimal network structure is inherited from VerifyNet. We only need to fine-tune a few steps on the training set to get the pruned network model. More details of the improved PSO algorithm are shown in Algorithm 1.

Results and Discussion
We conducted experiments on object recognition and image segmentation tasks to verify the effectiveness of the SuperPruner. e pruned network model includes VGG, GoogLeNet, ResNet, and UNet. All experiments run on one NVIDIA Tesla P40 GPU, implemented with Pytorch.

Experimental Settings
Datasets. On object recognition task, we evaluated our method on CIFAR-10 and CIFAR-100. e CIFAR-10 has 10 classes, and each class has 6K images.
ere are 50K training images and 10K test images. CIFAR-100 is similar to CIFAR-10 but is divided into 100 classes, each with 600 images. We randomly divide the original training set into two parts: 10% of images are used as the validation set and the remaining as the training set. e divided validation set is used for predicting performance, which network structure searches for VerifyNet, ensuring the generalization of the network.
Training Strategy. VerifyNet plays a very important role in quickly predicting network structure performance. For CIFAR dataset, we use the stochastic gradient descent (SGD) algorithm with a momentum of 0.9 and a weight decay of 0.0001. We train each VerifyNet by 2K epochs with the initial learning rate of 0.1, which is scaled by 0.25 over 500 epochs. e batch size is set to 256. When training the optimal network structure, we reduce the epochs from 2K to 150 and the learning rate is divided by 10 every 50 epochs.
PSO Parameter. In order to find the optimal network structure, we experimentally set M � 20, N � 100, and T � 10 in Algorithm 1. e value of m changes according to the network structure such as m is set to 16 for VGG16 and set to 27 for ResNet56. e value of μ belongs to 10%, 20%, . . . , 100% { }, and we can freely choose the value of the μ according to actual application. e influence of μ on the optimal network structure will be discussed in chapter D.

Results on Object Recognition Task
VGG16. VGG16 has 13-conv and 3-fc without shortcut, and the baseline can achieve 93.45% accuracy on CIFAR-10. Using SuperPruner to prune VGG16, we can remove 74.19% FLOPs and 89.25% parameters, but the accuracy can still be kept at 93.18%. As seen from Table 1, compared with other methods, such as GAL [41] and ABCpruner [19], our method is superior in FLOPs and parameters pruning ratio, with almost no reduction in accuracy. For example, our method can reach the higher pruning rate of FLOPs (74.19% vs. 45.26% by GAL and 73.68% by ABCpruner) and parameter (89.25% vs. 82.22% by GAL and 88.68% by ABCpruner) by the less accuracy loss (−0.3% vs. −0.54% by GAL and −0.37% by ABCpruner). Based on further analysis, Figure 4 shows that SuperPruner retains more channels and parameters for the first few layers of VGG16, and the parameter pruning rate is significantly improved starting from Conv6. is is because each layer of the network has different sensitivities to pruning, resulting in different pruning rates. e first few layers of VGG16 are mainly used for feature extraction, and retaining more channels and parameters helps the compressed model maintain high accuracy. erefore, SuperPruner can automatically learn network structure information through particle swarms in the search to obtain the optimal pruner model.

GoogLeNet.
For GoogLenet, our experimental results can be obtained from Table 2. We can remove 55.27% parameters, and the accuracy is only 1.37% lower than baseline. At the same time, SuperPruner can achieve 55.29% FLOPs pruning rate. e comparison results of SuperPruner and other algorithms are shown in Table 2     Scientific Programming pruning ratio and parameters pruning ratio, SuperPruner is better than L1 [16] in pruning rate (55.29% vs. 31.39% for FLOPs and 55.27% vs. 43.11% for parameters) and below L1 in accuracy (93.78% vs. 94.54%). Comparing with ABCpruner [19] and GAL [41], our algorithm has a slightly lower accuracy than ABCpruner but the pruning rate exceeds that of GAL.

ResNet.
VGG16 is a simple network that focuses on building convolutional layers, and there is no Short-Block. In order to verify the effect of SuperPruner on the Short-Block model, we prune ResNet56 on CIFAR-10 and CIFAR-100. We construct VerifyNet on ResNets as shown in Figure 3(b), which does not change the input and output of the block and only trims the middle part. We summarize the pruning results of ResNet56 in Table 3. On CIFAR-10, we set μ to 80%, our algorithm can reach 80.74% FLOPs pruning rate, but the accuracy declines by 1.10%. On CIFAR-100, SuperPruner can achieve 57.30% FLOPs pruning rate, and the accuracy is 2.23% lower than the unpruned network. is is because ResNet56 has lots of redundant connections during the design, and SuperPruner can automatically find these redundant connections and prune them. Removing these connections can effectively prevent over-fitting and will not affect network performance. Compared with other methods, our model can achieve competitive results. Compared with GAL, SuperPruner can achieve better results. e accuracy is increased from 91.58% to 92.17%, and the FLOPs pruning rate is increased from 60.20% to 80.74%. Even compared with the state-of-the-art algorithm TAS [21], SuperPruner still reaches the higher pruning rate of FLOPs (80.74% vs. 52.70%), with a slightly accuracy loss (92.17% vs. 92.81%). For CIFAR-100, the performance of most algorithms has declined, but our algorithm can still achieve the highest FLOPs pruning rate (57.30%) with a small loss of accuracy (68.97%).

Results on Semantic Segmentation Task.
is paper presents results of pruning of UNet trained for semantic segmentation on the Carvana data based on Kaggle's Carvana Image Masking Challenge from high definition images. After the challenge, we can only get all the images and corresponding masks of the training set. In order to obtain the pruned model with the best effect and generalization ability, we redivide the original training set into training set, validation set, and test set according to the ratio of 6 : 2 : 2. Consistent with the competition, we evaluated the pruned network on the mean dice coefficient. e dice is defined in our experiment as follows: where X is the predicted segmentation set of pixels and Y is the ground truth. e dice coefficient is defined to be 1 when both X and Y are exactly the same. We use the block (b) to construct VerifyNet according to the structure of UNet.
e VerifyNet was trained from scratch with 4096 images (no data augmentation) and trained 20 rounds in total. We set the initial learning rate to 0.0001, batch size to 1, and the remaining parameters are the same as above. e mean of the dice coefficients for each image in the validation set is used as the network performance evaluation index. After the training, the PSO algorithm is used to iteratively search for the optimal network structure. Finally, the optimal network structure on the origin training set is retrained to obtain the pruned UNet model. We compared the performance indicators of the pruned UNet and the original UNet, and the results are shown in Table 4. When μ is set to 30%, we can remove 78.34% FLOPs and 75.1% parameters still keep the dice score at 0.9945, even 0.002 higher than the original model. Figure 5 shows the segmentation results of 30%SuperPruner-UNet on the test image.
Apart from this, we tested the speed of the model. e input picture resolution is 959 × 640, the original UNet divides 249 pictures per second, but 30%SuperPruner-UNet can split 311 pictures. e pruned network is about 20% faster than the original network. rough the above analysis, the pruned network has achieved better results on the test set than baseline. Our proposed algorithm is also effective in semantic segmentation tasks.

Ablation Studies
To further illustrate the efficiency of SuperPruner in searching for the optimal network structure, we choose VGG16 for an ablation experiments.

Effect of the VerifyNet.
We designed two sets of experiments. A set of experiments does not train VerifyNet and directly uses the PSO to search for the optimal network structure. We retrained the searched network structure for four epochs to obtain network performance. We define this experiment as PSOPruner. Another set of experiments trains VerifyNet and uses the PSO to search for the optimal network structure. VerifyNet is used to predict network performance of the searched network structure. We define this experiment as SuperPruner. To ensure the fairness of comparison, all the parameters of PSO are the same. We

Effect of μ.
We compare the pruned network with different μ. e experimental results are shown in Figure 6. It can be easily found from the figure that as μ increases, the prune rate of FLOPs and parameters will decrease, but the accuracy will increase. We conjecture this because as μ rises, more and more convolution kernel channel will be saved, and the prune model obtained by the SuperPruner is sufficient for image feature extraction. Hence, we can change μ or customize the fitness function to get pruned model that satisfies constraints.

Comparison with Other Methods on Time Consumption.
We also analyzed ABCpruner [19] and Metapruner [25] on time consumption. e experimental results are shown in Table 5. ABCpruner directly uses the ABC algorithm to search for the optimal network structure. Retraining is also used when calculating fitness. If the ABCpruner and PSO-Pruner parameter settings are the same, the time overhead of the two algorithms is basically the same. Metapruner trains an auxiliary network PrunerNet, which is used to accelerate the calculation of the fitness of the searched network. In the training time of the auxiliary network, PrunerNet training 2000 rounds requires 25 h, while VerifyNet only requires 10 h. is is because the output of PrunerNet is the weight of the network structure. e output of VerifyNet is the accuracy of the network structure on a given dataset. In contrast, VerifyNet has a simpler structure and fewer parameters. It takes less time during training. In the calculation of fitness, Metapruner needs 3.01 s for one calculation and SuperPruner needs 2.20 s. is is because Metapruner requires PrunerNet to perform a forward propagation to predict the weight of the network structure. rough the predicted weights, the forward propagation is performed again to get the accuracy on the given dataset. However, VerifyNet only needs to perform forward propagation once to obtain the accuracy of the network structure on a given dataset. rough the above analysis, it can be concluded that

Scientific Programming
SuperPruner is significantly better than Metapruner and ABCpruner in running time.

Conclusions
In this paper, we introduce an efficient automatic pruning algorithm, named SuperPruner. SuperPruner introduces VerifyNet to predict the performance of the network structure, speeding up the search for the optimal network structure. On multiple datasets, SuperPruner is able to achieve higher pruning rate than other state-of-the-art method with little loss of accuracy. Compared with the automatic pruning algorithm, our proposed algorithm has a significant improvement in the pruning speed. Even if the hardware limitation changes, the pruning model can be obtained quickly. More importantly, SuperPruner can be efficiently applied in multiple fields, such as object recognition and semantic segmentation.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.