Review of Related Literature

Magnitude-based Pruning

Due to the inherent weaknesses of MBP there are limited studies that explore its uses. However, these few studies have been done within the recent years and showcases the possible usefulness of MBP.
Han et al. (2015) applied MBP in Convolutional Neural Networks (CNNs) to reduce the size and computations needed by the network without losing its achieved accuracy. The standard pruning procedure was used. A network was first trained but instead of finding the final weights of the network, the goal was to learn which connections were important. This initial phase can be referred to as the pre-training phase. This term is at times used interchangeably with the training phase with regards to the standard pruning procedure. The network was then pruned where pruning was applied per layer where the threshold was calculated by using the standard deviation per layer and multiplying it by a user set constant. The final network was achieved by simply retraining or through iterative pruning.
Other than MBP, other regularization techniques were used in the pre-training and retraining phase. These are L1 regularization, L2 regularization, and Dropout. L1 and L2 regularization are techniques that adds a regularization term to the loss leading to smaller weights (Jain, 2018) . This is also known as the penalty term method or weight decay method. The idea is to lead weights that do not contribute much to the network close to zero or even equal to zero depending on the formula of the penalty term.
L1 and L2 regularization were used either in the pre-training phase, retraining phase, or both in finding out which between the two would aid in achieving better networks. Overall, it was L2 regularization that gave better results. However, the values used for the methods were not stated. Using L1 and L2 regularization adds another variable to consider when designing the network.
Dropout was used in the retraining phase to prevent the pruned network from overfitting. The study did not use the default values when choosing the Dropout Ratio. A formula was made to adjust the Dropout Ratio as the network becomes sparser from pruning. However, there were no tests on the effects of using the original Dropout Ratio and the new method.
Other than pruning connections, the study also pruned neurons when a neuron no longer had any incoming connections or outgoing connections. To see the effectivity of the proposed method several CNN models were used. The different models were LeNet (LeCun, Bottou, Bengio, Haffner, et al., 1998), AlexNet (Krizhevsky, Sutskever,& Hinton, 2012), and VGGNet (Simonyan & Zisserman, 2014). Again, the initial values of the hyperparameters were not stated. In the end, the networks created were pruned by a significant amount, as much as 90%, without losing the original accuracy. However, this still may face pruning problems such as creating a network with irretrievable network damage, and the learning inefficiency of standard pruning procedures of long training or retraining times.
To resolve this issue, a method called Dynamic Network Surgery was proposed (Guo et al., 2016). The proposed method is composed of three major parts. First is a new pruning procedure. Instead of using the typical pruning procedure and retraining for a certain amount of iterations, the pruning phase was placed within the retraining phase. After the pre-training phase, after every weight update in the retraining phase pruning was applied. Second, a threshold interval was made as opposed to a single threshold. Lastly, the threshold interval was used for the new feature of splicing which recovered important connections that were pruned.
The threshold interval was made specifically for the pruning and splicing functions. The first threshold is gained by the variance of the weights of the layer and the second threshold is the first threshold added by a user chosen value. If the weight is less than the first threshold it is pruned. If the weight is greater than the second threshold and the connection is pruned then it is returned, otherwise it is kept. Any weights in between keeps its previous state of either being pruned or spliced. The problem however with the threshold interval is choosing the right values as it affects what is pruned and spliced. The study did not state what values were used for the threshold.
Using Dynamic Network Surgery, the study was able to create a network comparable to Han et al. (2015) in terms of accuracy but with regards to the network size, the results showed that the network was pruned twice as much. However, the study did not experiment on the individual effects of using a threshold interval and splicing feature, and pruning and retraining combined.
Straying away from CNNs and image classification, MBP was also used for Neural Machine Translation (NMT) (See et al., 2016), which are ANNs designed for translating one language to another (Bahdanau, Cho, & Bengio, 2014). The study used a recurrent neural network (RNN) architecture called Long Short-Term Memory (LSTM) with MBP using the standard pruning procedure. This study compared three different MBP schemes. The goal was to show that MBP with retraining though simple can be effective in creating pruned networks without loss of accuracy.
The first pruning scheme was class-blindor simply the percentile formula in statistics. This is where the absolute value of all the network’s weights are sorted.The smaller weights are then pruned based on a user set percentage and the percentile formula. The second is class-uniform, similar to the first pruning scheme but applied to weights arranged by class instead of pooling all of the weights together. Lastly was class-distribution where instead of pruning based on a chosen percentage, the threshold is the standard deviation of the class multiplied by a user set constant. It is similar to the threshold in the study of Han et al. (2015) but instead of by layer basis for the threshold, it is by class.
The study stated the hyperparameter configurations used and the optimization methods applied. Examples of these optimization methods are Dropout, learning rate scheduling, and maximum gradient norm. The problem however was that networks without the optimization methods were not made. This means that there is no clear way to say if these methods have effects on the resulting pruned networks.
The initial results of the three pruning schemes showed that class-blind outperformed the other two. The rest of the experiments only used class-blind. There was minimal performance loss when the network was pruned by 40% and had no retraining phase. Any higher percentages required retraining to regain lost performance or to further improve. Results showed that on certain high percentage values, pruning had a regularizing effect on the network when comparing the networks before being pruned and after being retrained. This is however achieved following the standard pruning procedure.
As seen with the different literature using MBP methods, these methods are able to achieve networks with minimal performance loss and a reasonable amount of size compression. The condition though being that most of the studies applied the standard pruning procedure. However, this shows that even with the downfalls of MBP, it can achieve reasonable pruned networks. Especially with the pruning procedure of Dynamic Network strategy, there is room for exploration with MBP being applied in the training phase.

Pruning while Training

Only recently have there been more interest and studies with regards to applying pruning into the training phase. Studies have been focusing on its uses for DNNs, especially for CNNs. These networks can get very large, as not just weights are used but also filters. How the pruning phase was adapted to the training phase is what will mostly be discussed in the following studies rather than the specific pruning methods used in choosing which weights and filters to remove.
DropBack is a technique for pruning the network during and after training (Golub,Lemieux, & Lis, 2018). The goal was to create a method in which the amount of energy used by the computer from memory accesses are minimized. This is also due to the limited amount of studies which focus on improving the computation speed and memory used while training of DNNs.
The study utilizes the gradients gained from backpropagation in deciding which connections are pruned. Specifically, the method keeps track of the accumulated gradients achieved by the weights. Using this, weights are then divided into those that are tracked and untracked. Tracked weights are those weights with the highest accumulated gradients. Untracked weights are the pruned weights. However, instead of the typical manner of setting pruned weights to zero, the study reinitializes the values of those weights. The idea is that computing new values were essential in achieving better accuracy. This is also done because recomputing the values were much faster than having to save and get the values from memory. Though the untracked weights have values, these untracked weights do not get updated in the backpropagation phase.
Another important part of the method was that selecting of tracked weights was not done for the whole duration of training. After a certain amount of epochs, the tracked weights are frozen, meaning that selecting of tracked weights is done. It was found that after a certain amount of iterations the highest-gradient weights show a level of stability further proving that setting of tracked weights is not necessary for the whole training phase. These three components of tracking of accumulated gradients, pruned weights with reinitialized values, and freezing of tracked weights make up DropBack.
Overall, testing on large CNN models such as Densenet (Huang, Liu, VanDer Maaten, & Weinberger, 2017) and WRN-28-10 (Zagoruyko & Komodakis, 2016), DropBack was able to prune the network by 80% with minimal accuracy loss. It is also capable of pruning even further with a certain amount of accuracy lost. Much more importantly to the purpose of their study, the amount of computation costs from memory accesses were significantly reduced.
In most pruning methods, the values of pruned parameters are set to zero and do not get used in forward propagation or do not get updated in backpropagation. The study of Y. He, Kang, Dong, Fu, and Yang (2018) strayed away from using this and applied soft pruning while training a CNN. This was also done to have a pipeline in which pruned networks can be less reliant on the pre-training phase.
The study mainly focused on pruning the filters in a CNN rather than the weights. The idea behind soft filter pruning (SFP) is that pruned filters can still be updated. After setting a pruned filter to zero, it means that they do not contribute in changing the output in the forward propagation. However, during backpropagation these pruned filters are updated rather than setting their gradients to zero. Unlike the previous study where the network is continually tracked, the pruning phase happens before the next training epoch excluding the first.
The CNN model used in the study was ResNet (K. He, Zhang, Ren, & Sun,2015a). SFP was capable of achieving networks comparable or even better than networks pruned using hard pruning methods. This was attributed to the fact that unlike hard pruning methods, the feature complexity in soft pruning is not lost. Few tests were done on the SFP interval ranging from one (1) to nine (9) epochs and there was no significant change in the outcome. The study stated however that this value could be fine-tuned and experimented on further.
Another study by Yue, Weibin, and Lin (2019) also questioned whether the pre-training process was really needed when creating networks that will be pruned. Iterative pruning was also discussed, that even if the process was effective, is it really necessary for ensuring no excessive pruning. Thus, Incremental Pruning Based on Less Training (IPLT) was proposed.
IPLT is similar to the previous study in which pruning is done after a certain amount of epochs. Unlike the standard pruning procedure where the network being pruned has stable connections, it is not the same for the weights found during training. With this, the idea of incrementally pruning the network by smaller values until the final network size has been reached was made. An example would be that a network will be pruned by 10% every time until 90% of the network has been pruned. When the desired pruning ratio has been reached, pruning will no longer be done.
The study shows that for certain CNN models it can compress and speed up computations better than traditional pruning. It also showcases that large amounts of training may not be necessary for pruning networks. However, tuning of the epoch intervals of when pruning is done and at what increments of the pruning ratio to use was not heavily experimented on.
The few studies experimenting on this new pruning procedure of pruning while training shows promising results. This new process could be further explored with how it affects other types of pruning methods such as MBP. As DNNs, specifically CNNs have been the sole focus of these studies, testing the possible effects on a basic ANN such as FNNs could provide insight. With the positive results on complex models such as CNNs, FNNs and MBP may benefit from the new pruning procedure.

Page updated

Google Sites

Report abuse