Due to the inherent weaknesses of MBP there are limited studies that explore its uses. However, these few studies have been done within the recent years and showcases the possible usefulness of MBP.
Han et al. (2015) applied MBP in Convolutional Neural Networks (CNNs) to reduce the size and computations needed by the network without losing its achieved accuracy. The standard pruning procedure was used. A network was first trained but instead of finding the final weights of the network, the goal was to learn which connections were important. This initial phase can be referred to as the pre-training phase. This term is at times used interchangeably with the training phase with regards to the standard pruning procedure. The network was then pruned where pruning was applied per layer where the threshold was calculated by using the standard deviation per layer and multiplying it by a user set constant. The final network was achieved by simply retraining or through iterative pruning.
Other than MBP, other regularization techniques were used in the pre-training and retraining phase. These are L1 regularization, L2 regularization, and Dropout. L1 and L2 regularization are techniques that adds a regularization term to the loss leading to smaller weights (Jain, 2018) . This is also known as the penalty term method or weight decay method. The idea is to lead weights that do not contribute much to the network close to zero or even equal to zero depending on the formula of the penalty term.
L1 and L2 regularization were used either in the pre-training phase, retraining phase, or both in finding out which between the two would aid in achieving better networks. Overall, it was L2 regularization that gave better results. However, the values used for the methods were not stated. Using L1 and L2 regularization adds another variable to consider when designing the network.
Dropout was used in the retraining phase to prevent the pruned network from overfitting. The study did not use the default values when choosing the Dropout Ratio. A formula was made to adjust the Dropout Ratio as the network becomes sparser from pruning. However, there were no tests on the effects of using the original Dropout Ratio and the new method.
Other than pruning connections, the study also pruned neurons when a neuron no longer had any incoming connections or outgoing connections. To see the effectivity of the proposed method several CNN models were used. The different models were LeNet (LeCun, Bottou, Bengio, Haffner, et al., 1998), AlexNet (Krizhevsky, Sutskever,& Hinton, 2012), and VGGNet (Simonyan & Zisserman, 2014). Again, the initial values of the hyperparameters were not stated. In the end, the networks created were pruned by a significant amount, as much as 90%, without losing the original accuracy. However, this still may face pruning problems such as creating a network with irretrievable network damage, and the learning inefficiency of standard pruning procedures of long training or retraining times.
To resolve this issue, a method called Dynamic Network Surgery was proposed (Guo et al., 2016). The proposed method is composed of three major parts. First is a new pruning procedure. Instead of using the typical pruning procedure and retraining for a certain amount of iterations, the pruning phase was placed within the retraining phase. After the pre-training phase, after every weight update in the retraining phase pruning was applied. Second, a threshold interval was made as opposed to a single threshold. Lastly, the threshold interval was used for the new feature of splicing which recovered important connections that were pruned.
The threshold interval was made specifically for the pruning and splicing functions. The first threshold is gained by the variance of the weights of the layer and the second threshold is the first threshold added by a user chosen value. If the weight is less than the first threshold it is pruned. If the weight is greater than the second threshold and the connection is pruned then it is returned, otherwise it is kept. Any weights in between keeps its previous state of either being pruned or spliced. The problem however with the threshold interval is choosing the right values as it affects what is pruned and spliced. The study did not state what values were used for the threshold.
Using Dynamic Network Surgery, the study was able to create a network comparable to Han et al. (2015) in terms of accuracy but with regards to the network size, the results showed that the network was pruned twice as much. However, the study did not experiment on the individual effects of using a threshold interval and splicing feature, and pruning and retraining combined.
Straying away from CNNs and image classification, MBP was also used for Neural Machine Translation (NMT) (See et al., 2016), which are ANNs designed for translating one language to another (Bahdanau, Cho, & Bengio, 2014). The study used a recurrent neural network (RNN) architecture called Long Short-Term Memory (LSTM) with MBP using the standard pruning procedure. This study compared three different MBP schemes. The goal was to show that MBP with retraining though simple can be effective in creating pruned networks without loss of accuracy.
The first pruning scheme was class-blindor simply the percentile formula in statistics. This is where the absolute value of all the network’s weights are sorted.The smaller weights are then pruned based on a user set percentage and the percentile formula. The second is class-uniform, similar to the first pruning scheme but applied to weights arranged by class instead of pooling all of the weights together. Lastly was class-distribution where instead of pruning based on a chosen percentage, the threshold is the standard deviation of the class multiplied by a user set constant. It is similar to the threshold in the study of Han et al. (2015) but instead of by layer basis for the threshold, it is by class.
The study stated the hyperparameter configurations used and the optimization methods applied. Examples of these optimization methods are Dropout, learning rate scheduling, and maximum gradient norm. The problem however was that networks without the optimization methods were not made. This means that there is no clear way to say if these methods have effects on the resulting pruned networks.
The initial results of the three pruning schemes showed that class-blind outperformed the other two. The rest of the experiments only used class-blind. There was minimal performance loss when the network was pruned by 40% and had no retraining phase. Any higher percentages required retraining to regain lost performance or to further improve. Results showed that on certain high percentage values, pruning had a regularizing effect on the network when comparing the networks before being pruned and after being retrained. This is however achieved following the standard pruning procedure.
As seen with the different literature using MBP methods, these methods are able to achieve networks with minimal performance loss and a reasonable amount of size compression. The condition though being that most of the studies applied the standard pruning procedure. However, this shows that even with the downfalls of MBP, it can achieve reasonable pruned networks. Especially with the pruning procedure of Dynamic Network strategy, there is room for exploration with MBP being applied in the training phase.