Introduction

Effects of Pruning while Training Using Percentile Pruning: A Proposed Magnitude-based Pruning Method

Background of the Study

Artificial neural networks (ANNs) have been used in different areas and problems such as classification problems, computer vision, speech recognition, and more. How well an ANN performs is affected by the configuration of the hyperparameters, dataset to be used, and network topology. The network topology, or also called as architecture or structure, is a representation of how the different layers and neurons of an ANN are connected (Fiesler & Beale, 1996). The design of the network topology of an ANN greatly affects its success (Augasta & Kathirvalavakumar, 2013).
The main components of the network topology of an ANN are the neurons, connections, and layers. The neurons, also referred to as nodes or cells, are perceptrons where the calculations happen. The connections or weights, represent the synapses connecting neurons to one another. An ANN is divided into three parts or layers: input, hidden, and output layer. A layer is connected to another where its outputs are used as inputs for the next. The amount of neurons in the input layer is equal to the size of the input data. The amount of neurons in the output layer is the number or type of outputs in the problem. How many hidden layers and hidden neurons per layer are chosen by the user depending on the purpose of the ANN or study.
This is the main issue in designing the topology of an ANN, configuring the amount of hidden layers and neurons to use per layer. In terms of deciding how many layers to use, a theory by Lippmann (1987) showed that two hidden layers were enough in getting the desired shape of a classification region. Another study states that one hidden layer can approximate any function one needs (Goodfellow, Bengio, & Courville, 2016). According to Reed and Marks II (1999), , even if most problems can be solved using a large single hidden layer, it can be more efficient to use more layers. However, these studies do not state the necessary amount of neurons needed per layer.
With regards to the optimal amount of hidden neurons, there is no standard and definite way to determine such a number. If the network has too many hidden neurons, it may have a high generalization error from overfitting to the training data (Augasta & Kathirvalavakumar, 2013). It can also cause the network to have redundant neurons. This leads to an ANN with high computation cost and memory wastage (Denil, Shakibi, Dinh, De Freitas, et al., 2013; Han, Pool, Tran, & Dally, 2015). It is especially true for deep neural networks (DNNs), which are ANNs with more than one hidden layer.
A network with too few neurons can end up underfitting and failing to find an optimal solution. The training time needed to reach an optimal solution or at least an acceptable state is increased. This is due to the network not having enough processing power (Reed, 1993).
Different approaches have been theorized and made to aid in designing the ANNs. First, some books and articles have proposed a \rule-of-thumb" for making the topology. An example would be that the total amount of hidden neurons should be between the size of the input and output nodes (Blum, 1992). These rules however cannot be used in most circumstances because the training dataset size, the complexity of the data to be learnt, and the amount of noise in the targets are not considered. Second, a simple approach would be trial and error but it yields sub-optimal designs and takes time (Stathakis, 2009). However, with experimentation, understanding of the problem, and existing models and literature the possible number of hidden nodes and hidden layers to start the trial and error with can be intuitively identified. Lastly, more dynamic approaches are pruning and constructive algorithms.
Pruning and constructive algorithms adjust the network structure by removing or adding connections, neurons, or layers to optimize the network (Stathakis, 2009). The objective of pruning is to reduce the size of the network without losing the network’s capabilities or even make it better (LeCun, Denker, & Solla, 1990). The standard approach is to first train an over-parameterized network, a network with a large amount of hidden neurons more than necessary, for a certain amount of epochs. One can also use a pre-trained model that have been made by others instead of training anew one. Next is to prune the trained network based on the chosen pruning criterion or method. The resulting network is then fine-tuned by retraining it to adapt and improve using its new structure (Liu, Sun, Zhou, Huang, & Darrell, 2018).
However, pruning still faces the issue of designing the initial network and configuring the hyperparameters to be used for training. There is also the problem of choosing the pruning method to be used and if the fine-tuning phase will use the same configurations as the training phase. This may cause finding a reasonable network to take much longer as opposed to creating a basic ANN. This is especially true if iterative pruning is done. It is where the network iterates over the pruning and fine-tuning phase until a condition is met to stop. The aim is to be able to find and create the smallest possible network without losing or improving its capabilities.
One of the first studies on pruning ANNs, by LeCun, Denker, and Solla (1990),created a technique called Optimal Brain Damage(OBD). The study states that parameters have a level of saliency, which is how much they affect the training error.To prune the network, parameters with small saliency are removed. These parameters when deleted have the least or do not have any effect on the resulting training error.To compute the saliencies, the second derivative of the objective function and weights are used. Iterative pruning is also utilized.
The problem with OBD is that finding the saliencies of the weights are computationally expensive, especially on large networks. Since iterative pruning is also used, this computation is done multiple times. There is also the issue of choosing which parameters to delete, meaning how many low-saliency parameters get deleted or if a threshold will be used. Other studies have focused on using this idea of finding the significance of the weights with regards to the network by speeding up or improving the calculations needed to find the parameters’ importance.
Instead of having extra calculations to find the importance of the parameters, the most basic method would be magnitude-based pruning (MBP). MBP assumes that weights with small values are irrelevant to the output of the network (Hagiwara, 1994). The only hyperparameter to set is the threshold. Any weights with a value less than the threshold are marked irrelevant and are pruned. This makes MBP simple to implement and quick.

Statement of the Problem

There is a problem with assuming that small weights are irrelevant, is it may end up removing connections that are actually important to the network Sietsma, 1988). This makes MBP methods more susceptible to irretrievable network damage. Since connections once pruned are not returned, removing the wrong weights may severely decrease the achieved performance of the network or limit the possibility of further improving.
A network is more likely to be stable and not overfit the training data if the network has smaller weights. Large weights can mean that an ANN has overfitted and has memorized the noise in the training data. This leads to the network being more sensitive to weight updates and changes to the input (Brownlee, 2019a). If the weights are too small then the gradients to update the weights can become too minute to create any progress in learning the solution. Thus, MBP methods with poorly chosen thresholds can greatly affect the performance of the network.
Different studies have focused on using the network’s weights in finding the pruning threshold. A study by Han et al. (2015), used the standard deviation of each layers’ weights multiplied by a user set value as the threshold. This means that instead of using a universal threshold for the whole network, each threshold was catered per layer. A similar study also used the idea of thresholds per layer, but instead of a single threshold, a threshold interval was used (Guo, Yao, & Chen, 2016). The purpose of the threshold interval was for their novel method of splicing, which will be explained later on. A simpler way of getting the pruning threshold is class-blind, one of three pruning schemes in a study by See, Luong, and Manning (2016). Class-blind sorts the absolute value of the whole network’s weights and prunes the smallest weights by a user set percentage.
However, even with these ways of getting thresholds, there is still no guarantee that the network will not be over-pruned. This is the purpose of the standard approach and of iterative pruning. The study of Han et al. (2015) used iterative pruning in trying to find the best pruned network. Class-blind on the other hand used the standard approach See et al. (2016). Instead of continually pruning a network by a certain percentage, one can train a network and test different pruning percentages on it to find the best pruned network.
The approach in the study of Guo et al. (2016) is called Dynamic Network Surgery. As stated before, the study used a threshold interval and splicing. The goal of splicing was to return pruned weights which were found to be actually important. Based on the thresholds, weights can be pruned, spliced, or keeps its previous state of pruned or spliced. The study also used a different pruning procedure. As opposed to iterative pruning, pruning and splicing was done while retraining. It was done after updating the weights of the network. The idea is to have continual network maintenance. However, there was still a concern of the possibility of pruning too much. This resulted in adding a probability value to decide if the network should or should not prune and splice.
As seen in the previous study there was still a training phase. The only change in the pruning procedure was the pipeline of when pruning and fine-tuning were done. This opens up the question of the possibility of applying the pruning phase in the training phase. This in turn removes the need of a fine-tuning phase and iterative pruning process, quickening the creation of a pruned ANN. Other than the study of Guo et al. (2016) , another example of how this idea could be possible is the regularization method of Dropout.
Dropout is a technique to keep an ANN from overfitting when training (Srivastava,Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). During every forward propagation each neuron has a probability value of whether it will be used or not. This technically creates a new sub-network every training iteration. One can assume that it is as if the network is randomly and temporarily pruned while training. However, this does not lead to a final network that is sparse and pruned. The purpose of Dropout, other than regalurazing the network, was that the network learns better due to being able to train on different network structures.
Dynamic Network Surgery and Dropout shows that there is a possibility that pruning while training using simple methods like MBP could achieve a reasonable and pruned network. Due to the simple implementation of class-blind, it will be used as the pruning method for this study. It will also be renamed Percentile Pruning because the percentile formula is used in getting the threshold and unlike the original study, no other pruning methods are compared.
During training, there are different times where pruning can be done such as after every iteration and at the end of every epoch. Without the use of techniques like the usage of a probability value of whether to prune or not, the network may become over-pruned. However, in this study it will utilize the main procedure of how Dropout works or the use of splicing in Dynamic Network Surgery. It is where neurons have the chance of being returned and used in the next iteration. In the proposed method, when the network is about to be pruned again, the weights which have been previously pruned will be filled with their original values. The network is then pruned with the same pruning percentage. This ensures that the network is kept at a specific size and not over-pruned. This also allows pruned weights that may have actually been important to the network to be recovered.

Significance of the Study

The results of this study will provide insight and information on the possible effects of applying the pruning phase into the training phase of an ANN. Examples are how the stability of how the weights change while training, or the overall effects on a networks’ flexibility to learn. The question of, if it is necessary to have long training and fine-tuning phases in creating pruned network can also be answered. This in turn may speed up the process in building large networks.
Using the proposed method of Percentile Pruning, this new pruning pipeline may aid in understanding and overcoming the initial weaknesses of MBP methods. The weaknesses of pruning small weights which are actually important, over-pruning, and in general irretrievable network damage. It is also important to understand the effects of the chosen MBP method not just as a pruning method but also its possible regularizing effects on the network.

Scopes and Limitation

As there are multiple types of ANNs, this study will use a basic Feedforward Neural Network (FNN) with a single hidden layer to gain a general understanding of the effects of the proposed MBP method and of pruning while training. The study will only test on one dataset. A standard dataset used for training and testing an ANN is the MNIST dataset, made by LeCun, Boser, et al. (1990). It is a classification dataset composed of handwritten digits. The amount of data samples in the dataset will not be altered for testing of imbalanced datasets. The data will not be preprocessed as the effects of editing the data and how it affects the proposed MBP method and resulting network is not the focus of the study.

Objectives

Generally, the research aims to conduct a parametric study on the effects of pruning on different training timings and pruning percentage values using the proposed method of Percentile Pruning.
Specifically, the study aims to accomplish the following:
1) To determine the best hyperparameter configuration for the initial ANN design;
2) To build and train the control ANNs using the chosen hyperparameter configuration;
3) To build and train ANNs with Percentile Pruning using the chosen hyperparameter configuration, different prune timings, and different pruning percentage values;
4) To compare the effects of ANNs with Percentile Pruning with the control ANNs;
5) To compare the effects of the different prune timings and pruning percentage values of ANNs with Percentile Pruning with one another.

Page updated

Google Sites

Report abuse