Methodology

Technologies Used

The ANN will be developed using Pytorch, a machine learning framework. It is capable of using the graphics processing units (GPU) CUDA cores for faster computation speed. Pytorch comes with a module that can, download, load, and process datasets called torchvision, which the MNIST dataset is present. Tensorboard will be used to log and aid in visualizing the data of the network.
Secifically the software and frameworks to be used are the following:• Pytorch 1.1• Python 3.7.3• Anaconda Python 4.7.11• CUDA 10.1• Windows 10
The ANN will be trained on a personal computer with the following hardware specifications:• AMD Ryzen 5 3600X 6-Core Processor @ 3.8 GHz• NVIDIA GeForce RTX 2060• 16 GB - DDR4 3200 - RAM

MNIST Dataset

The MNIST handwritten digit database is a dataset containing a total of 70,000 28x28 pixel images of black and white handwritten digits (LeCun, Boser, et al., 1990). There are 60,000 images for the training dataset and 10,000 images for the testing dataset. There will be no data preparation such as normalizing the values. The training dataset will not be split into training and validation dataset, as cross-validation is not used in this study.
In creating the networks, the networks will be trained and tested using the whole dataset. The networks will also be tested on smaller datasets from the MNIST dataset. Two (2) sub-datasets will be made by randomly sampling from the training dataset and test dataset, ensuring an equal amount of data samples per class unlike the original MNIST dataset. Each sub-dataset will have 30,000 training samples and 5,000 test samples. This is done to check for replicability and if the effects of the proposed method are consistent.

Figure 1. Sample of MNIST Images (Yaļcın, 2019)

Artificial Neural Network Design

The type of ANN to be used is a FNN. A common problem for designing ANNs is the endless possible configurations of the hyperparameters. However, as stated before, the hyperparameters of the ANN will be kept constant and the hyperparameters for pruning will be tested on. In finding the best configuration of certain hyperparameters, tests will be done to determine such.
Activation Function
In terms of choosing the activation functions for the hidden layer there are three widely used functions for the initial design of ANNs. These functions are the Sigmoid function,Tanh function, and Rectified Linear Unit(ReLU) function (Nair & Hinton,2010). This study will be using the ReLU function. The common issue with the Sigmoid and Tanh function is it makes the model susceptible to network saturation (Goodfellow et al., 2016). Another issue is that in DNNs these activations functions encounter the vanishing gradient problem. The gradients of a network get smaller and smaller as it moves backwards, meaning that earlier layers learn very slowly (Sussillo & Abbott, 2014); and these two functions cause the gradients to be even smaller. This should pose no problem for a FNN with a single hidden layer however, it can be said that a largely over-parameterized FNN is similar to a DNN in terms of the complex mapping functions it can learn.
The ReLU function can fix this issue. The computations of ReLU are also cheaper, as the equation can be seen as follows:

ReLU has also been recommended as the default activation function to be used, at least for FNNs and CNNs (Goodfellow et al., 2016; Krizhevsky et al., 2012). However, there are limitations to ReLU and modifications on this activation function has been made such as Leaky ReLU (Maas, Hannun, & Ng, 2013), Exponential Linear Unit (ELU) (Clevert, Unterthiner, & Hochreiter, 2015), and Parametric ReLU (K. He, Zhang, Ren, & Sun, 2015b); but to not over optimize the network and have a more basic control test, basic ReLU will be used.
The activation function for the output layer will be the softmax function. As this function is used for multi-class classification problems because it returns the probabilities for an input to belong in each class or label (Nwankpa, Ijomah, Gachagan,& Marshall, 2018).

Loss Function
Choosing the proper loss function is important as this is the criterion used by the network in knowing how well it is doing. Given that the MNIST dataset is a multi-class classification problem, it is best to choose loss functions for this type of problem. There is a loss function that has been widely used and paired with softmax. The categorical or multi-class cross-entropy loss has been preferred by the machine learning community for multi-class classification problems such as the MNIST dataset (Janocha & Czarnecki, 2017). It is also referred to as softmax loss (Gomez, n.d.) or negative log likelihood (Goldberg, 2017). It uses the probabilities per class gained from the softmax function in the cross-entropy loss function, as seen as follows:

where yˆ is the predictions or output of the softmax function, y are the true labels, and C being the total number of classes.
Weight InitializationHow the weights of the network are intialized is an important hyperparameter to set. To create more consistent control tests, Kaiming or also known as He initialization will be used. The reasoning is that this weight initialization method has been made to be used with ReLU in mind (K. He et al., 2015b). It has also been suggested that Kaiming be used with ReLU to further decrease the chances of vanishing gradients (Ouannes, 2019) and it is better in comparison to other weight initialization optimization methods like Xavier when used with ReLU (Glorot & Bengio, 2010).
There are two ways to initialize the weights, either by uniform distribution or normal distribution. No studies have been done to show if one option is better than the other, and that choosing between the two has been a matter of preference. For this study Kaiming with uniform distribution will be used. Taken from the PyTorch documentation, the weights are initialized with values in U(-bound, bound). The bound is found by:

where a is the negative slope of the rectifier used, and fan_in is the amount of incoming connections in the layer.
Layers and Neurons
In designing an ANN, there is no need for much experimentation in configuring the input and output layer. The input layer will have 784 input neurons. This is because the input data are 28x28 pixel images, if converted to a one dimensional array it equals an array of size 784. The output layer will have 10 output neurons based on the amount of classes present in the MNIST dataset. These are the digits zero (0) till nine (9).
As this study is on pruning, it is important to have an over-parameterized network. As stated before, the network will not be over-parameterized based on depth. This means that there will only be one hidden layer used. The main basis of over-parameterization is the number of hidden neurons used in the single layer. Another neuron to take into account are the bias neuron, or commonly denoted as bias nodes or simply bias.
Bias nodes are commonly used in designing ANNs. It is because it can help the network fit to the data better due to the increased flexibility of the model. In some cases the network may perform worse without the inclusion of bias nodes. However, deciding which layers which will contain bias nodes and which initialization method to use for the bias is another set of hyperparameters that need to be configured.
To choose whether to use bias nodes or not and what the amount of hidden neurons should be, an initial test was done with the previously chosen hyperparameters and the chosen values of three (3) learning rates: 0.1, 0.01, and 0.001, and five (5) batch sizes: 1, 4,16, 64, 256. The tests used 50 epochs, hidden neuron sizes of 256, 512, and 1024, and no bias nodes, on the assumption that it can still achieve good performance without it.

The criterion in deciding which hidden neuron size to use and if bias nodes seem necessary will purely be based on the test accuracy. The generalization ability, such as the training and test loss, was not taken into account to simplify the decision process. The idea behind this is that if a network is capable of achieving high accuracy without any significant improvement from increasing the hidden neuron size then the extra neurons no longer benefits the network. If a network is capable of achieving a test accuracy of 95%, it will be considered as high accuracy. If a network is also achieving such, then there can be an assumption that bias nodes will not be needed for the design of the final networks.
As seen in Table I, there are multiple configurations that were able to reach a test accuracy of 98%. On average, the test accuracy achieved per hidden neuron sizes were 98.26% for 256, 98.31% for 512, and 98.3% for 1024. It can be seen that the improvements on accuracy were not significant. With these tests, there can be an assumption that a hidden neuron size of 1024 and greater can be stated as over-parameterized. Therefore, this study will use the values of 1024 and 2048 for the hidden neuron size of the tests; and as the networks were able to achieve high accuracy without the use of bias nodes, no bias nodes will be added for the tests.

Learning Rate and Batch Size
Learning rate determines how quickly the model adapts to the problem and how large of a step the weights will be updated. It is one of the most important hyperparameters to tune (Goodfellow et al., 2016) . If the learning rate is too large it creates large loss and causes the network to converge to a suboptimal solution. If it is too small it causes slower convergence to a solution and can cause the network to get stuck on a local optima (Brownlee, 2019b). When choosing the learning rate, the batch size has to be taken into account as tuning these two hyperparameters affect each other (Bengio, 2012).
Batch size is how many training examples are calculated before updating the model. The error gradients of a batch are accumulated and adjusted before the updates. When the batch size is one (1), or also known as Stochastic Gradient Descent (SGD), after every training sample the weights are updated. Batch Gradient Descent (BGD) trains on all the training data in one epoch before updating the weights. In between these two is Mini-batch Gradient Descent (MGD), where a value between one (1) and the training data size is chosen. BGD will not be used due to hardware limitations, SGD will also not be used due to time constraints for training, and only MGD will be tested on. However, the issue would be choosing the values for MGD.
Tests were done on the learning rate and batch size to aid in choosing the values. A study states that values of 2, 4, or even 32 are good default values (Masters &Luschi, 2018), so batch sizes within the range of 4 and 32 were chosen. The batch sizes tested on are 4, 16, and 32, and the learning rates are 0.1, 0.01, and 0.001. The tests on batch sizes and learning rates were done on an FNN with 1024 hidden neurons using the already configured hyperparameters for 50 epochs.
To choose which batch size and learning rate to use for the final network design, unlike with the hidden neuron size test, the generalization ability of the network will be taken into account.The goal was to have a network capable of having high test accuracy while the difference of the training and test loss was reasonable. If the test loss is higher than the training loss and by a significant amount, it can mean that the network has overfitted to the training data. Even if it may have high accuracy, there is no guarantee that it will work just as well with input data that it has never seen, aside from the test set. The relative time it took the network to train may also be taken into account.
On all batch sizes, using a learning rate of 0.1 achieved a test accuracy of 98%. However, the networks have a large difference between training and test loss. This can be seen in Figure 2, where in batch size 4, the first few epochs showed erratic change of loss. Even if on the latter epochs it began to stabilize the network, the test loss failed to improve. On batch sizes 16 and 32, the test loss less frequently spikes but still fails to decrease and improve, as seen in Figure 3.

Figure 2. Training and test loss on batch size 4 and learning rate 0.1

Figure 3: Training and test loss on batch sizes 16 and 32 using a learning rate of 0.1

The learning rate of 0.001 showed the best results in terms of generalization ability. Creating networks with a loss converging downwards with minimal difference between test and training loss. However, none were able to reach a test accuracy of 98%, the closest being 97.78%. In Figure 4, it shows that at least two (2) of the batch sizes were able to reach an accuracy higher than 95%.

Figure 4: Test accuracy of networks trained with a learning rate of 0.001

On batch size 4 with a learning rate of 0.01 it was able to achieve 98.28% but similar to the test loss of tests with learning rate of 0.1, it failed to improve and is much higher than the training loss. Figure 5 shows the training loss and test loss of the three (3) different batch sizes on learning rate 0.01. Between batch size 16 and 32, 16 achieved better accuracy than 32, this being 98.13% and 97.91% respectively. However, as seen in Figure 5, though both networks seem to have reasonable generalization ability, batch size 32 did better. This is due to the time in training at which the test loss was higher than the training loss. It happened much later on batch size 32 as opposed to batch size 16.

Figure 5. Training and test loss of networks trained with a learning rate of 0.01

The best configuration per batch size were chosen in deciding the final values. For batch size 4, the best learning rate was 0.001 with a test accuracy of 97.78%. The most reasonable network for batch size 16 was achieved using a learning rate of 0.001. Though the test accuracy was only 96.09% it was able to generalize the best, with minimal difference between test and training loss. Between the two learning rates for batch size 32, 0.1 and 0.01, that achieved a test accuracy higher than 95%, 98.45% and 97.91% respectively, 0.01 did better when generalization ability is taken into account. It may have not reached a 98% accuracy like 0.1 but its test loss is much better and reasonable.
Figure 6 showcases the training and test loss of the three networks to choose the final configuration from. With regards to generalization ability, batch size 16 with learning rate 0.001 did the best. However, taking into account the test accuracy it would be better to choose from the batch size 4 or batch size 32 configuration. The two both have reasonable loss but batch size 4 seems the best between them but not by much. Looking at how long it took to train the network for 50 epochs, batch size 4 took on average 25 minutes while batch size 32 took only 7 minutes. Given that there are multiple tests that will be done and limited time, batch size 32 would do the best. The test accuracy is also slightly higher.
The final chosen values for the learning rate and batch size are 0.01 and 32.

Figure 6. Training and test loss of the best networks per batch size

Control Test

Other than the chosen hidden neuron sizes of 1024 and 2048, there will be additional hidden neuron sizes that will be tested. This is due to the fact that pruned networks have a different amount of active connections. This is also done to see if pruning an over-parameterized network does better than creating a basic ANN of similar amount of connections. The additional hidden neuron sizes are 128, 256, and 512. Table II showcases the amount of connections present on different hidden neuron sizes and pruning percentages.

Notation

Percentile Pruning

As explained the MBP method that will be used for this study is class-blind. It is renamed however to Percentile Pruning as this study does not compare any other pruning schemes. It is also due to the fact that an extra step is added once applied during training. The proposed MBP method is named Percentile Pruning based on how the threshold is chosen.
The threshold of this method is the value of the pth percentile of the weights of the network. The user set hyperparameter is the pth percentile. One can assume it is also how much the network will be pruned; if p = 50, the network will be pruned by half. The threshold is calculated using the standard percentile formula as seen in Algorithm I.

As stated before, there is an extra step when Percentile Pruning is applied during training. It is similar to Dropout where the neurons while training are just temporarily pruned. At the phase before pruning happens again to an already pruned network, the connections which were removed are returned. For this action of refilling the network, other than M-l, W-l there is another matrix, which will be denoted as C-l, that is used. This matrix holds a copy W-l before it was pruned. Assuming that W-l is a pruned network, the following equation summarizes the function of how refilling the network works:

This also mimics the inspiration of Dynamic Network Surgery, that certain pruned connections that are actually important get recovered (Guo et al., 2016). Refilling the network allows old weights to return. If during backpropagation certain active weights become lower than previously pruned weight, the chosen weights to be pruned can change. For this study, five (5) percentile values will be used: 10, 30, 50, 70, and 90.

Prune Timings

EpochIn previous studies the pruning phase was applied based on epoch intervals. This a hyperparameter that can be tuned however to test the initial effects of pruning while training, an interval of one (1) epoch will be used. Excluding the first epoch, before the start of the next training epoch the pruning phase will be applied. This means that pruning will not be applied on the end of the last training epoch. This prune timing will be referred to as epoch-1, where 1 denotes the interval.
IterationAn iteration can also be considered as a batch of input samples. If a network finishes training on the whole training data, that is considered as an epoch. This study will also refer to iteration as a step. To aid in understanding how one can number on which step pruning will be applied, the following notation will be used.The training data is signified as D, and Di={xi, yi} N until i=1in which xi and yi are the batch of inputs and labels respectively given step i until N.N is equivalent to the total training dataset size divided by the chosen batch size, rounded up. An example is if there is a batch size of 256, and the total training data is equal to 60,000, then N= 235. If one sets the interval to 235 from the example, that is equivalent to epoch-1.
Similar to epoch-1, this study will use a step interval of one (1). The pruning phase will be applied before the start of the next training iteration. The first step and the last step of the final epoch will not undergo the pruning phase. This prune timing will be named step-1.
However, as stated before the weights during the training phase are unstable. To see the importance of having to establish initial connections, another prune timing is introduced. It will almost be the same as step-1 but the first epoch will not exhibit any pruning. This means that no pruning is done in any of the steps of the first epoch. One can say that the first epoch is a form of pre-training. The prune timing will be named wait-step-1.
Standard and RetrainingIt is important to understand the effects of Percentile Pruning using the standard pruning procedure. Though, a study has been done with regards to class-blind, insight can be added with its effects on FNNs and classification problems as opposed to RNNs and NMT. To create these networks, the control tests of hidden neuron sizes 1024 and 2048 will be pruned. Due to the time constraints, the retraining phase will be for 25 epochs and no iterative pruning will be done. This procedure will be named standard.
Another pruning procedure will be done, similar to the study of Guo et al. (2016). Using the same control tests of hidden neuron sizes 1024 and 2048, epoch-1and step-1will be applied to the 25 epoch retraining phase. 25 epochs is chosen as to keep the retraining phase consistent on all tests. These prune timings will be referred to as standard-epoch-1 and standard-step-1. The reverse of this pruning procedure will be tested. The pruned tests of epoch-1, step-1, and wait-step-1will undergo 25 epochs of retraining. This is done to understand if a retraining phase is necessary when applying the pruning phase during training.These prune timings will be named epoch-1-rt, step-1-rt, and wait-step-1-rt, where rt denotes the retraining phase.
There are now four (4) general pruning procedures that will be tested. The standard pruning procedure of pretrain, prune, and retrain. The next two are pruning while training with and without retraining. Lastly, pre-training and pruning while retraining.

Final Network Design

Input Neuron Size: 784
Hidden Neuron Size:
- Control Tests: 128, 256, 512, 1024, 2048
- Pruning: 1024, 2048
Output Neuron Size: 10
Bias Nodes: None
Epochs:
- Training: 100
- Retraining: 25
Learning Rate 0.01
Batch Size: 32
Weight Initialization: Kaiming Uniform
Activation Function:
- Hidden Layer: ReLU
- Output Layer: Softmax
Loss Function: Multi-class Cross-Entropy

Pruning Percent: 10, 30, 50, 70, 90
Prune Timings: epoch-1, step-1, wait-step-1, standard, standard-epoch-1, standard-step-1, epoch-1-rt, step-1-rt, wait-step-1-rt

There are 10 control tests using the whole dataset. Using the networks created for 1024 and 2048 from the control tests, three (3) prune timings will be applied, standard, standard-epoch-1, and standard-step-1 with the different pruning percentage values. This leads to 30 tests on the standard pruning and pruning while retraining procedure. The pruning while training procedure will also have a total of 30, from using the prune timings of epoch-1, step-1, wait-step-1. These 30 networks will also under go retraining for the prune timings of epoch-1-rt, step-1-rt, and wait-step-1-rt, for another set of 30 networks. In total there are 95 main tests, if including the sub-datasets to check the consistency and reproducibility of results, a total of 285 networks are made.

Page updated

Google Sites

Report abuse