Learning Rate and Batch Size
Learning rate determines how quickly the model adapts to the problem and how large of a step the weights will be updated. It is one of the most important hyperparameters to tune (Goodfellow et al., 2016) . If the learning rate is too large it creates large loss and causes the network to converge to a suboptimal solution. If it is too small it causes slower convergence to a solution and can cause the network to get stuck on a local optima (Brownlee, 2019b). When choosing the learning rate, the batch size has to be taken into account as tuning these two hyperparameters affect each other (Bengio, 2012).
Batch size is how many training examples are calculated before updating the model. The error gradients of a batch are accumulated and adjusted before the updates. When the batch size is one (1), or also known as Stochastic Gradient Descent (SGD), after every training sample the weights are updated. Batch Gradient Descent (BGD) trains on all the training data in one epoch before updating the weights. In between these two is Mini-batch Gradient Descent (MGD), where a value between one (1) and the training data size is chosen. BGD will not be used due to hardware limitations, SGD will also not be used due to time constraints for training, and only MGD will be tested on. However, the issue would be choosing the values for MGD.
Tests were done on the learning rate and batch size to aid in choosing the values. A study states that values of 2, 4, or even 32 are good default values (Masters &Luschi, 2018), so batch sizes within the range of 4 and 32 were chosen. The batch sizes tested on are 4, 16, and 32, and the learning rates are 0.1, 0.01, and 0.001. The tests on batch sizes and learning rates were done on an FNN with 1024 hidden neurons using the already configured hyperparameters for 50 epochs.
To choose which batch size and learning rate to use for the final network design, unlike with the hidden neuron size test, the generalization ability of the network will be taken into account.The goal was to have a network capable of having high test accuracy while the difference of the training and test loss was reasonable. If the test loss is higher than the training loss and by a significant amount, it can mean that the network has overfitted to the training data. Even if it may have high accuracy, there is no guarantee that it will work just as well with input data that it has never seen, aside from the test set. The relative time it took the network to train may also be taken into account.
On all batch sizes, using a learning rate of 0.1 achieved a test accuracy of 98%. However, the networks have a large difference between training and test loss. This can be seen in Figure 2, where in batch size 4, the first few epochs showed erratic change of loss. Even if on the latter epochs it began to stabilize the network, the test loss failed to improve. On batch sizes 16 and 32, the test loss less frequently spikes but still fails to decrease and improve, as seen in Figure 3.