In our computer program, the size of a mini-batch represents the number of items used.
Here we have examples of two segmentations with different sizes of mini-batch.
The first example has 5 mini-batches, whereas the second has only 1. Both of those examples have a learning rate of 0,05 and the same number of epochs, which is 50.
Obviously, there are some differences between these two images. For instance, the edges on the second one seem to be sharper than on the first one. Moreover there are some white spots round the cow in the first image, that are non-existent on the second image.
Segmentation with a
Mini-batch=5 and a learning rate of 0.05 and 50 epochs.
Segmentation with a
Mini-batch=1 and a learning rate of 0.05 and 50 epochs.
One of the major stakes in deep learning is to control the efficiently of the learning rate.
Above all, the learning rate is the factor that speeds-up or not slows-down the gradient descent. In other words, the learning rate defines how far the weights are going to move for a mini-batch. If the value of the learning rate is too low, the gradient is going to converge to the minimum, however it is going to converge slowly. In opposite, for a learning rate that is too high, the gradient could diverge.
The challenge here was to find the best value that allows the gradient to converge in the least amount of time. There are several options to help choose an efficient learning rate.
The first one is the naive method, where you try arbitrary values until you get satisfied. Another method consists in using functions, such as the learning rate scheduler from Pytorch, that optimize the value in function of the epoch. There are also other methods to define the value of the learning rate. However, we only used the first and second approaches exposed before.
Convergence of the gradient with a low learning rate. (1)
Convergence of the gradient with a high learning rate. (1)
The loss function is a computation between a prediction and the target. It is a non-negative value, where the robustness of neural network increases along with the decrease of the value of the loss function.
There are three main functions that we used during this project. The first one is the mean square error (also known as the L2 norm), the second is the dice loss function, and last but not least, the cross entropy loss function.
At first glance, the mean square error seems (left picture) to be better than the other loss functions, because the edges of the cow are sharper. However, we built our computer program so as to try our neural network on a random item, hence why the three pictures shwn here are different. Even if we had no significant differences between two tries with the same parameters, but not the same picture tested at the end, it could be interesting to use the same image for different parameters.
Segmentation of a cow with the mean square error loss function.
Segmentation of a cow with the Dice loss function.
Segmentation of a cow with Cross Entropy loss function.
So far, the program was always going through the data set in the same order. That means that for every training, the network was working on the same pictures. One problem that can arise from this is that instead of learning how to segment pictures, the network learns how to recognize pictures.
To prevent this from happening, we first put in place a random shuffling of the data set, right before forming mini-batches. That way, each training would be different.
In the picture on the right hand side, which is the result obtained when randomly shffling the data set, we can see that there are fewer dark spots inside the cow, it appears whiter than on the left without random shuffle. However, there are still improvements to make.
Segmentation of a cow without random shuffle
Segmentation of a cow with random shuffle
Another way to prevent the network from remembering pictures is to feed it more pictures. We only had around 130 images in our data sat, so we had to put in place data augmentation.
There are two ways to do so. The first one, is to create new pictures from the ones we already have, that way the data set is physically bigger. The second one, is to randomly alter pictures when forming mini-batches, that way the data set in not physically bigger so less memory is needed. We chose the later, because it was easier to implement and because it is the most comonly used solution.
We put in place four different alterations : randomly rotating pictures, creating a symmetry, adding noise (as in blurring the picture), and randomly croping the picture so that only a part of the cow would be visible. When forming mini-batches, a random number is picked and depending on the range of that number, one of these aletrations, or non, was performed on a picture.
We can see in the picture on the right hand side that the data augmentation is still not enough to produce a reasonably good result. So we put in place one last improvement.
Segmentation of a cow without data augmentation
Segmentation of a cow with data augmentation
The last improvement we implemented in our program is adding drop-out. This means that some neurons are randomly deactivated during training, that way, a neuron cannot dedicate itself to recognizing one picture from the data set in particular.
In the picture on the right hand side, we can see that the information inside the cow is now much better, there are lesser dark spots. Although there is a white spot in the background, much bigger than what we ever had before, this is probably only due to the fact that the number of epochs was too high, and the network started over-learning.
Segmentation of a cow without drop-out
Segmentation of a cow with drop-out