Losely coupled neurons help in generalisation ability(Refer: below points).
Neurons trained with dropout cannot co-adapt with their neighboring neurons. So, they have to be as useful as possible on their own.
Also, They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons.
Hence, they end up being less sensitive to slight changes in the inputs. In the end you get a more robust network that generalises better.
Dropout can be used to handle overfitting issue. To understand intuitively why it helps, remember above observation that drop-out brings good generalisation. Note that generalisation and overfit are opposite. In other words, if a model it overfit, then it means that model is not well trained for generalisation.
By using drop-out, we get benefits of ensemble learning. Below is the reasoning behind this
Since each neuron can be either present or absent, there is a total of 2^N possible networks (where N is the total number of droppable neurons).
So, Once you have run a 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent since they share many of their weights, but they are nevertheless all different.
The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.
Ensemble learning improves performance. Refer here for the detail.
One particular form of regularisation was found to be especially useful for dropout— constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c. [verify if it is true and why it works?. Refer https://www.reddit.com/r/MachineLearning/comments/2bopxs/question_about_the_maxnorm_constraint_used_with/, https://machinelearningmastery.com/introduction-to-weight-constraints-to-reduce-generalization-error-in-deep-learning/ ]
This is also called max-norm regularization since it implies that the maximum value that the norm of any weight can take is c.
The constant c is a tunable hyperparameter, which is determined using a validation set.
The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout, and 0.0 means no outputs from the layer.
If you observe that the model is overfitting, you can increase the dropout rate (i.e., reduce the keep_prob hyperparameter).
Conversely, you should try decreasing the dropout rate (i.e., increasing keep_prob) if the model underfits the training set.
It can also help to increase the dropout rate for large layers, and reduce it for small ones.(Understand it)
We need to multiply each input connection weight of neuron by the keep probability (1 – p) after training. (Understand it)
For example, Suppose dropout probability, p = 50, during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on, and it is unlikely to perform well.
Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).
Applying dropout to the input layer increases the training time per epoch. It happens because dropout requires[Verify it]
additional matrices for dropout masks
drawing random numbers for each entry of these matrices
multiplying the masks and corresponding weights
Considering above, Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.
Colab example
https://colab.research.google.com/github/d2l-ai/d2l-en-colab/blob/master/chapter_multilayer-perceptrons/dropout.ipynb
https://towardsdatascience.com/machine-learning-part-20-dropout-keras-layers-explained-8c9f6dc4c9ab
Dropout is not used after training when making a prediction with the fit network.
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291
https://machinelearningmastery.com/dropout-for-regularizing-deep-neural-networks/
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-regularisation-in-machine-learning
https://stats.stackexchange.com/questions/376993/why-does-dropout-increase-the-training-time-per-epoch-in-a-neural-network
https://www.reddit.com/r/MachineLearning/comments/2bopxs/question_about_the_maxnorm_constraint_used_with/