Overfitting is a phenomenon that occurs when a Machine Learning model is constraint to training set and not able to perform well on unseen data.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
The noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting can be identified by checking validation metrics such as accuracy and loss. The validation metrics usually increase until a point where they stagnate or start declining when the model is affected by overfitting.
ML model adjusts weights for noisy training data as well. If noise is in significant proportion, then the training will be biased towards noisy values. Unwanted data is termed as noise here. Note that noise category depends on the problem.
For example in use-case of human face detection, picture having grass has noise as grass. If ML learns grass as well, then it can make model overfit.
Training with such data results in overfit (low training error rate and high test error case). This use-case can be handled at data level by ensuring the diversity in the input data. For example, pictures with different backgrounds. Moreover, regularisation method mentioned below prevent ML model to learn from noise.
In case, data doesn't represent the population well, this approach should be used. Data augmentation, consists of generating new training instances from existing ones, artificially boosting the size of the training set.
It improves variance by adding new varieties of data which helps in better generalised model and so avoiding overfit. Refer here for the detail
Refer here
The regularization term, or penalty, imposes a cost on the optimization function for overfitting the function or to find an optimal solution.
Drop-out also helps in handling overfit. Refer here for detail
It prevents overfit caused due to test data. Refer here for the detail.
Just interrupt training when its performance on the validation set starts dropping.
This approach works indirectly. Note that Normalisation method handles in-compatible scale values among features. Without normalisation, training can be biased/skewed towards specific feature causing overfit. Refer here for the detail.
All of above approaches needs to tried and find which works.
Data cleanup is not easy for big size data. However note below
Data source should be well examined to ensure good quality data.
Trivial noise(noise which are easy to find and clean) should be removed.
ML algorithms like autoencoder, PCA can be used for de-noising data. Refer here for the detail
A regression model which uses L1 Regularisation technique is called LASSO(Least Absolute Shrinkage and Selection Operator) regression.
A regression model that uses L2 regularisation technique is called Ridge regression.
It probabilistically remove inputs during training. To know why it works in handling overfit, refer here
Does regularisation impact convergence?
ML model convergence depends on
Learning rate
Loss function
Regularisation changes loss function. This new loss function should meet convergence criteria(Refer here).
Refer book here (Stanford publication) which mentions KKT condition as global minima convergence criteria for the loss function.
You may have an over-fitted model if you train so much on the training data. In general too many epochs may cause your model to over-fit the training data. It means that your model does not learn the data, it memorizes the data.
Regularisation hyperparameter
High Cost value(Penalty) results in under-fitting. Similarly, low penalty can still result in overfitting.
So, cost hyper-parameter(Lembda) should be chosen properly. To do this, one approach is to try out various value in incremental order
Points to remember
During Regularisation, the output function(y) does not change. The change is only in the loss function.
Regularisation and normalisation are different.
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-redundant-features-in-machine-learning
https://www.linkedin.com/posts/dpkumar_machinelearning-datascience-features-activity-6760734531889831936-ZiID
https://www.linkedin.com/posts/dpkumar_machinelearningtraining-datasciences-testingsolutions-activity-6765613272705200128-5y0k
https://www.linkedin.com/posts/dpkumar_convergence-machinelearningmodels-datasciences-activity-6769803533836521472-rjnu
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-dropout-in-machine-learning#TOC-Tackling-overfit
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/noise-removal-in-machine-learning
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-data-augmentation-in-machine-learning
https://www.geeksforgeeks.org/regularization-in-machine-learning/
https://en.wikipedia.org/wiki/Regularization_(mathematics)
https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
https://images.app.goo.gl/if8UsUVWhi2jMpcj9
https://images.app.goo.gl/LaxVtp9ubZEpYFZf6
https://youtu.be/KvtGD37Rm5I
https://coursera.org/share/4c472e9c634228f0a18ef8e5242aabcf
https://www.reddit.com/r/MLQuestions/comments/6l6aze/what_is_the_state_of_the_art_on_preventing/
https://analyticsindiamag.com/everything-you-should-know-about-dropouts-and-batchnormalization-in-cnn/
https://datascience.stackexchange.com/questions/27561/can-the-number-of-epochs-influence-overfitting
https://analyticsindiamag.com/hands-on-guide-to-implement-batch-normalization-in-deep-learning-models/
https://web.stanford.edu/~hastie/StatLearnSparsity/index.html
https://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions
https://images.app.goo.gl/HwvdeMLtQGN2AetN6
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291