[TBD]Handling overfitting in machine learning

Introduction

Laymen explanation

Overfitting is a phenomenon that occurs when a Machine Learning model is constraint to training set and not able to perform well on unseen data.

Technical explanation

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.

The noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

How to detect overfit?

Overfitting can be identified by checking validation metrics such as accuracy and loss. The validation metrics usually increase until a point where they stagnate or start declining when the model is affected by overfitting.

Solutions

Data cleanup to remove noise

ML model adjusts weights for noisy training data as well. If noise is in significant proportion, then the training will be biased towards noisy values. Unwanted data is termed as noise here. Note that noise category depends on the problem.

For example in use-case of human face detection, picture having grass has noise as grass. If ML learns grass as well, then it can make model overfit.

Training with such data results in overfit (low training error rate and high test error case). This use-case can be handled at data level by ensuring the diversity in the input data. For example, pictures with different backgrounds. Moreover, regularisation method mentioned below prevent ML model to learn from noise.

Data augmentation

In case, data doesn't represent the population well, this approach should be used. Data augmentation, consists of generating new training instances from existing ones, artificially boosting the size of the training set.

It improves variance by adding new varieties of data which helps in better generalised model and so avoiding overfit. Refer here for the detail

Removal of redundant features

Refer here

Régularisation

The regularization term, or penalty, imposes a cost on the optimization function for overfitting the function or to find an optimal solution.

Drop-out

Drop-out also helps in handling overfit. Refer here for detail

Cross-validation

It prevents overfit caused due to test data. Refer here for the detail.

Early stop

Just interrupt training when its performance on the validation set starts dropping.

(Batch) normalisation

This approach works indirectly. Note that Normalisation method handles in-compatible scale values among features. Without normalisation, training can be biased/skewed towards specific feature causing overfit. Refer here for the detail.

Which approach to use?

All of above approaches needs to tried and find which works.

Practical difficulty in data noise removal

Data cleanup is not easy for big size data. However note below

Data source should be well examined to ensure good quality data.
Trivial noise(noise which are easy to find and clean) should be removed.
ML algorithms like autoencoder, PCA can be used for de-noising data. Refer here for the detail

Regularisation techniques

L1/L2 regularisation

A regression model which uses L1 Regularisation technique is called LASSO(Least Absolute Shrinkage and Selection Operator) regression.

A regression model that uses L2 regularisation technique is called Ridge regression.

Dropout regularisation

It probabilistically remove inputs during training. To know why it works in handling overfit, refer here

Does regularisation impact convergence?

ML model convergence depends on

Learning rate
Loss function

Regularisation changes loss function. This new loss function should meet convergence criteria(Refer here).

Refer book here (Stanford publication) which mentions KKT condition as global minima convergence criteria for the loss function.

Impact of epochs on overfit

You may have an over-fitted model if you train so much on the training data. In general too many epochs may cause your model to over-fit the training data. It means that your model does not learn the data, it memorizes the data.

Regularisation hyperparameter

High Cost value(Penalty) results in under-fitting. Similarly, low penalty can still result in overfitting.

So, cost hyper-parameter(Lembda) should be chosen properly. To do this, one approach is to try out various value in incremental order

Points to remember

During Regularisation, the output function(y) does not change. The change is only in the loss function.
- Regularisation and normalisation are different.

Reference

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-redundant-features-in-machine-learning

https://www.linkedin.com/posts/dpkumar_machinelearning-datascience-features-activity-6760734531889831936-ZiID

https://www.linkedin.com/posts/dpkumar_machinelearningtraining-datasciences-testingsolutions-activity-6765613272705200128-5y0k

https://www.linkedin.com/posts/dpkumar_convergence-machinelearningmodels-datasciences-activity-6769803533836521472-rjnu

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-dropout-in-machine-learning#TOC-Tackling-overfit

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/noise-removal-in-machine-learning

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-data-augmentation-in-machine-learning

https://www.geeksforgeeks.org/regularization-in-machine-learning/

https://en.wikipedia.org/wiki/Regularization_(mathematics)

https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/

https://images.app.goo.gl/if8UsUVWhi2jMpcj9

https://images.app.goo.gl/LaxVtp9ubZEpYFZf6

https://youtu.be/KvtGD37Rm5I

https://coursera.org/share/4c472e9c634228f0a18ef8e5242aabcf

https://www.reddit.com/r/MLQuestions/comments/6l6aze/what_is_the_state_of_the_art_on_preventing/

https://analyticsindiamag.com/everything-you-should-know-about-dropouts-and-batchnormalization-in-cnn/

https://datascience.stackexchange.com/questions/27561/can-the-number-of-epochs-influence-overfitting

https://analyticsindiamag.com/hands-on-guide-to-implement-batch-normalization-in-deep-learning-models/

https://web.stanford.edu/~hastie/StatLearnSparsity/index.html

https://en.wikipedia.org/wiki/Karush–Kuhn–Tucker_conditions

https://images.app.goo.gl/HwvdeMLtQGN2AetN6

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291