Role of cross validation data in machine learning

Introduction

Laymen explanation

Cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of machine learning model prediction performance. Without this, the model learns the training data, however it will not perform well on unknown data of real world. If you like to understand this behaviour, then this document helps.

Technical explanation

The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalise to an independent dataset (i.e., an unknown dataset, for instance from a real problem).

Our main objective is that the model should be able to work well on the real-world data, although the training dataset is also real-world data, it represents a small set of all the possible data points(examples) out there. To know the real score of the model, it should be tested on the data that it has never seen before.

Approach summary

Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data.

Approaches

K-fold cross validation

In k-fold cross-validation, you split the input data into k subsets of data (also known as folds). You train an ML model on all but one (k-1) of the subsets, and then evaluate the model on the subset that was not used for training. This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time.

Leave One Out

Leave P-out

This method is exhaustive. It trains and tests on all possible combinations. So, it can become computationally expensive for large values of p.

Stratified Cross Validation

Stratified K-fold maintains the class proportions by splitting the data set in such a way that they contain approximately the same proportions of labels as in the original data set.

This strategy guarantees that when the data set is unbalanced, one class of the data is not over-represented.

Hold out method

When to use test data?

As a rule, the test set should never be used to change your model (e.g., its hyperparameters).
In the simplest scenario, one would collect one dataset and train your model via cross-validation to create your best model. Then you would collect another completely independent dataset and test your model.
- If you have smaller data set, then you can't afford test data and so, the validation is performed on every fold and your validation metric would be aggregated across each validation.

Benefits

Efficiently handling little training data

When we have very little data, splitting it into training and test set might leave us with a very small test set. We can get almost any performance on this set only due to chance.If we use cross-validation in this case, we build K different models, so we are able to make predictions on all of our data.

Training pipeline of ML models

Sometimes we want to (or have to) build a pipeline of models to solve something. The critical part here is that our second model must learn on the predictions of our first model. We can’t train both our models on the same dataset because then, our second model learns on predictions that our first model already seen. By using cross-validation, we can evaluate both models on different datasets

ML Model hyper-parameters fine tuning

Most of the learning algorithms require some (hyper)parameters tuning. It could be the number of trees in Gradient Boosting classifier, hidden layer size or activation functions in a Neural Network, type of kernel in an SVM and many more. There are many methods to do this. It could be a manual search, a grid search or some more sophisticated optimization. However, in all those cases we can’t do it on our training test and not on our test set of course. We have to use a third set, a validation set.

Role in machine learning

- Handles overfit problem due to test data
- Enables effective training with little dataset
- It is used for hyperparameter tuning
- In ensemble learning, cross-validation is used to train various models

Refer here for colab example with training, validation and test set.

Use in neural networks

Cross-validation is essentially a practical and reliable technique to gauge the quality of a particular neural network. Knowing the quality of a neural network allows you to identify when over-fitting has occurred.

When applied to several neural networks with different free hyper-parameter values (such as the number of hidden nodes, back-propagation learning rate, and so on), the results of cross-validation can be used to select the best set of parameter values.

Point to remember

There are various reasons for overfit. Above mentioned is one of them only

Reference

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

https://docs.aws.amazon.com/machine-learning/latest/dg/cross-validation.html

https://www.geeksforgeeks.org/cross-validation-machine-learning/

https://www.mygreatlearning.com/blog/cross-validation/

https://images.app.goo.gl/DDZRzmbEdEGpud5h8

https://towardsdatascience.com/5-reasons-why-you-should-use-cross-validation-in-your-data-science-project-8163311a1e79

https://images.app.goo.gl/BER935pexXtez8ep6

https://towardsdatascience.com/cross-validation-in-machine-learning-72924a69872f

https://towardsdatascience.com/understanding-8-types-of-cross-validation-80c935a4976d

https://images.app.goo.gl/U1gZzUyUaTCDqGwU7

https://aiaspirant.com/cross-validation/

https://images.app.goo.gl/eQJh54HAzimL5GeV7

https://images.app.goo.gl/6RnwF5APkgiRJC2E9

https://stats.stackexchange.com/questions/148688/cross-validation-with-test-data-set

https://visualstudiomagazine.com/articles/2013/10/01/understanding-and-using-kfold.aspx

https://machinelearningmastery.com/how-to-create-a-random-split-cross-validation-and-bagging-ensemble-for-deep-learning-in-keras/

https://www.researchgate.net/post/Is_cross_validation_necessary_in_neural_network_training_and_testing

https://youtu.be/MyBSkmUeIEs

https://colab.research.google.com/drive/1_J2MrBSvsJfOcVmYAN2-WSp36BtsFZCa?usp=sharing