[TBD]Handling little training data in machine learning

Introduction

Laymen explanation

It is impossible to precisely estimate the minimum amount of data required for an AI project. Obviously, the very nature of your project will influence significantly the amount of data you will need. For example, texts, images, and videos usually require more data.

Technical explanation

In a real-world setting, you often only have a small dataset to work with. Models trained on a small number of observations tend to overfit and produce inaccurate results.

How much minimum data is needed?

There should be enough data which represents the population. In other words, biases should not be there. Refer here to understand about bias in detail

As a rule of thumb, 10 samples per variable is needed. Refer here for more detail.

Impact of less data

- Generalisation issue
- Underfitting
- Less data for validation and test phase

Approaches

Use simpler classifiers which works well with lesser data

Increase data size

- Argumentation using real data modification

Artificial data synthesis

Optimize test and validation data need

k-fold cross validation approach works well without any exclusive need of separate data. Refer here for the detail

Remove redundant features

This issue tells that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily(curse of dimensionality). (?? Verify )

Feature extraction

It is about combining existing features to create a new, more useful feature that can have a higher importance in model. So, the model will have more appropriate features to get trained. (?? Verify if it helps)

Remove noise

When using a small dataset, outliers/noise can have a huge impact on the model. Noise in small dataset can cause overfit (verify). So, when working with scarce data, you’ll need to identify and remove outliers/noise.

Dropout regularisation

Understand How it helps [Refer https://arxiv.org/abs/1207.0580]??

Point to remember

Reference

https://www.kdnuggets.com/2019/06/5-ways-lack-data-machine-learning.html

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-cross-validation-data-in-machine-learning

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/understanding-bias-in-machine-learning

https://hackernoon.com/7-effective-ways-to-deal-with-a-small-dataset-2gyl407s

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/handling-large-training-dataset-in-machine-learning#TOC-Point-to-remember

https://en.wikipedia.org/wiki/Curse_of_dimensionality#Machine_Learning

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-redundant-features-in-machine-learning

https://towardsdatascience.com/problems-in-machine-learning-models-check-your-data-first-f6c2c88c5ec2