Choosing training dataset and test dataset in machine learning

Introduction

Laymen explanation

No element is more essential in machine learning than quality training data. If want to know importance of quality training data, then this document helps.

Technical explanation

Training data refers to the initial data that is used to develop a machine learning model, from which the model creates and refines its rules. The quality of this data has profound implications for the model’s subsequent development, setting a powerful precedent for all future applications that use the same training data.

Machine learning models depend on data. Without a foundation of high-quality training data, even the most performant algorithms can be rendered useless.

Kind of dataset depends on problem at hand

Your machine learning use case and goals will dictate the kind of data you need and where you can get it. If you are using natural language processing (NLP) to teach a machine to read, understand, and derive meaning from language, you will need a significant amount of text or audio data to train your algorithm.

You would need a different kind of training data if you are working on a computer vision project to teach a machine to recognize or gain understanding of objects that can be seen with the human eye. In this case, you would need labeled images or videos to train your machine learning model to “see” for itself.

Reason to choose right training datset and test dataset distribution

- Performance will be reliable if dataset quality is good. A robust machine learning models can be crippled when they are trained on inadequate, inaccurate, or irrelevant data in the early stages.

Dataset may have missing values. Such values needs cleaning
Sampling probability distribution may be incorrect which can result in class imbalance problem for classification model
Dataset may not have all necessary features. This can adversely affect prediction accuracy
Dataset may have redundant features. Redundant features act as noise impacting performance. This document can help
- Dataset is not true representation of the entire population. This adversely affects generalisation capability of ML model.

Real life dataset vs artificial dataset

Real life dataset is highly valuable for training. However, it has a concern. There can be observations which might not have been captured in real life dataset. If you use artificial dataset, then such limitations of machine learning models can be understood.

Reference

https://youtu.be/BqFt6hMDOzw?t=1824

https://www.cloudfactory.com/training-data-guide

https://towardsdatascience.com/what-to-do-when-bad-data-thwarts-machine-learning-success-fb82249aae8b

https://images.app.goo.gl/XRuVHJ7VonH1MPti6

https://youtu.be/PsGRGqMsKnY?t=97

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-redundant-features-in-machine-learning

Page updated

Google Sites

Report abuse