[TBD]Handling Noise in machine learning data

Introduction

Laymen explanation

ML model accuracy directly hinges on high-quality training data. However, When data is collected, humans tend to make mistakes and instruments tend to be inaccurate, so the collected data has some error bound to it. This error is referred to as noise.

Noise creates trouble for machine learning algorithms because if not trained properly, algorithms can think of noise to be a pattern and can start generalising from it, which of course is undesirable.

If you like to know noise handling approaches, then this document helps.

Technical explanation

Noise (in the data science space) is unwanted data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. Noise often causes the algorithms to miss out patterns in the data.

Signal versus noise

Every ML dataset has two parts.

Signal - Signal is the real pattern that we hope to capture and describe. It is the information that we care about. The signal is what lets the model generalise to new situations.
Noise - The noise is everything else. For example, It is imperfections in our sensors, typing things in wrong, variations driven by forces that we can’t or don’t try to model. It is all the other stuff.

Impact of noise

- Noise can cause ML model overfit
- If signal/noise ratio is low (means noise is more), then it creates high data bias. Note that high data bias adversely impact model generalisation capability

How to know your data is noisy?

Noise estimation

Noise reduction Approaches

Data cleanup at source

This approach should be tried first. For example, fix the data sources, such as broken sensors.

Dimension reduction approach

Dimension reduction approach can be used to remove noise. For example PCA can smoothen noise/outliers (verify). Refer here for the detail.

Anomaly detection approach

This can automatically removing outliers from a dataset before feeding it to another learning algorithm. Autoencoder ML model is one such approach. Refer here for example Colab python code.

Using data and feature lists

What if labels in training sample is noisy?

Mislabeled data is termed as noise. Label noise can significantly impact the performance. This paper talks about handing it. This another paper talks of detection of label noise

Can noise be removed fully?

Point to remember

Sometime we may tend to treat a data as noise when it is not. For example, using blur training pictures help to make robust model. So, in this case, blur picture should not be treated as noise.

Reference

https://www.linkedin.com/feed/update/urn:li:activity:6764073661415727104/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6764073661415727104%29

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

https://www.pluralsight.com/guides/use-autoencoders-to-denoise-images

https://colab.research.google.com/drive/1xJUZ1M5_9g-00vTgGxjbUQ9LZtxTzx1P?usp=sharing

https://www.kdnuggets.com/2019/06/separating-signal-noise.html

https://www.quora.com/What-is-noise-in-data-science-machine-learning

https://magoosh.com/data-science/what-is-deep-learning-ai/

https://arxiv.org/pdf/1912.02911.pdf

https://medium.com/ai³-theory-practice-business/google-research-finds-a-way-to-reduce-noise-in-training-data-6d62543e5b4d

https://dl.acm.org/doi/10.1145/3410530.3414366

https://images.app.goo.gl/mexYYr96WRgAbp1E8

https://images.app.goo.gl/BTS5PYnTdWAqzvzK9

Page updated

Google Sites

Report abuse