Laymen explanation
ML model accuracy directly hinges on high-quality training data. However, When data is collected, humans tend to make mistakes and instruments tend to be inaccurate, so the collected data has some error bound to it. This error is referred to as noise.
Noise creates trouble for machine learning algorithms because if not trained properly, algorithms can think of noise to be a pattern and can start generalising from it, which of course is undesirable.
If you like to know noise handling approaches, then this document helps.
Noise (in the data science space) is unwanted data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. Noise often causes the algorithms to miss out patterns in the data.
Every ML dataset has two parts.
Signal - Signal is the real pattern that we hope to capture and describe. It is the information that we care about. The signal is what lets the model generalise to new situations.
Noise - The noise is everything else. For example, It is imperfections in our sensors, typing things in wrong, variations driven by forces that we can’t or don’t try to model. It is all the other stuff.
Noise can cause ML model overfit
If signal/noise ratio is low (means noise is more), then it creates high data bias. Note that high data bias adversely impact model generalisation capability
??
??
This approach should be tried first. For example, fix the data sources, such as broken sensors.
Dimension reduction approach
Dimension reduction approach can be used to remove noise. For example PCA can smoothen noise/outliers (verify). Refer here for the detail.
This can automatically removing outliers from a dataset before feeding it to another learning algorithm. Autoencoder ML model is one such approach. Refer here for example Colab python code.
??
Sometime we may tend to treat a data as noise when it is not. For example, using blur training pictures help to make robust model. So, in this case, blur picture should not be treated as noise.
https://www.linkedin.com/feed/update/urn:li:activity:6764073661415727104/?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A6764073661415727104%29
https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291
https://www.pluralsight.com/guides/use-autoencoders-to-denoise-images
https://colab.research.google.com/drive/1xJUZ1M5_9g-00vTgGxjbUQ9LZtxTzx1P?usp=sharing
https://www.kdnuggets.com/2019/06/separating-signal-noise.html
https://www.quora.com/What-is-noise-in-data-science-machine-learning
https://magoosh.com/data-science/what-is-deep-learning-ai/
https://arxiv.org/pdf/1912.02911.pdf
https://medium.com/ai³-theory-practice-business/google-research-finds-a-way-to-reduce-noise-in-training-data-6d62543e5b4d
https://dl.acm.org/doi/10.1145/3410530.3414366
https://images.app.goo.gl/mexYYr96WRgAbp1E8
https://images.app.goo.gl/BTS5PYnTdWAqzvzK9