The sample data used for training has to be as close a representation of the real scenario as possible. If not, it can impact ML performance. ML can amplify bias which is worse. If you like to know more about its impact, this document can help.
Machine learning bias, also sometimes called algorithm bias or AI bias, is a phenomenon that occurs when an algorithm produces results that are systemically prejudiced due to erroneous assumptions in the machine learning process.
Bias comes from models that are overly simple and fail to capture the trends present in the data set. For example, a linear model may be suitable for one group but not for another.
Similarly, ignoring an important feature in machine training is a bias.
Data bias in machine learning is a type of error in which certain elements of a dataset are more heavily weighted and/or represented than others.A biased dataset does not accurately represent a model’s use case, resulting in skewed outcomes, low accuracy levels, and analytical errors.
Sample bias - Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. In statistical sense, actual PDF(probability distribution function) of population doesn't match the sample PDF. Class imbalance is also a kind of bias(Refer here)
Exclusion bias - Deleting valuable data thought to be unimportant. For example, outlier can be useful for anomaly detection based problem. If we delete them, then it impacts ML algorithm learning
Measurement bias -
Recall bias -
Observer bias - Observer bias is the effect of seeing what you expect to see or want to see in data
Racial bias - Racial bias occurs when data skews in favor of particular demographics.
Association bias - For example, our dataset may have a collection of jobs in which all men are doctors and all women are nurses. This does not mean that women cannot be doctors, and men cannot be nurses. However, as far as your machine learning model is concerned female doctors and male nurses do not exist.
There should be right balance between model bias and variance. Here is the reason.
Using insufficient number of features for training causes high bias resulting in underfitting. Note that ignoring an important feature is bias here(Refer below diagram).
Similarly, using unnecessary features for training causes high variance resulting in overfitting(Refer above pic). Note that unnecessary features act as noise and so, training on noises cause overfitting.
If a learning algorithm is suffering from high bias, getting more and more training examples doesn't help(Refer below pic).
??
https://stackoverflow.com/questions/2480650/what-is-the-role-of-the-bias-in-neural-networks
https://lionbridge.ai/articles/7-types-of-data-bias-in-machine-learning/
https://images.app.goo.gl/VSqLiK8mAuTDGfuh6
https://www.kdnuggets.com/2019/08/types-bias-machine-learning.html
https://images.app.goo.gl/x5juE7SudcEc8GuL7
https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/handling-class-imbalance-in-machine-learning
https://machinelearningmastery.com/what-is-imbalanced-classification/
https://www.borealisai.com/en/blog/tutorial1-bias-and-fairness-ai/
https://www.datacamp.com/community/blog/measuring-bias-in-ml
https://towardsdatascience.com/is-your-machine-learning-model-biased-94f9ee176b67
https://towardsdatascience.com/introducing-model-bias-and-variance-187c5c447793
https://images.app.goo.gl/8SZZszKKpKVbm4w27
https://coursera.org/share/0626a6420beef982feb69f2505424e59
https://images.app.goo.gl/bF13bQToTsNqhZxP7
https://images.app.goo.gl/tTAXjsMjyKcZxaRn9
https://images.app.goo.gl/VwCApSReYyu28ykKA