Handling missing values in Machine learning dataset

Introduction

Laymen explanation

Missing data is a well-known problem in data science. If you are interested to know about feature engineering methods for handling this, then this document helps.

Technical explanation

In statistics, imputation is the process of replacing missing data with substituted values.

Impact of missing data

Missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency.

Benefits of imputation

Imputation preserves all cases by replacing missing data with an estimated value based on other available information.

Imputation Techniques

No single imputation approach fit to all problems. Instead, based on the problem at hand, we need to decide right approach. Below are useful approaches for the same.

Dropping rows with null values

Ensure that it will not cause to lose generalizability in the models we build

Dropping features with high nullity

Before dropping features outright, consider subsetting the part of the dataset that this value is available for and checking its feature importance when it is used to train a model in this subset. If in doing so you disover that the variable is important in the subset it is defined, consider making an effort to retain it.

Take help of Statistics to approximate values

Mean substitution(refer below pic), Regression technique are few examples of statistical methods. This document provides such different approaches.

Reference

https://en.wikipedia.org/wiki/Imputation_(statistics)

https://www.kaggle.com/residentmario/simple-techniques-for-missing-data-imputation

https://expertseoinfo.com/missing-data-imputation-feature-engineering/

https://images.app.goo.gl/4QtWY4SvKVJuVQqu8

https://images.app.goo.gl/hZvmaMtzY7hzVmv86

Page updated

Google Sites

Report abuse