Data Cleaning

Handling missing values

Preprocessing - Data Cleaning

Summary

SimpleImputer: Use to replace missing values with a specified strategy (mean, median, etc.) in a dataset.

KNNImputer: Use to impute missing values using k-nearest neighbors.

StandardScaler: Use to scale features to have zero mean and unit variance.

MinMaxScaler: Use to scale features to a given range, typically [0, 1]

MaxAbsScaler: Use to scale features to the range [-1, 1] while maintaining sparsity.

FunctionTransformer: Use to apply a custom transformation function to your features.

PolynomialFeatures: Use to generate polynomial features (e.g., interaction terms) from input features.

KBinsDiscretizer: Use to discretize continuous features into bins.

OneHotEncoder: Use to encode categorical features as a one-hot numeric array.

LabelEncoder: Use to encode labels (categorical target values) as integers.

OrdinalEncoder: Use to encode categorical features as ordinal integers (for non-binary labels).

LabelBinarizer: Use to binarize labels (convert multi-class labels to binary format).

MultiLabelBinarizer: Use to binarize multilabel data (multiple labels per sample).

add_dummy_feature: Use to add a dummy feature (constant 1) to your dataset, often for intercept in linear models.

SimpleImputer

Fills missing values with one of the following strategies: 'mean', 'median', 'most_frequent' and 'constant'.

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

KNNImputer

Uses k-nearest neighbours approach to fill missing values in a dataset.
The missing value of an attribute in a specific example is filled with the mean value of the same attribute of n_neighbors closest neighbors.
The nearest neighbours are decided based on Euclidean distance.

from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)

Example: KNNImputer

Consider following feature matrix.

It has 4 samples and 2 missing values.

Let's fill in missing values with KNNImputer

Computing Euclidean distance in presence of missing values

1.2 Numeric transformers

1. Feature scaling

2. Polynomial transformation

3. Discretization

Feature scaling

Numerical features with different scales leads to slower convergence of iterative optimization procedures.

It is a good practice to scale numerical features so that all of them are on the same scale

StandardScaler

Transforms the original features vector into new feature vector using following formula

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler

minmax_scaler = MinMaxScaler()

MaxAbsScaler

from sklearn.preprocessing import MaxAbsScaler

maxabs_scaler = MaxAbsScaler()

FunctionTransformer

from sklearn.preprocessing import FunctionTransformer

transformer = FunctionTransformer(func=lambda x: x ** 2, validate=True)

Polynomial transformation

Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)

KBinsDiscretizer

Divides a continuous variable into bins.

One hot encoding or ordinal encoding is further applied to the bin labels

from sklearn.preprocessing import KBinsDiscretizer

kbins = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

1.2 Categorical transformers

1. Feature encoding

2. Label encoding

OneHotEncoder

Encodes categorical feature or label as a one-hot numeric array.

Creates one binary column for each of unique values.

Exactly one column has 1 in it and rest have 0.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

LabelEncoder

Encodes target labels with value between 0 and K-1, where K is number of distinct values.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

OrdinalEncoder

Encodes categorical features with value between 0 and K - 1, where K is number of distinct values.

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()

LabelBinarizer

Several regression and binary classification can be extended to multi-class setup in one-vs-all fashion.

This involves training a single regressor or classifier per class.

For this, we need to convert multi-class labels to binary labels, and LabelBinarizer performs this task

If estimator supports multiclass data, LabelBinarizer is not needed.

from sklearn.preprocessing import LabelBinarizer

label_binarizer = LabelBinarizer()

MultiLabelBinarizer

Encodes categorical features with value between 0 and K - 1, where K is number of classes.

In this example , since there are only genres of movies.

from sklearn.preprocessing import MultiLabelBinarizer

multi_label_binarizer = MultiLabelBinarizer()

add_dummy_feature

Augments dataset with a column vector, each value in the column vector is 1.

from sklearn.preprocessing import add_dummy_feature

X_new = add_dummy_feature(X)

Google Sites

Report abuse

Data Cleaning

SimpleImputer: Use to replace missing values with a specified strategy (mean, median, etc.) in a dataset.

KNNImputer: Use to impute missing values using k-nearest neighbors.

StandardScaler: Use to scale features to have zero mean and unit variance.

MinMaxScaler: Use to scale features to a given range, typically [0, 1]

MaxAbsScaler: Use to scale features to the range [-1, 1] while maintaining sparsity.

FunctionTransformer: Use to apply a custom transformation function to your features.

PolynomialFeatures: Use to generate polynomial features (e.g., interaction terms) from input features.

KBinsDiscretizer: Use to discretize continuous features into bins.

OneHotEncoder: Use to encode categorical features as a one-hot numeric array.

LabelEncoder: Use to encode labels (categorical target values) as integers.

OrdinalEncoder: Use to encode categorical features as ordinal integers (for non-binary labels).

LabelBinarizer: Use to binarize labels (convert multi-class labels to binary format).

MultiLabelBinarizer: Use to binarize multilabel data (multiple labels per sample).

add_dummy_feature: Use to add a dummy feature (constant 1) to your dataset, often for intercept in linear models.