Handling missing values
Preprocessing - Data Cleaning
Summary
SimpleImputer
Fills missing values with one of the following strategies: 'mean', 'median', 'most_frequent' and 'constant'.
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
KNNImputer
Uses k-nearest neighbours approach to fill missing values in a dataset.
The missing value of an attribute in a specific example is filled with the mean value of the same attribute of n_neighbors closest neighbors.
The nearest neighbours are decided based on Euclidean distance.
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
Example: KNNImputer
Consider following feature matrix.
It has 4 samples and 2 missing values.
Let's fill in missing values with KNNImputer
Computing Euclidean distance in presence of missing values
1.2 Numeric transformers
1. Feature scaling
2. Polynomial transformation
3. Discretization
Feature scaling
Numerical features with different scales leads to slower convergence of iterative optimization procedures.
It is a good practice to scale numerical features so that all of them are on the same scale
StandardScaler
Transforms the original features vector into new feature vector using following formula
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
MaxAbsScaler
from sklearn.preprocessing import MaxAbsScaler
maxabs_scaler = MaxAbsScaler()
FunctionTransformer
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(func=lambda x: x ** 2, validate=True)
Polynomial transformation
Generates a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
KBinsDiscretizer
Divides a continuous variable into bins.
One hot encoding or ordinal encoding is further applied to the bin labels
from sklearn.preprocessing import KBinsDiscretizer
kbins = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
1.2 Categorical transformers
1. Feature encoding
2. Label encoding
OneHotEncoder
Encodes categorical feature or label as a one-hot numeric array.
Creates one binary column for each of unique values.
Exactly one column has 1 in it and rest have 0.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
LabelEncoder
Encodes target labels with value between 0 and K-1, where K is number of distinct values.
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
OrdinalEncoder
Encodes categorical features with value between 0 and K - 1, where K is number of distinct values.
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
LabelBinarizer
Several regression and binary classification can be extended to multi-class setup in one-vs-all fashion.
This involves training a single regressor or classifier per class.
For this, we need to convert multi-class labels to binary labels, and LabelBinarizer performs this task
If estimator supports multiclass data, LabelBinarizer is not needed.
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()
MultiLabelBinarizer
Encodes categorical features with value between 0 and K - 1, where K is number of classes.
In this example , since there are only genres of movies.
from sklearn.preprocessing import MultiLabelBinarizer
multi_label_binarizer = MultiLabelBinarizer()
add_dummy_feature
Augments dataset with a column vector, each value in the column vector is 1.
from sklearn.preprocessing import add_dummy_feature
X_new = add_dummy_feature(X)