PreProcessing

Sklearn provides a library of transformers for data preprocessing.

Data cleaning (sklearn.preprocessing) such as standardization, missing value imputation, etc.

Feature extraction (sklearn.feature_extraction)

Feature reduction (sklearn.decomposition.pca)

Feature expansion (sklearn.kernel_approximation)

Part 1. Feature extraction

sklearn.feature_extraction has useful APIs to extract features from data:

DictVectorizer- Use when you need to convert a list of dictionaries into a matrix of features (one-hot encoding for categorical data).

FeatureHasher- Use when you need to convert large-scale categorical data into a fixed-size feature matrix using hashing (for dimensionality reduction).

DictVectorizer

Converts lists of mappings of feature name and feature value, into a matrix.

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)

FeatureHasher

High-speed, low-memory vectorizer that uses feature hashing technique.
Instead of building a hash table of the features, as the vectorizers do, it applies a hash function to the features to determine their column index in sample matrices directly.
This results in increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.
Output of this transformer is scipy.sparse matrix

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=10, input_type='string')

Google Sites

Report abuse