Sklearn provides a library of transformers for data preprocessing.
Data cleaning (sklearn.preprocessing) such as standardization, missing value imputation, etc.
Feature extraction (sklearn.feature_extraction)
Feature reduction (sklearn.decomposition.pca)
Feature expansion (sklearn.kernel_approximation)
Part 1. Feature extraction
sklearn.feature_extraction has useful APIs to extract features from data:
DictVectorizer
Converts lists of mappings of feature name and feature value, into a matrix.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
FeatureHasher
High-speed, low-memory vectorizer that uses feature hashing technique.
Instead of building a hash table of the features, as the vectorizers do, it applies a hash function to the features to determine their column index in sample matrices directly.
This results in increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.
Output of this transformer is scipy.sparse matrix
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=10, input_type='string')