Part 2: Feature selection
Filter based
Wrapper based
Sometimes in a real world dataset, all features do not contribute well enough towards fitting a model.
The features that do not contribute significantly, can be removed. It leads to decrease in size of the dataset and hence, the computation cost of fitting a model. sklearn.feature_selection provides many APIs to
accomplish this task.
Filter based feature selection methods
Removing features with low variance
VarianceThreshold
Removes all features with variance below a certain threshold, as specified by the user, from input feature matrix.
By default removes a feature which has same value, i.e. zero variance.
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)
Univariate feature selection
Univariate feature selection selects features based on univariate statistical tests.
sklearn provides one more class of univariate feature selection methods that work on common univariate
statistical tests for each feature:
SelectFpr selects features based on a false positive rate test.
SelectFdr selects features based on an estimated false discovery rate.
SelectFwe selects features based on family-wise error rate.
Univariate scoring function
MI and chi-squared feature selection is recommended for sparse data.
Do not use regression feature scoring function with a classification problem. It will lead to useless results.
SelectKBest
Selects 20 best features based on chi-square scoring function.
from sklearn.feature_selection import SelectKBest
selector = SelectKBest(k=5)
SelectPercentile
Selects top 20 percentile best features based on chi-square scoring function
from sklearn.feature_selection import SelectPercentile
selector = SelectPercentile(percentile=10)
GenericUnivariateSelect
Selects 20 best features based on chi-square scoring function.
from sklearn.feature_selection import GenericUnivariateSelect
selector = GenericUnivariateSelect(score_func='f_classif', mode='k_best', param=5)
mode:
GenericUnivariateSelect
Selects set of features based on a feature selection mode and a scoring function.
The mode could be 'percentile' (default), 'k_best', 'fpr', 'fdr', 'fwe'.
The param argument takes value corresponding to the mode.
Wrapper based filter selection
Unlike filter based methods, wrapper based methods use estimator class rather than a scoring function.
Recursive Feature Elimination (RFE)
Uses an estimator to recursively remove features.
Initially fits an estimator on all features.
Obtains feature importance from the estimator and removes the least important feature.
Repeats the process by removing features one by one, until desired number of features are obtained.
Use if we do not want to specify the desired number of features in RFE.
It performs RFE in a cross-validation loop to find the optimal number of features.
from sklearn.feature_selection import RFE
from sklearn.svm import SVC
selector = RFE(estimator=SVC(kernel="linear"), n_features_to_select=5)
SelectFromModel
The feature importance is obtained via coef_, feature_importances_ or an importance_getter callable from the trained estimator
Selects desired number of important features (as specified with max_features parameter) above certain threshold of feature importance as obtained from the trained estimator.
The feature importance threshold can be specified either numerically or through string argument based on built-in heuristics such as 'mean', 'median' and float multiples of these like '0.1*mean'.
Let's look at a concrete example of SelectFromModel
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
selector = SelectFromModel(RandomForestClassifier(), threshold="mean")
Here we use a linear support vector classifier to get coefficients of features for SelectFromModel transformer.
It ends up selecting features with non-zero weights or coefficients.
Sequential feature selection
Performs feature selection by selecting or deselecting features one by one in a greedy manner.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
selector = SequentialFeatureSelector(LogisticRegression(), n_features_to_select=5, direction='forward')
The direction parameter controls whether forward or backward SFS is used.
In general, forward and backward selection do not yield equivalent results.
Select the direction that is efficient for the required number of selected features:
When we want to select 7 out of 10 features,
Forward selection would need to perform 7 iterations.
Backward selection would only need to perform 3.
Backward selection seems to be a reasonable choice here.
SFS does not require the underlying model to expose a coef_ or feature_importances_ attributes unlike in RFE and SelectFromModel.
SFS may be slower than RFE and SelectFromModel as it needs to evaluate more models compared to the other two approaches.
For example in backward selection, the iteration going from features to features using -fold cross-validation requires fitting models, while RFE would require only a single fit, and
SelectFromModel performs a single fit and requires no iterations.