Feature Selection

Part 2: Feature selection

Filter based

Wrapper based

VarianceThreshold: Use to remove features with low variance (constant features or nearly constant).

SelectKBest: Use to select the top k features based on a statistical test (e.g., chi-square, ANOVA).

SelectPercentile: Use to select the top percentile of features based on a statistical test.

GenericUnivariateSelect: Use for univariate feature selection with different strategies (e.g., KBest, percentile, or threshold).

Recursive Feature Elimination (RFE): Use to recursively remove features and build a model on the remaining features, based on importance.

SelectFromModel: Use to select features based on importance weights of a fitted model (e.g., tree-based models).

Sequential Feature Selection: Use to select features sequentially (either forward or backward) based on model performance.

Sometimes in a real world dataset, all features do not contribute well enough towards fitting a model.

The features that do not contribute significantly, can be removed. It leads to decrease in size of the dataset and hence, the computation cost of fitting a model. sklearn.feature_selection provides many APIs to

accomplish this task.

Filter based feature selection methods

Removing features with low variance

VarianceThreshold

Removes all features with variance below a certain threshold, as specified by the user, from input feature matrix.

By default removes a feature which has same value, i.e. zero variance.

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=0.1)

Univariate feature selection

Univariate feature selection selects features based on univariate statistical tests.

sklearn provides one more class of univariate feature selection methods that work on common univariate

statistical tests for each feature:

SelectFpr selects features based on a false positive rate test.
SelectFdr selects features based on an estimated false discovery rate.
SelectFwe selects features based on family-wise error rate.

Univariate scoring function

MI and chi-squared feature selection is recommended for sparse data.

Do not use regression feature scoring function with a classification problem. It will lead to useless results.

SelectKBest

Selects 20 best features based on chi-square scoring function.

from sklearn.feature_selection import SelectKBest

selector = SelectKBest(k=5)

SelectPercentile

Selects top 20 percentile best features based on chi-square scoring function

from sklearn.feature_selection import SelectPercentile

selector = SelectPercentile(percentile=10)

GenericUnivariateSelect

Selects 20 best features based on chi-square scoring function.

from sklearn.feature_selection import GenericUnivariateSelect

selector = GenericUnivariateSelect(score_func='f_classif', mode='k_best', param=5)

mode:

GenericUnivariateSelect

Selects set of features based on a feature selection mode and a scoring function.
The mode could be 'percentile' (default), 'k_best', 'fpr', 'fdr', 'fwe'.
The param argument takes value corresponding to the mode.

Wrapper based filter selection

Unlike filter based methods, wrapper based methods use estimator class rather than a scoring function.

Recursive Feature Elimination (RFE)

Uses an estimator to recursively remove features.

Initially fits an estimator on all features.

Obtains feature importance from the estimator and removes the least important feature.

Repeats the process by removing features one by one, until desired number of features are obtained.

Use if we do not want to specify the desired number of features in RFE.

It performs RFE in a cross-validation loop to find the optimal number of features.

from sklearn.feature_selection import RFE

from sklearn.svm import SVC

selector = RFE(estimator=SVC(kernel="linear"), n_features_to_select=5)

SelectFromModel

The feature importance is obtained via coef_, feature_importances_ or an importance_getter callable from the trained estimator

Selects desired number of important features (as specified with max_features parameter) above certain threshold of feature importance as obtained from the trained estimator.

The feature importance threshold can be specified either numerically or through string argument based on built-in heuristics such as 'mean', 'median' and float multiples of these like '0.1*mean'.

Let's look at a concrete example of SelectFromModel

from sklearn.feature_selection import SelectFromModel

from sklearn.ensemble import RandomForestClassifier

selector = SelectFromModel(RandomForestClassifier(), threshold="mean")

Here we use a linear support vector classifier to get coefficients of features for SelectFromModel transformer.

It ends up selecting features with non-zero weights or coefficients.

Sequential feature selection

Performs feature selection by selecting or deselecting features one by one in a greedy manner.

from sklearn.feature_selection import SequentialFeatureSelector

from sklearn.linear_model import LogisticRegression

selector = SequentialFeatureSelector(LogisticRegression(), n_features_to_select=5, direction='forward')

The direction parameter controls whether forward or backward SFS is used.

In general, forward and backward selection do not yield equivalent results.

Select the direction that is efficient for the required number of selected features:

When we want to select 7 out of 10 features,

Forward selection would need to perform 7 iterations.

Backward selection would only need to perform 3.

Backward selection seems to be a reasonable choice here.

SFS does not require the underlying model to expose a coef_ or feature_importances_ attributes unlike in RFE and SelectFromModel.
SFS may be slower than RFE and SelectFromModel as it needs to evaluate more models compared to the other two approaches.
For example in backward selection, the iteration going from features to features using -fold cross-validation requires fitting models, while RFE would require only a single fit, and
SelectFromModel performs a single fit and requires no iterations.

Google Sites

Report abuse

Feature Selection

VarianceThreshold: Use to remove features with low variance (constant features or nearly constant).

SelectKBest: Use to select the top k features based on a statistical test (e.g., chi-square, ANOVA).

SelectPercentile: Use to select the top percentile of features based on a statistical test.

GenericUnivariateSelect: Use for univariate feature selection with different strategies (e.g., KBest, percentile, or threshold).

Recursive Feature Elimination (RFE): Use to recursively remove features and build a model on the remaining features, based on importance.

SelectFromModel: Use to select features based on importance weights of a fitted model (e.g., tree-based models).

Sequential Feature Selection: Use to select features sequentially (either forward or backward) based on model performance.