[TBD]Handling redundant features in machine learning

Introduction

Laymen explanation

Machine learning isn’t all about fancy models and complex optimization algorithms. Many Kagglers including myself are happy to acknowledge defeats if the winner employed a more advanced model or engineered some brilliant features, but only a few have the heart to accept defeat in this situation — beaten by a simple model trained with fewer features. Yes, not all practitioners recognize the role of minimalism in model training.

I am talking here about redundant features in dataset. If you interested to know, then this document helps

Technical explanation

Redundant features add no relevant information to your other features, because they are correlated or because they can be obtained by [linear] combination of other features. Its act adversely to the ML though. If you are interested to know more, then this document helps.

Impact

Redundant features act as noise. Redundant features act as noise and such dataset increases learning time and also generalisation capability of ML algorithm.

Criteria for redundant feature

Redundant features are those which are co-related. In other words, if a feature can be defined as linear combination of other feature, then one of them can be removed (refer below example with two features X1 and X2)

X2 = X1 - 3

What about the case where relationship is non-linear? In such case, normally only one of them can be removed and also it depends on our capability to identify relationship function. For example in below example of two features (x, y), y can be removed.

Challenges

- It is not always feasible to identify mathematical equation to determine relationship.
- Zero Correlation coefficient doesn't help to get independent set of relations. Note that correlation coefficient is zero doesn't confirm that two random variables are independent. Refer below example of parabola and note that correlation coefficient will be zero in this case

How to handle this?

As per my knowledge, there is no existing algorithm which can optimally reduce feature set for all cases.

However, for specific cases, there are algorithms which can find optimal set of features. Branch and bound is one such algorithm(Refer below example pic). However for pretty large feature set(100's of feature), this algorithm is difficult to implement due to time complexity of tree search.

SFS is a suboptimal solution which avoids the time complexity issue in above branch and bound. Refer below pic for the approach. This is greedy approach where at each step best feature is selected. Note that once a feature(Say Feature2) is added, it is never removed from the list of selected feature. So, final list might be suboptimal considering that optimal solution might not even include that feature(say Feature2).

To further improve, LR method is used which try to take care of interdependencies among features while adding or removing features(Refer here).

Feature clustering is another technique to find co-related features. For example, k-means clustering model can group co-related features. This paper talks in detail.

PCA is another method which uses co-relation matrix. . You can read more about PCA here

Note that all of above algorithms need user to provide number of features to be selected.

Recently widely used LASSO model doesn't need user to provide number of features.

Refer colab code for various feature selection method.

Criteria to select right feature selection method

For selecting right algorithm, following criteria are useful

We can select the algorithm which has least misclassification error.
Also we might be interested to know which feature has contributed most for the ML learning.

How to fulfil above criteria?

1st criteria

For meeting 1st criteria, using statistics and probability, one can theoretically decide which feature selection algorithm has least misclassification error. This approach uses Bayes decision rule. This video talks about this approach in detail.

However, there is practical difficulty to get prior/PDFs probabilities in real life and also calculating misclassification error is not easy for many case. To handle this, there are techniques which estimates probability distribution. Kernel density estimator is one such algorithm. This video and writeup talks about estimation approaches in detail.

2nd criteria

For meeting 2nd criteria, we need to use such algorithm which provides most contributing feature info. For example, LASSO and random forest based feature selection models(Refer below bar chart) can be useful choice. Note that PCA can't (since it transforms feature set). Please refer here for the detail.

Selecting number of features hyperparameter

This hyperparameter value is based on the selected algorithm and the problem at hand.

For example, although an algorithm needs number of features, however actual need might be different. Take PCA case. In this case, it is better to decide on the % of variance which needs to be preserved.

Similarly, for LASSO case, there is different approach.

Point to remember

Redundant features adversely impacts training time(increases) and accuracy(reduces) both(verify it).

Reference

https://youtu.be/y2Jsa4sgD5w

https://images.app.goo.gl/VPfzMfuQgzqN2djz5

https://images.app.goo.gl/SqP3sTSys1qECJrw8

https://youtu.be/y2Jsa4sgD5w?t=1729

https://www.semanticscholar.org/paper/A-Branch-and-Bound-Algorithm-for-Feature-Subset-Narendra-Fukunaga/8aee4e1022b18e7ecad7a963a5f6a3edb3832f2d

https://towardsdatascience.com/the-branch-and-bound-algorithm-a7ae4d227a69

https://youtu.be/QwS20ytvThE?t=1800

https://images.app.goo.gl/v1ids1DXrm4Zb7tK7

https://images.app.goo.gl/uSvvC9T7mEoNjcYy6

https://youtu.be/MtapFEwiiug?t=1810

https://youtu.be/aeEv3-tSvjM

https://www.linkedin.com/posts/dpkumar_machinelearningsolutions-features-datascience-activity-6764073661415727104-YcIL

https://www.researchgate.net/post/Should_a_feature_selection_method_to_remove_redundant_features_always_be_used_or_is_it_only_effective_when_a_certain_dimension_is_reached

https://towardsdatascience.com/feature-selection-why-how-explained-part-1-c2f638d24cdb

https://towardsdatascience.com/the-art-of-finding-the-best-features-for-machine-learning-a9074e2ca60d

https://www.sciencedirect.com/science/article/pii/S2314717218300059

https://youtu.be/q8gVpKl1f-4?t=1330

https://youtu.be/dLflHoGNofY?t=205

https://deepai.org/machine-learning-glossary-and-terms/kernel-density-estimation

https://www.linkedin.com/posts/dpkumar_machinelearningtraining-features-datasciences-activity-6762709881050017792-MhTm