Machine learning isn’t all about fancy models and complex optimization algorithms. Many Kagglers including myself are happy to acknowledge defeats if the winner employed a more advanced model or engineered some brilliant features, but only a few have the heart to accept defeat in this situation — beaten by a simple model trained with fewer features. Yes, not all practitioners recognize the role of minimalism in model training.
I am talking here about redundant features in dataset. If you interested to know, then this document helps
Redundant features add no relevant information to your other features, because they are correlated or because they can be obtained by [linear] combination of other features. Its act adversely to the ML though. If you are interested to know more, then this document helps.
Redundant features act as noise. Redundant features act as noise and such dataset increases learning time and also generalisation capability of ML algorithm.
Redundant features are those which are co-related. In other words, if a feature can be defined as linear combination of other feature, then one of them can be removed (refer below example with two features X1 and X2)
X2 = X1 - 3
What about the case where relationship is non-linear? In such case, normally only one of them can be removed and also it depends on our capability to identify relationship function. For example in below example of two features (x, y), y can be removed.
It is not always feasible to identify mathematical equation to determine relationship.
Zero Correlation coefficient doesn't help to get independent set of relations. Note that correlation coefficient is zero doesn't confirm that two random variables are independent. Refer below example of parabola and note that correlation coefficient will be zero in this case
As per my knowledge, there is no existing algorithm which can optimally reduce feature set for all cases.
However, for specific cases, there are algorithms which can find optimal set of features. Branch and bound is one such algorithm(Refer below example pic). However for pretty large feature set(100's of feature), this algorithm is difficult to implement due to time complexity of tree search.
SFS is a suboptimal solution which avoids the time complexity issue in above branch and bound. Refer below pic for the approach. This is greedy approach where at each step best feature is selected. Note that once a feature(Say Feature2) is added, it is never removed from the list of selected feature. So, final list might be suboptimal considering that optimal solution might not even include that feature(say Feature2).
To further improve, LR method is used which try to take care of interdependencies among features while adding or removing features(Refer here).
Feature clustering is another technique to find co-related features. For example, k-means clustering model can group co-related features. This paper talks in detail.
PCA is another method which uses co-relation matrix. . You can read more about PCA here
Note that all of above algorithms need user to provide number of features to be selected.
Recently widely used LASSO model doesn't need user to provide number of features.
Refer colab code for various feature selection method.
For selecting right algorithm, following criteria are useful
We can select the algorithm which has least misclassification error.
Also we might be interested to know which feature has contributed most for the ML learning.
For meeting 1st criteria, using statistics and probability, one can theoretically decide which feature selection algorithm has least misclassification error. This approach uses Bayes decision rule. This video talks about this approach in detail.
However, there is practical difficulty to get prior/PDFs probabilities in real life and also calculating misclassification error is not easy for many case. To handle this, there are techniques which estimates probability distribution. Kernel density estimator is one such algorithm. This video and writeup talks about estimation approaches in detail.
For meeting 2nd criteria, we need to use such algorithm which provides most contributing feature info. For example, LASSO and random forest based feature selection models(Refer below bar chart) can be useful choice. Note that PCA can't (since it transforms feature set). Please refer here for the detail.
This hyperparameter value is based on the selected algorithm and the problem at hand.
For example, although an algorithm needs number of features, however actual need might be different. Take PCA case. In this case, it is better to decide on the % of variance which needs to be preserved.
Similarly, for LASSO case, there is different approach.
Redundant features adversely impacts training time(increases) and accuracy(reduces) both(verify it).
https://youtu.be/y2Jsa4sgD5w
https://images.app.goo.gl/VPfzMfuQgzqN2djz5
https://images.app.goo.gl/SqP3sTSys1qECJrw8
https://youtu.be/y2Jsa4sgD5w?t=1729
https://www.semanticscholar.org/paper/A-Branch-and-Bound-Algorithm-for-Feature-Subset-Narendra-Fukunaga/8aee4e1022b18e7ecad7a963a5f6a3edb3832f2d
https://towardsdatascience.com/the-branch-and-bound-algorithm-a7ae4d227a69
https://youtu.be/QwS20ytvThE?t=1800
https://images.app.goo.gl/v1ids1DXrm4Zb7tK7
https://images.app.goo.gl/uSvvC9T7mEoNjcYy6
https://youtu.be/MtapFEwiiug?t=1810
https://youtu.be/aeEv3-tSvjM
https://www.linkedin.com/posts/dpkumar_machinelearningsolutions-features-datascience-activity-6764073661415727104-YcIL
https://www.researchgate.net/post/Should_a_feature_selection_method_to_remove_redundant_features_always_be_used_or_is_it_only_effective_when_a_certain_dimension_is_reached
https://towardsdatascience.com/feature-selection-why-how-explained-part-1-c2f638d24cdb
https://towardsdatascience.com/the-art-of-finding-the-best-features-for-machine-learning-a9074e2ca60d
https://www.sciencedirect.com/science/article/pii/S2314717218300059
https://youtu.be/q8gVpKl1f-4?t=1330
https://youtu.be/dLflHoGNofY?t=205
https://deepai.org/machine-learning-glossary-and-terms/kernel-density-estimation
https://www.linkedin.com/posts/dpkumar_machinelearningtraining-features-datasciences-activity-6762709881050017792-MhTm