Role of feature scaling in machine learning

Introduction

Laymen explanation

Assume your input dataset contains one column with values ranging from 0 to 1, and another column with values ranging from 10,000 to 100,000. The great difference in the scale of the numbers could cause problems when you attempt to combine the values as features during modeling. If you are interested to know about the reason, then this document helps

Technical explanation

Feature scaling is a technique often applied as part of data preparation for machine learning. The goal of Feature scaling is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values.

When all features are on similar scale, then the ML algorithm training converges faster.

Feature scaling types

Standardisation

The result of standardization is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0 and σ=1

where μ is the mean (average) and σ is the standard deviation.

Normalization

In this approach, the data is scaled to a fixed range — usually 0 to 1.

In contrast to standardization, the cost of having this bounded range is that we will end up with smaller standard deviations, which can suppress the effect of outliers. Thus MinMax Scalar is sensitive to outliers.

A Min-Max scaling is typicallydone via the following equation:

Non numeric features?

Machine learning models require all input and output variables to be numeric. So, If your data contains non-numeric features (for example, pincode, address), Firstly, you need to encode it to numbers. One hot encoding is one such approach. Please refer here for more detail.

Impact of feature scaling

It alters the mean value and variance of dataset.

Similarly, normalisation can impact co-variance value among features. It happens since not all features are normalised in the same scale.

Case when feature scaling is not good

Those cases where Mean and variance of dataset should be preserved, feature scaling should be avoided. This paper talks about this.

Case when feature scaling is not needed

Some algorithms take care of scaling itself. For example, Random Forests doesn't require future scaling since it doesn't care about absolute value. Similarly, linear regression takes care of scaling via coefficient adjustment(Ref: colab example).
- If you select right distance metric for K-NN classifier, then scaling is not needed. For example, Mahalanobis distance metric takes care of scaling automatically(note covariance matrix Σ in the formula). Refer example code here.

Relevance in neural network

Batch normalization is a technique for training very deep neural networks that standardizes the inputs to a layer for each mini-batch. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. Refer here for the detail.

- Input feature scaling is needed to avoid under-saturation/over-saturation problem due to activation function in neural networks. For example, ReLu activation functions outputs 0 for -ve values. So, an input normalisation should be such a way that it should avoid the negative value to be passed in ReLu. In a 2015 paper,7 Sergey Ioffe and Christian Szegedy proposed a technique called Batch Normalization (BN) to address the vanishing/exploding gradients problems.

Point to remember

Some algorithms require that data be normalized before training a model. Other algorithms takes care of itself.

Therefore, when you choose a machine learning algorithm to use in building a predictive model, be sure to review the data requirements of the algorithm before applying normalization to the training data.

Refer here for the case when pre-normalisation is not needed.

Reference

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/normalize-data

https://machinelearningmastery.com/batch-normalization-for-training-of-deep-neural-networks/

https://www.youtube.com/watch?v=DtEq44FTPM4

https://youtu.be/0HOqOcln3Z4?t=534

https://youtu.be/0HOqOcln3Z4?t=548

http://theprofessionalspoint.blogspot.com/2019/02/which-machine-learning-algorithms.html

https://datascience.stackexchange.com/questions/62031/normalize-standardize-in-a-random-forest

http://theprofessionalspoint.blogspot.com/2019/02/which-machine-learning-algorithms.html

https://stats.stackexchange.com/questions/41704/how-and-why-do-normalization-and-feature-scaling-work

https://images.app.goo.gl/7Vr3Di2T2dsoVPns9

https://en.wikipedia.org/wiki/Mahalanobis_distance

https://datascience.stackexchange.com/questions/25832/input-normalization-for-relu

https://gist.github.com/ajeyjoshi/e74e8c7f8bd389195efe163d1ab5bdc4

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

index.mp4

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-batch-normalisation-in-machine-learning

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-batch-normalisation-in-machine-learning#TOC-Role-in-machine-learning

https://www.linkedin.com/posts/dpkumar_machinelearning-datascience-features-activity-6760734531889831936-ZiID