Data Scaling & Normalization

Strategy for missing value treatment has already been discussed. In this post, Data Scaling and Normalization shall be discussed.

Scaling vs. Normalization

Both scaling and normalization involve transforming the values of the numeric variables. By doing so, the data points have specific helpful properties. The difference is that, in scaling, you’re changing the range of your data while in normalization you’re changing the shape of the distribution of your data.

Let's talk a little more in-depth about each these techniques.

Scaling

Machine learning algorithm just sees number — if there is a vast difference in the range say few ranging in thousands and few ranging in the tens, and it makes the underlying assumption that higher ranging numbers have superiority of some sort. So these more significant number starts playing a more decisive role while training the model.

A machine learning algorithm works on numbers and does not know what that number represents. For example, say we have two features - weight and price, as in the below table. The “Weight” cannot have a meaningful comparison with the “Price.” So the algorithm assumes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.”

Hence scaling these features is required to give every feature in our data set the same importance.

Feature scaling is especially important for those Machine Learning algorithms that work by calculating the distance between the data. If not scaled, the features with higher value will start dominating distances and will influence the output of the algorithm.

Some algorithms that require feature scaling include:

  • K-nearest neighbors (KNN) where Euclidean distance is measured which is sensitive to magnitudes of the features. Hence features should be scaled for them to weigh in equally.

  • K-Means since it again uses Euclidean distance, same as KNN.

  • Principal Component Analysis(PCA), since it tries to get the features with maximum variance, and the variance is high for high magnitude features and skews the PCA towards high magnitude features.

  • We can also speed up gradient descent during back propagation in Neural Networks using scaled features.

There are various scaling methods available, but the min_max scaling method is most commonly used to scale our data.

Min-max scaling transforms features by scaling each feature to a given range. This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g., between zero and one. This Scaling method shrinks the data within the range of -1 to 1 if there are negative values. This scaler is sensitive to outliers.

If Xsc is the scaled version of feature X, then as per min-max scaling, Xsc=(X−Xmin)/(Xmax−Xmin)

Normalization

By now, it must be clear that scaling just changes the range of our data. Normalization is a bit more radical. It changes our data so that it can be described as a normal distribution.

Normal Distribution

A normal distribution is known as the "bell curve", which is a specific statistical distribution where a roughly equal observations fall above and below the mean. The mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

We should only use normalization if our Machine Learning algorithm assumes that the data should be normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes

Practice

Kaggle has an interesting tutorial in Data Cleaning. It gives a good hands on experience on various methods of cleaning data including Scaling and Normalization.