Data Preparation

Transform

Why we need transform

What kind of transform we need?

In what situation we need transform

Transform

Why we need transform

In surpervised learning, especially regression, we do had some assumptions of variables like norm distribution etc, sometimes our target variables are not fitting the assumption. So, we need do some transformation.

And in order to get rid of the scale influence to the model, we also want to keep the variables in the same scale.

What kind of transform we need?

log transformation
Box-cox transformation
Sin
min-max standard
average standard
standardize

In what situation we need transform

PCA
Neural Network
KNN
SVM
Regression

Feature Selection

Feature selection is about choose the variables. Some people will confuse why I should do this step at very beginning, because somtimes you are dealing with very big data, and the reduce the feature can definite helps you to understand the data without taking everything into account.

Category variables

drop if there is no change

Numerical variables

contains missing value over 50%
variance of data less than 1
Highly correlated variables

Other method

There are some model can be used as feature selection like

Tree based model: feature importance can be used to choose the important

PCA : PCA can summarize the features

Neural Network: also can be used as feature extraction

We won't go details about these method here, but you definite should know that some methods can be applied in feature selection and dimension reduction.

Outlier Detect

There are mainly two methods using to check outliers. After moving outliers, we may have better model result

Boxplot

We can clear see the outlier in the boxplot, which mostly based on the IQR calculation.

Outlier Check function here can return the outliers in the array and outliers in the columns of the data frame

Z-score

Based on central limit theorem, we know that we can capture 99.7% data within three standard deviation. So, if we have some data point far beyond 3 standard deviation, they may be outliers.

Page updated

Google Sites

Report abuse