In surpervised learning, especially regression, we do had some assumptions of variables like norm distribution etc, sometimes our target variables are not fitting the assumption. So, we need do some transformation.
And in order to get rid of the scale influence to the model, we also want to keep the variables in the same scale.
log transformation
Box-cox transformation
Sin
min-max standard
average standard
standardize
PCA
Neural Network
KNN
SVM
Regression
Feature selection is about choose the variables. Some people will confuse why I should do this step at very beginning, because somtimes you are dealing with very big data, and the reduce the feature can definite helps you to understand the data without taking everything into account.
drop if there is no change
contains missing value over 50%
variance of data less than 1
Highly correlated variables
There are some model can be used as feature selection like
Tree based model: feature importance can be used to choose the important
PCA : PCA can summarize the features
Neural Network: also can be used as feature extraction
We won't go details about these method here, but you definite should know that some methods can be applied in feature selection and dimension reduction.
There are mainly two methods using to check outliers. After moving outliers, we may have better model result
We can clear see the outlier in the boxplot, which mostly based on the IQR calculation.
Outlier Check function here can return the outliers in the array and outliers in the columns of the data frame
Based on central limit theorem, we know that we can capture 99.7% data within three standard deviation. So, if we have some data point far beyond 3 standard deviation, they may be outliers.