When performing data analysis, you need to consider the importance of data transformation in preparation for the learning phase. Most machine learning algorithms work best when the Pearson's correlation is maximized between the variables you have to predict and the variable you use to predict them. The following sections present an overview of the most common procedures used during data analysis in order to enhance the relationship between variables.
During data science practice, you'll meet with a wide range of different distributions--with some of them named by probabilistic theory, others not. For some distributions, the assumption that they should behave as a normal distribution may hold, but for others, it may not, and that could be a problem depending on what algorithms you use for the learning process. As a general rule, if your model is a linear regression or part of the linear model family because it boils down to a summation of coefficients, then both variable standardization and distribution transformation should be considered.
In your analysis process, you may have realized that your variables have different scales and are heterogeneous in their distributions. As a consequence of your analysis, you need to transform the variables in a way that makes them easily comparable.
When you check variables with high skewness and kurtosis for their correlation, the results may disappoint you. As you find out earlier in this chapter, using a nonparametric measure of correlation, such as Spearman's, may tell you more about two variables than Pearson's r may tell you. In this case, you should transform your insight into a new, transformed feature.
Using homes.csv, try to do the following:
Scale the Sell and Taxes and then print it.
For both scaled Sell and Taxes, use at least 5 transformations for each and see which one gives Pearson's r correlation between the two closest to 0.