2.4. Fruits or not, split or encode and scale first?

After cleaning and selecting the proper input variables the following final result should present the confusion matrix and ROC curve like the graphics presented in the next Figures.

It is important to observe that these figures had been obtained after the application of the following order of operations in data processing:

From the data frame select input variables and store them in variable X, while the output variable should be in y variable.
Apply a transformation to the categorical variable related to values in column ''pol_city_unlocode''. Although transformation to a dummy (binary) variable is possible, it is not practical since the values in this column could be a range about of 400 cities, this is not a proper strategy. A more adequate strategy is to employ an encoding that maps each city name into a real number in an interval [0, 1].
Apply scale transformation to avoid numerical instabilities during the application of the classification method.
Perform the Train-Test Split.

Another possible sequence of data processing steps could be 1, 4, 2, 3.

As it can be seen there will a slight difference in the final results.

These results are perfectly aligned with the results and discussion found in specialized forums. Discussion about Scale before or after splitting in train/test datasets could be about:

1. Use scaling after train-test split:

https://medium.com/@megha.natarajan/scaling-data-before-or-after-train-test-split-35e9a9a7453f#:~:text=The%20Correct%20Sequence%3A%20Split%2C%20Then,used%20to%20create%20the%20model

2. Arguments that there are no differences between applying the procedure of encode and scale before or after train-test split.

https://stats.stackexchange.com/questions/599508/do-we-one-hot-encode-create-dummy-variables-before-or-after-train-test-split

The Python code with all the steps about Scale First version is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1OLSY5w65W9We5fkdX8CLUXmqpvjBK6px?usp=sharing

The Python code with all the steps about Train-Test Split First version is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1GyzYDj2Bp3KyBxqQq2ta22NEG_d1sHr1?usp=sharing

Page updated

Google Sites

Report abuse