1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
The next Python code deals with a data set with several features:
1. si_transaction_id - shipment instruction Id
2. harmonized_system_code - declared commodity code
3. container_uncode - container code denoting the container characteristics
4. package_count - the number of packages
5. allocated_container_count - the number of containers in the shipment instruction
6. weight_kg - the container weight
7. pol_city_unlocode - the place of loading that contains the city UNLOCODE
And two additional columns could be computed employing the data of the previous features:
8. weight_per_unit - the container weight / the number of packages
9. volume_per_unit - the volume of the container, which was pre-calculated based on its code / the number of packages
After cleaning and selecting the proper input variables the following final result should present the confusion matrix and ROC curve like the graphics presented in the next Figures.
It is important to observe that these figures had been obtained after the application of the following order of operations in data processing:
From the data frame select input variables and store them in variable X, while the output variable should be in y variable.
Apply a transformation to the categorical variable related to values in column ''pol_city_unlocode''. Although transformation to a dummy (binary) variable is possible, it is not practical since the values in this column could be a range about of 400 cities, this is not a proper strategy. A more adequate strategy is to employ an encoding that maps each city name into a real number in an interval [0, 1].
Apply scale transformation to avoid numerical instabilities during the application of the classification method.
Perform the Train-Test Split.
Another possible sequence of data processing steps could be 1, 4, 2, 3.
As it can be seen there will a slight difference in the final results.
These results are perfectly aligned with the results and discussion found in specialized forums. Discussion about Scale before or after splitting in train/test datasets could be about:
1. Use scaling after train-test split:
2. Arguments that there are no differences between applying the procedure of encode and scale before or after train-test split.
The Python code with all the steps about Scale First version is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1OLSY5w65W9We5fkdX8CLUXmqpvjBK6px?usp=sharing
The Python code with all the steps about Train-Test Split First version is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1GyzYDj2Bp3KyBxqQq2ta22NEG_d1sHr1?usp=sharing