Capstone Project 3 - From Binary to Multiple Classes - Mind Map
Basis code employing multiple classes using clustering methods and ensembles - Google colab notebook with Python code
https://colab.research.google.com/drive/1WcDmp_BaZ6KQtCQe85amITw8EbQ1EKWs?usp=sharing
Basis code employing multiple classes using neural networks methods - Google colab notebook with Python code
https://colab.research.google.com/drive/1sLsTGkz8bOJnHO6p5f3g1N_UIB2YJuSt?usp=sharing
Modified code proposed by NotebookLM (Employed Gemini with a format from ChatGPT)
Prompt: "From the given code, propose a new code that is able to deal with multiple classes"
This guide explains how to transform an existing binary classification pipeline into a multi-class classification setup using a logistic regression model. We use the hs2 column—representing the first two digits of the harmonized system code—as the new multi-class target variable.
Replace the original binary target xis_electronics with hs2, which groups products into broader categories.
target_multiclass = 'hs2'
y = df[target_multiclass]
print(y.head())
print(y.unique()) # List of unique classes
print(y.value_counts()) # Distribution of classes
Drop irrelevant columns and ensure hs2 (now the target) and xis_electronics (previous target) are not used as features.
drop_cols_multiclass = [
'si_transaction_id', 'harmonized_system_code',
'container_uncode', 'xis_electronics', target_multiclass
]
X = df.drop(columns=drop_cols_multiclass)
print(X.head())
Use Target Encoding to transform the pol_city_unlocode column. This encoding method remains valid for multi-class targets.
import category_encoders as ce
encoder = ce.TargetEncoder(return_df=True, smoothing=180)
X_ecd = encoder.fit_transform(X, y)
print(X_ecd.head())
Standardize feature values using StandardScaler, which helps improve model convergence.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_ss_ecd = scaler.fit_transform(X_ecd)
print(X_ss_ecd[:5])
Separate the data for training and testing using an 80/20 split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X_ss_ecd, y, test_size=0.2, random_state=42
)
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
Use the multinomial setting in LogisticRegression to handle multiple classes.
from sklearn.linear_model import LogisticRegression
logistic_model_multiclass = LogisticRegression(
multi_class='multinomial',
solver='lbfgs',
random_state=1,
max_iter=1000
)
logistic_model_multiclass.fit(X_train, y_train)
print("Multi-class Logistic Regression Model trained.")
Use prediction metrics such as accuracy, confusion matrix, and the classification report.
from sklearn.metrics import (
accuracy_score, classification_report,
confusion_matrix, ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
y_pred = logistic_model_multiclass.predict(X_test)
print("Accuracy (Multi-class): %.3f" % accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
# Plot the confusion matrix
labels_multiclass = sorted(y_test.unique())
ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=labels_multiclass)
plt.show()
# Detailed per-class metrics
print("Classification Report:\n", classification_report(y_test, y_pred))
Use K-Fold Cross-Validation to validate the model across different data splits.
from sklearn.model_selection import KFold, cross_val_score
model_kfold = LogisticRegression(
multi_class='multinomial',
solver='lbfgs',
random_state=1,
max_iter=1000
)
kfold = KFold(n_splits=10)
scores = cross_val_score(model_kfold, X_train, y_train, cv=kfold)
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean())
print("Standard deviation:", scores.std())
This adaptation transforms a binary classification setup into a multi-class machine learning pipeline with minimal changes to the structure. It incorporates best practices for encoding, scaling, model training, and evaluation—all while supporting a categorical target with multiple classes.