2.3. Capstone Project 3

Capstone Project 3 - From Binary to Multiple Classes - Mind Map

Basis code employing multiple classes using clustering methods and ensembles - Google colab notebook with Python code

https://colab.research.google.com/drive/1WcDmp_BaZ6KQtCQe85amITw8EbQ1EKWs?usp=sharing

Basis code employing multiple classes using neural networks methods - Google colab notebook with Python code

https://colab.research.google.com/drive/1sLsTGkz8bOJnHO6p5f3g1N_UIB2YJuSt?usp=sharing

Modified code proposed by NotebookLM (Employed Gemini with a format from ChatGPT)

Prompt: "From the given code, propose a new code that is able to deal with multiple classes"

This guide explains how to transform an existing binary classification pipeline into a multi-class classification setup using a logistic regression model. We use the hs2 column—representing the first two digits of the harmonized system code—as the new multi-class target variable.

1. Define the Multi-Class Target Variable

Replace the original binary target xis_electronics with hs2, which groups products into broader categories.

target_multiclass = 'hs2'

y = df[target_multiclass]

print(y.head())

print(y.unique()) # List of unique classes

print(y.value_counts()) # Distribution of classes

2. Select Input Features

Drop irrelevant columns and ensure hs2 (now the target) and xis_electronics (previous target) are not used as features.

drop_cols_multiclass = [

'si_transaction_id', 'harmonized_system_code',

'container_uncode', 'xis_electronics', target_multiclass

]

X = df.drop(columns=drop_cols_multiclass)

print(X.head())

3. Encode Categorical Variables

Use Target Encoding to transform the pol_city_unlocode column. This encoding method remains valid for multi-class targets.

import category_encoders as ce

encoder = ce.TargetEncoder(return_df=True, smoothing=180)

X_ecd = encoder.fit_transform(X, y)

print(X_ecd.head())

4. Scale Features

Standardize feature values using StandardScaler, which helps improve model convergence.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_ss_ecd = scaler.fit_transform(X_ecd)

print(X_ss_ecd[:5])

5. Split Dataset into Training and Test Sets

Separate the data for training and testing using an 80/20 split.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X_ss_ecd, y, test_size=0.2, random_state=42

)

print("Shape of X_train:", X_train.shape)

print("Shape of y_train:", y_train.shape)

6. Train a Multi-Class Logistic Regression Model

Use the multinomial setting in LogisticRegression to handle multiple classes.

from sklearn.linear_model import LogisticRegression

logistic_model_multiclass = LogisticRegression(

multi_class='multinomial',

solver='lbfgs',

random_state=1,

max_iter=1000

)

logistic_model_multiclass.fit(X_train, y_train)

print("Multi-class Logistic Regression Model trained.")

7. Make Predictions and Evaluate the Model

Use prediction metrics such as accuracy, confusion matrix, and the classification report.

from sklearn.metrics import (

accuracy_score, classification_report,

confusion_matrix, ConfusionMatrixDisplay

)

import matplotlib.pyplot as plt

y_pred = logistic_model_multiclass.predict(X_test)

print("Accuracy (Multi-class): %.3f" % accuracy_score(y_test, y_pred))

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Plot the confusion matrix

labels_multiclass = sorted(y_test.unique())

ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=labels_multiclass)

plt.show()

# Detailed per-class metrics

print("Classification Report:\n", classification_report(y_test, y_pred))

8. Cross-Validate the Multi-Class Model

Use K-Fold Cross-Validation to validate the model across different data splits.

from sklearn.model_selection import KFold, cross_val_score

model_kfold = LogisticRegression(

multi_class='multinomial',

solver='lbfgs',

random_state=1,

max_iter=1000

)

kfold = KFold(n_splits=10)

scores = cross_val_score(model_kfold, X_train, y_train, cv=kfold)

print("Cross-validation scores:", scores)

print("Mean accuracy:", scores.mean())

print("Standard deviation:", scores.std())

Summary

This adaptation transforms a binary classification setup into a multi-class machine learning pipeline with minimal changes to the structure. It incorporates best practices for encoding, scaling, model training, and evaluation—all while supporting a categorical target with multiple classes.

Page updated

Google Sites

Report abuse