2.2. Capstone Project 2

Capstone Project 2 - Two Classes only - Mind Map

Google colab notebook with Python code

https://colab.research.google.com/drive/1cuGko7Ev2iiCBk5as3L4ONaErtGk6D-9?usp=sharing

Explaination of the code by employing ChatGPT (just the first part)

📘 Project Overview: Scale First, Split Second

This notebook is likely focused on logistic regression modeling for shipment classification or analysis, involving preprocessing, feature scaling, and dataset splitting.

✅ Initial Steps and Their Purpose

1. Library Imports

python

CopiarEditar

import pandas as pd, numpy as np

from sklearn import preprocessing

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt, seaborn as sns

Uses standard libraries for data manipulation, visualization, and machine learning.
warnings.filterwarnings("ignore") suppresses warnings for cleaner output.
%matplotlib inline is a Jupyter magic command for inline plots.

2. Data Loading

python

CopiarEditar

url='https://drive.google.com/uc?export=download&id=...'

df = pd.read_excel(url, sheet_name='bdm_rs', dtype={'harmonized_system_code':str})

Loads a dataset from a Google Drive Excel file.
Uses sheet_name='bdm_rs' and forces harmonized_system_code to be treated as a string (important for categorical encoding later).

3. Initial Data Inspection

python

CopiarEditar

df.info()

df.isnull().sum()

len(df['pol_city_unlocode'].unique())

df.info() shows the structure and column types.
df.isnull().sum() counts missing values.
Checking for unique values in pol_city_unlocode suggests a focus on geographic/city-level logistics.

🧠 Key Concepts Introduced

Concept Explanation

Exploratory Data Analysis (EDA)

Understanding dataset structure, null values, and unique identifiers.

Feature Engineering

Likely next steps will involve preparing categorical/logistic features like city codes, port info, etc.

Logistic Regression

LogisticRegression is imported early, suggesting binary or multinomial classification.

Scaling First

From the notebook's title, scaling of features (e.g., StandardScaler, MinMaxScaler) is expected before data is split.

Splitting Second

Post-scaling, the dataset will be divided into training and test sets — an unusual but sometimes advantageous approach to avoid data leakage on small or imbalanced datasets.

UN/LOCODE usage

Suggests focus on global shipment classification, where city and country codes are essential features.

🔮 What Likely Comes Next

Based on these first cells and the title:

Encoding categorical variables (e.g., harmonized_system_code, pol_city_unlocode).
Feature scaling using preprocessing tools.
Splitting into training/test sets with train_test_split.
Training a logistic regression model.
Evaluation using accuracy, confusion matrix, etc.

Page updated

Google Sites

Report abuse