Capstone Project 2 - Two Classes only - Mind Map
Google colab notebook with Python code
https://colab.research.google.com/drive/1cuGko7Ev2iiCBk5as3L4ONaErtGk6D-9?usp=sharing
Explaination of the code by employing ChatGPT (just the first part)
This notebook is likely focused on logistic regression modeling for shipment classification or analysis, involving preprocessing, feature scaling, and dataset splitting.
1. Library Imports
python
CopiarEditar
import pandas as pd, numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt, seaborn as sns
Uses standard libraries for data manipulation, visualization, and machine learning.
warnings.filterwarnings("ignore") suppresses warnings for cleaner output.
%matplotlib inline is a Jupyter magic command for inline plots.
2. Data Loading
python
CopiarEditar
url='https://drive.google.com/uc?export=download&id=...'
df = pd.read_excel(url, sheet_name='bdm_rs', dtype={'harmonized_system_code':str})
Loads a dataset from a Google Drive Excel file.
Uses sheet_name='bdm_rs' and forces harmonized_system_code to be treated as a string (important for categorical encoding later).
3. Initial Data Inspection
python
CopiarEditar
df.info()
df.isnull().sum()
len(df['pol_city_unlocode'].unique())
df.info() shows the structure and column types.
df.isnull().sum() counts missing values.
Checking for unique values in pol_city_unlocode suggests a focus on geographic/city-level logistics.
Concept Explanation
Exploratory Data Analysis (EDA)
Understanding dataset structure, null values, and unique identifiers.
Feature Engineering
Likely next steps will involve preparing categorical/logistic features like city codes, port info, etc.
Logistic Regression
LogisticRegression is imported early, suggesting binary or multinomial classification.
Scaling First
From the notebook's title, scaling of features (e.g., StandardScaler, MinMaxScaler) is expected before data is split.
Splitting Second
Post-scaling, the dataset will be divided into training and test sets — an unusual but sometimes advantageous approach to avoid data leakage on small or imbalanced datasets.
UN/LOCODE usage
Suggests focus on global shipment classification, where city and country codes are essential features.
Based on these first cells and the title:
Encoding categorical variables (e.g., harmonized_system_code, pol_city_unlocode).
Feature scaling using preprocessing tools.
Splitting into training/test sets with train_test_split.
Training a logistic regression model.
Evaluation using accuracy, confusion matrix, etc.