This project builds and compares two machine learning models — Logistic Regression and Random Forest — to predict mortgage default risk using a real-world loan-level dataset.
The goal is to evaluate how different model architectures handle classification tasks in credit risk analysis, and to understand the trade‑offs in interpretability, predictive power, and practical deployment.
I obtained the mortgage default data from Kaggle: Loan Default Dataset
The dataset contains borrower, loan, and property characteristics such as:
Loan amount, interest rate, LTV ratio
Property value, income, credit score
Loan purpose, occupancy type, and more
The target variable is Status (default or non‑default).
Missing values: Imputed with median (numeric) and most frequent value (categorical).
Categorical variables: Encoded via one‑hot encoding.
Feature scaling: Not required for tree‑based models, but considered for logistic regression.
Train-test split: 80% training, 20% testing, with stratification to preserve class imbalance.
We implemented two core models:
Logistic Regression: A classical, interpretable baseline model for binary classification.
Random Forest Classifier: An ensemble of decision trees capable of capturing nonlinearities and complex interactions.
To prevent overfitting in Random Forest, we tuned key hyperparameters:
max_depth = 10
min_samples_leaf = 20
Accuracy
LR: 0.543
RF: 0.987
Precision (Class 1)
LR: 0.30
RF: 0.99
Recall (Class 1)
LR: 0.62
RF: 0.96
F1‑score (Class 1)
LR: 0.40
RF: 0.97
AUC Score
LR: 0.59
RF: 1.00
Both models’ ROC curves clearly illustrate Random Forest’s superior discriminative power (AUC = 1.00) compared to logistic regression (AUC = 0.59).
Logistic Regression:
High false positives and false negatives due to inability to capture nonlinear patterns.
Predictive power is limited, with significant misclassification of defaults.
Random Forest:
Strong performance with minimal misclassifications.
Captures complex relationships between features and default behavior.
Interpretable coefficients and clear economic interpretation.
Poor predictive performance on complex, nonlinear data.
Highly sensitive to class imbalance and feature scaling.
Struggles with feature interactions and nonlinearity, leading to low recall and precision.
Substantial improvement in accuracy and AUC.
Captures nonlinear relationships and feature interactions.
Robust to missing values and scaling.
Less interpretable compared to logistic regression
Potential risk of overfitting (mitigated here via hyperparameter tuning).
This experiment highlights the significant performance gap between traditional logistic regression and modern ensemble methods in mortgage default prediction.
While logistic regression remains a useful baseline model for interpretability, Random Forest offers superior predictive power and practical utility for real‑world credit risk assessment.
From a business perspective, a model with high recall and precision (like Random Forest) can reduce financial losses by identifying high‑risk loans more accurately, while still offering insights into key drivers of default risk.