Mortgage Default Prediction: Logistic Regression vs. Random Forest

1. Project Overview

This project builds and compares two machine learning models — Logistic Regression and Random Forest — to predict mortgage default risk using a real-world loan-level dataset.
The goal is to evaluate how different model architectures handle classification tasks in credit risk analysis, and to understand the trade‑offs in interpretability, predictive power, and practical deployment.

2. Data & Preprocessing

I obtained the mortgage default data from Kaggle: Loan Default Dataset

The dataset contains borrower, loan, and property characteristics such as:

Loan amount, interest rate, LTV ratio
Property value, income, credit score
Loan purpose, occupancy type, and more

The target variable is Status (default or non‑default).

Data Preparation Steps:

Missing values: Imputed with median (numeric) and most frequent value (categorical).
Categorical variables: Encoded via one‑hot encoding.
Feature scaling: Not required for tree‑based models, but considered for logistic regression.
Train-test split: 80% training, 20% testing, with stratification to preserve class imbalance.

3. Modeling Approach

We implemented two core models:

Logistic Regression: A classical, interpretable baseline model for binary classification.
Random Forest Classifier: An ensemble of decision trees capable of capturing nonlinearities and complex interactions.

To prevent overfitting in Random Forest, we tuned key hyperparameters:

max_depth = 10
min_samples_leaf = 20

4. Results

Accuracy

LR: 0.543

RF: 0.987

Precision (Class 1)

LR: 0.30

RF: 0.99

Recall (Class 1)

LR: 0.62

RF: 0.96

F1‑score (Class 1)

LR: 0.40

RF: 0.97

AUC Score

LR: 0.59

RF: 1.00

5. Visual Comparison

ROC Curve

Both models’ ROC curves clearly illustrate Random Forest’s superior discriminative power (AUC = 1.00) compared to logistic regression (AUC = 0.59).

Confusion Matrices

Logistic Regression:

High false positives and false negatives due to inability to capture nonlinear patterns.
Predictive power is limited, with significant misclassification of defaults.

Random Forest:

Strong performance with minimal misclassifications.
Captures complex relationships between features and default behavior.

6. Interpretation and Insights

Logistic Regression: Strengths & Limitations
Pros

Interpretable coefficients and clear economic interpretation.

Cons

Poor predictive performance on complex, nonlinear data.
Highly sensitive to class imbalance and feature scaling.
Struggles with feature interactions and nonlinearity, leading to low recall and precision.

Random Forest: Strengths & Considerations

Pros

Substantial improvement in accuracy and AUC.
Captures nonlinear relationships and feature interactions.
Robust to missing values and scaling.

Cons

Less interpretable compared to logistic regression
Potential risk of overfitting (mitigated here via hyperparameter tuning).

7. Conclusions

This experiment highlights the significant performance gap between traditional logistic regression and modern ensemble methods in mortgage default prediction.
While logistic regression remains a useful baseline model for interpretability, Random Forest offers superior predictive power and practical utility for real‑world credit risk assessment.

From a business perspective, a model with high recall and precision (like Random Forest) can reduce financial losses by identifying high‑risk loans more accurately, while still offering insights into key drivers of default risk.

Page updated

Google Sites

Report abuse

Mortgage Default Prediction: Logistic Regression vs. Random Forest

1. Project Overview

2. Data & Preprocessing

Data Preparation Steps:

3. Modeling Approach

4. Results

5. Visual Comparison

ROC Curve

Confusion Matrices

6. Interpretation and Insights

Logistic Regression: Strengths & LimitationsPros

Cons

Random Forest: Strengths & Considerations

Pros

Cons

7. Conclusions

Logistic Regression: Strengths & Limitations
Pros