Models Implemented
- Housing, Demographics, Migration Data -
- Housing, Demographics, Migration Data -
Models Implemented
Objective: Predict the average housing price in each ZIP code to explore how cost of living varies between regions in Colorado and Utah.
Why it was chosen: Random Forest captures non-linear relationships and works well with mixed numeric features. It also provides feature importance, which supports interpretation.
Assumptions: No assumptions on linearity or data distribution; handles missing values and irrelevant variables well.
Challenges & solutions:
Initial leakage from price_per_sqft feature → dropped it to avoid inflating model performance
The target contained missing values → filtered them out
Performance: R-squared = 0.3099
Top features: population_change, average_baths, average_beds, median_household_income, average_sqft
Objective: Classify each ZIP code as either Colorado or Utah based on demographic and housing characteristics.
Why it was chosen: Random Forest provides interpretable feature importance and handles categorical labels without preprocessing.
Assumptions: Non-parametric, handles imbalance and mixed data types well.
Challenges & solutions:
Some engineered features caused divide-by-zero errors → resolved using median imputation Needed to suppress seaborn palette warnings for visualization
Performance: Accuracy = 83.5% F1-score, Precision, Recall, Confusion Matrix all reported
Top features: hispanic_pct, average_sqft, median_age, housing_units, median_household_income
Objective: Predict the total number of people who moved in the last year (total_moved_last_year) based on housing and demographic features.
Why it was chosen: Worked far better than attempting to predict normalized migration_rate, which proved too noisy.
Assumptions: Model handles skewed target and population effects well.
Challenges & solutions:
Migration rate was too volatile and produced very low R^2 → switched to predicting raw migration totals Final model yielded extremely strong results
Performance: R-squared = 0.973
Top features: housing_units, population_change, average_sqft, median_age, average_beds
Best Model for Comparing Colorado and Utah
The Random Forest Classifier provided the clearest insight into what distinguishes Colorado and Utah ZIP codes. The model performed well, and feature importances revealed strong demographic separation. For understanding regional demographic differences, Random Forest Classification was most revealing. For raw migration and price prediction, regression models worked well, but were less helpful for comparing across states.
Cost of Living Regression: Moderate predictive power, useful for feature importance, but price is driven heavily by property-specific attributes.
Migration Regression: Impressive performance, but the signal was dominated by total population size. It explains movement patterns but not necessarily migration behaviors.