Models Implemented
- Housing, Demographics, Migration Data -

Models Implemented

Random Forest Regression (Cost of Living)

Objective: Predict the average housing price in each ZIP code to explore how cost of living varies between regions in Colorado and Utah.

Why it was chosen: Random Forest captures non-linear relationships and works well with mixed numeric features. It also provides feature importance, which supports interpretation.

Assumptions: No assumptions on linearity or data distribution; handles missing values and irrelevant variables well.

Challenges & solutions:

Initial leakage from price_per_sqft feature → dropped it to avoid inflating model performance

The target contained missing values → filtered them out

Performance: R-squared = 0.3099

Top features: population_change, average_baths, average_beds, median_household_income, average_sqft

Random Forest Classifier (Demographic Differences)

Objective: Classify each ZIP code as either Colorado or Utah based on demographic and housing characteristics.

Why it was chosen: Random Forest provides interpretable feature importance and handles categorical labels without preprocessing.

Assumptions: Non-parametric, handles imbalance and mixed data types well.

Challenges & solutions:

Some engineered features caused divide-by-zero errors → resolved using median imputation Needed to suppress seaborn palette warnings for visualization

Performance: Accuracy = 83.5% F1-score, Precision, Recall, Confusion Matrix all reported

Top features: hispanic_pct, average_sqft, median_age, housing_units, median_household_income

Random Forest Regression (Migration Patterns)

Objective: Predict the total number of people who moved in the last year (total_moved_last_year) based on housing and demographic features.

Why it was chosen: Worked far better than attempting to predict normalized migration_rate, which proved too noisy.

Assumptions: Model handles skewed target and population effects well.

Challenges & solutions:

Migration rate was too volatile and produced very low R^2 → switched to predicting raw migration totals Final model yielded extremely strong results

Performance: R-squared = 0.973

Top features: housing_units, population_change, average_sqft, median_age, average_beds

Conclusion

Best Model for Comparing Colorado and Utah

The Random Forest Classifier provided the clearest insight into what distinguishes Colorado and Utah ZIP codes. The model performed well, and feature importances revealed strong demographic separation. For understanding regional demographic differences, Random Forest Classification was most revealing. For raw migration and price prediction, regression models worked well, but were less helpful for comparing across states.

Cost of Living Regression: Moderate predictive power, useful for feature importance, but price is driven heavily by property-specific attributes.

Migration Regression: Impressive performance, but the signal was dominated by total population size. It explains movement patterns but not necessarily migration behaviors.

Page updated

Report abuse

Models Implemented- Housing, Demographics, Migration Data -

Random Forest Regression (Cost of Living)

Random Forest Classifier (Demographic Differences)

Random Forest Regression (Migration Patterns)

Conclusion

Models Implemented
- Housing, Demographics, Migration Data -