Models Implemented
K-Means Clustering
Objective: Based on the rate of non-insurance by ZIP code, areas are classified into groups with similar trends. This allows us to grasp the geographical characteristics of areas with many non-insured people and areas with few non-insured people.
Why chosen: Ideal for grouping unlabeled, numeric data (e.g., uninsured rate, population).
Assumptions: Numeric, standardized features; spherical, equally sized clusters; predefined k.
Hyperparameter tuning: Tested k=2–7; selected best k using silhouette score and Davies-Bouldin index.
Challenges & solutions:
Missing data → used .dropna()
Index mismatch → aligned using .loc[X.index]
Scaling required → applied StandardScaler
Regression (Linear & Ridge)
Objective: We will examine whether it is possible to predict the uninsured rate (Uninsured Rate) from the total population (Total Population) for each ZIP code.
Why chosen: To test if Total_Population can explain Uninsured_Rate using a simple numeric model.
Assumptions: Linearity, homoscedasticity, normal residuals; Ridge adds L2 regularization.
Hyperparameter tuning: Ridge tuned via GridSearchCV; best alpha = 1000, no performance gain.
Challenges & solutions:
Very low R² → Ridge applied, but still ineffective
Only one feature → suggested adding more variables
Decision Tree Classification
Objective: Classifies and predicts whether the uninsured rate (Uninsured Rate) in a given area is high or low, based on the population (Total_Population) by ZIP code.
Why chosen: Suitable for small, tabular data; interpretable and handles numeric input directly.
Assumptions: No assumptions on feature distribution or linearity.
Hyperparameter tuning: max_depth and min_samples_split tuned via GridSearchCV (5-fold, F1-score).
Challenges & solutions:
Created binary target with KBinsDiscretizer
Tuned tree depth to avoid overfitting
Apriori Analysis
Objective: Based on the uninsured rate and total population for each ZIP code, we extract the most common combination patterns.
Why chosen: Best for discovering frequent patterns in binned insurance and population data.
Assumptions: One-hot encoded categorical inputs; assumes frequent subsets imply frequent supersets.
Hyperparameter tuning: Tuned min_support and min_lift to balance rule quantity and strength.
Challenges & solutions:
Numeric input → discretized into bins
Too few/many rules → tuned thresholds
Performance Evaluation
K-Means Clustering
Evaluation Metrics:
Silhouette Score (↑ better separation)
Davies-Bouldin Index (↓ less overlap)
Results: See this page.
Regression (Linear & Ridge)
Evaluation Metrics: Used RMSE, MSE, and R² (all showed weak model fit).
Results: See this page.
Decision Tree Classification
Evaluation Metrics: Used precision, recall, and F1-score.
Results: See this page.
Apriori Analysis
Evaluation Metrics: Support, Confidence, Lift (used to rank rules).
Results: See this page.
Conclusion
Best Model for Comparing Colorado and Utah
KMeans Clustering was the most effective model for comparing Colorado and Utah.
It grouped ZIP codes based on Uninsured_Rate and Total_Population, revealing clear spatial patterns and disparities.
The high silhouette score indicated strong, well-separated clusters—ideal for understanding regional healthcare access.
Other Models
Linear & Ridge Regression: Very low R² (~0); population did not explain uninsured rate.
Decision Tree Classification: Moderate performance after tuning, but limited by use of only one feature.
Apriori Analysis: Provided interpretable rules on co-occurring attributes (e.g., high uninsured + small population), useful for pattern exploration, not prediction.
Conclusion
KMeans best supported the goal of identifying differences between Colorado and Utah.
Other models added interpretive or predictive insights but were limited by data structure or weak signal strength.
Data Formatting for Models
K-Means Clustering
Data Formatting: Selected and standardized Uninsured_Rate & Total_Population.
Before-and-after data transformation snapshots: See this page.
Regression (Linear & Ridge)
Data formatting: Used .dropna(); standardization not needed.
Before-and-after data transformation snapshots: See this page.
Decision Tree Classification
Data formatting: Binned Uninsured_Rate; used Total_Population directly.
Before-and-after data transformation snapshots: See this page.
Apriori Analysis
Data formatting: Binned Uninsured_Rate & Total_Population, then one-hot encoded for Apriori.
Before-and-after data transformation snapshots: See this page.