Models Implemented
- Health Insurance Data -

Models Implemented
1. K-Means Clustering
  - Objective: Based on the rate of non-insurance by ZIP code, areas are classified into groups with similar trends. This allows us to grasp the geographical characteristics of areas with many non-insured people and areas with few non-insured people.
  - Why chosen: Ideal for grouping unlabeled, numeric data (e.g., uninsured rate, population).
  - Assumptions: Numeric, standardized features; spherical, equally sized clusters; predefined k.
  - Hyperparameter tuning: Tested k=2–7; selected best k using silhouette score and Davies-Bouldin index.
  - Challenges & solutions:
    1. Missing data → used .dropna()
    2. Index mismatch → aligned using .loc[X.index]
    3. Scaling required → applied StandardScaler
2. Regression (Linear & Ridge)
  - Objective: We will examine whether it is possible to predict the uninsured rate (Uninsured Rate) from the total population (Total Population) for each ZIP code.
  - Why chosen: To test if Total_Population can explain Uninsured_Rate using a simple numeric model.
  - Assumptions: Linearity, homoscedasticity, normal residuals; Ridge adds L2 regularization.
  - Hyperparameter tuning: Ridge tuned via GridSearchCV; best alpha = 1000, no performance gain.
  - Challenges & solutions:
    1. Very low R² → Ridge applied, but still ineffective
    2. Only one feature → suggested adding more variables
3. Decision Tree Classification
  - Objective: Classifies and predicts whether the uninsured rate (Uninsured Rate) in a given area is high or low, based on the population (Total_Population) by ZIP code.
  - Why chosen: Suitable for small, tabular data; interpretable and handles numeric input directly.
  - Assumptions: No assumptions on feature distribution or linearity.
  - Hyperparameter tuning: max_depth and min_samples_split tuned via GridSearchCV (5-fold, F1-score).
  - Challenges & solutions:
    1. Created binary target with KBinsDiscretizer
    2. Tuned tree depth to avoid overfitting
4. Apriori Analysis
  - Objective: Based on the uninsured rate and total population for each ZIP code, we extract the most common combination patterns.
  - Why chosen: Best for discovering frequent patterns in binned insurance and population data.
  - Assumptions: One-hot encoded categorical inputs; assumes frequent subsets imply frequent supersets.
  - Hyperparameter tuning: Tuned min_support and min_lift to balance rule quantity and strength.
  - Challenges & solutions:
    1. Numeric input → discretized into bins
    2. Too few/many rules → tuned thresholds

Performance Evaluation
1. K-Means Clustering
  - Evaluation Metrics:
    1. Silhouette Score (↑ better separation)
    2. Davies-Bouldin Index (↓ less overlap)
  - Results: See this page.
2. Regression (Linear & Ridge)
  - Evaluation Metrics: Used RMSE, MSE, and R² (all showed weak model fit).
  - Results: See this page.
3. Decision Tree Classification
  - Evaluation Metrics: Used precision, recall, and F1-score.
  - Results: See this page.
4. Apriori Analysis
  - Evaluation Metrics: Support, Confidence, Lift (used to rank rules).
  - Results: See this page.
5. Conclusion
  - Best Model for Comparing Colorado and Utah

KMeans Clustering was the most effective model for comparing Colorado and Utah.
- It grouped ZIP codes based on Uninsured_Rate and Total_Population, revealing clear spatial patterns and disparities.
- The high silhouette score indicated strong, well-separated clusters—ideal for understanding regional healthcare access.

- Other Models
  - Linear & Ridge Regression: Very low R² (~0); population did not explain uninsured rate.
  - Decision Tree Classification: Moderate performance after tuning, but limited by use of only one feature.
  - Apriori Analysis: Provided interpretable rules on co-occurring attributes (e.g., high uninsured + small population), useful for pattern exploration, not prediction.
- Conclusion

KMeans best supported the goal of identifying differences between Colorado and Utah.
Other models added interpretive or predictive insights but were limited by data structure or weak signal strength.

Data Formatting for Models
1. K-Means Clustering
  - Data Formatting: Selected and standardized Uninsured_Rate & Total_Population.
  - Before-and-after data transformation snapshots: See this page.
2. Regression (Linear & Ridge)
  - Data formatting: Used .dropna(); standardization not needed.
  - Before-and-after data transformation snapshots: See this page.
3. Decision Tree Classification
  - Data formatting: Binned Uninsured_Rate; used Total_Population directly.
  - Before-and-after data transformation snapshots: See this page.
4. Apriori Analysis
  - Data formatting: Binned Uninsured_Rate & Total_Population, then one-hot encoded for Apriori.
  - Before-and-after data transformation snapshots: See this page.

Page updated

Report abuse

Models Implemented- Health Insurance Data -

Models Implemented
- Health Insurance Data -