Models Implemented
KMeans Clustering
Objective: We perform pattern discovery and classification analysis using KMeans clustering for the distribution of hospitals by ZIP code in Colorado (CO) and Utah (UT).
Why chosen: Suitable for uncovering structure in unlabeled ZIP-level hospital data (CO & UT).
Assumptions: Numeric, standardized data; spherical and equally-sized clusters; predefined k.
Hyperparameter tuning: Tried k = 2–10; selected k = 5 using silhouette score (best score = 0.904).
Challenges & solutions:
ZIP codes lack spatial meaning → used PCA with state & hospital count
Windows MKL warning → set OMP_NUM_THREADS = 1 and suppressed warnings
Regression (Linear & Ridge)
Objective: We will test whether it is possible to predict the number of hospitals (Hospital_Count) for each ZIP code from the ZIP code and state information.
Why chosen: Suitable for small, numeric, tabular data; Ridge helps prevent overfitting.
Assumptions: Linearity, numeric input, homoscedasticity; Ridge assumes regularization improves generalization.
Hyperparameter tuning: GridSearchCV used for Ridge (alpha = 1000).
Challenges & solutions:
Low R² → added Ridge regularization
Limited features → used ZIP (numeric) + state (one-hot)
Apriori Analysis
Objective: What are the most common attribute patterns among the characteristics (type and ownership structure) of hospitals?
Why chosen: Ideal for discovering frequent patterns in categorical data (Hospital Type, Ownership).
Assumptions: Requires binary, one-hot encoded transactions; assumes frequent subsets imply frequent supersets.
Hyperparameter tuning: Grid search on min_support and lift to balance rule quantity and relevance.
Challenges & solutions:
Raw data not suitable → applied TransactionEncoder
Too many weak rules → filtered with support/lift thresholds
Random Forest
Objective: Predicting a specific class based on the attributes of a hospital.
Why chosen: Handles mixed data types and imbalanced classes well; robust and non-linear.
Assumptions: Non-parametric; no assumptions on data distribution or linearity.
Hyperparameter tuning: GridSearchCV on n_estimators, max_depth, min_samples_split, with class_weight='balanced'.
Challenges & solutions:
Imbalanced classes → used class_weight='balanced'
Missing/invalid values → cleaned with pd.to_numeric and dropna
Unseen classes in test set → used labels= and zero_division=0
Performance Evaluation
KMeans Clustering
Evaluation Metrics: Silhouette Score and Davies-Bouldin Index are used.
Results: See this page.
Regression (Linear & Ridge)
Evaluation Metrics: RMSE, MSE, and R² are used.
Results: See this page.
Apriori Analysis
Evaluation Metrics: Rules assessed using Support, Confidence, and Lift; sorted by Lift.
Results: See this page.
Random Forest
Evaluation Metrics: Precision, recall, F1-score (macro & weighted)
Results: See this page.
Conclusion
Best Model for Comparing Colorado and Utah
KMeans Clustering is best aligned with the goal.
→ It identifies geographic patterns in hospital distribution across ZIP codes, enabling structural comparison of facility availability between the two states.
Other Models
Linear/Ridge Regression: Poor predictive power; not useful for comparison.
Apriori: Reveals ownership/type patterns, but not suitable for state-level comparison.
Random Forest: Good for prediction, but focuses on ownership classification, not availability.
Conclusion
KMeans provides the most relevant insights for comparing the distribution and concentration of hospitals in Colorado and Utah.
Data Formatting for Models
KMeans Clustering
Data Formatting: Features are scaled.
Before-and-after data transformation snapshots: See this page.
Regression (Linear & Ridge)
Data formatting: Categorical variables one-hot encoded; no scaling required for these models.
Before-and-after data transformation snapshots: See this page.
Apriori Analysis
Data formatting: Converted attributes to one-hot transactions for Apriori compatibility.
Before-and-after data transformation snapshots: See this page.
Random Forest
Data formatting: One-hot encoding for categories; label encoding for target
Before-and-after data transformation snapshots: See this page.