Models Implemented
- Hopsital Data -

Models Implemented
1. KMeans Clustering
  - Objective: We perform pattern discovery and classification analysis using KMeans clustering for the distribution of hospitals by ZIP code in Colorado (CO) and Utah (UT).
  - Why chosen: Suitable for uncovering structure in unlabeled ZIP-level hospital data (CO & UT).
  - Assumptions: Numeric, standardized data; spherical and equally-sized clusters; predefined k.
  - Hyperparameter tuning: Tried k = 2–10; selected k = 5 using silhouette score (best score = 0.904).
  - Challenges & solutions:
    1. ZIP codes lack spatial meaning → used PCA with state & hospital count
    2. Windows MKL warning → set OMP_NUM_THREADS = 1 and suppressed warnings
2. Regression (Linear & Ridge)
  - Objective: We will test whether it is possible to predict the number of hospitals (Hospital_Count) for each ZIP code from the ZIP code and state information.
  - Why chosen: Suitable for small, numeric, tabular data; Ridge helps prevent overfitting.
  - Assumptions: Linearity, numeric input, homoscedasticity; Ridge assumes regularization improves generalization.
  - Hyperparameter tuning: GridSearchCV used for Ridge (alpha = 1000).
  - Challenges & solutions:
    1. Low R² → added Ridge regularization
    2. Limited features → used ZIP (numeric) + state (one-hot)
3. Apriori Analysis
  - Objective: What are the most common attribute patterns among the characteristics (type and ownership structure) of hospitals?
  - Why chosen: Ideal for discovering frequent patterns in categorical data (Hospital Type, Ownership).
  - Assumptions: Requires binary, one-hot encoded transactions; assumes frequent subsets imply frequent supersets.
  - Hyperparameter tuning: Grid search on min_support and lift to balance rule quantity and relevance.
  - Challenges & solutions:
    1. Raw data not suitable → applied TransactionEncoder
    2. Too many weak rules → filtered with support/lift thresholds
4. Random Forest
  - Objective: Predicting a specific class based on the attributes of a hospital.
  - Why chosen: Handles mixed data types and imbalanced classes well; robust and non-linear.
  - Assumptions: Non-parametric; no assumptions on data distribution or linearity.
  - Hyperparameter tuning: GridSearchCV on n_estimators, max_depth, min_samples_split, with class_weight='balanced'.
  - Challenges & solutions:
    1. Imbalanced classes → used class_weight='balanced'
    2. Missing/invalid values → cleaned with pd.to_numeric and dropna
    3. Unseen classes in test set → used labels= and zero_division=0

Performance Evaluation
1. KMeans Clustering
  - Evaluation Metrics: Silhouette Score and Davies-Bouldin Index are used.
  - Results: See this page.
2. Regression (Linear & Ridge)
  - Evaluation Metrics: RMSE, MSE, and R² are used.
  - Results: See this page.
3. Apriori Analysis
  - Evaluation Metrics: Rules assessed using Support, Confidence, and Lift; sorted by Lift.
  - Results: See this page.
4. Random Forest
  - Evaluation Metrics: Precision, recall, F1-score (macro & weighted)
  - Results: See this page.
5. Conclusion
  - Best Model for Comparing Colorado and Utah

KMeans Clustering is best aligned with the goal.
→ It identifies geographic patterns in hospital distribution across ZIP codes, enabling structural comparison of facility availability between the two states.
- Other Models

Linear/Ridge Regression: Poor predictive power; not useful for comparison.
Apriori: Reveals ownership/type patterns, but not suitable for state-level comparison.
Random Forest: Good for prediction, but focuses on ownership classification, not availability.
- Conclusion

KMeans provides the most relevant insights for comparing the distribution and concentration of hospitals in Colorado and Utah.

Data Formatting for Models
1. KMeans Clustering
  - Data Formatting: Features are scaled.
  - Before-and-after data transformation snapshots: See this page.
2. Regression (Linear & Ridge)
  - Data formatting: Categorical variables one-hot encoded; no scaling required for these models.
  - Before-and-after data transformation snapshots: See this page.
3. Apriori Analysis
  - Data formatting: Converted attributes to one-hot transactions for Apriori compatibility.
  - Before-and-after data transformation snapshots: See this page.
4. Random Forest
  - Data formatting: One-hot encoding for categories; label encoding for target
  - Before-and-after data transformation snapshots: See this page.

Page updated

Report abuse

Models Implemented- Hopsital Data -

Models Implemented
- Hopsital Data -