Models Implemented
K-Means Clustering
Objective: To identify and group parks with similar geographic characteristics (e.g., location) across Colorado and Utah using K-Means clustering. This helps uncover regional patterns, supports spatial planning, and may inform resource allocation or facility development based on cluster similarity.
Why chosen: Ideal for grouping parks by geographic coordinates (latitude, longitude) without labels.
Assumptions: Assumes spherical, equally sized clusters using Euclidean distance.
Hyperparameter tuning: Tuned k (2–9) using Silhouette Score (↑) and Davies-Bouldin Index (↓).
Challenges & solutions:
MKL memory warning → resolved by setting OMP_NUM_THREADS=1.
Optimal k chosen using internal clustering metrics.
Decision Tree Classifier
Objective: To classify the state (Colorado or Utah) of each park based on its features, such as location and other attributes. This helps assess whether parks in the two states have distinguishable characteristics.
Why chosen: Handles both numeric (latitude, longitude) and categorical (designation) features with minimal preprocessing and interpretable rules.
Assumptions: Non-parametric; no assumptions about linearity or feature distribution.
Hyperparameter tuning: Tuned max_depth, min_samples_split, min_samples_leaf, and criterion using GridSearchCV with macro F1-score.
Challenges & solutions: Missing features handled by switching to designation; categorical data encoded with one-hot.
Apriori Analysis
Objective: To discover frequent co-occurrence patterns among categorical features (e.g., park designations and state), using the Apriori algorithm. This helps identify which types of parks tend to occur together in the same state, revealing potential policy or geographic patterns.
Why chosen: Ideal for finding frequent patterns in categorical data like designation and State.
Assumptions: No distributional assumptions; relies on frequent itemsets.
Hyperparameter tuning: Tuned min_support (0.01–0.07) and min_confidence (0.5–0.9) to balance rule quality and quantity.
Challenges & solutions: Limited structured features → focused on designation and State as key items.
Logistic Regression
Objective: To predict whether a park belongs to Colorado or Utah based on features such as its location (latitude, longitude) and type (designation), using logistic regression. This allows us to assess how well basic park characteristics explain the state classification.
Why chosen: Suitable for binary classification (CO vs UT) with interpretable results.
Assumptions: Assumes linearity (log-odds), feature independence, and no multicollinearity.
Hyperparameter tuning: Tuned C, penalty, and solver using GridSearchCV with macro F1-score.
Challenges & solutions:
Predicted only one class due to imbalance or weak features.
Used zero_division=0 to handle undefined metrics.
Performance Evaluation
K-Means Clustering
Evaluation Metrics: Used Silhouette Score and DB Index to assess cluster quality.
Results: See this page.
Decision Tree Classifier
Evaluation Metrics: Assessed with precision, recall, and macro F1-score for imbalanced binary classification.
Results: See this page.
Apriori Analysis
Evaluation Metrics: Used support, confidence, and lift; lift > 1.0 indicated meaningful associations.
Results: See this page.
Logistic Regression
Evaluation Metrics: Used precision, recall, and macro F1-score for imbalanced data.
Results: See this page.
Conclusion
Best Model for Comparing Colorado and Utah
Decision Tree Classification performed best.
Accurately predicted state using latitude, longitude, and designation.
Handled categorical data well and provided interpretable rules.
Other Models
KMeans Clustering:
Found spatial patterns but was unsupervised and not predictive.
Apriori Analysis:
Discovered co-occurrence patterns (e.g., park types in CO), but not suited for classification.
Logistic Regression:
Performed poorly due to class imbalance and limited feature separation.
Conclusion
Decision Tree was the most effective for classification.
Other models provided useful exploratory insights but lacked predictive power.
Data Formatting for Models
K-Means Clustering
Data Formatting: Standardized latitude and longitude; PCA used for visualization.
Before-and-after data transformation snapshots: See this page.
Decision Tree Classifier
Data formatting: One-hot encoded designation; numeric features used as-is.
Before-and-after data transformation snapshots: See this page.
Apriori Analysis
Data formatting: Converted each record into a transaction and applied one-hot encoding.
Before-and-after data transformation snapshots: See this page.
Logistic Regression
Data formatting: One-hot encoded designation; used latitude and longitude directly.
Before-and-after data transformation snapshots: See this page.