Models Implemented
- National Park Data -

Models Implemented
1. K-Means Clustering
  - Objective: To identify and group parks with similar geographic characteristics (e.g., location) across Colorado and Utah using K-Means clustering. This helps uncover regional patterns, supports spatial planning, and may inform resource allocation or facility development based on cluster similarity.
  - Why chosen: Ideal for grouping parks by geographic coordinates (latitude, longitude) without labels.
  - Assumptions: Assumes spherical, equally sized clusters using Euclidean distance.
  - Hyperparameter tuning: Tuned k (2–9) using Silhouette Score (↑) and Davies-Bouldin Index (↓).
  - Challenges & solutions:
    1. MKL memory warning → resolved by setting OMP_NUM_THREADS=1.
    2. Optimal k chosen using internal clustering metrics.
2. Decision Tree Classifier
  - Objective: To classify the state (Colorado or Utah) of each park based on its features, such as location and other attributes. This helps assess whether parks in the two states have distinguishable characteristics.
  - Why chosen: Handles both numeric (latitude, longitude) and categorical (designation) features with minimal preprocessing and interpretable rules.
  - Assumptions: Non-parametric; no assumptions about linearity or feature distribution.
  - Hyperparameter tuning: Tuned max_depth, min_samples_split, min_samples_leaf, and criterion using GridSearchCV with macro F1-score.
  - Challenges & solutions: Missing features handled by switching to designation; categorical data encoded with one-hot.
3. Apriori Analysis
  - Objective: To discover frequent co-occurrence patterns among categorical features (e.g., park designations and state), using the Apriori algorithm. This helps identify which types of parks tend to occur together in the same state, revealing potential policy or geographic patterns.
  - Why chosen: Ideal for finding frequent patterns in categorical data like designation and State.
  - Assumptions: No distributional assumptions; relies on frequent itemsets.
  - Hyperparameter tuning: Tuned min_support (0.01–0.07) and min_confidence (0.5–0.9) to balance rule quality and quantity.
  - Challenges & solutions: Limited structured features → focused on designation and State as key items.
4. Logistic Regression
  - Objective: To predict whether a park belongs to Colorado or Utah based on features such as its location (latitude, longitude) and type (designation), using logistic regression. This allows us to assess how well basic park characteristics explain the state classification.
  - Why chosen: Suitable for binary classification (CO vs UT) with interpretable results.
  - Assumptions: Assumes linearity (log-odds), feature independence, and no multicollinearity.
  - Hyperparameter tuning: Tuned C, penalty, and solver using GridSearchCV with macro F1-score.
  - Challenges & solutions:
    1. Predicted only one class due to imbalance or weak features.
    2. Used zero_division=0 to handle undefined metrics.

Performance Evaluation
1. K-Means Clustering
  - Evaluation Metrics: Used Silhouette Score and DB Index to assess cluster quality.
  - Results: See this page.
2. Decision Tree Classifier
  - Evaluation Metrics: Assessed with precision, recall, and macro F1-score for imbalanced binary classification.
  - Results: See this page.
3. Apriori Analysis
  - Evaluation Metrics: Used support, confidence, and lift; lift > 1.0 indicated meaningful associations.
  - Results: See this page.
4. Logistic Regression
  - Evaluation Metrics: Used precision, recall, and macro F1-score for imbalanced data.
  - Results: See this page.
5. Conclusion
Best Model for Comparing Colorado and Utah

Decision Tree Classification performed best.
- Accurately predicted state using latitude, longitude, and designation.
- Handled categorical data well and provided interpretable rules.
- Other Models

KMeans Clustering:
- Found spatial patterns but was unsupervised and not predictive.
Apriori Analysis:
- Discovered co-occurrence patterns (e.g., park types in CO), but not suited for classification.
Logistic Regression:
- Performed poorly due to class imbalance and limited feature separation.
Conclusion

Decision Tree was the most effective for classification.
Other models provided useful exploratory insights but lacked predictive power.

Data Formatting for Models
1. K-Means Clustering
  - Data Formatting: Standardized latitude and longitude; PCA used for visualization.
  - Before-and-after data transformation snapshots: See this page.
2. Decision Tree Classifier
  - Data formatting: One-hot encoded designation; numeric features used as-is.
  - Before-and-after data transformation snapshots: See this page.
3. Apriori Analysis
  - Data formatting: Converted each record into a transaction and applied one-hot encoding.
  - Before-and-after data transformation snapshots: See this page.
4. Logistic Regression
  - Data formatting: One-hot encoded designation; used latitude and longitude directly.
  - Before-and-after data transformation snapshots: See this page.

Page updated

Report abuse

Models Implemented- National Park Data -

Models Implemented
- National Park Data -