Models Implemented
- Air Pollution Data -

Models Implemented
1. MiniBatch KMeans
  - Objective: The goal of this clustering analysis is to identify and compare air quality patterns across monitoring sites in Colorado and Utah, based on PM2.5 concentration (arithmetic_mean) and location (latitude, longitude). This helps reveal regional pollution clusters and supports state-level environmental comparisons.
  - Why chosen: Fast and scalable for large, numeric, unlabeled air quality data (PM2.5 + location).
  - Assumptions: Numeric, scaled data; spherical clusters; fixed k.
  - Tuning: Tested k = 2–7; selected best k using Silhouette Score and DB Index.
  - Challenges & solutions:
    1. Large dataset → used sampling for scoring and PCA.
    2. Mixed feature scales → applied StandardScaler.
2. Regression (Linear & Ridge)
  - Objective: To predict PM2.5 concentration (arithmetic_mean) at air quality monitoring sites in Colorado and Utah, using geographic features (latitude, longitude) and state as inputs. The goal is to determine whether these basic spatial features can explain variations in air quality, and to compare the performance of Linear Regression and Ridge Regression (with regularization).
  - Why chosen: Simple, interpretable models for numeric geographic data (latitude, longitude, state).
  - Assumptions: Linearity, constant variance, independent errors; Ridge adds L2 regularization.
  - Hyperparameter tuning: Ridge tuned with GridSearchCV; best alpha = 1, no performance gain.
  - Challenges & solutions:
    1. Categorical state → one-hot encoded
    2. Weak predictive power → tried Ridge, but no improvement
3. Decision Tree Classifier
  - Objective: To build a classification model that predicts air quality categories (e.g., Good, Moderate, Unhealthy) based on the type of air pollutant and the state where the data was collected.
  - Why chosen: Handles categorical data (parameter, state), interpretable, no need for scaling.
  - Assumptions: No assumptions on linearity or distribution; flexible for non-linear patterns.
  - Hyperparameter tuning: Tuned with GridSearchCV; best max_depth = 3, criterion = 'gini'.
  - Challenges & solutions:
    1. Class imbalance → used zero_division=0 to handle undefined precision
    2. Rare labels not predicted → consider class weighting or merging in future
4. Apriori Analysis
  - Objective: To discover frequent combinations of air pollutants observed at the same site and date, using Apriori algorithm. This helps identify common pollutant co-occurrence patterns, which may indicate shared sources or atmospheric conditions.
  - Why chosen: Suitable for discovering co-occurring air pollutants across monitoring sites. The data naturally forms transaction-like records by grouping pollutant types per site and date.
  - Model assumptions: No statistical assumptions required. Assumes that meaningful patterns emerge from frequent co-occurrence of items (pollutants) in transactions.
  - Hyperparameter tuning: We varied min_support from 0.01 to 0.1 to capture both common and less frequent pollutant combinations. Rules with lift ≥ 1.0 were selected to ensure positive association strength.
  - Challenges & solutions:
    1. High-dimensional binary matrix → resolved with one-hot encoding using TransactionEncoder.
    2. Too few or too many frequent itemsets → addressed by tuning min_support across multiple thresholds.

Performance Evaluation
1. MiniBatch KMeans
  - Evaluation Metrics: Silhouette Score (↑ better separation), Davies-Bouldin Index (↓ better cohesion)
  - Results: See this page.
2. Regression (Linear & Ridge)
  - Evaluation Metrics: RMSE ≈ 6.91, R² ≈ 0.03 → poor predictive performance.
  - Results: See this page.
3. Decision Tree Classifier
  - Evaluation Metrics: Macro F1-score used due to imbalance; precision/recall low for minority classes.
  - Results: See this page.
4. Apriori Analysis
  - Evaluation Metrics: Evaluated rules based on support, confidence, and lift. Lift > 1.0 used to ensure meaningful associations.
  - Results: See this page.
5. Conclusion

Best Model for Comparing Colorado and Utah
- Decision Tree Classification performed best.
  - Successfully predicted air quality categories using pollutant type and state.
  - Provided interpretable rules and required minimal preprocessing.
Other Models

MiniBatch KMeans:
- Effectively revealed spatial clusters of healthcare/population patterns.
- Useful for unsupervised insights, but not predictive.
Regression (Linear & Ridge):
- Very low R² (≈ 0.03) indicated weak linear relationships.
- Ridge regularization offered no improvement.
Apriori Analysis:
- Found meaningful pollutant co-occurrence patterns (e.g., NO2 & OZONE).
- Good for exploratory insights, but not suitable for prediction.
- Conclusion

Decision Tree was the most effective for label-based prediction and actionable insights.
Other models provided valuable exploratory context, but had limitations in accuracy or applicability.

Data Formatting for Models
1. MiniBatch KMeans
  - Data Formatting: Selected and scaled features; PCA used for 2D visualization.
  - Before-and-after data transformation snapshots: See this page.
2. Regression (Linear & Ridge)
  - Data formatting: One-hot encoding for state; no scaling required.
  - Before-and-after data transformation snapshots: See this page.
3. Decision Tree Classifier
  - Data formatting: One-hot encoding for parameter and State; no scaling needed.
  - Before-and-after data transformation snapshots: See this page.
4. Apriori Analysis
  - Data formatting: Grouped by site and date to form transactions; applied one-hot encoding for pollutant presence per transaction.
  - Before-and-after data transformation snapshots: See this page.

Page updated

Report abuse

Models Implemented- Air Pollution Data -

Models Implemented
- Air Pollution Data -