Models Implemented
MiniBatch KMeans
Objective: The goal of this clustering analysis is to identify and compare air quality patterns across monitoring sites in Colorado and Utah, based on PM2.5 concentration (arithmetic_mean) and location (latitude, longitude). This helps reveal regional pollution clusters and supports state-level environmental comparisons.
Why chosen: Fast and scalable for large, numeric, unlabeled air quality data (PM2.5 + location).
Assumptions: Numeric, scaled data; spherical clusters; fixed k.
Tuning: Tested k = 2–7; selected best k using Silhouette Score and DB Index.
Challenges & solutions:
Large dataset → used sampling for scoring and PCA.
Mixed feature scales → applied StandardScaler.
Regression (Linear & Ridge)
Objective: To predict PM2.5 concentration (arithmetic_mean) at air quality monitoring sites in Colorado and Utah, using geographic features (latitude, longitude) and state as inputs. The goal is to determine whether these basic spatial features can explain variations in air quality, and to compare the performance of Linear Regression and Ridge Regression (with regularization).
Why chosen: Simple, interpretable models for numeric geographic data (latitude, longitude, state).
Assumptions: Linearity, constant variance, independent errors; Ridge adds L2 regularization.
Hyperparameter tuning: Ridge tuned with GridSearchCV; best alpha = 1, no performance gain.
Challenges & solutions:
Categorical state → one-hot encoded
Weak predictive power → tried Ridge, but no improvement
Decision Tree Classifier
Objective: To build a classification model that predicts air quality categories (e.g., Good, Moderate, Unhealthy) based on the type of air pollutant and the state where the data was collected.
Why chosen: Handles categorical data (parameter, state), interpretable, no need for scaling.
Assumptions: No assumptions on linearity or distribution; flexible for non-linear patterns.
Hyperparameter tuning: Tuned with GridSearchCV; best max_depth = 3, criterion = 'gini'.
Challenges & solutions:
Class imbalance → used zero_division=0 to handle undefined precision
Rare labels not predicted → consider class weighting or merging in future
Apriori Analysis
Objective: To discover frequent combinations of air pollutants observed at the same site and date, using Apriori algorithm. This helps identify common pollutant co-occurrence patterns, which may indicate shared sources or atmospheric conditions.
Why chosen: Suitable for discovering co-occurring air pollutants across monitoring sites. The data naturally forms transaction-like records by grouping pollutant types per site and date.
Model assumptions: No statistical assumptions required. Assumes that meaningful patterns emerge from frequent co-occurrence of items (pollutants) in transactions.
Hyperparameter tuning: We varied min_support from 0.01 to 0.1 to capture both common and less frequent pollutant combinations. Rules with lift ≥ 1.0 were selected to ensure positive association strength.
Challenges & solutions:
High-dimensional binary matrix → resolved with one-hot encoding using TransactionEncoder.
Too few or too many frequent itemsets → addressed by tuning min_support across multiple thresholds.
Performance Evaluation
MiniBatch KMeans
Evaluation Metrics: Silhouette Score (↑ better separation), Davies-Bouldin Index (↓ better cohesion)
Results: See this page.
Regression (Linear & Ridge)
Evaluation Metrics: RMSE ≈ 6.91, R² ≈ 0.03 → poor predictive performance.
Results: See this page.
Decision Tree Classifier
Evaluation Metrics: Macro F1-score used due to imbalance; precision/recall low for minority classes.
Results: See this page.
Apriori Analysis
Evaluation Metrics: Evaluated rules based on support, confidence, and lift. Lift > 1.0 used to ensure meaningful associations.
Results: See this page.
Conclusion
Best Model for Comparing Colorado and Utah
Decision Tree Classification performed best.
Successfully predicted air quality categories using pollutant type and state.
Provided interpretable rules and required minimal preprocessing.
Other Models
MiniBatch KMeans:
Effectively revealed spatial clusters of healthcare/population patterns.
Useful for unsupervised insights, but not predictive.
Regression (Linear & Ridge):
Very low R² (≈ 0.03) indicated weak linear relationships.
Ridge regularization offered no improvement.
Apriori Analysis:
Found meaningful pollutant co-occurrence patterns (e.g., NO2 & OZONE).
Good for exploratory insights, but not suitable for prediction.
Conclusion
Decision Tree was the most effective for label-based prediction and actionable insights.
Other models provided valuable exploratory context, but had limitations in accuracy or applicability.
Data Formatting for Models
MiniBatch KMeans
Data Formatting: Selected and scaled features; PCA used for 2D visualization.
Before-and-after data transformation snapshots: See this page.
Regression (Linear & Ridge)
Data formatting: One-hot encoding for state; no scaling required.
Before-and-after data transformation snapshots: See this page.
Decision Tree Classifier
Data formatting: One-hot encoding for parameter and State; no scaling needed.
Before-and-after data transformation snapshots: See this page.
Apriori Analysis
Data formatting: Grouped by site and date to form transactions; applied one-hot encoding for pollutant presence per transaction.
Before-and-after data transformation snapshots: See this page.