This project aims to analyze and predict the Air Quality Index (AQI) in Mumbai, India, using machine learning techniques.
The truth is that today it has become extremely difficult to predict weather systems, due to extreme climate change. Human cooperation, modernization and industrialization are producing extra heat, which is radiated into the atmosphere. All this extra pollution is affecting the weather systems, due to which it has become extremely difficult. To solve this problem, I have tried to compare all the popular ML algorithms and find the best machine learning algorithm in this project.
Objectives:
Analyze pollution and meteorological data from Mumbai.
Predict AQI using supervised ML algorithms.
Compare model performance (MAE, RMSE, R²).
Provide actionable insights for urban and health planning.
Dataset:
The dataset was obtained from the CPCB's data repository, focusing on daily air quality measurements in Mumbai from January 1st, 2021 to July 31st, 2023. It initially included the following features:
Timestamp: Date and time of measurement
Station: Location of the monitoring station
Pollutants: PM2.5, PM10, NO, NO2, NOx, NH3, SO2, CO, Ozone
Feature Engineering:
Calculated AQI: Computed AQI value using the CPCB's AQI Calculator.
AQI Category: Categorical representation of AQI (Good, Moderate, Poor, Unhealthy, Severe, Hazardous) based on predefined AQI ranges.
Data Preprocessing:
Handling Missing Values: A hierarchical imputation approach was employed to fill missing data points using quarterly, semester, 9-month, and yearly means, prioritizing seasonal patterns and data completeness.
Data Normalization: Min-Max scaling was applied to normalize pollutant features, ensuring they contribute equally to model training.
Feature Engineering: The "Calculated AQI" and "AQI Category" features were derived from raw pollutant concentrations using the CPCB's AQI calculation methodology.
Data Visualization:
Various visualizations were used to explore the data:
Pairplots: To examine relationships between pollutants and AQI categories. (See pairplot.png in the visualizations folder).
Boxplots: To compare pollutant distributions across stations. (See boxplot.png in the visualizations folder).
Time Series Plots: To track pollution trends over time for individual stations. (See timeseries_plots folder for individual station plots).
Bar Plots: To visualize average AQI levels across stations. (See avg_aqi_barplot.png in the visualizations folder).
Treemaps: To show pollutant distributions across stations. (See mumbai_pollution_treemap.html in the repository).
Radar Charts: To display average pollution levels for each station. (See radar_chart.png in the visualizations folder).
Feature Selection & Multicollinearity:
Variance Inflation Factor (VIF) was calculated to assess multicollinearity between features.
Features with high VIF were carefully considered to avoid potential issues in regression models.
SO2, NOx (ppb), and NO (µg/m³) were removed due to high multicollinearity.
Model Development:
Several regression models were trained and evaluated:
Linear Regression
Decision Tree Regressor
Random Forest Regressor
Support Vector Regressor (SVR)
K-Nearest Neighbors Regressor (KNN)
Model performance was assessed using metrics such as MAE, MSE, R², and Adjusted R².
Technologies Used:
Python, pandas, NumPy, scikit-learn, XGBoost, seaborn, matplotlib
Models: Linear Regression, Decision Tree, Random Forest, KNN, XGBoost, ANN
Techniques: Feature engineering, missing value handling, hyperparameter tuning (GridSearchCV), visualization
Outcome: The best-performing model was selected for its high accuracy and generalization. The solution has real-world application in public health advisories, urban planning, and predictive environmental monitoring.
Future Scope:
Real-time AQI dashboards
IoT sensor and satellite data integration
AI-based pollution alert systems
Role: Project Lead — designed and implemented the end-to-end pipeline, from data preprocessing to model evaluation.