Yoshita Narne

Master of Professional Studies in Data Science

(January 2020- May 2021)

Instructor: Dr. Murat Gurner

Air Quality Data Prediction

GitHub Repository Link

PHASE 1

INTRODUCTION:

Despite dramatic progress cleaning the air since 1970, air pollution continues to harm people’s health and the environment. This is a global problem and to be considered for future. Effective air quality prediction has become one of the important issue for monitoring and control stations in many cities to observe air pollutants such as NO2, CO, SO2, PM2.5, and PM10 and alert about pollution if the threshold exceeds.

The Particulate Matter PM 2.5 is a fine atmospheric pollutant that has a diameter of fewer than 2.5 micrometers, Particulate Matter PM10 is a coarse particulate that is 10 micrometers or less in diameter. Carbon Monoxide CO is a product of combustion of fuel such as coal, wood, or natural gas. Vehicular emission contributes to the majority of carbon monoxide let into our atmosphere. Nitrogen dioxide or nitrogen oxide expelled from high-temperature combustion: sulfur dioxide SO2 and Sulphur Oxides SO produced by volcanoes and in industrial processes. Petroleum and Coal often contain sulfur compounds, and their combustion generates sulfur dioxide. Air pollution is caused by the presence of poison gases and substances; therefore, it is impacted by the meteorological factors of a particular place, such as temperature, humidity, rain, and wind.

OBJECTIVE :

The main objective is to create a model based on the given data and is to predict the PM 2.5 levels in the air to determine the quality of the air using ARIMA.

DATASET:

Air quality dataset contains the air quality data of the 35 air quality monitoring stations in Beijing from 2017 to 2018.
Each data item of the dataset contains the id, timestamp, PM2.5 concentration, PM10 concentration, NO2 concentration, CO, O3, SO2 concentration respectively measured at the air quality monitoring stations.
Characteristics of the data is multivariate and Time-Series.
Dataset contains 420768 instances, 18 attributes and all there are missing values filled with NAN.
Source : https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data

METHEODOLOGY:

ARIMA model which stands for "Auto-Regressive Integrated Moving Averages".

Autoregressive integrated moving model is the generalised moving average model for time series predictions. A non season Arima has three components p, d, q.

p - Specifies the order of time lag.

d - Specifies the degree of differencing

q - Specifies order of moving average.

ARIMA is implemented python stats library which will be used for training and predictions. This project uses a non seasonal variant of ARIMA.

Based on the p-value and the threshold value, we can reject null-hypothesis which states that the data is not stationary.

In case if the data is stationary only AR and MA will betaken into account.

Then AIC will be calculated and then I will find values of p and q having lowest AIC where the p represents the number of Auto-Regressive (AR) terms and q represents the number of Moving Averages (MA) terms.

After that the results will be predicted.

EVALUATION METRICS:

Mean absolute error(MAE): MAE is the absolute difference between the target value and the value predicted by the model.
Root mean squared error(RMSE): RMSE is the square root of the averaged squared difference between the target value and the value predicted by the model.
Coefficient of determination(R² ): coefficient of determination or R² helps us to compare our current model with a constant baseline and tells us how much our model is better.

PHASE1

Power Point Presentation of PHASE-1

Presentation video of PHASE-1

PHASE 2

DATA DISTRIBUTION:

PM2_5's density is concentrated below 25% of data ie. as the PM2_5 value increase the density get decreases exponentially. The main idea we can fetch from this is that the PM2_5 data is not evenly distributed. it doesn't follow gaussian distribution property.

ESTIMATING THE TIME SERIES ANALYSIS CHARACTERISTICS:

IMPORTANT CHARACTERISTICS:

IS a there a Trend or Pattern ? Which means that is there any constant increase or decrease observed.
IS there any Seasonality ? Which means that is there a pattern of highs or lows related to seasons, quarterly, monthly etc.any constant increase or decrease observed.
Is the Variance constant or continuously varying.
Is there any periodic cycles repeating in the data.

THE ABOVE PLOT SHOWS THE AVERAGE WEEKLY PM2.5 VALUES:

The Trend plot indicates that there is an overall decreasing trend. The dataset starts at values around 95 and (after increasing in 2014) ends just below 90.
The Seasonal plot identifies repeating patterns that reach their lowest value about 60% of the way through each year. The maximum values are near the start of each year.
The Residual plot reflects the remaining noise in the dataset after removing the other variation types. There are no patterns present.

TESTING SERIES STATIONARITY

The series stationarity can be tested by using Augmented Dickey-Fuller test which is a statistical hypothesis test. This test doesn't get into details but its null hypothesis is essentially estimates the stationarity. Therefore, when we run the test on our target variable and if we expect to see the lowest p-values which means the series is Stationary. we can negotiate one of the three components of ARIMA i.e, degree of differencing(d).

ESTIMATING THE CORRELATION COEFFICIENTS

In the above ACF correlation plot lines represent the confidence band, with center dotted line represents mean and upper and lower dotted line represent boundaries based on 95% confidence interval. Notice that we have good positive correlation with the lags up-to 77, this is the point where ACF plot cuts the upper confidence threshold. Although we have good correlation up-to 77th lag we cannot use all of them as it will create multi-collinearity problem, thats why we turn to PACF plot to get only the most relevant lags!

In the above PACF plot we can see that lags up-to 1.7 have good correlation before the plot first cuts the upper confidence interval. This is our p value i.e the order of our AR process. We can model given AR process using linear combination of first 1.7 lags.

PHASE2

Power Point Presentation of PHASE-2

Presentation video of PHASE-2

PHASE 3

ARIMA MODEL BUILDING AND PREDICTION

From the plot we can observe the Prediction of PM2_5 Concentration values.

The Trend plot indicates that there is an overall increasing and decreasing trend.

In 2013 first quarter - 2014 first quarter - almost same trend

In 2015 first quarter - bit different trend

In 2013 third quarter - 2014 third quarter - almost same trend

From 2016 third quarter - 2017 first quarter - model predicted almost similar to 2013-01 - 2016-06

Our Time Series model done a bit well for the PM2.5 Concentration Air Quality data with our ARIMA model.

MODEL RESULTS:

From the above RMSE values we can say that model is overfitted.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model.
The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

PHASE3.pptx

Power Point Presentation of PHASE-3

LIMITATIONS AND FUTURE WORK

Since the model is overfitted which eventually limit and constrain how much detail the model learns.
There are some additional steps that you should explore to improve the result.
You can try box cox transformation on the original series and use that as input for the model, apply grid search on the transformed dataset to find optimal parameters.
Also it is often appropriate to use AIC in the model selection.

REFERENCES:

Peixeiro, M. (2021, February 15). The complete guide to time series analysis and forecasting. Retrieved March 08, 2021, from https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775
Time Series Analysis. (n.d.).

https://online.stat.psu.edu/stat510/lesson/3/3.1

https://towardsdatascience.com/time-series-analysis-modeling-validation-386378cd3369
https://github.com/jiwidi/time-series-forecasting-with-python
IEEE Xplore Full-Text PDF:. (n.d.). Retrieved March 07, 2021, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8637825
(n.d.). Retrieved from http://www.ijstr.org/final-print/mar2020/Air-Quality-Prediction-Through-Regression-Model.pdf