DATA 606 Capstone Project
Prof. Ergun Simsek
Predictive Climate Change
Introduction :
To serve as the hub for climate-related information, data, and tools, the world bank (WB) created the climate change knowledge portal (CCKP).
The portal provides an online platform for access to comprehensive global, regional, and country data related to climate change and development.
The CCKP portal provides global data on historical climate, vulnerabilities, and impacts of country, region, and watershed views.
Users can evaluate climate-related vulnerabilities, risks, and actions for a particular location on the globe by interpreting climate and climate-related data at different levels of detail.
Dataset:
The primary data contains various series names(that could impact temperature) and the statistics from years range 1990 to 2011 for all 233 countries. The other sheets in this data contain information about countries and their regions, series names, their Ids, and the source of the series data, etc. The data file contains around 13k rows and around 5 megabytes in size.
The other supportive data contains temperature and rainfall values for all the countries on monthly average statistics from the year 1901 to 2016 in four different files.
Download Dataset:
1) Primary Data: https://datacatalog.worldbank.org/dataset/climate-change-data#__sid=js3
2) Supportive Data: https://climateknowledgeportal.worldbank.org/download-data
AIM:
The aim is to predict the climate changes for the next 10 years and find the abnormality in the temperature, rainfall, or weather. With using the series impact on the climate for previous years, come up with the predictable solution that can prevent the future unnecessary heavy climate changes.
Phase I:
Introductory video of the project information, hypothesis, detailed information of dataset. Dataset cleaning and exploratory data analysis.
The figure shows the data of average monthly temperature and rainfall in the United States from 1901 to 2016.
The blue bar indicates the rainfall and the dotted black line shows the temperature value for every month's average.
The figure shows the aim of the project, to predict the temperature from 2020 to several more years in the United States.
The dotted blue line shows the average monthly temperature and the light blue shadow suggest the possible change in that specific month.
Phase II:
Building the ARIMA model to predict the values that are missing from the dataset to prove the hypothesis, and also forecast temperature and rainfall data for the future.
ARIMA Model:
ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.
AR: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.
I: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from observation at the previous time step) in order to make the time series stationary.
MA: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models.
The seasonal part of the model consists of terms that are similar to the non-seasonal components of the model, but involve backshifts of the seasonal period.
The modeling procedure is almost the same as for non-seasonal data, except that we need to select seasonal AR and MA terms as well as the non-seasonal components of the model.
A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.
p: the number of lag observations in the model; also known as the lag order.
d: the number of times that the raw observations are differenced; also known as the degree of difference.
q: the size of the moving average window; also known as the order of the moving average.
The forecasting equation is constructed as follows. First, let y denote the dth difference of Y, which means:
If d=0: yt = Yt
If d=1: yt = Yt - Yt-1
If d=2: yt = (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2
Note that the second difference of Y (the d=2 case) is not the difference from 2 periods ago. Rather, it is the first-difference-of-the-first difference, which is the discrete analog of a second derivative, i.e., the local acceleration of the series rather than its local trend.
In terms of y, the general forecasting equation is:
ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p - θ1et-1 -…- θqet-q
Here the moving average parameters (θ’s) are defined so that their signs are negative in the equation, following the convention introduced by Box and Jenkins. Some authors and software (including the R programming language) define them so that they have plus signs instead. When actual numbers are plugged into the equation, there is no ambiguity, but it’s important to know which convention your software uses when you are reading the output. Often the parameters are denoted thereby AR(1), AR(2), …, and MA(1), MA(2), … etc.
Phase III:
After training the model with higher accuracy and again train on every result to get the next prediction till the year 2026.
After merging both datasets, I have final data from 1990 to 2026, for all series values compared with yearly average temperature values for a specific country.
As shown in the graph, the series CO2 emissions with the temperature for the United States shows that as CO2 emissions increase, it also causes an increase in temperature.
Applications of ARIMA Model:
In business and finance, the ARIMA model can be used to forecast future quantities (or even prices) based on historical data. Therefore, for the model to be reliable, the data must be reliable and must show a relatively long time span over which it’s been collected. Some of the applications of the ARIMA model in business are listed below:
Forecasting the quantity of a good needed for the next time period based on historical data.
Forecasting sales and interpreting seasonal changes in sales
Estimating the impact of marketing events, new product launches, and so on.
Limitations of ARIMA Model:
Although ARIMA models can be highly accurate and reliable under the appropriate conditions and data availability, one of the key limitations of the model is that the parameters (p, d, q) need to be manually defined; therefore, finding the most accurate fit can be a long trial-and-error process.
Similarly, the model depends highly on the reliability of historical data and the differencing of the data. It is important to ensure that data was collected accurately and over a long period of time so that the model provides accurate results and forecasts.
What Next?:
Look for any anomalies in temperature or rainfall in the future predicted by the model which variables are causing more fluctuation in temperature and rainfall in next years, and try to prevent
Try to get the latest data by 2019 or 2020, so we can make our model more accurate towards the output and even predict more accurate values
Improve the model to catch minor abnormalities and use that knowledge to predict accurate values.
Conclusion:
From two different datasets(Temperature and Series data-values), I was able to predict temperature and rainfall for any country till 2026(as I already have data till 2016, and the ARIMA model can predict accurate values for the next 10 predictions).
Then prediction of both the data until 2016 using the Auto-Regressive Integrated Moving Average(ARIMA) model, in which every forecast(output prediction) is used in the next prediction.
Finally, by combining both the datasets, I get the actual relationship between the temperature and series values and then finding Pearson's R coefficient to find a strong positive or negative relationship between the two graphs. For example, there is a positive relation between CO2 emission and temperature increase in the United States over the last 20 years.
Reference:
1) World Development Indicators | DataBank. (2010). World Bank. https://databank.worldbank.org/reports.aspx?source=2&Topic=19#, Last visited 15 April 2021
2) World Bank Climate Change Knowledge Portal. (2010). CCKP. https://climateknowledgeportal.worldbank.org, Last visited 15 April 2021
3) Wikipedia contributors. (2021, March 18). Autoregressive integrated moving average. In Wikipedia, The Free Encyclopedia. Retrieved 16:31, April 18, 2021, from https://en.wikipedia.org/w/index.php?title=Autoregressive_integrated_moving_average&oldid=1012768874
4) S. (2019b, April 7). Time-series Forecasting — ARIMA models - Towards Data Science. Medium. https://towardsdatascience.com/time-series-forecasting-arima-models-7f221e9eee06, Last visited 25 April 2021.
5) Rajbhoj, A. (2019, October 4). ARIMA simplified. - Towards Data Science. Medium. https://towardsdatascience.com/arima-simplified-b63315f27cbc Last visited 22 April 2021.
6) Brownlee, J. (2020, December 10). How to Create an ARIMA Model for Time Series Forecasting in Python. Machine Learning Mastery. https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/ Last visited 29 April 2021
7) Maklin, C. (2019, July 21). ARIMA Model Python Example — Time Series Forecasting. Medium. https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-average-model-arima-c1005347b0d7 Last visited 30 April 2021
8) P. (2020, October 29). How to Create an ARIMA Model for Time Series Forecasting in Python. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/10/how-to-create-an-arima-model-for-time-series-forecasting-in-python/ Last visited 24 April 2021