WEATHER FORECASTING

Vishal Panuganti

GitHub Repository : https://github.com/datajunkie-v/Data-606/

Introduction

Weather is a powerful natural phenomenon and has always been very unpredictable. We have been trying to predict weather using traditional methods like the use of a barometer, now casting, etc., which are useful to predict weather for a short range. Weather forecasting is beneficial for a lot of human activities such as Air travel, Agriculture, Marine, avoiding catastrophes or even planning an event. With the advancements in artificial intelligence and super computers weather forecasting is improving significantly.

Traditional forecasting methods done by using physical models of the atmosphere are inaccurate for a large period of times. In this project, we explore application of machine learning/statistical models to potentially get an accurate weather forecasts for long range. The scope of this project is restricted to maximum temperature, minimum temperature and precipitation.

Data Set

Source

GHCN (Global Historical Climatology Network) Daily Summaries is a database that addresses need for historical daily temperature, precipitation, and snow records over global land areas. GHCN-Daily is a composite of climate records from numerous sources that were merged and then subjected to a suite of quality assurance reviews. The archive includes over 40 meteorological elements including temperature daily maximum/minimum, temperature at observation time, precipitation, snowfall, snow depth, evaporation, wind movement, wind maximums, soil temperature, cloudiness, etc.

GHCN-Daily will serve as a replacement product for older NCDC-maintained data sets that are designated for daily temporal resolution. It will function as the official archive for daily data from the Global Climate Observing System (GCOS) Surface Network (GSN) and is particularly well suited for monitoring and assessment activities related to the frequency and magnitude of extremes. Containing observations of one or more of the above elements at more than 100,000 stations that are distributed across all continents, the dataset is the world's largest collection of daily climatological data. The total of 1.4 billion data values includes 250 million values each for maximum and minimum temperatures, 500 million precipitation totals, and 200 million observations each for snowfall and snow depth. Station records, some of which extend back to the 19th century, are updated daily where possible and are usually available one to two days after the date and time of the observation.

DATASET SAMPLE

Literature Review

Related works included many different and interesting techniques to try to perform weather forecasts. While much of current forecasting technology involves simulations based on physics and differential equations, many new approaches from artificial intelligence used mainly machine learning techniques, mostly neural networks while some drew on probabilistic models such as Bayesian networks. [1]

Two machine learning algorithms were implemented: linear regression and a variation of functional regression. A corpus of historical weather data for Stanford, CA was obtained and used to train these algorithms. The input to these algorithms was the weather data of the past two days, which include the maximum temperature, minimum temperature, mean humidity, mean atmospheric pressure, and weather classification for each day. The output was then the maximum and minimum temperatures for each of the next seven days.

The first algorithm that was used was linear regression, which seeks to predict the high and low temperatures as a linear combination of the features. Since linear regression cannot be used with classification data, this algorithm did not use the weather classification of each day. As a result, only eight features were used: the maximum temperature, minimum temperature, mean humidity, and mean atmospheric pressure for each of the past two days.

The second algorithm that was used was a variation of functional regression, which searches for historical weather patterns that are most similar to the current weather patterns, then predicts the weather based upon these historical patterns.

Phase 1: Exploratory Data Analysis

Temperature (Minimum and Maximum)

I used data from the years 2010 to 2020 with the key variables including Precipitation, Maximum Temperature, Minimum temperature, Date, Altitude, and name of the station. This data includes weather data from multiple stations across Maryland.

This line plot shows the minimum and maximum temperature over the decade. With the highest reaching almost 110 and the lowest -5.

Precipitation

Precipitation is any product of the condensation of atmospheric water vapor that falls under gravitational pull from clouds. The main forms of precipitation include drizzling, rain, sleet, snow, ice pellets, graupel and hail.

The graph shows precipitation in Laurel, MD over the decade. Precipitation and Temperature are key factors in planning for agriculture.

SEASONAL DECOMPOSE

Trend:

This refers to the overall direction of the data. It is a pattern in data that shows the movement of a series to relatively higher or lower values over a long period of time. In other words, trend is observed when there is an increasing or decreasing slope in the time series.

Seasonality:

It is the periodic Component. A seasonal effect is a systematic and calendar related effect. Some examples include the sharp escalation in most retail series which occurs around December in response to the Christmas period, or an increase in water consumption in summer due to warmer weather.

Cyclical:

A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. If the fluctuations are not of a fixed frequency then they are cyclic

Residual:

The “residuals” in a time series model are what is left over after fitting a model. This can also be referred to as the "noise".

Minimum and Maximum Temperature Distribution

CORRELATION and AUTO CORRELATION

The graphs below show the correlation between all the different variables used in the dataset. There seems to be no strong correlation between between the variables. We will only be using the date and historic values going forward to forecast weather.

Rolling Standard Deviation is a statistical measurement of volatility. It serves as an indicator for temperature direction.

Moving average/Rolling Mean is a calculation to analyze data points by creating a series of averages of different subsets of the full data set.

The time series data can be considered stationary if the rolling statistics remain constant with time.

Hypothesis / Research Question(s)

Does the addition of variables to the model which solely uses time variables like Year, Month and Day impact the model's performance?
Comparing different models used to forecast timeseries data.

Phase 2: Model Construction & Implementation

The machine learning models used in this project are:

Random Forest Regression
Decision Tree Regression
ARIMA

Random Forest Regression

A random regression forest is an ensemble of randomized regression trees. Denote the predicted value at point by the -th tree, where are independent random variables, distributed as a generic random variable , independent of the sample.

I used day, month and year as the features for the Random Forest Regression model. I added temperature as an additional variable to see if it would improve the performance of the model.

Predicting precipitation (temp as an additional feature)

Predicting precipitation using day, month and year

Predicting Maximum Temperature

Random Forest Regression model didn't perform well in predicting precipitation with an additional feature temperature added to the model and there wasn't a significant difference in the model's performance excluding the additional feature.

However, the model had a mean accuracy of 0.81 when predicting the maximum temperature.

Decision Tree Regression

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.

Precipitation

Maximum Temperature using Decision Tree Regression

Autoregressive Integrated Moving Average Model (ARIMA)

An ARIMA Model is a class of statistical models for analyzing and forecasting time series data. It explicitly caters to a suite of standard structures in time series data, and as such provides a simple yet powerful method for making skillful time series forecasts. ARIMA is a generalization of the simpler Auto Regressive Moving Average and adds the notion of integration.

Autocorrelation

Residual

Residual Error Density Plot

This is the density plot of the residual error values, suggesting the errors are Gaussian, but may not be centered on zero.

PREDICTED

V/S

EXPECTED

POWER POINT PRESENTATION

SARIMAX

Seasonal Autoregressive Integrated Moving Average (SARIMA) is an extension to ARIMA. It adds an additional parameters for the period of seasonality.

Predicting Maximum temperature with SARIMAX

Phase 3: Conclusion & Future work

Day, month and year are the only features selected to train the machine learning models. For the Random Forest Model only one additional feature was added which didn’t have any significant difference in the model’s performance. Overall, ARIMA has lesser error rate compared to the other two machine learning models.

SARIMAX was also used to train the weather data. SARIMAX is best suited machine learning algorithm for Time series analysis as it captures both trend and seasonality.

Future Work:

Current weather forecasting procedures have a great accuracy for a short range while statistical models can be used for predicting long range periods. Adding additional variables to the existing models or implementing deep learning models will significantly improve our accuracy for long range periods.

References

1. Machine Learning Applied to Weather Forecasting - Mark Holmstorm, Dylan Liu, Christopher Vo, Dec 15, 2016

2. https://towardsdatascience.com/weather-forecasting-with-machine-learning-using-python-55e90c346647

3. https://stackoverflow.com/questions/52329220/convert-dataframe-to-series-for-multiple-column

4. https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-average-model-arima-c1005347b0d7

5. https://machinelearningmastery.com/sarima-for-time-series-forecasting-in-python/

6. https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

7. https://medium.com/analytics-vidhya/time-series-forecasting-sarima-vs-auto-arima-models-f95e76d71d8f

Links

Linkedin: https://www.linkedin.com/in/vishal-panuganti-8874401b4/

GitHub: https://github.com/datajunkie-v/