ARIMA

ARIMA

ARIMA is a method for predicting time series.

The following notes cover some basics of time series and ARIMA modeling.

A time series TS is said to be stationary if the mean and deviation and maybe other statistical properties stay constant over time.

A lot of time series models assume that the TS is stationary. We assume that if a TS has a particular behaviour over time there is a high chance that it will follow the same in the future. For example, traffic pattern overtime, water and energy consumption, etc. Stock market is probably not a good example as there are too many factors that make it almost random (arguably).

Also, theories related to stationary series are more mature and easier compared to non-stationary series.

The Dickey-Fuller stationarity test gives the confidence level of a series being stationary.

Why traditional analysis needs the time series to be stationary?

A stationary ts indicates that the mean & stddev are the same over time.

This guarantees that if you take N points to make a forecast, and repeat this with another N points to make another forecast, the linear relationship between N points and the forecast is still the same.

The original N points and the other N points have to be from the same distribution to make use of the same linear model.

For example, a time series of how many burgers you can buy with $10 from 1990 to 2020. Back in 1900s, $10 gives you 2 burgers but in 2020 you can afford only 1 burger with $10 due to inflation.

Although there is fluctuation during the year but the time series has a trend going down, i.e. mean goes down.

Now you train a linear model with 1990's data, which gives you roughly #burgers = (#burgers in 1990 + #burgers in 1991 + ...+ #burgers in 1999)/10 which is 2.

When you use the same linear model with 2020's data, it will be so wrong because the #burgers you can afford with $10 is around 1.

It's so wrong because the mean value of #burgers in 1900s is 2 and the mean value of #burgers in 2020s is 1. the distributions are different.

To address the problem, if you consider the difference of #burgers between consecutive years, you would find the difference is about x% which is exactly the inflation rate.

So by differencing, you transform the time series of #burgers into time series of burger_difference which is a more stationary distribution (assuming inflation rate is always the same).

Now you can build linear model to predict the burger_difference which would work ok for both 1990s and 2020s.

The burger_difference + the base #burgers in 1990 will give you the correct prediction of #burgers in 2020s.

The function is sth like y = wx + b  where b is the base #burgers in 1990, and w is the -inflation_rate, and x is the #year from 1990.

Although the stationarity assumption is taken in many models, almost none of practical time series are stationary.

However people find out ways to convert time series to a stationary one, or close enough to stationary.

There are 2 main reasons behind non-stationarity:

1.Trend

  The mean of x grows / drops over time. e.g. gdp, consumer spending,etc.

2.Seasonality

  The deviation changes over time within a cycle. e.g. ice cream sale within a year.

The method is to remove the trend and seasonality from a time series and build a model. The model forecast is then converted back to the original scale by applying trend and seasonality. This sounds a bit like the Batch Normalisation.

Assuming a time series ts in Pandas, view the data and the moving average & moving deviation.

If the moving average continues to go up / down, it means there is a underlying trend.

  moving_avg = ts.rolling(window=12).mean()

  moving_std = ts.rolling(window=12).std()

  plt.plot(ts, color='blue')

  plt.plot(moving_avg, color='yellow')

  plt.plot(moving_std, color='red')

One can transform a time series using log, square root, cube root, etc to reduce trend a little bit, and view the timer series and moving avg/std again

  ts_log = numpy.log(ts)

  

Again calculate the moving average of ts_log to define the underlying trend.

  moving_avg = ts_log.rolling(window=12).mean()

Sometimes use expoential weighted moving average to represent the trend. Exponential weight means bigger weights for more recent data. The weight decays exponentially.

  exp_moving_avg = ts_log.ewm(halflife=12).mean()

  

Subtract the trend from the time series

  ts_log_less_trend = ts_log - moving_avg

  or

  ts_log_less_trend = ts_log - exp_moving_avg

  

Then plot the moving average and moving std for the ts less trend.

Usually the time series's moving mean should be smoother. However, the above simple method of subtracting trend doesn't work / work well most of the cases.

There are other methods that are a little trickier.

Differencing

This looks at the speed of grow (change from previous period) by taking the difference between the current period and the previous period

  ts_diff = ts_log - ts_log.shift(periods=1)

Note here still uses the log of ts, otherwise the moving mean will be smooth but the moving std will be growing up. The log suppresses the amplitude (deviation) of the time series.

plot the ts_diff and its moving average. it could be much smoother than the previous.

Decomposing

If there is seasonal behavior, it can be decomposed to a base trend + a seasonal fluctuation + residual.

We can model them separately. E.g. the base trend of sales is determined by GDP grow, the seasonal movements are caused by temperature, the residual is harder but maybe caused by ad hoc incentives, campaign, and random behaviors, etc.

We can do differencing first (or not) and then decomposing. Build models for the base trend and seasonal movements, and possibly the residuals.

When predicting the future, simply add the future trend value, and future seasonal value and the future residual.

If luckily the time series is a stricktly stationary series and a value within a seasonal cycle doesnt depend on any previous value, i.e. the next value has no relation to previous values, the modeling is easy, but this is rare. Usually there is a dependency, so we need some modeling techniques.


Auto Regressive Integrated Moving Average ARIMA

There are 3 components:

AR: Autoregression

    A model that uses the dependent relationship between an observation and some lagged observations.

I: Integrated

    The use of differencing of raw observations (e.g. subtracting an observation from a previous observation) to make the time series stationary.

MA: Moving Average

    A model that uses the dependency between an observation, and a residual error from a moving average model applied to lagged observations.

The 3 compoents are parameters of the ARIMA model ARIMA(p,d,q) 

p: Number of AR (Auto-Regressive) terms (p)

   The number of lag observations used as predictors. E.g. if p is 5, the predictors for x(t) will be x(t-1)….x(t-5).

d: Number of Differences

   The number of times that the raw observations are differenced, also called the degree of differencing.

   i.e. difference, difference of difference, etc. similar to 1st, 2nd order derivatives

   we can either difference ts as data preparation and put d=0, or pass the original ts and put d=1. Both will generate same results.  

q: Number of MA (Moving Average) terms (q)

   MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for x(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average of prediction values at i, and the moving average of actual values at i. Seems it needs to keep track of the prediction value of every i.

ARIMA constructs a linear regression model using the prevous observations, residual error from moving average. The differencing is used as data preparation to make the time series stationary.

When the p/d/q parameter is set to zero, ARIMA takes the corresponding component out. So if d =0 then no differencing, and ARIMA becomes ARMA. If p and d =0, it becomes MA.

To determine p, i.e. the # of lag observations, simply check the Autocorrelation Function (ACF) between current and lag observations.

    from pandas.plotting import autocorrelation_plot

    autocorrelation_plot(ts)

pick the # of lags that is significantly correlated. The first cross through the upper confidence interval.

To determine q, i tneeds to check the Partial Autocorrelation Function (PACF)

This measures the correlation between current and lag observations but after eliminating the variations already explained by the intervening comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained by lags 1 to 4.

check 'from statsmodels.tsa.stattools import acf, pacf' for calculating acf and pacf

An eample python script to train a ARIMA model is:

  from statsmodels.tsa.arima_model import ARIMA

  p=20

  d=1 #differencing once

  q=0 #dont use MA for now

  model = ARIMA(ts_log, order=(p,d,q))

  model_fit = model.fit(disp=0)

  print(model_fit.summary())

After that usually needs to check the residual.

  # plot residual errors, to see anything else (obvious trend) to learn

  residuals = pd.DataFrame(model_fit.resid)

  residuals.plot()

  # also plot the distribution of error using kde, kernel density estimation

  # pure noise would be a mean 0 normal distribution

  residuals.plot(kind='kde')

  #further check the mean, std, etc.

  print(residuals.describe())

Then make a forecast for the future

  #make a forecast for 2 steps ahead

  #output[0] is the forecasts for every future step

  #output[1] is the std error

  #output[3] is the confidence interval

  output = model_fit.forecast(steps = 2)

  

A python code for testing.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from statsmodels.tsa.seasonal import seasonal_decompose

from statsmodels.tsa.stattools import adfuller

from statsmodels.tsa.arima_model import ARIMA

from pandas.plotting import autocorrelation_plot

#plot the moving average and moving std of a time series

def plot_moving_avg_std(ts):

    moving_avg = ts.rolling(window=12).mean()

    moving_std = ts.rolling(window=12).std()

    plt.plot(ts, color='blue',label='Original')

    plt.plot(moving_avg, color='red', label='Rolling Mean')

    plt.plot(moving_std, color='black', label = 'Rolling Std')

#print the dickey-fuller test result

def print_stationarity_test(ts):

    dftest = adfuller(ts.values.flatten(), autolag='AIC')

    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])

    for key,value in dftest[4].items():

        dfoutput['Critical Value (%s)'%key] = value

    print(dfoutput)

parse_dates = ['Month']

index_col = ['Month']

date_parser = lambda dates: pd.datetime.strptime(dates, '%Y-%m')

ts = pd.read_csv('D:/programming/python/AirPassengers.csv', parse_dates = parse_dates, index_col = index_col, date_parser=date_parser)

 

plt.plot(ts)

print_stationarity_test(ts)

#ap['1949-01-01':'1949-05-01']

#ap.loc['1949-01-01']

plot_moving_avg_std(ts)

#take log of ts

ts_log = np.log(ts)

log_moving_avg = ts_log.rolling(window=12).mean()

plt.plot(ts_log, color='blue',label='Original')

plt.plot(log_moving_avg, color='red', label='Rolling Mean')

#remove moving average from ts

ts_log_less_moving_avg = ts_log - log_moving_avg

plot_moving_avg_std(ts_log_less_moving_avg)

#take exponential weight moving average

log_ewm = ts_log.ewm(halflife = 12).mean()

plt.plot(ts_log, color='blue',label='Original')

plt.plot(log_ewm, color='red', label='Rolling Mean')

#remove exponential weight moving average from ts

ts_log_less_ewm = ts_log - log_ewm

plot_moving_avg_std(ts_log_less_ewm)

#plot the stationarity, much better

print_stationarity_test(ts_log_less_ewm)

#differencing 

ts_log_shift = ts_log.shift(periods=1)

ts_diff = ts_log - ts_log_shift

plot_moving_avg_std(ts_diff)

#decompose time series into base trend, seasonal movements and residual

decomposition = seasonal_decompose(ts_log)

trend = decomposition.trend

seasonal = decomposition.seasonal

residual = decomposition.resid

plt.plot(trend, color = 'green')

plt.plot(seasonal, color = 'blue')

plt.plot(residual, color = 'red')

#the base trend and seasonal movements are easy, check the residual

plot_moving_avg_std(residual)

#ARIMA(p, d, q)

#print the correlation between observations and the lag observations

#pick the # of lags that is still significantly correlated, i.e. parameter p

autocorrelation_plot(ts)

p=20

d=1 #differencing once

q=2 

model = ARIMA(ts_log, order=(p,d,q))

model_fit = model.fit(disp=0)

print(model_fit.summary())

# plot residual errors, to see anything else (obvious trend) to learn

residuals = pd.DataFrame(model_fit.resid)

residuals.plot()

# also plot the distribution of error using kde, kernel density estimation

# pure noise would be a mean 0 normal distribution

residuals.plot(kind='kde')

#further check the mean, std, etc.

print(residuals.describe())

#make a forecast for 2 steps ahead

#output[0] is the forecasts for every future step

#output[1] is the std error

#output[3] is the confidence interval

output = model_fit.forecast(steps = 2)


Side notes:

When it comes to time series data, there are a few important things to consider first. This is can be roughly done by plotting the data and having a look:

Any base trend? E.g. gdp grows every year

Any seasonality? E.g. sales shows a yearly up and down pattern

Any long-run cycle or period unrelated to seasonality factors? E.g. financial crisis, economy cycle

Any constant/non-constant variance? E.g. fluctuation could change over time

Any outliers?


Preprocess datase to make the series stationary

   e.g. decompose the series into seasonal, base trend and residual

   model the seasonal and base trend separately

   use ARIMA for the residual

   you may also use differencing, logging, etc to further transform

   When stationary, fit the ARIMA model and make a prediction

   Remember the prediction needs to be converted back by adding seasonal, base trend, etc..

When fiting an ARIMA model, you need determine a few parameters   

   Try different d value (order of differencing)

   Polt and check the ACF and PACF. From there, determine the p and q values


ARIMA formulas.pdf