Trang chủ‎ > ‎IT‎ > ‎Data Mining‎ > ‎Time Series Analysis‎ > ‎

Another time series forecasting using R with accuracy testing on air passenger dataset

According to the Merriam-Webster dictionary (http://www.merriam-webster.com/dictionary/forecast), the act of forecasting consists in saying that (something) will happen in the future or to predict (something, such as weather) after looking at the information that is available

Forecasting is a form of prediction. In time series analysis we often want to predict something in the future on the basis of what we have observed in the past. The evaluation of the accuracy of our prediction is an important part of model evaluation. Poor accuracy means that our prediction will be scarcely helpful.

Herewith enclosed is an exercise I’ve done on the basis of the online book of Hyndman and Athanasopoulos “Forecasting: principles and practice”, a very good reading. You may find the book here: https://www.otexts.org/fpp

You may find details on the topic (accuracy in forecasting) in the online book of Hyndman and Athanasopoulos: https://www.otexts.org/fpp/2/5

The codes are a reworking of the codes I’ve found in the site and elsewhere in another site of Rob J. Hyndman (http://robjhyndman.com/hyndsight/)

Of course, all responsability is mine, so, if you find errors and mistakes, this is my responsability and not of Hyndman or Athanasopoulos.

I will appreciate that any error be signalled to me at the Twitter address below.

Let’s start!

# first load the necessary packages to have the file ready to be published in RPub
# you do not need these libraries to perform the analyses

library(knitr)
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 3.1.3

The example is based on the dataset “AirPassengers”

# load the necessary library

library(forecast)
## Warning: package 'forecast' was built under R version 3.1.3
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: timeDate
## This is forecast 6.1
data(AirPassengers)

# inspect the series

AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
### data are assigned to a convenient vector
### this is a easy way to avoid changing the code every time

series <- AirPassengers


# plot the series 

plot(series, col="darkblue", ylab="Passegners on airplanes")

# plot the seasonal distribution of the series

windows(width=800,height=350)   # set the window with the dimensions you need

boxplot(split(series, cycle(series)), names = month.abb, col = "gold")

plot of chunk unnamed-chunk-2


The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model.

The size of the test set is typically about 20% of the total sample

So, we will split the series in a training set and a test set

### training set
### use data from 1949 to 1956 for forecasting

sr = window(series, start=1949, end=c(1956,12))

### test set
### use remaining data from 1957 to 1960 to test accuracy

ser = window(series, start=1957, end=c(1960,12))

Now we are ready to start.

Let’s rock the boat!

    ######################################################################
    # plot training set
    ######################################################################

    plot(sr, main="AirPassengers", ylab="", xlab="Months")

    # plot forecasting for 5 years according to four methods
    lines(meanf(sr,h=48)$mean, col=4)
    lines(rwf(sr,h=48)$mean, col=2)
    lines(rwf(sr,drift=TRUE,h=48)$mean, col=3)
    lines(snaive(sr,h=48)$mean, col=5)

    # legend
    legend("topleft", lty=1, col=c(4,2,3, 5),
    legend=c("Mean method","Naive method","Drift method", "Seasonal naïve method"),bty="n"

    # the test set
    lines(ser, col="red")

plot of chunk unnamed-chunk-4

    # accuracy for forecasting of sr (forecasted data) on ser (original data used as test set)
    # the best model had the lowest error (particularly the MAPE, Mean absolute percentage error)

    # Mean method
    accuracy(meanf(sr,h=48), ser)
##                      ME   RMSE    MAE    MPE  MAPE  MASE   ACF1 Theil's U
## Training set -9.442e-15  71.54  59.07 -11.68 30.89 2.023 0.9282        NA
## Test set      1.998e+02 214.34 199.77  46.60 46.60 6.841 0.7915     4.358
    # Naive method
    accuracy(rwf(sr,h=48), ser)
##                   ME   RMSE    MAE     MPE   MAPE   MASE   ACF1 Theil's U
## Training set   2.042  23.33  18.69  0.5236  8.716 0.6402 0.2472        NA
## Test set     107.479 132.61 107.73 23.5372 23.620 3.6891 0.7915     2.528
    # Drift method
    accuracy(rwf(sr,drift=TRUE,h=48), ser)
##                      ME  RMSE   MAE     MPE   MAPE   MASE   ACF1 Theil's U
## Training set -1.122e-14 23.24 18.61 -0.5356  8.713 0.6373 0.2472        NA
## Test set      5.745e+01 87.67 63.89 11.7513 13.729 2.1879 0.7296     1.646
    # Seasonal naïve method
    accuracy(snaive(sr,h=48), ser)
##                 ME  RMSE   MAE   MPE  MAPE  MASE   ACF1 Theil's U
## Training set 28.80 32.73 29.20 12.49 12.69 1.000 0.7740        NA
## Test set     85.23 97.80 85.23 19.59 19.59 2.919 0.8754     1.932
    ######################################################################
    # plot test set only with the predictions
    ######################################################################

    # calculate the forecasting

    sr.mean <- meanf(sr,h=48)$mean
    sr.naive <- rwf(sr,h=48)$mean
    sr.drift <- rwf(sr,drift=TRUE,h=48)$mean
    sr.seas <- snaive(sr,h=48)$mean

    # plot the test set
    plot(ser, main="AirPassengers", ylab="", xlab="Months", ylim = c(200,600))

    # plot forecasting for 4 years according to four methods
    lines(sr.mean, col=4)
    lines(sr.naive, col=2)
    lines(sr.drift, col=3)
    lines(sr.seas, col=5)

    # legend
    legend("topleft", lty=1, col=c(4,2,3,5),
    legend=c("Mean method","Naive method","Drift method", "Seasonal naïve method"),bty="n"

plot of chunk unnamed-chunk-4

It is rather obvious than none of these methods produce a good forecast of the series.

The mean method and the naive method do not detect nor the trend neither the seasonality in the series. The drift method does detect the trend but not the seasonality, while the seasonal naïve method does the reverse. The best method, on the basis of the Mean absolute percentage error (MAPE) is the drift method, which in my view suggests that the trend is more important than the seasonality in this series.

    ########################################################################
    # for ARIMA; Hyndman suggest to use auto-arima without stepwise 
    ########################################################################

    library(fpp)
## Loading required package: fma
## Loading required package: tseries
## Loading required package: expsmooth
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.1.3
    trainData <- sr
    testData <- ser

    #  the default value in auto.arima() is test="kpss". 
    # A KPSS test has a null hypothesis of stationarity
    # In general, all the defaults are set to the values that give the best forecasts on average.

    # CAUTION! Takes a while to compute

    arimaMod <- auto.arima(trainData, stepwise=FALSE, approximation=FALSE)
    arimaMod.Fr <-forecast(arimaMod,h=48)

    # plot of the prediction and of the test set

    plot(arimaMod.Fr)
    lines(testData, col="red")
    legend("topleft",lty=1,bty = "n",col=c("red","blue"),c("testData","ARIMAPred"))

plot of chunk unnamed-chunk-5

    # plot of the test set and its prediction only

    AR.mean <-forecast(arimaMod,h=48)$mean

    plot(testData, main="AirPassengers", ylab="", xlab="Months", col="darkblue")  
    lines(AR.mean, col="red")

plot of chunk unnamed-chunk-5

    # accuracy

    accuracy(arimaMod.Fr,testData)
##                   ME   RMSE    MAE     MPE  MAPE   MASE    ACF1 Theil's U
## Training set  0.7965  8.551  6.092  0.2873 2.768 0.2086 0.01757        NA
## Test set     -7.8983 26.104 21.080 -2.6868 5.186 0.7218 0.65516    0.5771
    # test residues of arima

    tsdisplay(residuals(arimaMod))

plot of chunk unnamed-chunk-5

# Best lag
# It is recommended using h=10 for non-seasonal data and h=2m for seasonal data, where m is the period of seasonality.
# for seasonality over 12 months, h = 2 * 12 = 24
# see: http://robjhyndman.com/hyndsight/ljung-box-test/

lb <- Box.test(residuals(arimaMod), lag = 24, type = "Ljung-Box")
lb
## 
##  Box-Ljung test
## 
## data:  residuals(arimaMod)
## X-squared = 22.94, df = 24, p-value = 0.5232
bp <- Box.test(residuals(arimaMod), lag = 24, type = "Box-Pierce")
bp
## 
##  Box-Pierce test
## 
## data:  residuals(arimaMod)
## X-squared = 18.55, df = 24, p-value = 0.7754
########################################################
#### Residuals diagnostics in forecasting
########################################################

res.fr <- residuals(arimaMod.Fr)

par(mfrow=c(1,3))

plot(res.fr, main="Residuals from ARIMA method",
  ylab="", xlab="Years")

Acf(res.fr, main="ACF of residuals")

u <- residuals(arimaMod)

m<-mean(u)
std<-sqrt(var(u))
hist(u, breaks=20, col="gray", prob=TRUE, 
xlab="Residuals", main="Histogram of residuals\n with Normal Curve")
curve(dnorm(x, mean=m, sd=std), 
      col="black", lwd=2, add=TRUE)

plot of chunk unnamed-chunk-5

The ARIMA model had the lowest MAPE of all forecasting methods, and it is obvious from the plot that the prediction based on the ARIMA model detects both the trend and the seasonality of the series.

The residuals of the model are reasonably good, and the Ljung–Box test shows that there is no serial correlation.

The superiority of ARIMA over the other models of forecasting depends, in part, on ARIMA including a term for differerincing.

Indeed, the ‘AirPassengers’ series is not stationary, as it can be proved with the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test. The null hypothesis of the test is that the series is stationary, so a p < 0.05 (the conventional threshold for rejecting the null, whatever this means) indicates that the series is not stationary.

We will use the test as implemented in the ‘tseries’ package

library(tseries)

kpss.test(series)
## Warning: p-value smaller than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  series
## KPSS Level = 4.342, Truncation lag parameter = 2, p-value = 0.01

The test p-value = 0.01, so we can assume that the series is not stationary. To have it stationary we should difference the time series. To evaluate how much we must difference the series, we can use the ‘ndiffs’, then apply the number produced by the function with the appropriate command.

ndiffs(series)
## [1] 1
series.diff <- diff(series, differences=1) 

# you may automate the procedure assigning the result of 'ndiffs' to a vector and using the output for the required differences in 'diff'

# d <- ndiffs(series); series.diff <- diff(series, differences=d[1])


# check wether the differenced series is stationary

kpss.test(series.diff)
## Warning: p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  series.diff
## KPSS Level = 0.0115, Truncation lag parameter = 2, p-value = 0.1
# Now the series is stationary, as it can be seen by comparison with the original series

# The RColorBrewer can be used to create a set of diverging colors

library(RColorBrewer)
rb<-brewer.pal(7,"Blues")

# paired plot with enlarged window

windows(width=1200,height=350)

par(mfrow=c(1,2))

plot(series, col=rb[4], xlab = "Original Series", ylab="Passegners on airplanes")
plot(series.diff, col=rb[7], xlab = "Differenced Series",  ylab="Passegners on airplanes")  

plot of chunk unnamed-chunk-7


Anyway, this was not the point of the exercise. I wanted to compare some method of forecasting and evaluate how to measure the accuracy of the models. Indeed, even differencing the series, the forecasting of the other methods remain poor, with the seasonal naïve method being the closest to the target.

Comments