Lecture 14
For today you should:
1) Read Chapter 12 of Think Stats 2e
2) Read "Statistical inference is only mostly wrong" and (optionally) the discussion of the article on Reddit
3) Prep for a quiz.
Today:
1) Quiz
2) Time series analysis
For next time you should:
1) Read "Avoiding a common mistake with time series"
2) Install Bokeh
3) Work on completing and cleaning up your journal entries. Hint: make a checklist.
Optional: Interpretation of BBC survey of Muslims in the UK: same data, two stories
Headline #1: "1 in 4 UK Muslims: Violence okay over Muhammad images; BBC poll shows substantive minority sympathetic with Charlie Hebdo terrorists; one in 10 condones fighting the West"
Headline #2 "Most British Muslims 'oppose Muhammad cartoons reprisals'; "
But 27% of the 1,000 Muslims polled by ComRes said they had some sympathy for the motives behind the Paris attacks.
11% feel sympathy for people who want to fight against western interests
Asked if acts of violence against those who publish images of the Prophet Muhammad can "never be justified", 68% agreed that such violence was never justifiable.
But 24% disagreed with the statement, while the rest replied "don't know" or refused to answer.
Time series analysis
Comments from Chapter 12
1) Pandas provides excellent support for timestamps and time series, but it takes some getting used to.
2) The groupby function is a powerful tool, but the documentation is not great, so it takes some persistence to get comfortable with it.
3) One option for working with time series is to treat time just like any other variable. The example in the chapter is linear regression.
The problems are:
First, there is no reason to expect the long-term trend to be a line or any other simple function. In general, prices are determined by supply and demand, both of which vary over time in unpredictable ways.
Second, the linear regression model gives equal weight to all data, recent and past. For purposes of prediction, we should probably give more weight to recent data.
Finally, one of the assumptions of linear regression is that the residuals are uncorrelated noise. With time series data, this assumption is often false because successive values are correlated.
When people say "time series analysis" they are usually taking about methods that try to identify:
Trend: A smooth function that captures persistent changes.
Seasonality: Periodic variation, possibly including daily, weekly, monthly, or yearly cycles.
Noise: Random variation around the long-term trend.
Methods in the chapter include rolling means and EWMA.
Serial correlation
Correlation between successive elements in a time series.
If you generalize to lags greater than 1, you get the ACF, which is a powerful tool for identifying periodic behavior or "seasonality".
Prediction
When you use TSA to generate predictions, you should consider three sources of error:
Sampling error: The prediction is based on estimated parameters, which depend on random variation in the sample. If we run the experiment again, we expect the estimates to vary.
Random variation: Even if the estimated parameters are perfect, the observed data varies randomly around the long-term trend, and we expect this variation to continue in the future.
Modeling error: We have already seen evidence that the long-term trend is not linear, so predictions based on a linear model will eventually fail.
The first two are quantifiable; the last one is the real problem.
According to Donald Rumsfeld, United States Secretary of Defense under George W Bush:
"... as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones."
With uncharacteristic understatement, Wikipedia say, "The statement became the subject of much commentary."
Variables that are included in your model are known knowns.
The residuals left unexplained by your model are known unknowns.
Modeling errors are unknown unknowns.
The fundamental problem with all prediction is that non-trivial systems are under no obligation to behave in the future the way they behaved in the past.
But without making SOME assumption about which things will stay the same and which might change, we can't make predictions at all.
And virtually every decision we make is based on some kind of prediction, often implicit.
Exercise: Consider any decision you have made recently, either momentous or trivial. What goal were you trying to achieve? And what prediction was your choice predicated on? How reliable was that prediction?
Bertrand Russell: “The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.”
As an aside, I highly recommend this documentary about Rumsfeld: