Lecture 14

For today you should:

1) Read Chapter 12 of Think Stats 2e

2) Read "Statistical inference is only mostly wrong" and (optionally) the discussion of the article on Reddit

3) Prep for a quiz.

Today:

1) Quiz

2) Time series analysis

For next time you should:

1) Read "Avoiding a common mistake with time series"

2) Install Bokeh

3) Work on completing and cleaning up your journal entries.  Hint: make a checklist.

Optional: Interpretation of BBC survey of Muslims in the UK: same data, two stories

Headline #1: "1 in 4 UK Muslims: Violence okay over Muhammad images; BBC poll shows substantive minority sympathetic with Charlie Hebdo terrorists; one in 10 condones fighting the West"

Headline #2 "Most British Muslims 'oppose Muhammad cartoons reprisals'; "

But 27% of the 1,000 Muslims polled by ComRes said they had some sympathy for the motives behind the Paris attacks.

11% feel sympathy for people who want to fight against western interests

Asked if acts of violence against those who publish images of the Prophet Muhammad can "never be justified", 68% agreed that such violence was never justifiable.

But 24% disagreed with the statement, while the rest replied "don't know" or refused to answer.

Time series analysis

Comments from Chapter 12

1) Pandas provides excellent support for timestamps and time series, but it takes some getting used to.

2) The groupby function is a powerful tool, but the documentation is not great, so it takes some persistence to get comfortable with it.

3) One option for working with time series is to treat time just like any other variable.  The example in the chapter is linear regression.

The problems are:

When people say "time series analysis" they are usually taking about methods that try to identify:

Methods in the chapter include rolling means and EWMA.

Serial correlation

Correlation between successive elements in a time series.

If you generalize to lags greater than 1, you get the ACF, which is a powerful tool for identifying periodic behavior or "seasonality".

Prediction

When you use TSA to generate predictions, you should consider three sources of error:

The first two are quantifiable; the last one is the real problem.

According to Donald Rumsfeld, United States Secretary of Defense under George W Bush:

"... as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones."

With uncharacteristic understatement, Wikipedia say, "The statement became the subject of much commentary."

The fundamental problem with all prediction is that non-trivial systems are under no obligation to behave in the future the way they behaved in the past.

But without making SOME assumption about which things will stay the same and which might change, we can't make predictions at all.

And virtually every decision we make is based on some kind of prediction, often implicit.

Exercise: Consider any decision you have made recently, either momentous or trivial.  What goal were you trying to achieve?  And what prediction was your choice predicated on?  How reliable was that prediction?

Bertrand Russell:  “The man who has fed the chicken every day throughout its life at last wrings its neck instead, showing that more refined views as to the uniformity of nature would have been useful to the chicken.”

As an aside, I highly recommend this documentary about Rumsfeld: