Lecture 08

For today you should:

1) Read Chapter 8 of Think Stats 2e

2) Complete data inventory, start Exploring relationships (see below)

3) Prepare for a quiz on Chapters 1-8

Today:

1) Project update

2) Estimation

3) Sampling distributions

4) Quiz during meetings

5) Quiz debrief (if possible)

For next time you should:

1) Read Chapter 9 of Think Stats 2e, sections 9.1 to 9.7

2) Complete Exploring relationships

Journal and meeting notes

1) Most important: if you are waiting for data, find useful things to do!

2) The journal is not a diary; it should not contain dates.

It should contain sections labeled

Project description

Data management plan

Data inventory

Relationship between variables

These sections should contain the results of your analysis (or pointers to them), not just words.

You will revise these sections continuously.  

3) The meeting notes should be organized.

Each meeting should have a date and a to-do list.

Take notes during the meeting, and then revise and extend them after the meeting.

I will suggest next moves, but then you should add, filter, and flesh out your to-do list.

4) Note revisions on the Project page

Estimation

Think of an estimator as a function that takes a sample and produces a statistic.

Examples:

Estimator = estimation process

Estimate = result of one execution of an estimation process

If you use the same estimator many times, you can characterize its long-term performance:

An estimator with MAE=0 is unbiased.

Same estimators minimize MSE in the long run.

An MLE estimator has the best chance of being right.

(2) and (6) are the most relevant for engineering, but seldom used in statistics because they do not lend themselves to analysis.

Sampling

The fundamental idea behind all statistics: using measurements from a sample to make inferences about a larger population.

What could go wrong?

1) Sampling bias: you accidentally oversample one group and undersample another.

2) Measurement error: you measure or record data incorrectly

3) Sampling error: because of random variation, the measurement from your sample deviates from the measurement of the whole population.

When sample sizes are large, (1) and (2) almost always dominate (3).

But (3) is easy to compute, so it is often the only kind of error people think about.

Note on vocab: It is easy to confuse "sampling bias" and "sampling error", but they are just different things!

Example of bias: The Polls Were Skewed Toward Democrats

Definitely read this article; it's excellent.  But I have a quibble with the headline:  This article is about unintentional bias: the polling results consistently overestimated the fraction of votes for the Democrats.  I prefer to reserve the word skew to describe the asymmetry of distributions (but I understand that in common English they can be synonymous).

As a result, election results often fall outside the "plus or minus" error bars in the polling results, which quantify sampling error only.

Sampling distribution

You can quantify sampling error by computing the sampling distribution, either analytically or by simulation.

1) Assume that the estimates are correct.

2) Simulate the experiment many times.

3) Accumulate the sampling distribution.

And now, the most confusing sentence in all of statistics:

"The RMSE of the sampling distribution is the sampling error."

And for good measure:

"The 5th and 95th percentiles of the sampling distribution form a 90% confidence interval."

What does the 90% CI mean?

"If the estimates were correct and you ran the experiment many times, the result would fall in the 90% CI 90% of the time."

It DOES NOT MEAN "There is a 90% chance that the actual value falls in this range."

Why not?

1) As a practical matter, that's not true because of sampling bias, measurement error, etc.

2) As a theoretical matter, it doesn't mean that because the CI is simply not a statement about the actual value.

The sampling distribution quantifies the uncertainty of the estimate due to sampling error.