Lecture notes‎ > ‎

Lecture 07

For today you should have:

  1. Done Homework 5.


  1. Relationship advice
  2. Rankits!
  3. Prosecutor's fallacy
  4. Practice quiz.

For next time:

  1. Read Chapter 6.
  2. Prepare for a quiz on Chapters 1-5.
  3. Midpoint survey.

Relationships between variables

Many of the questions you posed for your projects involve relationships between variables.

Chapter 9 presents correlation coefficients and linear regression (least squares fit).

I will also show an example of multivariate regression (Exercise 9.11).

Until we get there, you can do simple exploratory analysis.

  • First, make sure you have explored each of your variables individually.  You should have a PMF, CDF, and appropriate summary statistics.
  • If one of the variables is categorical, with a small number of categories, break up the data and plot multiple PMFs and CDFs.
  • If a variable is numerical, you can treat it as categorical by breaking it into groups.
  • Compare summary statistics between groups.
  • If both variables are numerical, a scatterplot is a good place to start.
Feel free to jump into Chapter 9.

Normal probability plots

A normal probability plot is a special case of a quantile-quantile, or Q-Q plot.

The underlying idea is to compare the observed values to the values we would expect if the distribution were normal.

Suppose we draw 100 samples from a normal distribution with mean 0 and std 1.

What do you expect the lowest value to be?  25th percentile?  Median?  75th percentile? Highest? 

Now we can compare the actual values with these predictions.

More generally, if we draw n samples, we can predict the value with rank k.  It's the k/n rankit.

Or what's another way to predict the value with rank k out of n?

Ways to test whether a model is a good description for a dataset:

1) Plot the PMF and the model: bad.
2) Plot the CDF and the model: better, but depends on parameter estimation.
3) Plot an appropriate transform of the CDF: good.
4) Make a Q-Q plot: also good, emphasizes different kinds of deviation.

Two kinds of deviation: noisy and systematic.

Is the model good enough?  Depends on what it's for.

Prosecutor's fallacy

Example: accusing a lottery winner of cheating.

Example: searching a DNA database to find a suspect.

Example: accusing a woman of murder because two children died of SIDS.

Example: "To lose one parent may be regarded as a misfortune; to lose both looks like carelessness" (Oscar Wilde).

Example: Blood evidence in OJ Simpson case.   (Defendant fallacy)

What is the error in these cases?

Practice quiz

1) A Q-Q plot is a way of comparing two distributions.  Write a function 
called QQplot that takes two Cdf objects and a list of percentiles (that is, 
a sorted list of values between 0 and 1).  For each percentile in the list,
it should compute the corresponding value from each Cdf and return a
list of value pairs.

2) Doctors are taught two aphorisms: ``An
  unusual presentation of a common disease is more common than the
  usual presentation of an uncommon disease,'' and the shorter
  version, ``When you hear hoofbeats, think horses, not zebras.''

  Suppose that the typical adult gets a cold 3 times per year,
  and that 1 person in 15,000 develops active tuberculosis (per year).  Of people
  who develop tuberculosis, 60% visit a doctor and report
  a ``bad cough that lasts more than 3 weeks.''  Of people who
  develop a cold, 1% visit a doctor and report the same symptom.

  If you are a doctor and a patient comes to see you with a bad cough
  that has lasted more than 3 weeks, what is your estimate of the
  probability that the patient has tuberculosis?

  For full credit, you should state the hypothesis and evidence,
  and use Bayes's theorem explicitly.

3) Suppose that in a survey 19 out of 60 male respondents say
they smoke, and 12 out of 40 female respondents say they smoke.

1) What is the probability that a randomly-chosen respondent is male?

2) What is the probability that a randomly-chosen respondent is a smoker?

3) What is the probability that a randomly-chosen respondent is a
  male smoker?

4) What is the probability that a randomly-chosen male is a smoker?

5) What is the probability that a randomly-chosen smoker is a male?

6) Does this mean that smoking is cool?

Subpages (1): Practice quiz solutions