Relationships between variables
Many of the questions you posed for your projects involve relationships between variables.
I will also show an example of multivariate regression (Exercise 9.11).
Until we get there, you can do simple exploratory analysis.
Feel free to jump into Chapter 9.
A normal probability plot is a special case of a quantile-quantile, or Q-Q plot.
The underlying idea is to compare the observed values to the values we would expect if the distribution were normal.
Suppose we draw 100 samples from a normal distribution with mean 0 and std 1.
What do you expect the lowest value to be? 25th percentile? Median? 75th percentile? Highest?
Now we can compare the actual values with these predictions.
More generally, if we draw n samples, we can predict the value with rank k. It's the k/n rankit.
Or what's another way to predict the value with rank k out of n?
Ways to test whether a model is a good description for a dataset:
1) Plot the PMF and the model: bad.
2) Plot the CDF and the model: better, but depends on parameter estimation.
3) Plot an appropriate transform of the CDF: good.
4) Make a Q-Q plot: also good, emphasizes different kinds of deviation.
Two kinds of deviation: noisy and systematic.
Is the model good enough? Depends on what it's for.
Example: accusing a lottery winner of cheating.
Example: searching a DNA database to find a suspect.
Example: accusing a woman of murder because two children died of SIDS.
Example: "To lose one parent may be regarded as a misfortune; to lose both looks like carelessness" (Oscar Wilde).
Example: Blood evidence in OJ Simpson case. (Defendant fallacy)
What is the error in these cases?
1) A Q-Q plot is a way of comparing two distributions. Write a function
called QQplot that takes two Cdf objects and a list of percentiles (that is,
a sorted list of values between 0 and 1). For each percentile in the list,
it should compute the corresponding value from each Cdf and return a
list of value pairs.
2) Doctors are taught two aphorisms: ``An
unusual presentation of a common disease is more common than the
usual presentation of an uncommon disease,'' and the shorter
version, ``When you hear hoofbeats, think horses, not zebras.''
Suppose that the typical adult gets a cold 3 times per year,
and that 1 person in 15,000 develops active tuberculosis (per year). Of people
who develop tuberculosis, 60% visit a doctor and report
a ``bad cough that lasts more than 3 weeks.'' Of people who
develop a cold, 1% visit a doctor and report the same symptom.
If you are a doctor and a patient comes to see you with a bad cough
that has lasted more than 3 weeks, what is your estimate of the
probability that the patient has tuberculosis?
For full credit, you should state the hypothesis and evidence,
and use Bayes's theorem explicitly.
3) Suppose that in a survey 19 out of 60 male respondents say
they smoke, and 12 out of 40 female respondents say they smoke.
1) What is the probability that a randomly-chosen respondent is male?
2) What is the probability that a randomly-chosen respondent is a smoker?
3) What is the probability that a randomly-chosen respondent is a
4) What is the probability that a randomly-chosen male is a smoker?
5) What is the probability that a randomly-chosen smoker is a male?
6) Does this mean that smoking is cool?
Adapted from an example at http://people.richland.edu/james/lecture/m170/ch05-cnd.html
Lecture notes >