Lecture 12

For today you should:

1) Read Chapter 11 of Think Stats 2e, through Section 11.5.

2) Work on your project (no new journal sections)

3) Prepare for a quiz.  Hint: read Section 10.5 carefully.

Today:

1) Exercise 10.1

2) Effect size

3) Linear regression

For next time you should:

1) Read the rest of Chapter 11 of Think Stats 2e

2) Work on the regression journal entry (see below).

Optional reading:

Nature, "Psychology journal bans P values"

Here's the editorial explaining the decision.

Coming up:

March 13: All required journal elements done.  Evaluation and feedback during break.

March 31: Preliminary report draft due, where "draft" means you have done the best you can, and you would be proud to show it to your collaborator.

April 3: Preliminary report goes out to the collaborator.

April 6-17: Good time for a f2f meeting with collaborators.

April 30: Draft final report due.  See prior definition of "draft".

May 7: Final report goes out to the collaborator and final archive/portfolio is due (details to follow).

Regression

For the next section of your journal, you should:

1) Identify a quantity you would like to explain or predict and at least one variable that might have explanatory or predictive power.

2) Design a model that relates these variables and run a regression to fit the parameters of the model.

3) Interpret the results.

For many of you, an analysis like this will be an important component of your project, so I encourage you to think of this an a pilot study.

For others it might be a secondary result, but there might be something in your dataset worth investigating.

If there is really nothing in your dataset that's applicable, consider doing the Chapter 11 exercises instead.

Linear regression

Multiple regression is a versatile tool for

1) Finding relationships between variables while controlling for other factors.

2) Predicting dependent variables that can be expressed as a linear combination of explanatory variables

Example from Chapter 10:

1) There is a StatSig difference in birth weight between first babies and others.

2) There is a StatSig relationship between birth weight and mother's age.  It might be non-linear.

3) There is a StatSig difference in age between mothers of first babies and other mothers (they tend to be younger).

So we might naturally wonder how much of (1) can be explained by (3).

Back of the envelope calculation suggests that the age difference might account for 50% of the weight difference.

Results from linear regression support this conclusion.

These results suggest:

1) When we control for age, the apparent difference between first babies and others gets smaller, and only borderline StatSig.

2) With the nonlinear age model, the apparent difference is even smaller and no longer StatSig.

3) Most of the p-values are very small, which means that relationships as strong as these would be unlikely to occur by chance (if, in fact, there is no relationship).

4) However, the relationships are quite weak.  The estimated parameters are small, and the R2 values are very low.

In science, you might be interested in the existence of an effect, even if it is small.  But in engineering and medicine, we care about effect size because we can about prediction (very often).

And now let's get back to the papers from last class.

Odds

Odds and probabilities are equivalent ways to quantify uncertainty.

o = p / (1-p)

p = o / (1+o)

From http://en.wikipedia.org/wiki/Odds_ratio

"OR is a measure of effect size, describing the strength of association or non-independence between two binary data values. It is used as a descriptive statistic, and plays an important role in logistic regression."

WARNING: The numbers in the following exercise are based on a little research and some back-of-the-envelope calculations; they might be very wrong:

1) The lifetime risk of lung cancer if you don't smoke is something like 0.5%, but if you do smoke, it's more like 10%.

2) Lifetime risk of HPV infection for sexually active women might be as high as 80%.  It's too early to know how much this might be reduced by vaccination, but just for specificity, let's say that for people who are vaccinated, it's reduced to 10%.

For each of these scenarios, compute

1) The difference in percentages

2) The probability ratio (aka relative risk)

3) The odds ratio

4) The log odds ratio

What are the pros and cons of each way of reporting this info?

Data mining

"Data mining" is a buzzword that can mean different things.  See http://en.wikipedia.org/wiki/Data_mining

What I mean in Chapter 11 is just an automated process for exploring a space of models.

Pro: fast way to explore big datasets and find leads for further exploration.

Con: High risk of false positives.  Also easy to be fooled by coding errors, outliers, etc.