Lecture 11

For today you should:

1) Read Chapter 10 of Think Stats 2e

2) Also read The Jimmy Nut Company Problem

2) Start the hypothesis test section of your journal

Today:

1) Quiz 4 debrief

2) Linear least squares

3) Exercise 10.1

For next time you should:

1) Read Chapter 11 of Think Stats 2e, through Section 11.5.

2) Work on your project (no new journal sections)

3) Prepare for a quiz.  Hint: read Section 10.5 carefully.

Linear least squares

There are many ways to draw a line through a scatterplot.  LLS is popular primarily because it is computationally cheap, which doesn't mean it's the best choice.

From http://en.wikipedia.org/wiki/Linear_least_squares_(mathematics)

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

Note that  : has the dimension 1x1 (the number of columns of ), so it is a scalar and the quantity to minimize becomes

Differentiating this with respect to 

 and equating to zero to satisfy the first-order conditions gives

which is equivalent to the above-given normal equations.

See also http://en.wikipedia.org/wiki/Matrix_calculus

Residuals

If the relationship between the variables is linear, the residuals should be random, and might be quite noisy, but there should be no remaining relationship with x.

You can check that by making a scatterplot of residuals versus x, or a visualization like this one:

What does this figure suggest?

Estimation

You can compute the sampling distribution of the estimated parameters by resampling the inputs.

And if the records are weighted, you can take that into account, naturally, during the resampling process.

def ResampleRowsWeighted(df, column='finalwgt'):

    weights = df[column]

    cdf = Cdf(dict(weights))

    indices = cdf.Sample(len(weights))

    sample = df.loc[indices]

    return sample

Hypothesis testing

Two options:

1) Test the whole model.  What would be the chance of seeing such goodness of fit by chance?  This is equivalent to testing the coefficient of correlation.

2) Test whether one of the coefficients might actually be zero.

2a) Model a null hypothesis and see how often the simulated slope exceeds the actual slope.

2b) Compute the sampling distribution and see how often it crosses the zero line.

The first is more rigorously correct, but awkward.  The second is approximately correct, and nearly universal in practice.

Exercise

1) git pull upstream master

2) Load chap10soln.ipynb and read through it.  We'll go over it at the end of class.

3) Read "Early Childhood Bereavement Linked to Later Psychosis" and try to find a description of the effect size.  What is the sample size of this study?

4) Read "Severe bereavement stress during the prenatal and childhood periods and risk of psychosis in later life: population based cohort study" and try to find a description of the effect size.

5) In the secondary article, "Dr. Melhem said that the take-home message for clinicians is that they should pay particular attention to children exposed to bereavement of a close relative "and to screen and monitor them closely to detect those at risk early on in time for preventions and interventions."

6) Now contrast all that with this article, "Peanut allergies? For children, the best treatment may be peanuts"

Is this recommendation warranted by the results of the study?

Hint: no.