Lecture notes‎ > ‎

Lecture 13

For today you should have:

  1. Homework 9.
  2. Prepared for a quiz.


  1. Quiz.
  2. Correlation.
  3. Causation.
  4. Additional topics.

For next time:

  1. There is no next time.
  2. Finish your project.  See the Project page for details, and see the schedule for the due date.


Like variance, covariance is kind of useless by itself, but it is used in other computations.

Pearson's correlation is a standardized covariance, where "standardizing" means dividing through by the standard deviation.

What are the units of covariance?  Correlation?

What's Cov(X,X)?

What's the moral of Figure 9.1?

Anscombe's quartet makes a similar point:

Correlation measures linear relationships.  If the relationship is not linear, it understates the strength of the relationship, possibly by a lot!

Linear least squares

What's so great about least squares fits?

Generally good properties and easy to compute.

But it's not always the right choice.

Coefficient of determination: fraction of variability explained by the model, OR
reduction in MSE if you have to make a guess.

Exercise 9.8: SAT scores and IQ.

Correlation and causation

As always, xkcd says it best:

This is probably familiar territory for you, but just so I feel like I've done my job:

In general, a relationship between two variables does not tell you for sure whether one causes the other, or the other way around, or both, or whether they might both be caused by something else altogether.

So what can you do to provide evidence of causation?

1) Use time.  If A comes before B, then A can cause B but not the other way around.

But this does not preclude spurious relationships.

2) Use randomness.  If you divide a large population into two groups at random, then for any property, X, you expect the difference in the mean of X to be small (with what caveat?).

If the groups are nearly identical in all properties but one, you can eliminate spurious relationships.

This works even if you don't know what the confounding variables are.

But it works even better if you do, because you can check that the groups are identical.

If you combine these two ideas, the result is a randomized controlled trial, which is the most reliable way (we know) to demonstrate a causal relationship, and the defining characteristic of so-called "Western medicine."

Everything is correlated with income

Unfortunately, controlled trials are only possible in a few domains of knowledge.

The alternatives are:

    Example: Haiti and the Dominican Republic, Diamond and Robinson, Natural Experiments of History.

2) Statistical controls: regression analysis.

    Example: age, birth order and birthweight.  See age_lm.py.

Some topics we would cover if we had more time

All good stuff.  If you get a chance to study this later, take it!

Note: (4) is being offered next semester:

Introduction to Stochastic Processes

The course will study basic random processes and their applications. Topics covered will include random walks, Markov chains, Bernoulli and Poisson processes, and if time permits, Brownian Motion, Gaussian Random Processes, and Martingale Theory. Applications in Operations Research (queuing, data networks, traffic), communication systems and information theory (modeling data, signals) and mathematical finance (portfolio theory, gambling).

Just for fun, let's talk a little about #3.

Information Theory

According to Martello, Midnight Ride, Industrial Dawn:

"[Paul Revere] told a group of Patriots in Charlestown to watch for signal lanterns hung from the Old North Church: one lantern meant the British troops would march through Boston neck, and two lanterns meant they planned to cross the Charles River." 

Suppose that you are one of the Patriots watching for this signal.  At the appointed time you see two lanterns at the top of the church.  How much information have you received?  [Hint: what units do we measure information in?]

For simplicity, consider only the explicit information in the message, not the additional information provided by the existence of the signal: for example, that Paul Revere has not been captured.