How are causation and correlation related?

Week 6: Correlation and Causation

During the sixth week of class, we will discuss the connections and distinctions between causation and correlation via an activity. The assignment to be completed by the start of class are in this font, to distinguish them from the rest of the text.

Please read this article. Then, read this and this. In about half a page, describe your thoughts regarding the latter two articles, in light of the first. In particular answer the following questions about each of the latter two articles:

Activity Introduction

This week we'll be doing the below activity. For each plot depicted below, we will try to determine whether or not the data represents a real causal relationship. 

First, we will attempt to draw best fit lines and determine the extent to which the data correlate. For each plot, do your best to draw a line that seems to best fit the data; use your intuition. Then, for each line, compute the slope and y-intercept. Here is a reference that might help you. The plots can be downloaded here.

Plot 1

Plot 2

Plot 3

Plot 4

Correlation/Causation Activity

For each of the above plots, we will compute the coefficient of determination. This is a measure of how well the best fit lines fit the data. First, let us calculate the average, or mean, of the data. This is given by the sum of the y-values divided by the number of y-values:

where y_i represent y-values (the subscript i is a label; the y-values are labeled y_1, y_2, etc.), and N is the total number of y-values. The y with a bar over it refers to the average of y. Compute this for each plot.

Next, we're going to compute the residual sum of squares, which is a measure of how different the data are from the best fit line. For each y_i, we are going to compute the difference between y_i and the associated y-value on the best fit line. (These values will share the same x-value.)  Then, we will square each difference and sum these values. This procedure is defined by 

Finally, we will compute the total sum of squares. This is similar, except that we will compare each y_i to the average of y:

The coefficient of determination is

Let us discuss how to interpret the coefficient of determination. If the best fit line first the data perfectly, all of the data points will fall along the line. Thus, S_r will be zero. This means that the coefficient of determination will be one. Contrarily, if the best fit line were to be drawn arbitrarily, the residual sum of squares will be, on average, much closer to the total sum of squares. If the best fit line fits very poorly, the residual sum of squares will be similar to the total sum of squares, implying that the coefficient of determination will be close to zero.

Compute the coefficient of determination for each plot. Make brief remarks about how well each of your best fit lines fits the data. 

Then, see if you can determine whether each plot represents a causal relationship. To guide you, select from the following possibilities. Each plot is one of the following:


I will then tell you the origin of each of these plots. Once I do so, please comment on what effect this has on your perception of correlation and causation. 

Submit Assignment

To sign in, you must input your CUNY credentials ("firstname.lastnameXX@login.cuny.edu", where "XX" are the last two digits of your student ID). You cannot use "qmail" credentials. If you get an error, please logout of your email/Office365 and then click on the below link.