Simpson's Paradox

Post date: Sep 04, 2017 6:46:21 PM

One instance of Simpson's paradox is the ecological fallacy, which is the (mistaken) assumption that relations evident at an aggregate level also obtain at the individual level. In other words, aggregated data can give misleading results.

For example, suppose that, within a given neighborhood, home prices tend to increase with the number of bedrooms. Below is a scatter plot showing data for four different areas, identified by color. Each dot is a home. As you can see, within each area, home prices increase with the number of bedrooms.

Now imagine that instead of plotting individual homes, you had plotted average home price in each area against average number of bedrooms. This would yield just 4 data points, and notice the negative slope -- in the aggregated data, it looks like the more bedrooms houses have, the lower the price, which is wrong.

Of course, it isn't aggregation per se that causes the problem. A regression on the home-level data in the first plot would in fact yield a negative slope, just like the aggregated data. The real issue is that a key third variable, namely neighborhood, isn't being controlled for. That's Simpson's Paradox: if in correlating X and Y you fail to control Z which affects both of them, the result you get could be the reverse of what it is supposed to be. This is a case of omitted variable bias.