Discussion of Chapters 10 and 11

Chapter 10

Swapping order of axes.

Plot A vs. B typically means A is on the Y axis.  However, beyond that confusion the problem says to predict someone's weight from their height which should have made things less ambiguous.  The thing being predicted is always on the y-axis and the predictor variable is always on the x-axis.

How to best present parameters

Mostly people didn't understand this question, or interpreted in different ways.  I took it to mean how to make the parameters intuitive for a person to understand.  See the solutions in the repository for my version of this.

How to interpret R^2

People were unsure how to interpret R^2 values.  R^2 of .28 seems small, but for some problems it would be immense.  Also, using the log domain makes it even harder, suggest moving back to the original domain for more clarity.

Dropna issues

Some people did dropna on individual pandas series and then found that the number of elements no longer matched up.  The trick is to do dropna on the whole data frame (or on a subset of the columns using the subset kwarg).

How much does it help?

Some disagreement regarding "How much would it help?".  This question presumes a specific utility of this type of prediction.  In what contexts would this be helpful / not helpful?

Vocabulary issue with correlation

Some people said 28% correlation when they really meant a .28 coefficient of variation (or R^2).  Another way to say it is that factor A explains 28% of the variance in factor B.

Be careful with the base of your logarithm

Make sure to label it on any graphs.  Make sure to use the right exponent when inverting the log transform.

Interpreting the Quantile Residual Plot

We'll go over this as a class to better understand what it shows us.

Some Vocabulary Used Before Being Defined

I wasn't as careful as I should have been assigning the reading out of order.  This caused some confusion.  For instance, someone didn't understand the resampling part in 10.4.  This was not surprising given the chapters we skipped.  I will be more careful about this in the future.  If stuck on vocabulary, please post on Piazza.

Motivation for using a log transform

We'll discuss this as a class, but my answer is to influence the error function for the least squares regression.

Other tractable optimization problems.  What are they?

Linear least squares is certainly one of the most common, however, a larger class of problems is covered under the banner of convex optimization problems.  If you are interested in this, I can suggest more reading.

Chapter 11

Some Vocabulary Used Before Being Defined

This was also an issue with this chapter.  Again, sorry, and please post in Piazza so we can collectively work through any difficulties.

Coupling Factors

Someone asked if using combinations of factors (e.g. age + income) would be good to do data mining on.  We'll discuss this as a class.

Implementing Versus Building Your Own

Some people are uncomfortable about using ThinkStats2 instead of implementing their own version of an algorithm.  I'll discuss some tradeoffs here.  Bottom line is that you are supposed to be using external libraries to solve problems unless otherwise instructed.

Regression versus Classification

At least one person was confused about predicting the birth date of the baby.  The issue was rooted in confusion regarding regression versus classification.

Patsy Syntax

There was confusion about logical formulas in a Patsy regression formula.  Think of this as making a binary indicator variable using a logical expression.  This indicator variable then gets a weight assigned to it through regression.

What does Joining the Data Do

You'll see this in my solution.  Basically it is putting together two data frames based on matching up the values in a particular column.

Wading through Lots of Features

This is hard.  Make sure to make your life easier using modular code.  Having columns bubble up to the top using some criterion (e.g. correlation with a relevant variable) is a good way to navigate when you have massive numbers of variables to test.  Another filter you can apply is to use your intuition to look at certain columns that you feel would be interesting / important.

Overview of ThinkStats Data Mining Code

We'll go over this version as a class, and I'll show you a similar one that I made using scikit-learn.