Lecture 07

For today you should:

1) Read Chapter 7 of Think Stats 2e

2) Work on journal entries, including data inventory (see below)

Today:

1) Relationships between variables

For next time you should:

1) Read Chapter 8 of Think Stats 2e

2) Complete data inventory, start Exploring relationships (see below)

3) Prepare for a quiz on Chapters 1-8

Quiz 2

Let's make some adjustments that will help with both quizzes and execution of the projects:

1) NINJAs, yay!  

2) We'll take some time to learn the APIs.

3) I'll try to go over exercises in class, but if you find that you can't do them, we should fix that.

4) You are responsible for maintaining a working software environment.  We'll help, but if something is broken, fixing it is a priority.

Some practical tips:

1) The libraries are your toolkit.  The better you know them, the more effective you will be.

2) Use functional decomposition on quizzes and in real life.  If I ask for a function, just write the function.

Relationships between variables

Jittering is useful for values that have been rounded off, and for ordinal variables (like the income field in the NSFG data):

Scatterplots can be hard to get right.  Avoid being mislead by ink saturation.

Another way to visualize relationships is to plot percentiles of one variables versus another:

But this has the same binning problem as histograms, and it is easy to get distracted by anomalies at the extremes that are due to small numbers.

Correlation

IF you know you are dealing with a relationship that is approximately linear, you can quantify the strength of the relationship by computing a coefficient of correlation.

http://wikipedia.org/wiki/Correlation_and_dependence

If the relationship is non-linear but monotonic, you might be better off with Spearman's rank correlation.

Spearman's is also more robust.

Exploring relationships

Your data inventory mostly describes variables one at a time.

The next step is to explore relationships between variables.

For continuous (and ordinal) variables, you can apply the techniques in the book.

For categorical variables, you might have to improvise.  We can talk about this during our meetings.

The next section of your journal should document these explorations.

Exercises

thinkstats2

1) Read the documentation of thinkstats2 and answer the following questions:

a) What are the attributes of a Cdf?

b) What methods modify a Cdf?

c) How do you iterate through the values and probabilities in a Cdf?

d) How do you iterate through the values and frequencies in a Hist?

e) What methods modify Pmfs and Hists?

Note: In the list of functions, there are several MakePmfFromX and MakeCdfFromY that are no longer needed because the Pmf and Cdf constructors can handle more types now.

pandas

2) Read the documentation of pandas.DataFrame and find three methods we have seen so far in the book.

3) Read the documentation of pandas.Series and find three methods we have seen so far in the book.

numpy

4) Read the documentation of numpy.array and find three ways to create a new array.

Note: most operations on DataFrames and Series are provided as methods, but many numpy operations are only available as functions.

Do Exercise 7.1: Using data from the NSFG, make a scatter plot of birth weight versus mother’s age. Plot percentiles of birth weight versus mother’s age. Compute Pearson’s and Spearman’s correlations. How would you characterize the relationship between these variables?