Lecture 06

For today you should:

1) Read Chapter 6 of Think Stats 2e

2) Prepare for a quiz

3) Start your journal (see Lecture 05)

Today:

1) Quiz

2) PDFs

3) Chapter 6 exercises

4) Meetings

For next time you should:

1) Read Chapter 7 of Think Stats 2e

2) Work on journal entries, including data inventory (see below)

Optional reading: Are your data normal?

PDFs

For continuous quantities, the PDF maps from values to "probability density", which is a thing you can integrate to get a probability.

For analytic distributions, we usually have the PDF in closed form.

For empirical distribution, we can estimate the PDF by KDE (which is necessarily based on assumptions about smoothness).

The Pdf class is just a convenience for working with analytic and estimated PDFs.

Distributions

Chapter 6 has some information about the implementation of these classes.

Moments

From Wikipedia

Exercise

The distribution of income is famously skewed to the right. In this exercise, we’ll measure how strong that skew is.

The Current Population Survey (CPS) is a joint effort of the Bureau of Labor Statistics and the Census Bureau to study income and related variables. Data collected in 2013 is available fromhttp://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm. I downloadedhinc06.xls, which is an Excel spreadsheet with information about household income, and converted it to hinc06.csv, a CSV file you will find in the repository for this book. You will also find hinc2.py, which reads this file and transforms the data.

The dataset is in the form of a series of income ranges and the number of respondents who fell in each range. The lowest range includes respondents who reported annual household income “Under $5000.” The highest range includes respondents who made “$250,000 or more.”

To estimate mean and other statistics from these data, we have to make some assumptions about the lower and upper bounds, and how the values are distributed in each range.hinc2.py provides InterpolateSample, which shows one way to model this data. It takes a DataFrame with a column, income, that contains the upper bound of each range, and freq, which contains the number of respondents in each frame.

It also takes log_upper, which is an assumed upper bound on the highest range, expressed inlog10 dollars. The default value, log_upper=6.0 represents the assumption that the largest income among the respondents is 106, or one million dollars.

InterpolateSample generates a pseudo-sample; that is, a sample of household incomes that yields the same number of respondents in each range as the actual data. It assumes that incomes in each range are equally spaced on a log10 scale.

Compute the median, mean, skewness and Pearson’s skewness of the resulting sample. What fraction of households reports a taxable income below the mean? How do the results depend on the assumed upper bound?

Data inventory

After your project description and data management plan, the next section in your journal should be a data inventory, which should answer (but not include verbatim) these questions:

1) Where does the data come from?  Who collected it, how, and why?  Based on background, what problems are there likely to be?

2) How much data is there in terms of storage size, and also in terms of rows and columns (or whatever other metric is appropriate)?  If there are multiple files/tables, explain the structure.

3) Give an overview of the variables in the dataset, and details about the ones you are most likely to work with.

4) For the most important variables, generate distributions and/or summaries.  I generally start with CDFs, but for each variable, you should choose the best visualization/summary.

5) Validate the data, both internally and (if possible) externally.  If there are validation issues, quantify them.  Postpone value judgments about 

the quality of the data.  You don't know yet whether it will be good enough; for now, just describe.

6) What additional questions should you address, given the unique nature of your data?

The target audience of the data inventory is the instructors.  You should answer the questions an interested, curious person would naturally ask about your data.

Exploring a new dataset should be like opening presents!