For today you should have:
For next time:
Let's review Pmf.html.
Draw a class diagram for the classes in Pmf.py.
Review the methods for Hist and Pmf.
What is the difference between a Hist and a Pmf?
Let's do exercise 2.5: Write Mean(pmf) and Var(pmf).
Any questions from Chapters 1 and 2?
Let's read and discuss this article.
What is the primary statistical claim this article makes?
What summary statistics does it use to support this claim?
How could we use the NSFG to test this claim?
How should we define "average family size?"
How should we track it over time?
Let's collect some data: How many children are there in your family?
Make a histogram. Compute the mean.
How can we use this data to estimate the distribution of family sizes?
What population can we use this data to talk about?
According to this table from the U.S. Census, the average number of children per family, for families that have children, is 1.86.
One of the reasons we are using a general-purpose language rather than a stats language like R is that for many projects the "hard" part is preparing the data, not doing the analysis.
1) Parsing the data. Of course, this depends on what format it is in:
Anybody dealing with anything else?
2) Cleaning. Survey responses and other data files are almost always incomplete. Sometimes there are multiple codes for things like, "not asked," "did not know," and "declined to answer."
And there are almost always errors.
One of the first steps when you work with a new dataset is to explore the weird stuff and make a strategy for dealing with it.
For example, a simple strategy is to remove or ignore incomplete records.
Any problems with that strategy?
3) Confirmation. Check the sanity of the data by checking for internal consistency, comparing with published summaries, and applying common sense.
4) Data structures.
Once you read the data, you usually want to store it in a data structure that lends itself to the analysis you want to do.
Read my NSFG code and let's talk about the data structure.
Most databases provide a mapping from keys to values, so they are like dictionaries.
One strategy is to design and develop your structure using dictionaries and then translate into database.
With appropriate encapsulation, you can replace the dictionaries with a database seamlessly.
Another reason for using a database: persistence across multiple runs.
Lecture notes >