Lecture notes‎ > ‎

### Lecture 02

For today you should have:
1. Read Chapters 1 and 2.
2. Done Homework 1 (turn in by noon today; future homeworks due in class).
3. Prepared a question/dataset pitch: what do you want to explore, where are you getting data?
4. Signed up for the mailing list.
Today:
1. Share your folder with arjun.aletty@students.olin.edu and abekim0607@gmail.com.
2. Pmf API.
3. Family sizes.
4. Project pitches.
5. Parsing, cleaning and data structure design.
For next time:
2. Homework 2.
3. Write a project proposal (see the Project page).

## Chapters 1 and 2

Let's review Pmf.html

Draw a class diagram for the classes in Pmf.py.

Review the methods for Hist and Pmf.

What is the difference between a Hist and a Pmf?

Let's do exercise 2.5: Write Mean(pmf) and Var(pmf).

Any questions from Chapters 1 and 2?

## Family size

What summary statistics does it use to support this claim?

How could we use the NSFG to test this claim?

How should we define "average family size?"

How should we track it over time?

Let's collect some data: How many children are there in your family?

Make a histogram.  Compute the mean.

How can we use this data to estimate the distribution of family sizes?

What population can we use this data to talk about?

According to this table from the U.S. Census, the average number of children per family, for families that have children, is 1.86.

Credits: this exercise is a variation of an example from Gelman and Nolan, Teaching Statistics.

## Parsing, cleaning and data structure design

One of the reasons we are using a general-purpose language rather than a stats language like R is that for many projects the "hard" part is preparing the data, not doing the analysis.

Primary steps:

1) Parsing the data.  Of course, this depends on what format it is in:

• Plain text: government datasets are often in plain text because it is so universal.
• Fixed columns: some plain text files use the same number of columns in each line, and the codebook tells you the start and end column for each field.
• CSV: "A comma-separated values or character-separated values (CSV) file is a simple text format for a database table."  Almost any spreadsheet program can write CSV; you can use the csv module to read it.
• XML: "Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form".  Python provides a module to parse it.
• HTML: if the data is on a web page for people, and not in a machine-friendly format, sometimes you have to parse the HTML (this is a form of data scraping).  Python provides several modules that can help.
Anybody dealing with anything else?

2) Cleaning.  Survey responses and other data files are almost always incomplete.  Sometimes there are multiple codes for things like, "not asked," "did not know," and "declined to answer."

And there are almost always errors.

One of the first steps when you work with a new dataset is to explore the weird stuff and make a strategy for dealing with it.

For example, a simple strategy is to remove or ignore incomplete records.

Any problems with that strategy?

3) Confirmation.  Check the sanity of the data by checking for internal consistency, comparing with published summaries, and applying common sense.

4) Data structures.

Once you read the data, you usually want to store it in a data structure that lends itself to the analysis you want to do.

• If the data fits into memory, building a data structure is usually the way to go.
• If not, you could make multiple passes, reading and writing text files.
• Or you could build a database, which is an out-of-memory data structure.
Most databases provide a mapping from keys to values, so they are like dictionaries.

One strategy is to design and develop your structure using dictionaries and then translate into database.

With appropriate encapsulation, you can replace the dictionaries with a database seamlessly.

Another reason for using a database: persistence across multiple runs.