Lecture 05

For today you should:

1) Read Chapter 5 of Think Stats 2e.

2) Talk with your liaison, get data!

Today:

1) Analytic distributions

2) Chapter 5 exercises

3) Meetings

For next time you should:

1) Read Chapter 6 of Think Stats 2e

2) Prepare for a quiz

3) Start your journal (see below)

Reminder: follow instructions.

Exponential distribution

If an event is equally likely to occur at any time, the time between events is distributed exponentially.

More generally, the expo distribution is an analytic model that often fits data well enough for purposes of analysis and prediction.

Characterized by a single parameter, the "arrival rate" in events per unit time.

If you suspect that your data are approximately exponential, you can check by plotting the CCDF on a log-y scale.

Normal distribution

Two parameter model: the mean, µ, and standard deviation σ

Or maybe I should say that the equation of the distribution has two parameters, and it turns out that the mean is µ, and standard deviation is σ

CDF doesn't have a simple analytic form, but it's easy to evaluate computationally.

The visual test for normality is a little more complicated.

Lognormal distribution

If log(x) is normal, then x is lognormal.  Example: adult weights.

Pareto distribution

Common model of long-tailed distributions.

Visual test: plot CCDF on log-log scale.

For many datasets, lognormal and Pareto are competing models.

Exercises

1) git pull upstream master

2) Open chap05ex.ipynb and work on the exercises.

3) Do Exercise 5.5

In the repository for this book, you’ll find a set of data files called mystery0.dat, mystery1.dat, and so on. Each contains a sequence of random numbers generated from an analytic distribution.

You will also find test_models.py, a script that reads data from a file and plots the CDF under a variety of transforms. You can run it like this:

$ python test_models.py mystery0.dat

Based on these plots, you should be able to infer what kind of distribution generated each file. If you are stumped, you can look in mystery.py, which contains the code that generated the files.

Journal

1) Ongoing documentation of your work

2) Primary thing we will work with during meetings

3) Source of the material that will go into reports

First three elements

1) Project description

2) Data management documentation

3) Data inventory

Format

1) Google Doc, shared with Paul and me (allendowney@gmail.com)  Title: "Data Science 2015 <Sponsor name> journal"

2) Some sections might be good as IPython notebooks; in that case, embed a link to nbviewer (if public) and paste in material otherwise.

Project description

1) What is the institution, company or agency involved, and what is their mission?

2) What group or people will you be working with, and what are their goals?

3) What kind of data will you work with?  Who collected it and why?

4) What are the goals of the project?

5) If the project is successful, what kind of impact is possible?  Without being too dramatic, how will this project make the world a better place?

The project descriptions on the web page contain answers to these questions, provided by the liaison.  I am asking for your answers to these questions.

Data management

Technical issues

1) Collection / extraction

2) Transmission / transformation

3) Storage

4) Security [note on third-party storage]

5) Integrity [validation and version control]

6) Reproducibility [automation]

See Software engineering practices for graduate students

1) Version control

2) Automation

3) Agile development

Legal issues

1) Data sharing agreements:  DSA info

2) Non-disclosure agreements: NDA info

3) IP assignment: example contract

Ethical issues:

1) Privacy

2) Discrimination

3) Handling emotionally sensitive data

4) Openness / transparency

5) Reproducibility

6) ?