Lecture 17

For today you should:

1) Have a good break.

Today:

1) From journal to preliminary report

2) Estimation redux

For next time you should:

1) Start your preliminary report.  Put it in a Google Doc named

Data Science 2015 <collaborator> preliminary report

and share it with allendowney and paullundyruvolo at gmail.  And please don't create extra work for my by improvising.

Revised schedule:

April 3: Preliminary report draft due, where "draft" means you have done the best you can, and you would be proud to show it to your collaborator.

April 10: Preliminary report goes out to the collaborator.

April 6-17: Good time for a f2f meeting with collaborators.  If you have not already, schedule now!

April 30: Draft final report due.  See prior definition of "draft".

May 7: Final report goes out to the collaborator and final archive/portfolio is due (details to follow).

Journal debrief: technical

Saturated scatterplots: use jitter, adjust alpha, use pcolor or hexbin.

Scatterplots with obvious structure: explore variables one at a time and make appropriate transformations.

Estimation: with resampling, the resampled sample is the same size as the sample.  See exercise below.

Hypothesis testing: p is never 0.  If you ran 1000 times and got 0 hits, p<0.001

Handle NaNs and explain what you did (again, this is part of the single variable explorations).

Report an appropriate number of significant digits.  And units!  And label the axes.  C'mon people!

Journal debrief: communication

Start with this style guide.  Don't be fooled by the word "style".  These recommendations are not optional, and they are not just one lunatic's opinion.  Every one of these errors brands you as a rookie and undermines the credibility of everything you write.

One more suggestion, on a higher level: words have meaning.  Some of you write like random Markov chains, choosing whatever word seems statistically most likely to come next.  Write like you are an autonomous being with an idea you are trying to convey.

The preliminary report

Who are the audiences?

What are the goals?

What should be at the top of the first page?

Three suggestions for getting started:

1) Make a copy of your journal and edit.

Pro: easy.

Con: dangerous!

Choose this only if you are willing to edit aggressively!  The audience and goal for the preliminary report are not the same as for the journal.  The same document cannot do both.

2) Outline the report, pull material from the journal and identify the remaining pieces you need to add.

Pro: Top-down organization can be good.

Con: Top-down organization is hard.

3) Grab someone who does not know anything about your project and explain it to them in 20 minutes using a whiteboard.

Partner who is not explaining, take notes on what was confusing, and what could be rearranged or clarified.

Find another naive victim and repeat.

Now use the organization you just discovered and write it down.

Pro: I believe that this method will substantially improve your writing.

Con: We will never know if I'm right because no one ever does it.

Estimation exercise

1) If you already have CompStats, do a git pull; otherwise git clone.

2) Load sampling.ipynb, run the examples, and do the exercises.

3) Check your solutions against sampling_soln.ipynb