Lecture 20

For today you should:

1) Finish a draft of your preliminary report by 2pm today.  If you have not already shared the document with Paul and me, do so immediately!

Today:

1) Quiz debrief

2) Report work time

3) Meetings

For next time you should:

1) Prepare for a quiz

2) Prepare for ML lectures by reading "A few useful things to know about machine learning"

Suggestions for quiz prep:

1) Do the in-class activities.

2) Keep your environment up to date so you can try examples quickly.

3) Keep track of examples we have seen so far.

4) Keep a link to the documentation handy.

5) Don't be afraid to read source code.

6) thinkstats2 is higher level than pandas, which is higher level than NumPy/SciPy, which is higher level than Python data structures and for loops.  Use the highest-level API that can do the job unless there is a compelling reason to drop down a level.

7) For every operation you perform, keep track of the return type (and check), then do an inventory of what capabilities each type provides.  Many computations can be expressed as a sequence of conversions from one type to another, like stacking adapters to connect mismatched plugs.

Quiz 8

1) Suppose I have four large urns full of blue and green marbles. Urn 0 contains no blue marbles, only green.  Urn 1 is 1/3 blue.  Urn 2 is 2/3 blue, and Urn 3 is all blue.

You choose an urn at random, draw two marbles, and get one blue and one green.  What is the probability that you chose from each urn?

2) sampling.ipynb defines a class called Resampler that takes a sample and estimates the sampling distribution of the mean.  Suppose you have a sample of adult weights in an NumPy array named weight_sample.  Write a few lines of code that use Resampler to compute the 90% CI of the sample mean.

resampler = Resampler(weight_sample)

stats = Resampler.compute_sample_statistics()

cdf = thinkstats2.Cdf(stats)

ci = cdf.ConfidenceInterval(90)

3) One drawback of simple resampling is that the resampled data can only contain values that appear in the original sample.  An alternative is to use the original sample to estimate a PDF using KDE, then draw samples from the estimated PDF.

Suppose you have a Pandas dataframe with a column named height that contains mostly valid heights in cm, but also some NaNs.  Write a few lines of code to estimate the PDF of the valid values, then fill in the NaNs with random values chosen from the PDF.

See pandas_demos.ipynb in ThinkStats2/code

4) When you report statistical results, which result is the most important to report prominently: the effect size, the confidence interval, or the p-value?  Which is the least important? 

In my opinion, effect size is the most important, CI is second, and p-value is third.