The goal of the project is for you to use statistical analysis with real data to answer an interesting question.  Often the biggest challenge is to find a question that can be addressed effectively with a dataset you can access.  It is easy to generate an interesting question you can't answer.  And it is easy to generate uninteresting questions about a dataset.  It will take some work, and luck, to find a good pairing.


Some of my blog posts at Probably Overthinking It are examples of the kind of exploration I have in mind.

This Income inequality paper is a good model.  It is more professional than what I expect this semester, but we can emulate the tone, style, level of detail, etc.

Here are examples of reports written by previous students in this class:

HINT: Some of the best projects involve working with an external "client": you have a natural source of questions and data, you get feedback as the project goes along, and there's a good chance you will create something with real value.  If you make contact with a possible client, try to arrange a face-to-face meeting in their work space if at all possible.  To avoid overselling your services, I suggest you tell them that you will be able to do "basic exploratory data analysis."

Data sources:

Here are some suggestions for data sources.  They are intended to get you thinking; you should certainly not limit yourself to these options.

1) The datasets I used in the book are NSFGBRFSS and NLS.  (Brad Minch suggests investigating the whether births are correlated with the lunar cycle).

2) Cancer statistics from SEER.  Also, I have a contact at the Massachusetts Department of Public Health who might be a good person to talk to.

3) Crime statistics from the FBI.

4) Running results from has results from just many foot races.  Also, I have a contact at who might be able to get you some data!

5) ManyEyes is a social site that includes datasets and vizualizations contributed from many sources.

7) Jon Stolk has data he has collected as part of his educational research, and he has many interesting questions to investigate.

8) From Caitrin Lynch:

"How about the Health and Retirement Study dataset, see info at and the publicly accessible data at

"You could have the students look, for ex., we hear that retirement leads to quick death (idleness leads to death?), is this true? What are the health impacts of retirement, and how does that vary by economic circumstance--so, is it more likely to lead to quick death in a retired janitor than a retired stock broker? And how does it vary by gender, etc. etc.

"There is similar data for England, maybe they can do comparative things?

"I have some retirement-related questions I'd actually like quantitative answers to, but I can't think of them in a clear manner right now--but I could come up with some if your students would actually figure some things out for me. :)"

9) Ann Schaffner has data from NSSE (National Survey of Student Engagement) and lots of interesting questions to explore.  Also, Ann is working on the Olin Dashboard, which you can find on the portal under "Olin in Numbers".

10) I have used data from HERI's Freshman Survey.  We have hardcopies in the library, and you can request it in electronic form.

11) Debbie Chachra recommends, which a portal to datasets from the gub'ment.  Daniel Bathgate points out that has even more!

12) Steve Gold suggested, which hosts competitions that often involve extracting information from large datasets.

13) Alisha Sieminiski suggested the Framingham Heart Study, which reminded me of the Nurses' Health Study.  And if you Google "longitudinal study" you will find many more.

14) From Mark Sheldon:

"I ... suggest NTSB aviation accident data (see for a query interface, see for monthly breakdowns, and they have other items, including downloadable yearly data sets).

"One might also check out the Aviation Safety Foundations data ( which I think is a subset of the NTSB data intended to focus on general aviation (I think they throw out aircraft with gross weights of 12,500 pounds or more).

"There are lots of questions that come up, many of which are addressed in publications.  Is the accident rate higher at night?  Are night accidents more fatal?  Are particular aircraft more/less likely to be involved in, say takeoff accidents?  Do solar flare or full moons affect the accident rate (actually, new moons might!).

[Addendum: students could investigate the relative safety of light sport aircraft, which is a topic of current controversy among pilots and air safety administrators.]

15) From Chris Morse:

"My data is concerning my math diagnostic tool. I have the individual answers to all 25 questions for just over 1000K students. For each student, I also have what class year they were, gender, and the scores on every problem set, test, and final grade.

"So, there is tons of stuff to play with, if people want to correlate pretty much anything."

16) For demographic data, try the Census, Occupational Employment Statistics, and the National Compensation Survey.  Also the Statistical Abstract published by the U.S. Census.  We have an Olin alum working at the Census, so if you have ideas in this area, I will put you in touch.

17) Check out the alleged cancer cluster among TSA workers at Logan Airport.

18) This paper presents some statistical analysis, but there are several errors.  Get their data and fix them.

19) Millenium Development Goals

20) Read reddit/r/statistics for a few weeks and volunteer to help someone with their data.

21) I have a contact who works in a state health department who wants to look at HIV/AIDS rates in different groups, and the effect of the economic downturn.

22) Here a topic without a data source: This article in the Economist speculates about the effect of the economy on drug use, but claims, "Few academics have studied the link between drug use and macroeconomic performance, and what work exists is inconclusive."

23) Mark Chang has data from Twitter on geotagged tweets, and tools for collecting more data.  


At the beginning of Meeting 3 (see the schedule) you should turn in a one-page project proposal with the following information:

1) Your name and a tentative title for the report.

2) A few sentences describing the questions you plan to address.

3) A description of the data source(s) you plan to work with.

4) A status report: Do you have the data in hand?  Have you been able to process it and generate summary statistics?  Do you foresee any issues getting or processing the data?

Please bring a hard copy to class and also put a PDF version of the proposal in our shared folder in a file named proposal.pdf.


Over the course of the semester you should perform most of the investigations described below.  Not all investigations are relevant to all projects, but you should probably do most of these.  If you don't see how some of these are relevant, I might be able to help.

1) Generate effective summary statistics that describe several variables from your dataset.

2) Generate a PMF for at least one variable, or compare PMFs for different groups.

3) Compute a CDF for at least one variable, or compare CDFs for different groups.

4) Find a variable in your dataset that is well-modeled by a continuous CDF.  Generate a plot to demonstrate this behavior.

5) State and test a difference-in-mean hypothesis about your dataset.  Compute a p-value for the null hypothesis and/or a Bayesian likelihood ratio.

6) State and test a chi-square hypothesis about your dataset.  Compute a p-value for the null hypothesis and/or a Bayesian likelihood ratio.

7) Estimate a parameter for at least one variable in your dataset.

8) Perform a Bayesian estimation.

9) Measure a correlation between a pair of variables.

10) Perform a linear regression.

11) Perform a statistical analysis not described in the book.

Draft Report

A draft report is due two weeks before the final report.  By then you will be able to write a background section, and you should have finished at least a few explorations.  In general it is better to turn something in on time, and get prompt feedback, than to turn in something more complete later, in which case my feedback might be too late to be useful.

Please turn in your draft report by giving me a copy on paper and putting a PDF file in our shared folder with the name lastname_draft.pdf (where lastname is your last name).

Final Report:

Please turn in your final report by giving me a copy on paper and putting a PDF file in our shared folder with the name lastname_report.pdf (where lastname is your last name).  Also, please put the code you used in our shared folder in a subdirectory named project_code.  Include your data, too, unless it is very big.

The audience for the final report is other students in this class.  You can assume that they are familiar with the statistical concepts we studied this semester.

The goal of the final report is to take up an interesting question, address it by doing statistical analysis of real data, and to present the results in a clear and entertaining way.

Some suggestions on organization:
  1. Try to present one exploration per section.
  2. Present the explorations in an order that makes sense for the project.
  3. You can address one big question or a few smaller ones.
  4. Keep it simple!  Better to ask small, precise questions you can answer than big, vague ones you can't.
  5. Don't refer explicitly to the list of explorations.  Your explorations should be motivated by your questions and results, not by a checklist.
Each section should have:
  1. A motivating question,
  2. An explanation of what you did,
  3. Results, and
  4. Interpretation of the results as an answer to the question you posed.

"Next I investigated variation in height for men and women.  Again I used data from the BFRSS, which includes self-reported height from 12,345 respondents [or whatever].  The following table shows the mean height for men and women (in cm), the variance and standard deviation, and the coefficient of variation:

      Mean height   Variance      Standard dev  CV
Men   178.090966766 59.4275328443 7.70892553112 0.0432864488925
Women 163.226104443 52.7684723388 7.26419110011 0.0445038563217

For variance and standard deviation, I used Sn2.  Since the sample size is large, the difference between the population and sample variance is small.

The variance for men is higher (and the standard deviation, of course), which suggests that males are more variable, but the coefficient of variation, which expresses standard deviation as a fraction of the mean, tells a different story.  By that measure, which is probably more meaningful, men are slightly less variable.  This difference is small, so it may be due to chance (See exploration 12)." 

Things to notice:
  1. First person singular, active voice (but not a personal narrative).   Don't write your report in the passive voice.  The passive voice is not required or desirable in technical writing.  Anyone who tells you different is just plain wrong.  Please show them this: The Passive Voice is a Hoax.
  2. Logical flow.
  3. More detail than would be typical in a scientific paper -- for this exercise, I want you to be a little pedantic.
  4. One simple question at a time.
If you were grading that exploration, what would you correct?

Please use the following words carefully:

Significant: Usually means statistically significant.  If you mean something else, be really clear.

Trend: Usually means something is changing in time.  "Pattern" is a good word to use for an effect that does not involve time.

Correlation: Means a specific kind of relationship between variables; if you just mean relationship, say "relationship".  For example "Based on this scatterplot, it looks like there is a linear relationship between these variables, so I calculated their correlation coefficient."

Allen Downey,
Sep 27, 2011, 1:25 PM
Allen Downey,
Jul 27, 2011, 10:42 AM
Allen Downey,
Jul 27, 2011, 10:42 AM