The goal of the project is for you to use statistical analysis with real data to answer an interesting question. Often the biggest challenge is to find a question that can be addressed effectively with a dataset you can access. It is easy to generate an interesting question you can't answer. And it is easy to generate uninteresting questions about a dataset. It will take some work, and luck, to find a good pairing.
Some of my blog posts at Probably Overthinking It are examples of the kind of exploration I have in mind.
This Income inequality paper is a good model. It is more professional than what I expect this semester, but we can emulate the tone, style, level of detail, etc.
Here are examples of reports written by previous students in this class:
HINT: Some of the best projects involve working with an external "client": you have a natural source of questions and data, you get feedback as the project goes along, and there's a good chance you will create something with real value. If you make contact with a possible client, try to arrange a face-to-face meeting in their work space if at all possible. To avoid overselling your services, I suggest you tell them that you will be able to do "basic exploratory data analysis."
Here are some suggestions for data sources. They are intended to get you thinking; you should certainly not limit yourself to these options.
1) The datasets I used in the book are NSFG, BRFSS and NLS. (Brad Minch suggests investigating the whether births are correlated with the lunar cycle).
2) Cancer statistics from SEER. Also, I have a contact at the Massachusetts Department of Public Health who might be a good person to talk to.
3) Crime statistics from the FBI.
4) Running results from coolrunning.com has results from just many foot races. Also, I have a contact at MapMyRun.com who might be able to get you some data!
5) ManyEyes is a social site that includes datasets and vizualizations contributed from many sources.
6) Eco-Economy indicators from the Earth Policy Institute (suggested by Dee Magnoni)
7) Jon Stolk has data he has collected as part of his educational research, and he has many interesting questions to investigate.
8) From Caitrin Lynch:
9) Ann Schaffner has data from NSSE (National Survey of Student Engagement) and lots of interesting questions to explore. Also, Ann is working on the Olin Dashboard, which you can find on the portal under "Olin in Numbers".
10) I have used data from HERI's Freshman Survey. We have hardcopies in the library, and you can request it in electronic form.
11) Debbie Chachra recommends http://data.dc.gov/, which a portal to datasets from the gub'ment. Daniel Bathgate points out that http://data.gov has even more!
12) Steve Gold suggested kaggle.com, which hosts competitions that often involve extracting information from large datasets.
13) Alisha Sieminiski suggested the Framingham Heart Study, which reminded me of the Nurses' Health Study. And if you Google "longitudinal study" you will find many more.
14) From Mark Sheldon:
15) From Chris Morse:
"My data is concerning my math diagnostic tool. I have the individual answers to all 25 questions for just over 1000K students. For each student, I also have what class year they were, gender, and the scores on every problem set, test, and final grade.
"So, there is tons of stuff to play with, if people want to correlate pretty much anything."
16) For demographic data, try the Census, Occupational Employment Statistics, and the National Compensation Survey. Also the Statistical Abstract published by the U.S. Census. We have an Olin alum working at the Census, so if you have ideas in this area, I will put you in touch.
17) Check out the alleged cancer cluster among TSA workers at Logan Airport.
18) This paper presents some statistical analysis, but there are several errors. Get their data and fix them.
19) Millenium Development Goals http://mdgs.un.org/unsd/mdg/Data.aspx
20) Read reddit/r/statistics for a few weeks and volunteer to help someone with their data.
21) I have a contact who works in a state health department who wants to look at HIV/AIDS rates in different groups, and the effect of the economic downturn.
22) Here a topic without a data source: This article in the Economist speculates about the effect of the economy on drug use, but claims, "Few academics have studied the link between drug use and macroeconomic performance, and what work exists is inconclusive."
23) Mark Chang has data from Twitter on geotagged tweets, and tools for collecting more data.
At the beginning of Meeting 3 (see the schedule) you should turn in a one-page project proposal with the following information:
1) Your name and a tentative title for the report.
2) A few sentences describing the questions you plan to address.
3) A description of the data source(s) you plan to work with.
4) A status report: Do you have the data in hand? Have you been able to process it and generate summary statistics? Do you foresee any issues getting or processing the data?
Please bring a hard copy to class and also put a PDF version of the proposal in our shared folder in a file named proposal.pdf.
Over the course of the semester you should perform most of the investigations described below. Not all investigations are relevant to all projects, but you should probably do most of these. If you don't see how some of these are relevant, I might be able to help.
1) Generate effective summary statistics that describe several variables from your dataset.
2) Generate a PMF for at least one variable, or compare PMFs for different groups.
3) Compute a CDF for at least one variable, or compare CDFs for different groups.
4) Find a variable in your dataset that is well-modeled by a continuous CDF. Generate a plot to demonstrate this behavior.
5) State and test a difference-in-mean hypothesis about your dataset. Compute a p-value for the null hypothesis and/or a Bayesian likelihood ratio.
6) State and test a chi-square hypothesis about your dataset. Compute a p-value for the null hypothesis and/or a Bayesian likelihood ratio.
7) Estimate a parameter for at least one variable in your dataset.
8) Perform a Bayesian estimation.
9) Measure a correlation between a pair of variables.
10) Perform a linear regression.
11) Perform a statistical analysis not described in the book.
A draft report is due two weeks before the final report. By then you will be able to write a background section, and you should have finished at least a few explorations. In general it is better to turn something in on time, and get prompt feedback, than to turn in something more complete later, in which case my feedback might be too late to be useful.
Please turn in your draft report by giving me a copy on paper and putting a PDF file in our shared folder with the name lastname_draft.pdf (where lastname is your last name).
Please turn in your final report by giving me a copy on paper and putting a PDF file in our shared folder with the name lastname_report.pdf (where lastname is your last name). Also, please put the code you used in our shared folder in a subdirectory named project_code. Include your data, too, unless it is very big.
The audience for the final report is other students in this class. You can assume that they are familiar with the statistical concepts we studied this semester.
The goal of the final report is to take up an interesting question, address it by doing statistical analysis of real data, and to present the results in a clear and entertaining way.
Some suggestions on organization:
Each section should have:
Things to notice:
If you were grading that exploration, what would you correct?
Please use the following words carefully:
Significant: Usually means statistically significant. If you mean something else, be really clear.
Trend: Usually means something is changing in time. "Pattern" is a good word to use for an effect that does not involve time.
Correlation: Means a specific kind of relationship between variables; if you just mean relationship, say "relationship". For example "Based on this scatterplot, it looks like there is a linear relationship between these variables, so I calculated their correlation coefficient."